PECE Data Management


Link to the Preliminary Report (COMPLETE DRAFT, v. 0.8)

Link to the Presentation at the RDA 6th Plenary, CNAM, Paris


This document describes the data management capabilities being built for the Platform for Experimental and Collaborative Ethnography (PECE, pronounced "peace"). Since February 2015, the PECE Design Group has been worked to implement the recommendations of the Research Data Alliance's Practical Policies Working Group. The goal of the PECE Design Group is to develop best-practice data management capabilities customized to meet the special needs of empirical humanities researchers.

The Platform for Experimental, Collaborative Ethnography (PECE) is an open source (Drupal-based) digital platform designed to support distributed, collaborative work with diverse types of research data. While designed to support diverse experimental ethnography projects, it provides a general model for the digital humanities, and particularly for what we have termed the empirical digital humanities (including work in history, anthropology, folklore, and other fields that collect and analyze primary data, using hermeneutic or interpretive techniques). PECE provides a place to archive and share primary data generated by empirical humanities scholars, facilitates analytic collaboration, and encourages experimentation with diverse modes of publication. The platform also encourages humanities scholars to experiment with and better understand digitally-mediated, interdisciplinary collaboration, provides opportunities to involve students in humanities research as it progresses, and quickens the public availability of humanities research, in an open access form. PECE also provides the capacity to experiment with new forms of peer review for humanities research, and functions as a portal to a suite of open source tools useful for humanities research, including tools developed in data science for other scientific communities.


Note on Ethnographic Data Curation

The data curator is responsible for designing and implementing processes to ensure long-term preservation, proper management, and sharing of ethnographic data. Given the nature of the ethnographic inquiry, it is not possible to specify either a fixed work-flow or a “one-size-fits-all” solution for data management. Therefore, we aimed for flexible guidelines in PECE that are general enough to accommodate different ethnographic projects and yet specific enough to guarantee data management and data sharing between collections. One of the key roles of the data curator and manager is to balance the observance of privacy issues that are part of most (if not all) ethnographic research projects, with creating the conditions for data sharing.


Data Management Definition

We designed and implemented a set of practical policies for data management per recommendation of the RDA's WG-PP and the National Science Foundation's Data Management Plan (DMP) of September, 2011. Ethnographic research data management in PECE encompasses four inter-related dimensions: preservation, disposition, privacy, and collaboration. For PECE design, we aimed for a balance between the need for preservation of privacy and anonymity and the preoccupation with the need for creating conditions for ethnographic data sharing and collaborative analysis among ethnographers and co-participants of our projects. In practical terms, we implemented and extended features of the Drupal web framework to ensure proper data management. In respect to the aforementioned four constitutive dimensions of PECE data management:

1. Preservation: our data model is the first step for the organization of digital ethnographic data for future archival by library repositories. We do not aim for substituting robust data repository solutions with PECE, but to help in organizing our ethnographic collections for long-term storage by our colleagues in the library and information sciences. In order to contribute to the task of ethnographic data preservation, we describe data types with rich meta-data and linked data mapping (which allows for better discoverability, replication, and, what is key for ethnographers, the capacity to ask questions across several ethnographic collections). In practical terms, we offer a way to organize, analyze, and replicate digital ethnographic collections that will be useful not only for ethnographers but by other professionals working on issues of digital preservation of scholarly archives. We included in our metadata description important fields, such as: provenance (which in the context of ethnographic projects has to do with information about the project, field sites, researchers, and methodological and theoretical orientations) in addition to fields for contributors, licensing, tags, and permissions.

2. Disposition: Having data preservation as one of our goals, we took one step further in organizing our data for archival, collaborative analysis, and sharing by creating open (as in public) interfaces for data science experts to harvest PECE open access data.

3. Privacy: Ethnographic projects are first and foremost based on the engagement between researchers and research participants for the interpretation of specific sociocultural processes. Out of the experience of engagement, a myriad of privacy and ethical concerns must be raised. Various types of content (from participant observation or interviews for instance) cannot be shared publicly due to their presentation of sensitive information. Having the commitment to preserve our research co-participants privacy, the PECE Team designed the platform around the need to flag certain types of content as restricted for public view. In addition to a simple permission system that is based on user roles, we are planning to implement public-key encryption for our data store in the next version of the platform.

4. Collaboration: By leveraging Open Source-based web technologies (including semantic extensions) to support ethnographic projects, PECE aims to help advance modes of collaborative inquiry. For this purpose, data management is one of the key practices to ensure the usage of open formats, flexible copyright licenses, and web standards to facilitate collaboration. The PECE Team made the design choice of running the platform on an established web framework, Drupal, to foster collaboration on many levels: its community size and geographic distribution (which spans across East Asia, Western and Eastern Europe and the Americas); its development community formed by companies (big and small), local community chapters and conferences, large international conferences, as well as numerous book publications and web resources with rich documentation for all levels of skill. Several companies and news outlets (with big and small datasets) run on Drupal with millions of articles, creating a vibrant community which cooperates to develop a public, common resources by sharing code and documentation, as well as experiences among admins and users. For PECE's sustainability in particular, this collaborative dimension is fundamental. We decided to rely and contribute to upstream Drupal development and help with contributed modules by testing, reporting, and fixing bugs. We are contributing, in specific, a set of tools for collaborative annotation that will be of great value for the academic community already running their research projects on the Drupal framework.


Implementation of RDA Practical Policies

1. Contextual metadata extraction:

We are working on an open API with access control that will allow for harvesting data (with rich meta-data). This will work with XML and JSON end-points for every content type of our platform. The web framework we use has native support for linked data. We have one collaborator, Dominic di Franzo, working on RDF extensions and mappings for PECE.

See the example of implementation of this policy here (PDF)
See the example config file for: public endpoint, private endpoint, artifacts
See the example JSON output for an artifact listing here

Link to the draft policy implementation on Google Docs


2. Data access control:

We specified user roles and permissions as well as an easy way for publishing 'restricted' content. We have been looking into data encryption and system-level 'hardening' for our platform as well. The first public release of PECE will come pre-configured for admins and users.

See the example of implementation of this policy here (PDF)

Link to the draft policy implementation on Google Docs


3. Data backup:

We have a set-up for automatic backups of the platform (filesystem and DB) which will be installed by default for every PECE instance. We will document and suggest users to have a redundant backup as well -- by generating regular snapshots of the virtual machine where PECE runs.

See the example of implementation of this policy here (PDF)
See the two configuration: settings_profile.txt schedule_profile.txt.

Link to the draft policy implementation on Google Docs


4. Data format control:

We specified and implemented restrictions for types of content that can be uploaded (only "web safe" and Open Document Formats will be permitted to be uploaded). For text documents, only PDF and ODF documents will be supported. For audio and video files, only web-safe and open formats will be supported (with the exception of mp4, which is supported by modern browsers): webm, ogg, ogv, mp4, m4v, m4a, wav, mp3.

See the example of implementation of this policy here (PDF).

Link to the draft of this policy implementation on Google Docs.


5. Data retention:

Ethnographic projects tend not to have “embargo periods” and ethnographic data tends not to have “expiration dates” whereas both are quite common for digital data management in science and engineering disciplines. There are particular reasons that account for this difference. First, ethnographers tend not to share “raw data” but drafts of partial and preliminary analyzes with other ethnographers and other research groups. The very concept of “raw data” is foreign to most contemporary ethnographic projects since data only acquires meaning in the context of a particular ethnographic project. To put in different terms, data must refer to what we call “conditions of production” to acquire particular meaning and become useful for research purposes. Ethnographic data is data generated in the context of human relationships in general and forms of human and non-human interaction in particular. Without information on these basic foundations of data production, ethnographic research data is not useful and not usable by other researchers. Lastly, the reason why expiration dates are not common for ethnographic data is because ethnographic data represent documents of, not only anthropological and sociological interest, but of historical importance in many cases. They can be used for building archives and for comparative efforts at any point in the future as long as they are properly stored, extensively described, and made available through flexible licensing schemas and interoperable data management systems with open, public interfaces.
See the example of implementation of this policy here (PDF).
Access the example script included in the technical description of this policy implementation.

Link to the draft policy implementation on Google Docs.


6. Disposition:

According to the Research Data Alliance's workgroup on “practical policies” for data management (RDA WG-PP), “disposition” policies are triggered at every event in which a retention period has been reached to delete or archive a digital object. For the needs of the PECE project in particular, “disposition” further represents the need for organizing information in a way to allows for ethnographic data to be readily available for sharing across platforms and research groups in the humanities and social sciences. There are two specific approaches to disposition which encompass both the general orientation of the RDA WG-PP and the specific needs of the PECE project: 1) make it simple and straight forward for users to use flexible copyright content in their pieces of data; and 2) to trigger a disposition policy when an expiration period has been reached.

See the example of implementation of this policy here (PDF).

Link to the draft policy implementation on Google Docs.


7. Integrity and Replication:

Data integrity checking is performed primarily by the Drupal framework (through its Schema API) in conjunction with its database back-end, MariaDB: CRUD operations are handled by the Schema API, offering an abstraction layer for database operations on PECE/Drupal data structures, and the database server guarantees integrity through ACID (atomicity, consistency, isolation and durability) conditions for all data transactions. For automatic checking the integrity of database tables, we use the extension module “dba” which allows for checking, reporting, and repairing data corruption on a regular basis. Data replication can be handled in many ways on PECE: 1) automated replication between production, testing, and backup instances for redundancy and/or performance (for advanced PECE administrators using our VM distribution: we discuss this configuration in the “PECE Technical Specification” document); 2) scheduled, automated server “snapshot” generation performed by the hosting service company to save the state of a particular instance; and last but not least 3) using PECE Open API to replicate the data of a particular instance. This last option allows for easy integration with large-scale data repositories as described in the section on “Metadata extraction” of this document. For administrators with *nix expertise, replication is also conveniently done with Drush (and batch operations using shell scripting).
See the example of implementation of this policy here (PDF).

Link to the draft policy implementation on Google Docs.


8. Notification:

Drupal core provides logging capabilities through its watchdog() function which basically operates by registering system events, such as available updates, security issues, and user account events which can be, then, notified to administrators, researchers, and collaborators. Severity of events on Drupal is determined after the RFC3164 (which specifies the BSD syslog protocol). PECE has specific needs, however, that require extending the standard email notification system of Drupal. Automated notification capabilities are handled on PECE by security modules (as explained in the “Data Access and Security” section) and messaging modules. These capabilities include the ability to report all sorts of events to the user on various levels: system level (related to the platform itself), account level (related to specific users), and content level (related to additions, modifications, and deletion of artifacts). PECE's notification system follows “user roles” when addressing specific users with respect to the nature of the event. It also supports notifications that are addressed to research groups via PECE's group functionality: OG member subscribe and OG new content creation, change, or deletion. There are two types of notification: email and in-system, respectively, notifying users and administrators based on their email contact or upon log-in.
See the example of implementation of this policy here (PDF).
Access the example scripts: PECE_rules_artifact_change_config.txt and PECE_rules_artifact_expired_config.txt

Link to the draft policy implementation on Google Docs.


9. Restricted searching:

PECE is shipped with scalable search functionalities. For this purpose, the platform comes with an extension for a search server back-end. Search servers are important for our web framework (Drupal) because they allow for powerful discovery capabilities in a big corpus of text (and across different corpora of texts). It is a known limitation of the native search capability of Drupal to underperform with a SQL database with more than 50k documents/nodes. Another important benefit of having a search database back-end is the ability to perform search across different PECE instances for identifying ethnographic content as well as for asking research questions across several ethnographic collections. We have tested alternatives (such as ApacheSolr and ElasticSearch) and planned but not yet implemented our scalable searching capabilities. We are planning to use three search backends: one is our web framework native search mechanism; another is a connector from our platform to an ElasticSearch back-end; and, finally, we will provide a SPARQL endpoint. We will use the ElasticSearch backend for searching content in the platform (which will follow the RDA policy for restricted content; displaying restricted content for users who have permission). The ElasticSearch and the SPARQL back-end will be used to index and query across several PECE instances.

See the example of implementation of this policy here (PDF).

Link to the draft policy implementation on Google Docs


10. Instance Cost Reports:

PECE depends on a set of Free and Open Source technologies that constitute the Drupal framework: *nix system tools (such as cron, drush, df, awk, bash, and other multimedia manipulation tools, such as FFmpeg), database server (such as MariaDB), scripting languages (such as PHP and Javascript), and a set of contributed libraries that are used for data manipulation, management, and security purposes. Given the level of complexity of the system as a whole, we recommend PECE users to rely on Drupal managed hosting services offered by web hosting companies. This option is recommended to PECE administrators who are not experienced in *nix system administration. In order to provide PECE administrators with data on monthly usage for calculating costs, PECE relies on basic descriptive statistics that are generated by the Drupal core module “statistics” as well as information about disk usage that is gathered in the back-end at every cron run. This information is very useful when estimating data transfers and calculating incurring hosting costs. Fully automated gathering and reporting of the usage of computational resources (such as CPU time, IO, individual artifact sizes) is a functionality that is being planned for the version 2.0 of the platform.

See the example of implementation of this policy here (PDF).

Link to the draft policy implementation on Google Docs

See the cost estimates spreadsheet
See the PECE Sustainability Plan here


11. Use agreements:

We already drafted a user agreement and a privacy policy for our platform (see: 'Appendix' below). We are now working with lawyers from the Berkman Center for Internet and Society at Harvard to draft the legal version of these documents which will be included in the user-registration page. Every user requesting a PECE account will have to read and agree on our terms (comprised of three sections: User Agreement, Users' Conduct, and Privacy Terms).

See the example of implementation of this policy here (PDF).

Link to the draft policy implementation on Google Docs


APPENDIX

User Agreement

The Platform for Collaborative and Experimental Ethnography (PECE), hereby represented by the PECE Team (see “Team-members-list”), is a web platform for collaborative work around the tasks of archival, analysis, and sharing of ethnographic data. By using PECE, you accept the following terms and conditions, including those in our code of conduct and privacy policy documents. If you do not accept these terms, please refrain from using our platform. PECE is licensed under the GNU General Public License (GPL) version 3. The platform is based on the Drupal framework,

which is licensed under the General Public License (GPL) version 2 or later, including its contributed modules. Other third-party software included in Drupal and PECE is licensed under compatible Free Software licenses, which are included in our source code repository for public access.

PECE was created to promote Open Access, Open Data, and Open Standards in the humanities and social sciences. In order to achieve this goal, all of our generated content is licensed under the “Creative Commons Attribution-ShareAlike 4.0 International” by default unless otherwise noted for specific pieces of content. Users are responsible for describing the license they want for their own content (or the license chosen by the copyright owner for collected materials), if the content is being uploaded by a contributor not the original author.

All uploaded content is the sole responsibility of the person who published it. The PECE team is not responsible for the content posted by users of the platform and cannot monitor all the published content. Registered researchers and contributors are responsible for their usage of the platform. They are also responsible for anonymizing the ethnographic data they upload to the platform if they carry any potential privacy issue. PECE comes with no warranty or guarantee of fitness for any particular use as described by its software license, GPL v.3. There are no restrictions for its use, copy, study, modification, or redistribution as described in the GPL v.3 license. We make no warranty as to the reliability, accessibility, or quality of our web services. When using the platform you agree that the usage of our services is at your sole and exclusive risk. The PECE team is not liable for any direct, indirect, incidental, consequential, or exemplary damages, including but not limited to harm and damage to research participants of any kind and in the context of any research project making usage of the platform. We worked to minimize the security risks of the platform, but we cannot guarantee the complete security, anonymity, and confidentiality of the data posted on the platform.

Therefore, when in doubt regarding the privacy and ethical implications of a particular piece of data, please refrain from uploading it to the web. Always contact your research co-participants and rely on the guidance of your IRB committee to discuss and decide the best course of action as soon as you identify any potential privacy and security issue.

We reserve the right to change these terms at any time. If we make major changes, we will notify our users in a clear and prominent manner.

User's Conduct

Collaborative and open projects on the Internet are prone to social dynamics that can be harmful to the participants due to disagreements that are marked by sociocultural differences as well as differences in technical/academic background. We have learned from experience with Free and Open Source communities that they are more common than we would like to admit, given the premise of our gathering in the first place: to advance a culture of collaboration and sharing in the context of software development. In order to respond to recurrent events of harassment and misconduct – involving recurrent attacks against ethnic and gender minorities – we must find ways to speak openly as well as privately about these issues.

In favor of promoting a positive environment for collaborative work, the PECE team has agreed on a code of conduct with consists a set of basic orientations for social interaction on our platform. When accepting our terms, you also accept to follow our code of conduct:

- We expect our users to be collaborative and welcoming, patient and respectful to other researchers and contributors of different academic traditions;
- Do not use nor post content with discriminatory language except when needed for research purposes;
- Assume good-faith when encountering and engaging with other users in different groups of the PECE community;
- Help promote openness among our academic circles by encouraging the usage of flexible copyright licenses and open formats when using PECE and other technologies;
- Avoid using words that can be harmful to another user.

Misconduct and/or offensive content can be reported at: cocreport@worldpece.org. When submitting your report, feel free to send us an anonymous email. We will not publicize your name or your email under any circumstance if you do not anonymize your information. We strongly recommend for you to use public key encryption (OpenGPG extensions in your email client) to protect the contents of your report. Our public key can be found in any keyserver under the email <cocreport@worldpece.org> and the fingerprint: 0E29 01C9 4D6B 502C 58E4 93B1 F1AA FBD9 59DE C514. Feel free to contact one of the PECE Team members to arrange for meeting face-to-face at professional conferences for key exchange and signing.

Privacy Policy

PECE does not collect nor store any data on its users, except for the personal data that is given by the users themselves when creating profiles (“registration information”).

In order to achieve its mission of promoting collaborative work among researchers in humanities and social sciences, PECE privileges open data and open standards. It follows closely the best practices of Free and Open Source communities when dealing with privacy and security concerns. Open and full disclosure of any security problem is our responsibility. Following the Free and Open Source community practice, we do not hide problems from our users.

When designing and implementing software for our platform, we prioritized our research participants' right to anonymity, confidentiality, and privacy. PECE users (in the major roles of collaborators and researchers) are responsible for specifying the permission settings for every piece of content they upload. That is, if a piece of data will be public or private – only accessible to the registered researchers, PECE researchers (with IRB approval) or collaborators and Internet anonymous users.

PECE was designed in accordance to the ethical guidelines of professional anthropological associations, such as the American Anthropological Association (AAA), Associação Brasileira de Antropologia (ABA) and the World Council of Anthropological Associations (WCAA). We follow the core ethical principles of protecting our research participants' rights to privacy, anonymity, and confidentiality. We aim to cause no harm to research participants or to any social group directly or indirectly as a consequence of research work in our platform. We subordinate our research goals to the ethical concerns and privacy needs of our research participants. Our goal is to encourage wider data sharing while protecting privacy and sharing of research data should not be at the expense of protecting confidentiality.” PECE will be used for academic research and dissemination of data and research results in collaboration with other researchers and academic collaborators: data contributed by researchers and contributors will be controlled by themselves with permission settings they must specify during data entry. We do not collect nor share users' data, browser fingerprints, nor do we read, collect, or analyze communication between users.
We reserve the right to change this policy at any time. If we make changes, we will notify our users in a clear and prominent manner.


PECE Team
Alli Morgan, Brandon Costello-Kuhn, Dominic DiFranzo, Kim Fortun, Lindsay Poirier, Luis Felipe R. Murillo, Brian Callahan, Michael Fortun, Rodolfo Hernandez.
Contact: profmikefortun@gmail.com, lfmurillo@cyber.law.harvard.edu