Best Practices

Data Publication


Protected Data

Protected data in the  Data Depot Repository (DDR) are generally (but not always) included within interdisciplinary and social science research projects that study human subjects, which always need to have approval from an Institutional Review Board (IRB). We developed a data model and onboarding instructions in coordination with our CONVERGE partners to manage this type of data within our curation and publication pipelines. Additionally, CONVERGE has a series of check sheets that outline how researchers should manage data that could contain sensitive information; these check sheets have also been published in the DDR.  

Natural Hazards also encompasses data that have granular geographical locations and images that may capture humans that are not the focus of the research/would not fall under the purview of an IRB. See both the Privacy Policy within our Terms of Use

Data de-identification, specially for large datasets, can be tasking. Users working with the RAPID facility may discuss with them steps for pre-processing before uploading to DesignSafe. 

To publish protected data researchers should pursue the following steps and requirements:

  1. Do not publish HIPAA, FERPA, FISMA, PII data or sensitive information in the DDR.

  2. To publish protected data and any related documentation (reports, planning documents, field notes, etc.) it must be processed to remove identifying information. No direct identifiers and up to three indirect identifiers are allowed. Direct identifiers include items such as participant names, participant initials, facial photographs (unless expressly authorized by participants), home addresses, phone number, email, vehicle identifiers, biometric data, names of relatives, social security numbers and dates of birth or other dates specific to individuals. Indirect identifiers are identifiers that, taken together, could be used to deduce someone’s identity. Examples of indirect identifiers include gender, household and family compositions, places of birth, or year of birth/age, ethnicity, general geographic indicators like postal code, socioeconomic data such as occupation, education, workplace or annual income.

  3. Look at the de-identification resources below to find answers for processing protected data.

  4. If a researcher has obtained consent from the subjects to publish PII (images, age, address), it should be clearly stated in the publication description and with no exceptions the IRB documentation including the informed consent statement, should be also available in the documentation. 

  5. If a researcher needs to restrict public access to data because it includes HIPAA, FERPA, PII or other sensitive information, or if de-identification precludes the data from being meaningful, it is possible to publish only metadata about the data in the landing page along with descriptinve information a bout the dataset. The dataset will show as Restricted.

  6. IRB documentation should be included in the publication in all cases so that users clearly understand the restrictions imposed for the protected data. In addition, authors may publish the dataset instrument, provided that it does not include any form of protected information. 

  7. Users interested in restricted data can contact the project PI or designated point of contact through their published email address to request access to the data and to discuss the conditions for its reuse.

  8. The responsibility of maintaining and managing a restricted dataset for the long term lies on the authors, and they can use TACC's Protected Data Services if they see fit. 

  9. Please contact DDR through a help ticket or join curation office hours prior to preparing this type of publication.

De-identification Resources

The NISTIR 8053 publication De-Identification of Personal Information provides all the definitions and approaches to reduce privacy risk and enable research. 

Another NIST resource including de-identification tools. 

John Hopkins Libraries Data Services Applications to Assist in De-identification of Human Subjects Research Data

Reusing Data Resources in your Publication

Researchers frequently use data, code, papers or reports from other sources in their experiments, simulations and field research projects as input files, to integrate with data they create, or as references, and they want to republish them. It is a good practice to make sure that this data can be reused appropriately and republished as well as give credit to the data creators. Citing reused sources is also important to provide context and provenance to the project. In the DDR you can republish or reference reused data following the next premises:

  1. If you use external data in a specific experiment, mission or simulation, cite it in the Referenced Data field.

  2. Use the Related Work field at project level to include citations for the data you reused as well as your own publication related to the data reuse.

  3. Include the cited resource title and corresponding DOI in https format; this way, users will be directed to the cited resource. 

  4. Be aware of the reused data original license. The license will specify if and how you can modify, distribute, and cite the reused data.

  5. If you have reused images from other sources (online, databases, publications, etc.), be aware that they may have copyrights. We recommend using these instructions for how to use and cite them. 

Timely Data Publication 

Although no firm timeline requirements are specified for data publishing, researchers are expected to publish in a timely manner. Recommended timelines for publishing different types of research data (i.e., Experimental, Simulation, and Reconnaissance) are listed in Table 1.

Guidelines specific to RAPID reconnaissance data can be found at rapid.designsafe-ci.org/media/filer_public/b3/82/b38231fb-21c9-41f8-b658-f516dfee87c8/rapid-designsafe_curation_guidelines_v3.pdf
 

Table 1. Recommended Publishing Timeline for Different Data Types

Project/Data Type

Recommended Publishing Timeline

Experimental

12 months from completion of experiment

Simulation

12 months from completion of simulations

Reconnaissance: Immediate Post-Disaster

3 months from returning from the field

Reconnaissance: Follow-up Research

6 months from returning from the field


Public Accessibility Delay

This refers to time during which a project is not made broadly accessible awaiting for the review and publication of a corresponding journal paper. In some cases, users need to have the data DOI to submit their manuscript for review.  We work with users to:

  • Provide access to reviewers via My Projects before the data is published. There is no DOI involved and the review is not annonymous.

  • The dataset is curated and complete and receives a DOI. We can make the DataCite DOI metadata not web indexable. The publication will be findable within DDR. It will eventually be indexed by search engines. 

  • The dataset is curated and complete and receives a DOI. We can remove the publication from the list of published datasets in the DDR. 

Users that request that their publications remain non-indexable and or non viewable should let the DDR team know through a Help ticket when the respective paper is accepted so we change its status. Without exceptions, all non-indexable, non-viewable publications in the DDR will be made publicly available upon one year even if the corresponding paper is not published.

We ask users to submit a Help ticket to Data Curation and Publication so we can work with them on the best alternative for their case. See our Public Accessibility Delay Policy for more information on the limitations and expectations for accessibility delays.

Licensing

Within DesignSafe, you will choose a license to distribute your material. The reason for offering licences with few restrictions, is that by providing less demands on reusers, they are more effective at enabling reproducible science. We offere licenses with few to no restrictions. By providing less demands on reusers, they are more effective at enabling reproducible science. Because the DesignSafe Data Depot is an open repository, the following licenses will be offered:

  • For datasets: ODC-PDDL and ODC-BY
  • For copyrightable materials (for example, documents, workflows, designs, etc.): CC0 and CC-BY
  • For code: GPL

You should select appropriate licenses for your publication after identifying which license best fits your needs and institutional standards. Note that datasets are not copyrightable materials, but works such as reports, instruments, presentations and learning objects are.

Please select only one license per publication with a DOI. 


Available Licenses for Publishing Datasets in DesignSafe

DATASETS

If you are publishing data, such as simulation or experimental data, choose between:

Open Data Commons Attribution
Recommended for datasets

 

  • You allow others to freely share, reuse, and adapt your data/database.
  • You expect to be attributed for any public use of the data/database.

 

Please read the License Website
Open Data Commons Public Domain Dedication
Consider and read carefully

 

  • You allow others to freely share, modify, and use this data/database for any purpose without any restrictions.
  • You do not expect to be attributed for it.

 

Please read the License Website

WORKS

If you are publishing papers, presentations, learning objects, workflows, designs, etc, choose between:

Creative Commons Attribution 
Recommended for reports, instruments, learning objects, etc.

 

  • You allow others to freely share, reuse, and adapt your work.
  • You expect to be attributed for any public use of your work.
  • You retain your copyright.
Please read the License Website
Creative Commons Public Domain Dedication
Consider and read carefully

 

  • You allow others to freely share, modify, and use this work for any purpose without any restrictions.
  • You do not expect to be attributed for it.
  • You give all of your rights away.

Please read the License Website

SOFTWARE

If you are publishing community software, scripts, libraries, applications, etc, choose the following:

GNU General Public License
  • You give permission to modify, copy, and redistribute the work or any derivative version.
  • The licensee is free to choose whether or not to charge a fee for services that use this work.
  • They cannot impose further restrictions on the rights imposed by this license.
Please read the License Website

Subsequent Publishing

With the exception of Project Type Other, which is a one time publication, in the DDR it is possible to publish datasets or works subsequently. A project can be conceived as an umbrella where reports or learning materials, code, and datasets from distinct experiments, simulations, hybrid simulations or field research missions that happen at different time periods, involve participation of distinct authors, or need to be released more promptly, can be published at different times. Each new product will have its own citation and DOI, and users may select a different license if that is appropriate for the material, (e.g. a user publishing a data report will use a Creative Commons license, and an Open Data Commons license to publish the data). The subsequent publication will be linked to the umbrella project via the citation, and to the other published products in the project through metadata. 

After a first publication, users can upload more data and create a new experiment/simulation/hybrid simulation or mission and proceed to curate it. Users should be aware that momentarily they cannot publish the new product following the publication pipeline. After curation and before advancing through the Publish My Project button, they should write a help ticket or attend curation office hours so that the DDR team can assist and publish the new product.

Amends and Version Control 

Once a dataset is published users can do two things to improve and or / continue their data publication: amends and version control. Amends involve correcting certain metadata fields that do not incur major changes to the existing published record, and version control includes changes to the data. Once a dataset is published, however, we do not allow title or author changes. If those changes need to be made due to omission or mistake, users have to submit a Help ticket and discuss the change with the data curator. If applicable, changes will be done by the curation team. 

Amends include:

  • Improving descriptions: often after the curator reviews the publicationn, or following versioning, users need to clarify or enhance descriptions.
  • Adding related works: when papers citing a dataset are published we encourage users to add the references in Related Works to improve data understandibility and cross-referencing and citation count.
  • Changing the order of authors: even though DDR has interactive tools to set the order of authors in the publication pipeline, users further require changes after publication due to oversight.

Version control includes:

  • Adding or deleting files to a published dataset.
  • Documenting the nature of the changes which will publicly show in the landing page.
  • Descriptions of the nature of the changes are displayed for users to see what changed and stored as metadata.
  • In the citation and landing pages, different versions of a dataset will have the same DOI and different version number. 
  • The DOI will always resolve to the latest version of the data publication. 
  • Users will always be able to access previous versions through the landing page.

When implementing amend and version take the following into consideration: 

Amend is only going to update the latest version of a publication (if there is only one version that will be the target). Only the specified fields in the metadata form will be updated. The order of authors must be confirmed before the amendments can be submitted.

Version will create a new published version of a project. This pipeline will allow you to select a new set of files to publish, and whatever is selected in the pipeline is what will be published, nothing else. Additionally, the order of authors can be updated. 

Important: Any changes to the project’s metadata will also be updated (this update is limited to the same fields allowed in the Amend section), so there is no need to amend a newly versioned project unless you have made a mistake in the latest version.

Leave Data Feedback

We welcome feedback from users about the published datasets. For this, users can click on the "Leave Feedback" button at the top of the data presentation on the data publication landing pages. We suggest that feedback is written in a positive, constructive language. The following are examples of feedback questions and concerns:

  • Questions about the dataset that are not answered in the published metadata and or documentation.
  • Missing documentation.
  • Questions about the method/instruments used to generate the data.
  • Questions about data validation.
  • Doubts/concerns about data organization and or inability to find desired files.
  • Interest in bibliography about the data/related to the data.
  • Interest in reusing the data.
  • Comments about the experience of reusing the data.
  • Request to access raw data if not published.
  • Congratulations.

Marketing Datasets

Datasets take a lot of work to produce; they are important research products. By creating a complete, organized, and clearly described publication in DDR, users are inviting others to reuse and cite their data.  Researchers using published data from DDR must cite it using the DOI, which relies on the DataCite schema for accurate citation. For convenience, users can retrieve a formatted citation from the published data landing page. It is recommended to insert the citations in the reference section of the paper to facilitate citation tracking and count.

When using social media or any presentation platform to communicate research, it is important to include the proper citation and DOI on the presentations, emails, tweets, professional blog posts, etc.. A researcher does not actually need to reuse a dataset to cite it, but rather may cite it to point/review something about a dataset (e.g., how it was collected, its uniqueness, certain facts, etc.). This is similar to the process of citing other papers within a literary review section.