Data Curation &
Publication Guidelines

Guidance and best practices for publishing data

 

Introduction

Data curation generates data that is easy to find and re-use because it has been organized and described. Below is guidance for data curation using the NHERI DesignSafe cyberinfrastructure (CI) to share and publish natural hazards engineering data. We also provide on-demand assistance from a curator thought the DesignSafe ticketing system, if questions or issues arise.

Data curation is made up of all the activities undertaken to generate organized and documented data that is easy to re-use.  Using data management tools in DesignSafe, researchers are empowered to progressively curate their own data as their research progresses. When curation is complete, researchers can publish the dataset with a permanent digital object identifier (DOI) that allows the data to be easily located on the web and cited.  Features are in place to ensure the authenticity, integrity, security and persistence of the datasets for open access. DesignSafe is committed to the continuity of data preservation beyond the conclusion of the DesignSafe project.

To cite the use of DesignSafe in your research, please reference the following paper:

Rathje, E., Dawson, C. Padgett, J.E., Pinelli, J.-P., Stanzione, D., Adair, A., Arduino, P., Brandenberg, S.J., Cockerill, T., Dey, C., Esteva, M., Haan, Jr., F.L., Hanlon, M., Kareem, A., Lowes, L., Mock, S., and Mosqueda, G. 2017. “DesignSafe: A New Cyberinfrastructure for Natural Hazards Engineering,” ASCE Natural Hazards Review, doi: 10.1061/(ASCE)NH.1527-6996.0000246

 

Data Sharing and Publishing

DesignSafe provides an end-to-end data management, analysis and publication platform for both experimental and simulation-based research.  Within the DesignSafe Data Depot, researchers have access to a private “My Data” space, a collaborative “My Projects” space, and a “Published” space for published datasets.

Any files from a research project (data, processing scripts, analysis products, models, etc.) can be stored in DesignSafe from the start the project and shared among project team members. They will be kept private with a Project space until they are published by the research team. From the moment they are uploaded to a project the files can be curated for eventual publication, easing the burden of this work at the end of a project.

Research teams curate their own data in DesignSafe, using tools provided in the CI’s Data Depot. These tools facilitate organizing, categorizing and describing data. After researchers curate their data and request to publish it in the CI, the data is vetted to ensure that it meets minimum descriptive requirements (see details below), and then receives a permanent digital object identifier (DOI) for persistent identification and ease of data sharing and reuse on the web.

Researchers using published data from the DesignSafe Data Depot must cite it using the DOI, which relies on the DataCite schema for accurate citation (http://schema.datacite.org/).

 

Responsibilities and Timelines

Researchers working at a NHERI EF will receive their bulk data files via the Data Depot.  NHERI EF staff will deposit the data files into an existing Project created for the research project.  For all other types of research (e.g., simulation, experimental work performed at a non-NHERI lab), it will be the responsibility of the research team to upload their data to the Data Depot.  As noted previously, the research team is responsible for data curation and publishing.  Although no firm timeline requirements are specified for data publishing, researchers are expected to publish in a timely manner.  Recommended timelines for publishing different types of research data (i.e., Experimental, Simulation, and Reconnaissance) are listed in Table 1.

Table 1.  Recommended Publishing Timeline for Different Data Types

Project/Data Type

Recommended Publishing Timeline

Experimental

12 months from completion of experiment

Simulation

12 months from completion of simulations

Reconnaissance: Immediate Post-Disaster

3 months from returning from the field

Reconnaissance:  Follow-up Research

6 months from returning from the field

 

Licensing

Within DesignSafe, you will choose a license for your material. Because the DesignSafe Data Depot is an open repository, the following licenses will be offered:

  • For datasets: ODC-PDDL and ODC-BY
  • For copyrightable materials (for example, documents, workflows, designs, etc.): CC0 and CC-BY
  • For code: any open, non-commercial license (for example, GPL)

Tables 2-4 will help you select appropriate licenses for your data after identifying which license best fits your needs and institutional standards. Note that datasets are not copyrightable materials.

Tables 2-4. Datasets, Copyrightable Materials & Code Licensing

DATASETS

License

Full Name

Features

ODC-PDDL

ODC Public Domain Dedication and License

Allows to freely share, modify, and use a work for any purpose and without any restrictions. This license is intended for use on databases or their contents (“data”), either together or individually.

ODC-BY

Open Data Commons Attribution License

Allows to freely copy, distribute, and use databases and the data they contain; to create new works from the database and/or data; and to modify, transform and build upon the data and/or database, with the restriction that the data and/or database must be cited for attribution.

 

MATERIALS
(e.g: publications, white papers, presentations, learning objects, workflows, designs, etc.)

License

Full Name

Features

CCO

Creative Commons Public Domain Dedication License

Work is dedicated to the public domain by waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. The work may be copied, modified, distributed and performed, even for commercial purposes, all without asking permission.

CC-BY

Creative Commons Attribution License

Work may be copied, distributed, adapted, remixed, transformed and built upon in any medium or format, with the requirement that it is cited for attribution.

 

CODE
(e.g community software, scripts, libraries, applications, etc.)

License

Full Name

Features

GPL

GNU General Public License

Software/code may be copied, distributed and modified as long as changes and their dates are tracked in source files. Any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions.

 

Data Archiving and Preservation

Depositing your data and associated research project materials in the DesignSafe Data Depot will meet NSF requirements for data management. DesignSafe will persistently maintain all uploaded data on storage resources at the Texas Advanced Computing Center, and these resources are redundant and geographically replicated. DesignSafe operates a dedicated Fedora repository platform to ensure the authenticity, integrity, security and persistence of published datasets for open access. 

 

Minimum Metadata Requirements

Overview of minimal metadata implementation in DesignSafe

Metadata is information that describes data. Schemas provide a structured way for users to share metadata within and across domains. Schemas typically include labels (or, elements) and their definitions, and may have different levels of descriptive granularity.

Because there is no standard schema to describe natural hazards engineering research and data, DesignSafe offers metadata sets to describe key components of datasets in the CI. These were developed in close consultation with researchers and Experimental Facilities. All of the community terms gathered for use in DesignSafe are documented in an interactive meta-dictionary called YAMZ, along with their definitions. The terms are evolving, and they can and will be expanded, updated and corrected with input from the community as DesignSafe curation pipelines become more widely adopted.

DesignSafe’s metadata approach maps community terms to elements of widely-used, standardized schemas so that metadata can be exchanged with other platforms. Elements from the NEES data model are also included. The schemas to which terms have been mapped are: Dublin Core for description of the research project and the data publication, PROV to display provenance relationships between data and the processes from which it derives, and DataCite for DOI assignment and citation.

Due to variations in research domains and their methods, users may not need to use all of the elements available to describe their research. However, we identified a minimum set of metadata terms that represent the structure of the data, are useful for discovery, and will allow proper citation of data. When users request to publish data in the CI, the system will check for completeness of these core terms and whether data are associated with them. To ensure the quality of published data in DesignSafe’s Data Depot, publication will be granted once the minimum standard is completed by the user or research team. The minimum element set is shown below.

KEY (to help understand usage of the terms in the minimum set)

(bold) Denotes the structure of the data. For example, an experimental project may have more than one experiment and more than one corresponding analysis.

(*) The metadata is repeatable, with multiple entries allowed.

($) Recommended if exists. For example, not every project will include an analysis.