CURATION & PUBLICATION POLICIES

Data Curation


Data curation involves the organization, description, representation, permanent publication, and preservation of datasets in compliance with community best practices and FAIR data principles. In the DDR, data curation is a joint responsibility between the researchers that generate data and the DDR team. Researchers understand better the logic and functions of the datasets they create, and our team's role is to help them make these datasets FAIR-compliant. 

Our goal is to enable researchers to curate their data from the beginning of a research project and turn it into publications through interactive pipelines and consultation with data curators. The DDR has and continues to invest efforts in developing and testing curation and publication pipelines based on data models designed with input from the NHERI community.  

Data Management Plan

For natural hazards researchers submitting proposals to the NSF using any of the NHERI network facilities/resources, or alternative facilities/resources, we developed Data Management guidelines that explain how to use the DDR for data curation and publication. See Data Management Plan at: https://www.designsafe-ci.org/rw/user-guides/ and https://converge.colorado.edu/data/data-management

Data Models

To facilitate data curation of the diverse and large datasets generated in the fields associated with natural hazards, we worked with experts in natural hazards research to develop five data models that encompass the following types of datasets: experimental, simulation, field research, hybrid simulation, and other data products (See: 10.3390/publications7030051; 10.2218/ijdc.v13i1.661) as well as lists of specialized vocabulary. Based on the Core Scientific Metadata Model, these data models were designed considering the community's research practices and workflows, the need for documenting these processes (provenance), and using terms common to the field. The models highlight the structure and components of natural hazards research projects across time, tests, geographical locations, provenance, and instrumentation. Researchers in our community have presented on the design, implementation and use of these models broadly. 

In the DDR web interface the data models are implemented as interactive functions with instructions that guide the researchers through the curation and publication tasks. As researchers move through the tion pipelines, the interactive features reinforce data and metadata completeness and thus the quality of the  publication. The process will not move forward if requirements for metadata are not in place (See Metadata in Best Practices), or if key files are missing.  

Metadata

Up to date, there is no standard metadata schema to describe natural hazards data. In DDR we follow a combination of standard metadata schemas and expert-contributed vocabularies to help users describe and find data. 

Embedded in the DDR data models are categories and terms as metadata elements that experts in the NHERI network contributed and deemed important for data explainability and reuse. Categories reflect the structure and components of the research dataset, and the terms describe these components.  The structure and components of the published datasets are represented on the dataset landing pages and through the Data Diagram presented for each dataset. 

Due to variations in their research methods, researchers may not need all the categories and terms available to describe and represent their datasets. However, we have identified a core set of metadata that allows proper data representation, explainability, and citation. These sets of core metadata are shown for each data model in our Metadata Requirements in Best Practices. 

To further describe datasets, the curation interface offers the possibility to add both predefined and custom file tags. Predefined file tags are specialized terms provided by the natural hazard community; their use is optional, but highly recommended. The lists of tags are evolving for each data model, continuing to be expanded, updated, and corrected as we gather feedback and observe how researchers use them in their publications. 

For purposes of metadata exchange and interoperability, the elements and tags in the data models are mapped to widely-used, standardized metadata schemas. These are: Dublin Core for description of the dataset project, DDI (Data Documentation Initiative) for social science data, DataCite for DOI assignment and citation, and PROV-O to record the structure of the dataset. Metadata mapping is substantiated as the data is ingested into Fedora. Users can download the standardized metadata in the publications landing page. 

Metadata and Data Quality

The diversity and quantity of research methods, instruments, and data formats in natural hazards research is vast and highly specialized. For this reason, we conceive of data quality as a collaboration between the researchers and the DDR.  In consultation with the larger NHERI network we are continuously observing and defining best practices that emerge from our community's understanding and standards.

We address data quality from a variety of perspectives:

Metadata quality: Metadata is fundamental to data explainability and reuse. To support metadata quality we provide onboarding descriptions of all metadata elements, indicate which ones are required, and suggest how to complete them. Requirements for core metadata elements are automatically reinforced within the publication pipeline and the dataset will not be published if those are not fulfilled.  Metadata can be accessed by users in standardized formats on the projects’ landing pages. 

Data content quality: Different groups in the NHERI network have developed benchmarks and guidelines for data quality assurance, including StEER, CONVERGE and RAPID. In turn, each NHERI Experimental Facility has methods and criteria in place for ensuring and assessing data quality during and after experiments are conducted. Most of the data curated and published along NHERI guidelines in the DDR are related to peer-reviewed research projects and papers, speaking to the relevance and standards of their design and outputs. Still, the community acknowledges that for very large datasets the opportunity for detailed quality assessment emerges after publication, as data are analyzed and turned into knowledge. Because work in many projects continues after publication, both for the data producers and reusers, the community has the opportunity to version datasets. 

Data completeness and representation: We understand data completeness as the presence of all relevant files that enable reproducibility, understandability, and therefore reuse. This may include readme files, data dictionaries and data reports, as well as data files. The DDR complies with data completeness by recommending and requesting users to include required data to fullfill the data model required categories indispensable for a publication understandibility and reuse. During the publication process the system verifies that those categories have data assigned to them.The Data Diagram on the landing page reflects which relevant data categories are present in each publication. A similar process happens for metadata during the publication pipeline; metadata is automatically vetted against the research community’s Metadata Requirements before moving on to receive a DOI for persistent identification.

We also support citation best practices for datasets reused in our publications. When users reuse data from other sources in their data projects, they have the opportunity to include them in the metadata through the Related Works and Referenced Data fields. 

Data publications review: Once a month, data curators meet to review new publications. These reviews show us how the community is using and understanding the models, and allows verifying the overall quality of the data publications. When we identify curation problems (e.g. insufficient or unclear descriptions, file or category  misplacement, etc.) that could not be automatically detected, we contact the researchers and work on solving these issues. Based on the feedback, users have the possibility to amend/improve their descriptions and to version their datasets (See amends and version control).

Curation and Publication Assistance 

We believe that researchers are best prepared to tell the story of their projects through their data publications; our role is to enable them to communicate their research to the public by providing flexible and easy to use curation resources and guidance that abide by publication best practices. To support researchers organizing, categorizing and describing their data, we provide interactive pipelines with onboarding instructions, different modes of training and documentation, and one-on-one help.

Interactive pipelines: The DDR interface is designed to facilitate large scale data curation and publication through interactive step-by-step capabilities aided by onboarding instructions. This includes the possibility to categorize and tag multiple files in relation to the data models, and to establish relations between categories via diagrams that are intuitive to data producers and easy to understand for data consumers. Onboarding instructions including vocabulary definitions, suggestions for best practices, availability of controlled terms, and automated quality control checks are in place. 

One-on-one support: We hold virtual office hours twice a week during which a curator is available for real-time consulting with individuals and teams. Other virtual consulting times can be scheduled on demand. Users can also submit Help tickets, which are answered within 24 hours, as well as send emails to the curators. Users also communicate with curatorial staff via the DesignSafe Slack channel. The curatorial staff includes a natural hazards engineer, a data librarian, and a USEX specialist. Furthermore, developers are on call to assist when needed. 

Guidance on Best Practices: Curatorial staff prepares guides and video tutorials, including special training materials for Undergraduate Research Experience students and for Graduate Students working at Experimental Facilities.

Webinars by Researchers: Various researchers in our community contribute to our curation and publication efforts by conducting webinars in which they relay their data curation and publication experiences. Some examples are webinars on curation and publication of hybrid simulations, field research and social sciences datasets.