Best Practices for Data Curation and Publication

What is Data Curation?

Data curation involves the organization, description, quality control, preservation, accessibility, and ease of reuse of data, with the goal of making your data publication FAIR and assuring that it will be useful for generations to come.

Step-by-step data curation instructions can be found in How to Curate Data?, and we strongly recommend following the onboarding instructions in the My Project curation and publication interface. We are available for additional help during our Virtual Office Hours and via help tickets. The DDR Policies communicate requirements for managing and publishing data. General curation best practices exist for Curation Quality and Curating Various Types of Research Data.

Curation Quality

Curation quality involves ensuring the accuracy, completeness, consistency, and reliability of your dataset.  In the DDR, curation quality policies as well as the curation and publication interactive functions are geared towards ensuring excellence in data publications. Each data publication is unique; it reflects and provides evidence of the research work of individuals and teams. It is the user's responsibility to publish data that is up to the best standards of their profession, and the DDR's  commitment is to help them achieve these standards.In addition, the following best practices are highly recommended to assure that a dataset is curated towards reusability and reproducibility:

  • Check for completeness of data transfer. Sometimes users upload folders with large numbers of files, and transfers get interrupted so everything in the folder may not get transferred. Please check that all folders contain all the files intended.
  • Use quality control methods to review the data for errors or to improve its fitness to the research application. Methods may involve: calibration, validation, normalization, resizing, improved resolution, cleaning, transformation to open formats, noise reduction, sub-sampling, performance testing, consolidation, further documentation, etc. Always include an explanation of the quality control methods you used in the data report or readme file so that other users are aware of the quality control methods employed. 
  • When publishing tabular data include a data dictionary in the data report to explain the meaning of the column fields. 
  • Data dictionaries are also useful to clarify acronyms, abbreviations or codes for measurements used in your data or documentation.
  • It is possible to publish both raw and curated data. Raw data comes directly from experiment recording instruments (camera, apps, sensors, scanners, etc). When raw data is corrected, calibrated, reviewed, edited or post-processed in any way for publication, it is considered curated. Some researchers want to publish their raw data along with their curated data. For those who wish to publish both, carefully consider the necessity of publishing both sets, and how another researcher might use them. Always clarify whether your data is raw or curated in the description, by adding a tag to label the files, or in a data report and include information about the method used to curate it.
  • Researchers generate enormous amounts of images and other file types. While there are no restrictions on the number of files in a project, it is important to be selective as to not overwhelm users. Make sure that all images have a purpose and are illustrative of a process or a function, and avoid redundancy. Use file tags to describe them, this is to make sure users know what you want to highlight. The same approach should be applied for other data formats.
  • Researchers that publish large amounts of files (for example as simulation outputs or field research collections) should consider publishing scripts/tools that allow selection/visualization/post-processing of files of interest so users can review and subset  data.
  • When publishing scripts/code, always include a readme file that explains clearly how to use the scripts.
  • Before publishing scripts and notebooks, always make sure your code works properly.

AI-Ready Curation Quality

AI Ready curation quality  is about ensuring data is clean, organized, structured, unbiased, and includes necessary contextual information to support AI workflows effectively leading to secure and meaningful outcomes. Overall, it points to achieving research reproducibility. DDR's policies and best practices cover many of the requirements to publish AI ready data, and we notice via citations that many types of published datasets in DDR are used in AI applications. There are additional recommendations for datasets specifically created to train particular ML models and or benchmarks and testbeds to be used in many applications. In those cases, beyond following all the applicable policies and best practices to publish a quality dataset, researchers should also:

  • Reference the public model used to train the data in the field Reference Data or Software.  
  • Document in the data report the results of the trained model including the model's performance under the published dataset. If the results are published in a paper, reference the paper in Related Work. 
  • Reference any data if/that was reused to generate the training/benchmark dataset.

The concept is that AI ready data involves showcasing a network of resources that includes: the data, the model and the performance of the model/results. The following are examples of datasets published in DDR that comply with AI readiness.

Del-Castillo-Negrete, C., B. Pachev, P. Arora, E. Valseth, C. Dawson (2023). "Alaska 1m Surge Events - Nome, Red Dog Dock, Unalakleet (1992 - 2022)", in Alaska Storm Surge Events. DesignSafe-CI. https://doi.org/10.17603/ds2-4rnb-j321

Massey, C., D. Townsend, B. Rosser, R. Morgenstern, K. Jones, B. Lukovic, J. Davidson (2021). "Version 2.0 of the landslide inventory for the Mw 7.8 14 November 2016, Kaikōura Earthquake", in V2.0 Landslide inventory for the Mw7.8 14 November 2016, Kaikōura Earthquake, New Zealand. DesignSafe-CI. https://doi.org/10.17603/ds2-1ftv-hm22

Curating Various Types of Research Data

Proprietary Formats

Many instruments used in natural hazards research involve proprietary file formats. Many of those files can be converted to open formats within their corresponding software prior to uploading to DDR.  In turn, Excel and Matlab are proprietary file formats that are frequently used in this community for analysis purposes. Instead of Excel spreadsheet files, it is best to publish data in CSV format so it can be used by different software. However, in some cases conversion may distort the data structures and thus the files cannot be converted. Always retain an original copy of any structured data before attempting conversions, and check between the two for fidelity. Additionally, it is possible to upload and publish both the proprietary and the converted version in the DDR, especially if you consider that publishing with a proprietary format is convenient for data reuse.

When publishing proprietary files that are ubiquitous in the field, please refer to the following Data Curation Primers. 

Compressed Data

Users that upload data as a zip file should unzip before curating and publishing, as zip files prevent others from directly viewing and making sense of the published data and the repository. If uploading compressed files to My Data, it is possible to unzip it using the extraction utility available in the workspace before copying data to My Project for curation and publication.

Simulation Data

When curating and publishing simulation datasets researchers should follow this best practices document. The document addresses the needs and recommendations of the numerical modeling community, and is informed by the experience of engineers that  conduct simulations. These best practices focus on attaining published datasets with precise descriptions of the simulation' design, access to the software used, and when possible, the complete publication of inputs and all outputs. Tying these pieces together requires documentation to understand the research motivation, origin, processing, and functions of the simulation dataset in line with FAIR principles. 

Geospatial Data

We encourage the use of recommended Geospatial data formats. Within the Data Depot Tools and Applications we provide open source software for users to share and analyze geospatial data. QGIS accommodates most open format datasets, and HazMapper is capable of visualizing geo-tagged photos and GeoJSON files. To access this software, users should create an account in DesignSafe.

Understanding that ArcGIS software is widespread in this community, it is possible to upload both proprietary and recommended geospatial data formats in the DDR. When publishing feature and raster files, it is important to make sure that all the relevant files for reuse–such as the projection and header files are included in the publication. For example, for shapefiles it is important to publish all .shp (the file that contains the geometry for all features), .shx (the file that indexes the geometry) and .dbf (the file that stores feature attributes in a tabular format) files.

Point Cloud Data

Point cloud data should be published in open, non-proprietary formats that ensure long-term usability and interoperability (e.g., LAS/LAZ, E57). The standard formats for publication are LAS and its compressed counterpart LAZ, both of which are widely supported by geospatial software and community tools. These formats store the raw point measurements along with important metadata, including the coordinate reference system (CRS), which is essential to accurately locate the data in space and make it usable in mapping and analysis. In DS, we have Potree available for users to view point cloud datasets. Through the Potree Convertor application, non-proprietary point cloud files can be converted to a Potree readable format for visualization in DesignSafe.

Authors are encouraged to:

  • Publish their point cloud data in LAS/LAZ format.
  • Create a Hazmapper map (optional) to allow users to interactively view the point cloud data on the dataset's landing page online. While users can download data from a published project landing page, Hazmapper provides a convenient way to preview the dataset and support discovery, while users can download the published LAS/LAZ files directly from DesignSafe. See Hazmapper > Collaboration for more details about how to create and publish a map in the project level landing page.

Before publishing, ensure that:

  • Point cloud files are provided in open, non-proprietary formats (e.g. LAS, LAZ)
  • The coordinate reference system (CRS) is correctly defined and stored within the file metadata. This is critical so users know exactly where the data is located on the Earth's surface and can align it with other spatial datasets.
  • Provide a README file including information about the directory structure, what files are located where and a dictionary to the file naming convention used. This "map" to your data will allow reusers to find the files they want to use.
  • If applicable, some authors include precursor/raw sensor files used to generate the LAS/LAZ. It is important to evaluate if these are valuable for researchers who wish to reprocess the data. However, when publishing precursor/raw files, we recommend not to just deposit a dump from the instruments. Instead, clean the content from instrument related files that are not useful for reuse, and clearly document how the raw files relate to the LAS/LAZ products and how they could be reused in the README file or in the Data Report.

Research Software

In alignment with FAIR for Research Software (FAIR4RS) recommendations, research software published in DDR should be feature complete and designed to be actively maintained, enhanced, and versioned. Whether published as Project Type "Other" or Project Type "Research Software", it should include sufficient documentation about its provenance (CodeMeta metadata) and how to use it (Readme).

A CodeMeta json file should be generated online and uploaded to the GitHub repository, or uploaded along with the software files to Project Type “Other”. When filling the CodeMeta file, it is important to include as much information as possible, which allows for others to understand, reuse, and attribute your software.

A Readme file including details of how to install and use the software should also be part of the publication. To be reproducible, the software publication needs to include documentation about, and links to libraries and dependencies needed to run it. Links to dependencies can be noted in the Related Data and Software entry which is available in the . If allowed by their licenses, dependency files can and should be included as part of the publication package for ease of software installation. We ask researchers to test that the software can be installed according to instructions and that it is working properly at the time of publication.

Many times users include or small subsets of data/demo files to their research software publications. Github is not a data repository so datasets should not be included in the Github repository. If the data cannot be published independently, software and data should be published as project type "Other". In that case, the source/provenance of the demo dataset should be noted in the Readme file and or pointed at in the Referenced Data and Software entry.

Jupyter Notebooks as Research Software

If planning to publish a Jupyter Notebook in the DDR, the content must comply with the definition of research software available in the Research Software Publication Policy. Thus, the content of the Jupyter Notebook must be a contribution to solving a scientific problem. It should also comply with Research Software Best Practices outlined above.

Jupyter Notebooks, published via GitHub or as Project Type Other should be self-contained and should not depend on file paths on DesignSafe computing resources or external file systems. Please see instructions on how to publish Jupyter Notebooks.

Many users publish notebooks along with a sample dataset or in connection to a published dataset. Always reference the dataset in the Related Data and Software entry.

Organizing and Naming Files

The DDR data models  provide categories for users to organize their data in a standardized way. Categories highlight the main components of the dataset in relation to the research method used. Each category has a form that must be filled with metadata to describe and represent its characteristics, and there are onboarding instructions on what kind of information is suitable for each metadata field. Once you have categorized the files do not make changes to categorized files through an SSH connection or through Globus. If you need to, please remove the category, deselect the files, and start over.

Under each category, corresponding files are represented as a list to facilitate accessibility of the dataset. However, curation of large datasets may require further organization of files into folders, which can be done prior to uploading the data or in My Project. In doing so it is best to avoid extensively nested folders, as browsing through folder hierarchies on the web slows the computer and confuses the user as to the location of the files. 

To avoid excessive nesting and improve navigation, users can come up with a file naming convention schema, ideally during the research planning phase and prior to gathering the data. Well-constructed folder and file names make it possible to identify files by succinctly expressing their content and their relations to other files, considering the key information they want to convey to others.

Naming Convention Ideas
  • Project acronym, is good for branding the data
  • Location/spatial coordinates
  • Type of data
  • Type of structure tested
  • Simulation or experiment type
  • Test/run number and parameters/conditions
  • Natural hazard
  • Damage type
  • Version number of file
Naming Convention Recommendations
  • File naming should be meaningful, both to the team and to others, and should be kept short. 
  • Consider the folder name as complementary to the file naming convention
  • File names should not have spaces, periods, accents, or special characters, as those features may cause errors within the storage systems.
  • Always leave the three letter file extension. 
  • The meaning and components of file naming conventions should be documented in the Data Report so that others can identify files. 
  • Do not repeat filenames or folder names across different experiments, collections and missions, or simulations, as this is confusing for other users. Always include another piece of information to distinguish files and folders across groupings. 

For long-term preservation and interoperability purposes, we recommend and promote storing and publishing data in open formats. Below is an adaptation of the list of recommended formats for data and documentation by Stanford Libraries. For those available, we include a link to the  Data Curation Primers which are peer-reviewed documents that can be used as guidelines to curate research data:

Tagging

Users can tag folders and  files for ease of data comprehension and reuse by others. While tagging is not required,  we recommend it because it helps others to identify file contents in the dataset landing page. The list of tags available are agreed upon terms contributed by experts in the field. If the tags available do not apply, feel free to add custom tags, and you may also submit tickets informing the curation team about the need to incorporate them in the list.  Using tags to clarify the file function is also a way of avoiding excessive folder nesting.

Writing Helpful Titles, Keywords, and Descriptions

Well-written titles, keywords, and descriptions are key for understanding, discovery, and marketing of your research. Here are some tips to craft ones:

Except for the data type "Other," the Data Depot Repository (DDR) requires titles at both the project level and the dataset level (mission, experiment, simulation, hybrid simulation). The dataset level title serves as the primary citation title, while the project title functions as a complementary secondary title. The citation format is shown below:

Author/s (year of publication). "Dataset Title", in Project title 

Example Citation:

Stark, N., M. Gardner, M. Grilliot, A. Lyda, K. Dedinsky, J. Mueller, C. Pezoldt, J. Hubler, C. Castro-Bolinaga, A. Schueller, W. Zhan, M. Haefeli, S. Burghardt, M. Wondolowski, S. Holberg, M. Hassan, J. Parker, J. Laurel-Castillo, L. Eggensberger, E. Nichols, H. Herndon, P. Wang, M. Olabarrieta Lizaso, B. Raubenheimer, Y. Hashash, S. ADUSEI, N. Jafari (2025). "NEER/GEER: Hurricanes Helene & Milton Dataset", in Multidisciplinary Pre, During and Post Storm Data Collection. DesignSafe-CI. https://doi.org/10.17603/ds2-m8h3-5802

Project Level Titles

  • Remember that this is the sub-title of the dataset citation.
  • Keep titles short, between 50 and 60 characters long.
  • Use descriptive words that point to the key themes and characteristics of the research project.
  • Do not repeat the title of the dataset.
  • Do not repeat the title of a paper, which could create confusion.
  • Avoid using or spelling out acronyms. 

Dataset Level Titles

  • Remember that this is the title that will be featured in the citation. 
  • Keep titles short,  between 50 and 60 characters long.
  • Use descriptive words that point to the key themes and characteristics of the dataset.
  • Do not repeat the title of the research project, or use “Dataset for:…(title of research project)
  • Do not repeat the title of a paper to avoid duplication of DOIs or use "Replication Dataset for:...(title of the paper). 
  • Avoid using or spelling out acronyms. 

Keywords

  • Remember that other researchers use keywords to find datasets in the Data Depot or online. 
  • When applying keywords, researchers must think about how others would search for this particular dataset.
  • When applicable, use keywords to indicate the type of hazard, research method, technology, problem addressed, and purpose. 
  • Repeating words used in the description and titles as keywords increases the chances that the dataset will be discovered. 

Consult the CONVERGE check sheet for guidance in choosing helpful keywords. 

Descriptions

Except for Data type "Other", DesignSafe requires descriptions at the project level and at the dataset level (mission, experiment, simulation, hybrid simulation). Both descriptions are complementary. The project level description addresses the high level objectives of the research, who participates, why the research is unique, how many datasets will be published and who will benefit from it. The dataset level description focuses on the scope and content of the dataset and how it was obtained. Below are general recommendations that apply to both descriptions.  

* Minimum 200 words, max 300 for each level

Project Level Descriptions

  • Begin with a general statement that provides context to the study by which the dataset was created (e.g., The system under investigation…). 
  • Mention the type(s) of hazard being studied (wind, earthquake, wildfire,  multi-hazard, etc.). 
  • Address the research problem that the data is helping to solve.
  • Do not copy the abstract of the paper, as that describes the research results and not the dataset itself.
  • Address who will benefit from reusing the data and how (reproducibility, generating new studies, validation, Machine Learning, etc.)?
  • Use language that can reach experts as well as broader audiences.
  • Use words that you think will help or are used by others to find datasets like yours online. Repeat these words across the title, descriptions, and keywords to increase online discoverability.
  • Avoid using acronyms if possible. Spell out the full acronym if you must.

Dataset Level Descriptions

  • Datasets should be described as a standalone research output, so they can be understood independently from related research products such as a published paper or research code.Focus on describing the dataset. You may begin the text with "This dataset…"
  • Provide an overview of the methodology by which the dataset was obtained.
  • Provide a very brief overview of the scope and contents of the dataset and how it is organized. 
  • Indicate if the data was quality controlled - you may go into more detail in the Data Report. 
  • Keep descriptions concise and engaging, further details about the dataset should be introduced in a Data Report.
  • Use language that can reach experts as well as layperson audiences.
  • Use words that you consider will help or are used by others to find datasets like yours online. 
  • Repeat these words across the title, description, and keywords to increase online discoverability.
  • Avoid using acronyms if possible. Spell out the full acronym if you must.

Writing Effective Data Reports

Understanding the provenance, structure, and functions of a dataset often requires more detail than what can be described in the metadata fields provided by the DDR landing pages. This is especially so in cases of data generated from large-scale experiments and simulations such as the many housed in DDR. DDR recommends that datasets are published along with adequate documentation. In DDR the documentation takes the form of data reports, readme files, field research planning documents, or any other record that guides users to understand and reuse a dataset. Such documentation is focused on describing the methodology, characteristics, content, and organization of the datasets as well as any post-processing or quality control performed on them. This guide refers to these different documents as Data Reports.

Except for the "Other" project type, all the DDR project types have categories in which to include this kind of documentation. The "Other" project type is used by many researchers to publish databases, historical data, benchmark datasets, and ML training datasets, and they should also include a data report. For Simulation, Experimental, and Hybrid Simulation, there is a Report category where to assign the data report. In Field Research the documentation can be included under the Document Collection or the Research Planning Documents categories.

Below are recommendations for what to include in data reports for all dataset types, as well as more specific ones for simulation, experiments, and field research datasets. We suggest that researchers pick the ones that are suitable for their particular cases. While some information may seem redundant from what is input in the forms during curation, the goal is for the data report to provide comprehensive and expanded documentation about the dataset's technical provenance and functionalities that users can download along with the data.

General

Applicable for all datasets:

  • Name the report file with a convention consistent with the rest of the files in the dataset.
  • Include title of the project and of the dataset/s, authors and dates in the first page.
  • Provide context to the dataset publication by including a brief overview of the research problem that the data is trying to solve.
  • Many datasets are composed of different types of data/file formats. List the dataset contents and the functions of the different data types.
  • Include the methodology used to collect/generate the data.
  • Include information about the instrument/s used to collect/generate the data, and if/how they were calibrated.
  • Explain how the data is organized including a schema of the folder structure.
  • Describe the folder structure and the file naming convention used to organize and identify the files.
  • Include a data dictionary to clarify the meaning of acronyms, variables, or any other information that is coded or synthesized in the data (ex. labels in tabular data).
  • Include a statement of how the quality of the data was evaluated for completeness and accuracy, and if any post processing such as normalization, removal of errors, validation, etc. were completed. If the data was not processed and is distributed as is, this should be noted as well.
  • Indicate if files were converted and to which format.
  • Researchers generally refer to raw data as the data obtained directly from any type of recording instrument. Indicate if the dataset includes both raw and converted data, and or post-processed data.
  • If data are distributed as .bin files, indicate the structure of the binary format as well as software or scripts used to read in the files.
  • Reporting a dataset's performance evaluation is applicable to benchmark or testbed datasets that have been used to train a model or for numerical validation.
  • Describe usage of Jupyter notebook or code that you are publishing to visualize, read in, or post- process the data. Include instructions on how to use them.

Simulations

  • Provide information about the software used to conduct the simulation and its version. Point to its repository and citation if available.
  • Describe any ancillary scripts/code used to run the simulation software.
  • Describe how the simulation runs were structured including a graph of the workflow.
  • Include spatial and temporal scales and resolution.
  • Note the typical runtime of the simulation workflow and the system configuration.
  • Report on the computational environment where the simulations were run.
  • Reference the source/citation of input files (taken from published experiments or external repositories) when applicable.

Field Research

Engineering and Social Science
  • Briefly describe the characteristics of the natural hazard event that was observed.
  • If the field research is not associated with a natural hazard event, describe the unit of analysis that is being observed.
  • Report on the data collection design, sampling criteria, and coordination between data collectors if applicable.
  • Describe the methodology including type/s of observation/s.
  • Indicate corresponding instrumentation used to collect different types of data. Include details of their make, model and setup as needed.
  • Note any relevant information about the instrument set up and calibration.
  • Note the correspondence between the phenomena observed, the type of instruments used and the corresponding dataset.
  • For survey and interview data include the questionnaire and interview protocol and report on the population size and participant characteristics.
  • For human subjects data describe all issues regarding protection of personal identifiable information and confidentiality.

Experiments

  • Provide details of the facility/lab and equipment used.
  • Describe the experimental design/model configuration.
  • Describe/graph the model/specimen construction/layout.
  • Indicate sensor types and disposition.
  • Indicate calibration.
  • Record pre-test documentation.
  • Describe the test protocol including frequency and loads.
  • Note the format in which the data was captured.
  • Record the units in which the data was captured and if there was conversion to other units.
  • Indicate if data published raw and or processed.
  • Record any data post-processing activity and note software used.

Reusing and Citing Resources in the Datasets Landing Pages

Reusing Data and Software

In their projects, researchers frequently reuse data as input files, for validation, or to integrate with the data that they collect. They also reuse open software and may modify it for their purposes. To make sure that data and software can be reused, modified, and shared appropriately, researchers should always look into and proceed according to what is stated in the resource license. Licenses are standard ways for content creators to grant permissions to others to reuse their work under certain conditions. The license will, for example, indicate if the creators want to be attributed with a citation. In that regard, it is a good practice to always provide the citation of the software and data that was reused, even if the author does not require it. This is because authorship is part of the resource's provenance and to give credit to the creators.

Citing Papers and Reused Data and Software

DDR offers possibilities to cite different types of resources that provide context, cite, or were used to create a new data publication.  Located in the curation forms, those are:

  • Related Work: to cite data, papers, websites, reports or presentations that provide context, have a direct relationship to, or cite the published dataset. Those resources could be external to DDR or published in DDR. The type of connection between the resource and the published dataset must be selected by the user as: context (informs the published dataset), link (points to a resource that is complementary to, derives from, or is derived from the published dataset) and is cited by (the work references the published dataset) 
  • Referenced Data and Software: to cite data or software that have been reused to create the published dataset.

Note that:

  • There could be more than one Related Work and Referenced Data or Software per published dataset.  
  • Related Works and Referenced Data or Software can be added at any time during or after the dataset's publication. In the former case, it can be done through Amends or Versioning.
  • In the form researchers should include the cited resource title and corresponding DOI in https format; this way, other users will be directed to the cited resource.
  • When using a DOI, the platform will submit this information to a third party (Cross-Ref via DataCite) that will assign the citations to the corresponding resources so that authors are credited appropriately.
  • In order to reuse images from other sources (online, databases, publications, etc.), users should look into their rights and licensing restrictions and follow them appropriately. Users should also follow instructions on how to cite the images.

Rights

DDR publishes data and other types of materials, some of which such as reports, presentations, and learning materials are protected under the U.S. Copyright law. The U.S. Copyright Office considers that information that is discovered as opposed to created does not have copyrights. This applies to a majority of datasets published in DDR which are conceived as a collection of facts.

Data may not, however, be only facts. That could be the case of certain types of field notes, answers from human subject interviews, or photographs. In turn, databases or other systems in which data is organized may be protected, but the data within will not be. In addition, anything created solely by AI cannot be copyrighted. For updated information about this important topic see the University of Texas Libraries Copyright Issues in AI.

As an open repository committed to responsible access, we offer different licenses under which authors can share their data  publicly, and establish the conditions in which it can be reused by others. Options offered in the DDR are Creative Commons licenses (with attribution or public domain) and Open Data licenses (with and without attribution). The first ones involve creative work under copyright law, and the latter are designed specifically for data. All licensing choices in DesignSafe involve the least restrictive conditions under which data and works can be shared and reused.  In general, any creative work such as papers, reports, presentations, social science research instruments, learning objects,  and posters can be shared under a Creative Commons license while for data it is best to choose an Open Data license. When consulting with our community, most expressed that they want to be attributed, which means that reusers of data should add the dataset citation in the reference section of the papers they publish using the data. Please refer to the Licenses documentation, and visit us during our Virtual Office Hours to discuss any doubts in choosing the right license with the data curator.

Amends and Versioning

Once a dataset is published, users can amend or version their data publications. Amends involve correcting certain metadata fields that do not incur changes to the existing data files. Versioning involves amends and or changes to the data. These actions can be performed by the authors through My Project. 

Amends include:

  • Improving descriptions and adding keywords: after the curator reviews the publication, or following versioning, users may need to clarify or enhance descriptions and keywords.
  • Improve or add metadata: Adding information in a non-required field, or changing, improving already completed fields.
  • Changing the order of authors: even though DDR has interactive tools to set the order of authors in the publication pipeline, users may require changes after publication.
  • Add Funders and Awards: It is always important to provide credit to the agencies that fund the research that allows creating the data. 
  • Adding Related Works: when papers citing a dataset are published, we encourage users to add the references in Related Works (option: is cited by) to improve data understandability, cross-referencing, and citation count. Other related works inform contextual, informing the 
  • Adding Referenced Data and Software: when other datasets and or research software was used to create the published dataset.

Version control includes:

  • Adding or deleting files to a published dataset.
  • Documenting the nature of the changes which will publicly show in the landing page.
  • Descriptions of the nature of the changes are displayed for users to see what changed and stored as metadata.
  • Any changes to the project’s metadata will also be updated (this update is limited to the same fields allowed in the Amend section), so there is no need to amend a newly versioned project unless you have made a mistake in the latest version.

Versioning has these implications for the project’s DOI:

  • In the citation and landing pages, different versions of a dataset will have the same DOI and different version number.
  • The DOI will always resolve to the latest version of the data publication.
  • Users will always be able to access previous versions through the landing page.

When implementing amends and versioning, consider the following:

  • Amend will only update the latest version of a publication (if there is only one version that will be the target). Only the specified fields in the metadata form will be updated. The order of authors must be confirmed before the amendments can be submitted.
  • Once a dataset is published, major changes to the title or author changes are not permitted. If changes must be made due to omission or a mistake, users should submit a Help ticket and discuss the change with the data curator. If applicable, changes will be made by the curation team.

Leaving Data Feedback

Users can click a “Leave Feedback” button on the projects’ landing pages to provide comments on any publication. This feedback is transformed into a ticket and forwarded to the curation team for any needed actions, including contacting the authors. In addition, it is possible for users to message the authors directly as their contact information is available via the authors field in the datasets landing pages. However, leaving the feedback in the form allows tracking of the issues raised. The following are examples of feedback questions and concerns:

  • Questions about the dataset that are not answered in the published metadata and or documentation.
  • Missing documentation.
  • Questions about the method/instruments used to generate the data.
  • Questions about data validation.
  • Doubts/concerns about data organization and or inability to find desired files.
  • Interest in bibliography about the data/related to the data.
  • Interest in reusing the data.
  • Comments about the experience of reusing the data.
  • Request to access raw data if not published.
  • Requests to access restricted data if the authors authorize it. 
  • Issues related to published code that is not working properly.
  • Congratulations.