Balancing the needs of consumers and producers for scientific data collections

Enabling data citations for large data collections.

AmeriFlux includes 390 sites with data, 19,900 unique downloads since late 2015, and over 950 papers have been published using this data.
[Image Credit: Courtesy AmeriFlux]

The Science

Easy-to-use data citation methods are needed to address current challenges around integrating data from dozens or hundreds of datasets that large-scale DOE projects generate.  

The Impact

Enabling proper citation of large data collections will provide tracking of citations to individual datasets, and allow machine learning and AI to use large-scale integrated datasets and cite them accurately. We aim to support discoverable and reusable data through accurate citation counts so that authors receive appropriate credit for their work.  

Summary

Recent emphasis and requirements for open data publication have led to significant increases in data availability in the Earth sciences, which is critical to data integration. Currently, data are often published in a repository with an identifier and citation, similar to those for papers. Subsequent publications that use the data are expected to provide a citation in the reference section of the paper. However, the format of the data citation is still evolving, particularly with regards to citing dynamic data, subsets, and collections of data. Considering the motivations of those who contribute and use the data, the most pressing need is to create user-friendly solutions that provide credit and enable accurate citation of integrated data.

Providing easy-to-use data citations is needed to address social and technical challenges around data integration. Studies that integrate data from dozens or hundreds of datasets must often include data citations in supplementary material due to page limits. However, citations in the supplementary material are not indexed, making it difficult to track citations and thus giving credit to the data producer. In this paper, we discuss our experiences and the challenges we have encountered with current citation guidance. We also review the relative merits of the currently available mechanisms designed to enable compact citation of collections of data, such as data collections, data papers, and dynamic data citations. We consider these options for three scenarios: a domain-specific data collection, a data repository, and a large-scale, multidisciplinary project. We propose a new mechanism to enable citation of multiple datasets and credit to data producers, and convene a community of practice to address current social and technical challenges.

Principal Investigator

Deb Agarwal
Lawrence Berkeley National Laboratory
daagarwal@lbl.gov

Program Manager

Paul Bayer
U.S. Department of Energy, Biological and Environmental Research (SC-33)
Environmental System Science
paul.bayer@science.doe.gov

Funding

This work was funded through the AmeriFlux Management Project and the ESS-DIVE repository by the U.S. DOE’s Office of Science Biological and Environmental Research under contract number DE-AC02-05CH11231 to LBNL as part of its Earth and Environmental Systems Science Division Data Management program.

References

Agarwal, D. A., J. Damerow, C. Varadharajan, and D. S. Christianson, et al. "Balancing the needs of consumers and producers for scientific data collections". Ecological Informatics 62 101251  (2021). https://doi.org/10.1016/j.ecoinf.2021.101251.