A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats

Teams of scientists across DOE National Laboratories use GitHub in a novel way for collaboration on developing community data standards.

Scientists make several recommendations for using GitHub to version control reporting formats, including providing descriptive README files, versioning documentation with semantic versioning, and archiving all content in a long-term data repository.

[Reprinted under a Creative Commons Attribution 4.0 International License (CC BY 4.0.) from Crystal-Ornelas, R., et al. “A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats.” Earth and Space Science 8 (8), (2021). DOI: 10.1029/2021ea001797]

The Science

Earth and environmental data standards are an important way to make data FAIR (Findable, Accessible, Interoperable, and Reusable). However, there is no agreed upon way for groups to share and collaborate on the standards. Some groups host standards on static websites, others circulate templates in proprietary formats. Therefore, scientists working together across the Department of Energy’s (DOE) National Labs have outlined a set of best practices to guide research communities in disseminating and collaborating on standards. Their main recommendation is that researchers use the version control platform GitHub to openly share data standards, organize feedback from their user community, and clearly track changes to the standards over time.

The Impact

A systematic review resulted in several key recommendations for researchers looking to develop data reporting formats for their diverse datasets. First, scientists suggest that GitHub, a website typically used for collaboration on computer code, can also be used for open and transparent collaboration on reporting format documentation.  Beyond using GitHub as a collaborative platform, scientists provide a review of tools within GitHub that benefit those looking to bring more researchers into the data standardization process (e.g., submitting feedback using GitHub issues or creating project websites using GitBook or GitHub Pages).

Summary

Over the past three years, the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository has worked with six teams of community partners across the National Lab network to develop data reporting formats for some of the complex ESS data that are submitted to ESS-DIVE. The teams needed a web platform to host their data reporting format documentation and templates that fulfilled several requirements. The web platform needed to (1) track changes to multiple documents over time, (2) facilitate collaboration between researchers, and (3) display content openly and transparently.

To determine a path forward, the teams conducted a systematic review of over 100 data standards in earth and environmental science and explored how data standards documentation was hosted on the internet. Across the 108 data standards that were reviewed, there was no universal way that researchers chose to publish their data standards. The review revealed that 32 researchers used GitHub as the platform to manage their associated documents and templates. Though GitHub is typically used for collaboration on computer code, it meets the three criteria outlined above for collaboration on reporting formats. Thus, the teams selected it as the platform for hosting ESS-DIVE’s data reporting formats.

Based on the results of this systematic review, several best practices for leveraging GitHub features for collaboration on reporting formats were identified. First, GitHub repositories should contain descriptive README files that help orient first-time users to the reporting formats and include information like usage licenses and recommended citations. Second, semantic versioning should be used to indicate when data reporting format documents have been updated in major or minor ways (e.g., v2.0.0 or v.1.1.0, respectively). Lastly, GitHub Issues are built-in to every repository, and allow anyone with a GitHub account to provide feedback on the reporting formats. Taken together, GitHub provides an open and transparent way to host, version, and collaborate on community-led earth and environmental science data and metadata reporting formats.

Principal Investigator

Deb Agarwal
Lawrence Berkeley National Laboratory
daagarwal@lbl.gov

Program Manager

Daniel Stover
U.S. Department of Energy, Biological and Environmental Research (SC-33)
Environmental System Science
daniel.stover@science.doe.gov

Funding

This work was funded through the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository by the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy’s (DOE) Office of Science under contract number DE-AC02-05CH11231. Work was also supported by the iNtegration, Artificial Intelligence Analytical Data Services (iNAIADS) Early Career Research Award, funded by DOE BER under Berkeley Lab Contract Number DE-AC02-05CH11231. Additional support was provided through DOE contract number DE-SC0012704 to Brookhaven National Laboratory. Reporting format development was supported by ESS-DIVE’s Community Funds through DOE BER.

Related Links

References

Crystal-Ornelas, R., et al. "A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats." Earth and Space Science 8 (8), (2021). https://doi.org/10.1029/2021ea001797.