2024 Abstracts

A Processing Pipeline for High-Volume, High-Quality Environmental Sensor Data

Authors

Stephanie C. Pennington1* (stephanie.pennington@pnnl.gov), Roy Rich2, Selina Cheng2, Ben Bond-Lamberty1, Xingyuan Chen3, Kurt Maier3, Michael Degan3, Vanessa Bailey3

Institutions

1Joint Global Change Research Institute, Pacific Northwest National Laboratory, College Park, MD; 2Smithsonian Environmental Research Center, Edgewater, MD; 3Pacific Northwest National Laboratory, Richland, WA

URLs

Abstract

The Coastal Observations, Mechanisms, and Predictions Across Systems and Scales-Field, Measurements, and Experiments (COMPASS-FME) project has a network of observational sites across the Chesapeake Bay and western Lake Erie regions, extensively instrumented with soil, vegetation, and weather sensors logging data every 15 minutes. Such high-resolution environmental monitoring requires sophisticated data management in order to provide quality, timely data to researchers and the public. COMPASS-FME is organized around Findable, Accessible, Interoperable, and Reusable (FAIR) data principles and prioritizes rapid model-experiment iteration, and thus a major goal is to make this site sensor data rapidly available for quality assurance/quality control (QA/QC), analysis, and model ingestion, and openly available on ESS-DIVE within 1 year of collection. A combination of hardware and software workflows makes this possible. Sensor data are automatically uploaded to Smithsonian Environmental Research Center and Dropbox servers. A multi-step processing pipeline assigns a unique hash to each observation, ensuring traceability; reformats and unit-transforms the data; flags out-of-bounds of out-of-service problems; propagates quality flags based on the physical site setup; and automatically generates extensive site-specific metadata. The pipeline runs in R and uses technologies such as Quarto for logs and documentation; a SQL database for data quality flags; automated code testing; and both algorithmic and human QA/QC. Data are released in L0 (close to raw), L1 (limited QA/QC, full metadata), and L2 (highest quality, filtered and averaged) levels. This open-source system runs on the COMPASS computing cluster and processed over half a billion raw observations to produce the recent v1.0 data release of 97.3 million observations over 2 years.