A Processing Pipeline for High-Volume, High-Quality Environmental Sensor Data

Authors

Stephanie C. Pennington¹* ([email protected]), Roy Rich², Selina Cheng², Ben Bond-Lamberty¹, Xingyuan Chen³, Kurt Maier³, Michael Degan³, Vanessa Bailey³

Institutions

¹Joint Global Change Research Institute, Pacific Northwest National Laboratory, College Park, MD; ²Smithsonian Environmental Research Center, Edgewater, MD; ³Pacific Northwest National Laboratory, Richland, WA

URLs

https://compass.pnnl.gov/FME/COMPASSFME

Abstract

The Coastal Observations, Mechanisms, and Predictions Across Systems and Scales-Field, Measurements, and Experiments (COMPASS-FME) project has a network of observational sites across the Chesapeake Bay and western Lake Erie regions, extensively instrumented with soil, vegetation, and weather sensors logging data every 15 minutes. Such high-resolution environmental monitoring requires sophisticated data management in order to provide quality, timely data to researchers and the public. COMPASS-FME is organized around Findable, Accessible, Interoperable, and Reusable (FAIR) data principles and prioritizes rapid model-experiment iteration, and thus a major goal is to make this site sensor data rapidly available for quality assurance/quality control (QA/QC), analysis, and model ingestion, and openly available on ESS-DIVE within 1 year of collection. A combination of hardware and software workflows makes this possible. Sensor data are automatically uploaded to Smithsonian Environmental Research Center and Dropbox servers. A multi-step processing pipeline assigns a unique hash to each observation, ensuring traceability; reformats and unit-transforms the data; flags out-of-bounds of out-of-service problems; propagates quality flags based on the physical site setup; and automatically generates extensive site-specific metadata. The pipeline runs in R and uses technologies such as Quarto for logs and documentation; a SQL database for data quality flags; automated code testing; and both algorithmic and human QA/QC. Data are released in L0 (close to raw), L1 (limited QA/QC, full metadata), and L2 (highest quality, filtered and averaged) levels. This open-source system runs on the COMPASS computing cluster and processed over half a billion raw observations to produce the recent v1.0 data release of 97.3 million observations over 2 years.