An Efficient Data Toolkit for Uncertainty Quantification in Ultrahigh-Resolution E3SM Land Model Simulations


Dali Wang* (, Fengming Yuan, Peter Schwartz, Shih-chief Kao, Michelle Thornton, Danial Ricciuto, Peter Thornton, Paul J. Hanson


Climate Change Science Institute and Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN



With the development of high-resolution datasets and the availability of supercomputing, large-scale land simulation at ultrahigh-resolution (1 km by 1 km) becomes feasible. This study presents a toolkit designed for large-scale data processing to support uncertainty quantification (UQ) in ultrahigh-resolution E3SM Land Model Simulations (uELM) simulations on massively parallel computers, such as Summit and Frontier. This study presents the first application of its data toolkit to prepare atmospheric forcing and surface properties datasets for UQ in uELM simulations over North America. The simulation domain is an 8075 by 7814 grid at a 1km by 1 km resolution. The atmospheric forcing dataset used in this study is a temporally interpolated Daymet dataset with a 3-hour interval. The total size of the forcing dataset is around 50TB, containing seven variables: precipitation, shortwave radiation, longwave radiation, pressure, temperature, humidity, and wind.

A general approach is developed to support flexible partitioning schemes (e.g., round robin, romanization, and area of interest) over the designated domain, containing 22 million land grid cells (approximately 22,000,000 km2). A new surface properties dataset (containing over 60 variables) is derived from a 0.5-degree by 0.5-degree global surface properties dataset using a nearest-neighbor approach or linear interpolation. The final surface properties data product takes about 120GB. The computational platforms used in this study include an Nvidia GDX station (a 20-core Intel Xeon processor, 250GB memory, and 2TB of SSD storage) and a (704)-node commodity-type Linux cluster (Andes) connected to 250PB parallel file system. Each node contains two 16-core 3.0GHz AMD EPYC processors and 256GB of main memory. The partition and generation of domain-dependent subsets of atmospheric forcing datasets took approximately 10–12 hours using 10 Andes nodes. The generation of fine-resolution surface datasets took around 20 minutes on the DGX station. Further profiling confirmed that this data toolkit is an efficient, memory-bounded application.