[Esip-cloud] Using Zarr to Store and Efficiently Access Output From Operational Numerical Weather Prediction Models

Mon Apr 19 11:58:59 EDT 2021

Hello ESIP Cloud Computing Cluster Members!

We are excited to welcome guest speakers Taylor Gowan and John Horel of the
Department of Atmospheric Sciences, University of Utah and Cloudnine
Weather to present on *Using Zarr to Store and Efficiently Access Output
>From Operational Numerical Weather Prediction Models. **Read the abstract,
questions and meeting agenda below.*

*Meeting Logistics!*
Topic: Using Zarr to Store and Efficiently Access Output From Operational
Numerical Weather Prediction Models.
Monday April 26th, 1:00-2:00 pm ET/10:00-11:00 am PT
https://us02web.zoom.us/j/86535177705?pwd=ay9yVDJ6UzNiSGRMWTFxbkNXdEJXUT09
Meeting ID: 865 3517 7705
Passcode: 354962
Find your local number: https://us02web.zoom.us/u/knxOPNBj5

*Abstract:*
Research and operations dependent on numerical weather prediction involves
synthesizing vast amounts of continuously updating output grids that
require viable, economical archival and retrieval solutions. We were the
only public archive from 2015 until recently for output from the
High-Resolution Rapid Refresh (HRRR) forecast modeling system of the
National Weather Service. That university-based archive has been relied
upon by over a thousand registered users. Fortunately, our research group
no longer needs to continue expanding beyond the current 130+ terabytes of
HRRR model output in GRIB2 format (a file type that efficiently stores
hundreds of two-dimensional variable fields for a single valid time) since
Amazon and Google are doing so now as part of the Open Data Program of the
National Oceanic and Atmospheric Administration.

Despite the highly compressible nature of GRIB2 files, they are on the
order of several hundred megabytes each, making high-volume input/output
applications challenging due to the memory and compute resources needed to
parse these files. With support from the Amazon Sustainability Data
Initiative, our group is creating and maintaining HRRR model output in an
optimized format, Zarr, in a publicly-accessible S3 bucket - hrrrzarr. That
bucket contains sets for each model run and every variable of analysis and
forecast files sectioned into 96 small chunks. The structure of the
HRRR-Zarr files are designed to allow users the flexibility to access data
they need by selecting many small files for subdomains and parameters of
interest without the overhead that comes from accessing GRIB2 files. The
workflows required to generate the Zarr files and illustrations of use
cases common to weather and machine learning applications will be presented.

*Discussion Questions - Choosing the right chunk shape:*

Many options exist as far as how to organize N-dimensional data sets such
as output from numerical weather prediction models. In our case, N=6
(x-longitude, y-latitude, z-height, v-variable, t-model run time, f-model
forecast time). We chose to generate x-y chunks and generate 3-D (f, x, y)
Zarr cubes. What use cases would have benefited from alternative approaches
in terms of chunk size, compression, and dimensions? Does it make sense to
generate complete Zarr archives using other dimensional combinations or
leave that to users to create their own repositories?

*Agenda:*

   - 20-30 minutes: Presentation
   - 20-30 minutes: Discussion questions
   - 10 minutes: Aimee, with credit to Sudhir Raj Shrestha, Rob Casey, and
   Rich Signell, provides overview of draft ESIP Cloud Computing Cluster
   2021 Plan
   <https://docs.google.com/document/d/1dcGFNohOnjPLQcYwY-W5Bxs8pbayXhwjRpPfL3i-UmA/edit#heading=h.pwl6y4r2af53>
and
   how cluster participants can get involved

Looking forward to seeing y'all there!
Aimee and Sudhir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.esipfed.org/pipermail/esip-cloud/attachments/20210419/84efa5a6/attachment.htm>