[Esip-preserve] Some Thoughts on Organizing Documents for Inclusion in our List for Provenance

Bruce Barkstrom brbarkstrom at gmail.com
Fri Jan 28 14:48:36 EST 2011


Over the last couple of years, I've felt that the Earth science data
issues are being dealt with by communities that don't talke with
each other enough to have a common dialect or world view, so
each community has its own mental model.  In the professional
jargon, this leads to "semantic heterogeneity".  The question is
"what can we do to reduce the difficulties in communicating with
each other?"  Here are some suggestions:

1.  It would be helpful to identify communities of practice in data
production.  For each community, it would be useful to identify
one or two example Earth science data collections that we could
use to check completeness of the experiences we need to describe.

Here's a suggestion - probably incomplete:
a.  *Biodiversity sampling*, where the data consists of human observations
of such material as species identified and population statistics;
b.  *Regional environmental monitoring*, thinking of such data collections
as time series of aerosol concentration, ozone concentration, etc. obtained
by a modest number of sites set up in a particular region that do in situ
chemical sampling;
c.  *Field experiments*, which may include fixed sites, aircraft
measurements,
ship measurements, and similar sampling strategies, usually involving
multiple instruments observing a particular region for a specified time
period;
d.  *Imaging surveys*, usually made from aircraft, in which the primary
data are images or photographs obtained on a scattered geographic
basis (meaning non-contiguous images).  Examples here include
the Hurricane Ike Damage Assessment Photo Collection, the USGS
aerial surveys for mapping, and the NSIDC Glacier Photo Collection.
e.  *Large-scale "global" networks*, in which there are numerous data
collection sites, sometimes with global coverage, usually collecting
in situ data.  Examples include the Global Historical Climate Network
that collects the fundamental temperature, humidity, and precipitation
data used for such climate efforts as IPCC, the Baseline Surface
Radiation Network (BSRN), which collects data on solar irradiance
at the Earth surface on a fairly global basis, the network of radiosonde
stations that launch radiosonde instruments for collecting temperature
and humidity profiles, and tidal guage networks.
f. *"One off" satellite instruments*, which may involve special purpose
instruments designed to measure particular variables.  Examples could
include some of the UARS instruments, such as HALOE, MAPS, LIS,
CALYPSO, GLAS, and so on.
g.  *Long-term satellite measurement programs*, including the NASA EOS
and NOAA Operational Satellites.  Some of these may be slightly
discontinuous
in terms of agency support, although the participants are trying to provide
continuity.  Examples of the latter include measurements of the solar
constant
(ACRIM, ERBE, SOVA, SORCE, etc.), the SAGE-type instruments dealing
with stratospheric aerosols, and so on.

2.  Production Paradigms, meaning what is the workflow associated
with each of these kinds of data collection approaches.  To place this
in context, it would be helpful to have the following kinds of information

a.  *Large-scale Work Breakdown Structure*, showing at least one or two
activities involved in planning before the measurements, the activities
involved in actually collecting the data, and activities involved in
transferring
the data to an archive (using an OAIS RM "Submission Agreement").
Just to keep this from seeming too daunting, I've included a simple Gantt
chart for a Project and a second level WBS for Algorithm Development.
For now, we'd probably only want the Project level.  The basic context
that this kind of diagram supplies is an idea of the time period involved
in a particular project (or "campaign" if that's more familiar language).
As a note, the Hurricane Ike Gantt chart would have only one or two
days of actual observations.  The NSIDC Gantt chart would have about
a century of taking photos and perhaps a year of scanning the photos
into digital form and creating the metadata [which is my guess, not a
confirmed schedule item].  If the group is interested, I can build these
charts fairly quickly as Postscript or Encapsulated Postscript figures.

b.  *A Top-Level Data Flow Diagram*, showing what I call "Data Products"
(and Curt has called ESDT's) and "Algorithm Families" (which Curt might
call "PGE's).  This kind of diagram is helpful at showing where particular
kinds of content or documentation needs to be available for understanding.
For example, if we've got an Algorithm Family that does geolocation for
a satellite program, that kind of algorithm will need an Earth geoid and
a Digital Elevation Model (if it's locating data on the Earth's surface -
which
isn't needed by the "solar constant" instruments, although a very small
table containing latitude, longitude, and altitude with respect to a
specified
geoid might be sufficient for surface sites).  Note also that the WBS from
item a would indicate where in the Project's history this kind of
information
might appear.

c.  *A List of Documents Required for Understanding the Data*, which might
include source code, notebooks, contractor reports from instrument
contractors,
commercial instrument provider serial numbers and calibration procedures,
etc.

3.  A Simple Description of Versioning Strategies
These should probably include
a.  *Is there more than one version of each Data Product*?
b.  *If there is more than one version, what does a Gantt chart of*
*the production schedule look like?*

I'll stop at this point - with the intent to provide at least one or two
examples.  Personally, I think it would be useful to extend the list to
include what we'd need to do some simple "production engineering"
of the type that would be needed for scaling production and access.
Such simple statistics include the total number of files produced,
the average size of a data product file, and the rate at which the
files are produced.  No doubt other production metrics would be
useful - suggestions encouraged.

Bruce B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110128/4d077d2f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Algorithm_Development_Gantt.eps
Type: application/postscript
Size: 13313 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110128/4d077d2f/attachment-0002.eps>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Project_Gantt.eps
Type: application/postscript
Size: 11652 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110128/4d077d2f/attachment-0003.eps>


More information about the Esip-preserve mailing list