[Esip-preserve] Identifier Pleasantries - Stateful and Virtual Files

Bruce Barkstrom brbarkstrom at gmail.com
Tue Dec 14 09:04:26 EST 2010


In thinking over some production scenarios, I've had a couple of instances
where we could have some interesting times:

1.  In some production scenarios, files are updated with new data although
their use depends on the order in which the updates occur.  On CERES,
we have one major file that is the history of previous cloud properties in
the month before the data use.  If you think of places with low clouds, such
as some of the tropical areas of nearly perpetual cloud cover (Africa and
the
Amazon, for example) the sequence of updates matters a good deal.
If the updates had missed a week - and then those missing days were
filled in out-of-time order - the estimate of clear-sky conditions would be
dependent on when the file was used.  This means that if you used the
file when the missing days were still missing you could get different
results than you would after they had been filled in.  Likewise, the NCDC
records for GHCN (which is the fundamental record of temperature and
humidity collected at ground stations) are updated fairly often.  The file
name isn't
changed when they do an update - but the file with that name will have
different contents.  Upshot for identifiers is that the file is volatile, so
that
the ID is not a guarantee of identity.  [Given my government employment,
I'd rank the prospect of changing the habits of any government agency
as down around the probability of hell freezing over (despite Dante).]
To put it a bit more formally, in this case, the state of the file updates
is important - and so the files are actually stateful.  If we were dealing
with databases, this would mean we would need to have the history of
transactions in order to understand the current state.

2.  In some cases, the production history provenance graph depends
on "virtual files" that are created simply to pass data from one job to
another.  The files aren't saved, although they are part of the processing.
The expected production on NPOESS had this kind of "feature".
I'm not sure what we propose to do about this situation when we're dealing
with identifiers.  Technically, if we had to audit the history or try to
replicate previous results, it looks like the only way to reconstruct the
state of the file would be to reconstruct the "virtual files".  That could
be "interesting".

3.  In a fair number of cases, file collections can exhibit similar kinds of
"volatility".  We've already encountered the situation where MODIS sends
out "replacements" for files that had already been produced.  In many other
cases (in EOS), the team will release partially completed versions that
they're still in the process of updating.  That will also be true of
operational
data collections - radiosondes, buoy and tide records, radar data,
geostationary
images, and so on.  These cases are frequent enough to be a meaningful
fraction of the Earth science data we have to deal with.  We need some
approach to dealing with identifiers for changeable collections.  Al's note
that we might call an unchanging collection "closed" and one that is
volatile
"open" is probably a helpful piece of nomenclature - but it isn't clear what
we would recommend to users about what these terms mean for their
data ordering practices.  Personally, I think I'd like to know about changes
since I last visited - but that is probably an impractical burden to place
on data repository or archival sites (and might require them to maintain
records of personal identification for users).  Maybe they would need
to have a record that shows what data had been included on a particular
date.  As I noted in a previous note, the key issue in these cases is
not a timestamp on the data production date, but a history of the
observation
time and date.

So there we are, some more stuff to think about.

Bruce B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101214/bb00ce9a/attachment.html>


More information about the Esip-preserve mailing list