[Esip-preserve] ESIP Citation Guidelines

Wed Oct 13 09:58:23 EDT 2010

On 10/11/10 19:50, alicebarkstrom at frontier.com wrote:
> Do you think it is possible to adapt the UFN approach previously
> mentioned to our earth science data?  It addresses (some, but not
> all of) the things you discuss here.
>
> <BRB>Absolutely NOT!  The UFN approach starts by assuming that data
> can be arranged in a "canonical" sequence of values and held to a
> single specified precision and represenation.  There isn't anyone to
> play "pope" to provide a "canon" of data formats.  That means that
> there isn't anyone who can identify the "canonical" representation
> of a numeric data collection.

Not for all data, or not for any subset of the data?

> 1.  The NOAA GHCN adjusted precipitation data separates the
> geolocation data from the actual precip data - which is arranged in
> single year arrays (Jan as first month, Dec as the last month) using
> an ASCII encoding of five characters to represent an integer.  I
> don't think most of us would accept the notion that the data in
> memory (or in a file) that converted the ASCII to, say, double
> precision floats would suddenly render the data in memory
> "inauthentic".

So the canonicalization for that says "always compare it like this".

Why is that impossible?

> 2.  The MODIS MOD02 data product contains the lat and long values
> for each location in the 1 km data (if I remember what we had to
> deal with on CERES).  If someone takes the spectral channels for 1
> km res data and extracts the lats and long, and uses that for an
> analysis, I don't think most of us would assume that they've created
> "inauthentic" data.  So - is the "authentic" data the spectral
> radiances without the geolocation - or does the geolocation data
> have to accompany the spectral radiances?  If the answer is the
> spectral radiances, does the identifier have to refer just to that
> data?  If the answer is both, what identifier should someone quote
> who wants to use just the spectral data?

Now you are getting into subsets.  I'd prefer to simply postpone a
subset discussion by simply citing the whole file, even if you use
only part of it.  (Think of citing a fact from a paper.  I just cite
the DOI of the paper as a whole).

If you did have a valid canonicalization for the file as a whole, and
an identifier scheme that can distinguish subsets of the file, then
you could identify a valid canonicalization of the subset of the file.

We could also play with the "SEC" identifier (UUID of the
authoritative file) and tie that into identifiers for subsets.

But, as I said, let's put off identifiers for subsets of files until
we can at least get identifiers fo the files themselves.

> As additional indicators of the difficulty, you can take the
> different formats available for images, with the format differences
> persisting over multiple decades (bmp, jpg, tiff, ps, and eps).
> Likewise, do you really expect NASA, NOAA, and DOD to agree on
> exactly the format and representation they'll use in common?  Or,
> for that matter that NASA and ESA will agree on identical data
> formats and sequential order in files of the "same" data?

Like I said, I don't think we need to have a solution that big to have
something useful.

Right now, we have a universe of files and no way to assert SEC for
any of them, even if they are scientifically equivalent.

If we can come up with a scheme to identify scientific equivalence for
some small corner of the universe, it could be useful.

> Additionally, I think reproducibility through complete provenance
> capture helps address this (though I acknowledge it doesn't solve
> it).
>
> <brb>Don't agree at all!</brb>

Need more information.  What don't you agree with?  That capturing
provenance is useful at all?  That provenance information doesn't help
address reproducibility?  Or that reproducibility is simply a lost
cause and we shouldn't even aim for it?

> We have typically relied on a trusted curator to manage and affirm
> this.  We can't prove it, (especially in the case of a malicious
> curator), but we log cryptographic digests as we produce data, and
> distribute them with the data files.  We (EOSDIS) dictate formats for
> standard data products for the "authoritative" version and that is
> what gets archived and distributed.  As you point out, this only
> checks the physical bits.
>
> <brb>Why should we rely on unverifiable curation?  What techniques
> do we have to audit the claim of identity?  I do think auditing is
> possible and can be highly reliable.  However, it means that the
> curating authority has to present a chain of evidence that will
> somehow allow independent verification of the claim of unaltered
> scientific identity.  I think we have a responsibility to avoid
> claims that cannot be substantiated - and I don't think the current
> state of our claims of "unique identifiers" can be verified
> independently.  If we go back to "Applied Cryptography", we don't
> have a verifiable trust model.  Anyone in the security business can
> (and should) shoot at what we're claiming.  I do not believe our
> current position fits with long-term preservation.</brb>

As opposed to what we do now?  We constantly make unsubstantiated
claims.  Audits are rare in this world.

I think you and I agree substantiation and audits would be good and
useful.

If today we have no good identifiers and no substantiation, and our
goal is to have both good identifiers and substantiation.  Are you
objecting to proposing some identifiers prior to setting up a system
for substantiation?  We can certainly head there next, but I'd be
happy for self-certifying compliance with a standard as a first step.
Then we can move to independent audits, formal certification, etc.

I don't see failure to jump to a perfect end point as an argument
against taking the first baby step.

> In some cases, like the MODIS "process on demand" L1B, we can't do
> that.  We assert that we have the ability to reproduce an equivalent
> file (although with the current implementation, it actually performs
> what I call "reprocessing" rather than "reproducing" -- The
> difference being that reprocessing can use better versions of
> ancillary data files, or later versions of the algorithms rather
> than trying to apply a faithful attempt to make the same file.)
>
> <brb>In short - it's not reproducable at all!  Bad guarantee of
> fixity!  Indeed, it sounds like a case of near-fraud,
> misrepresenting a reproduction as a (possibly bad) replica!</brb>

I mispoke -- I should haven't used the word equivalent there -- they
aren't really claiming to be reproducing the original files, no fixity
here.  They simply make the file when you ask for it.  The processing
step will make the best file the know how to make (not necessarily an
equivalent file to what they made last time.)

Curt