[Esip-preserve] ESIP Citation Guidelines

Ruth Duerr rduerr at nsidc.org
Fri Oct 15 17:02:38 EDT 2010


Hi Bruce,

Just for the record, I think that it is pointless to change the definition of existing data sets to be something other than what they were originally defined to be.  There are many existing data sets (or collections if you prefer) in existence that have miscellaneous content, that could have been defined differently perhaps as several data sets even.  That's just a reality and neither of our opinions will ever impact that reality.  We can talk about how it would have been nice if the data set had been defined differently; but unless there is funding to change those definitions and the archives in question believe that there are sound reasons to change those definitions, those definitions just need to be accepted as fact.

On the other hand,  I do believe that Curt's definition of an ESDT + Collection (i.e., dataset along with version) as the object that is assigned a unique DOI is sound, especially when accompanied by a date of access; but that later bit drifts off of the identifier discussion to citation guidelines - two different discussions (though I agree identifiers are foundational to both).

Ruth


On Oct 13, 2010, at 10:28 AM, alicebarkstrom at frontier.com wrote:

> We need a "virtual bar" to hang out in between the arguments.
> 
> If you want to start in on collections (or aggregations), see
> my previous e-mail on what level of detail we need to identify
> in citations.  Ruth and I have sort of parted company on this,
> since I think the distinctions between collections is imporant
> and she's wanted to call everything a "dataset".
> 
> Again, I'll note that we have a limited time to talk in the
> telecon - what are the top three topics we want to discuss?
> 
> Even so, I'm inclined to feel that having a drink together would
> be a highly constructive activity at this point in the discussion.
> Unfortunately, I won't be at the Fall AGU meeting, where we would
> probably find Foley's Irish Pub to be just the kind of noisy
> atmosphere that would lubricate the conversation.
> 
> Bruce b.
> ----- Original Message -----
> From: "Mark A. Parsons" <parsonsm at nsidc.org>
> To: "Curt Tilmes" <Curt.Tilmes at nasa.gov>
> Cc: esip-preserve at lists.esipfed.org
> Sent: Wednesday, October 13, 2010 10:28:00 AM
> Subject: Re: [Esip-preserve] ESIP Citation Guidelines
> 
> I'm losing the thread on this, but I note all the discussion to date has been about files. What about database records and aggregations or collections.
> 
> Cheers,
> 
> -m. 
> On 13 Oct 2010, at 7:58 AM, Curt Tilmes wrote:
> 
>> On 10/11/10 19:50, alicebarkstrom at frontier.com wrote:
>>> Do you think it is possible to adapt the UFN approach previously
>>> mentioned to our earth science data?  It addresses (some, but not
>>> all of) the things you discuss here.
>>> 
>>> <BRB>Absolutely NOT!  The UFN approach starts by assuming that data
>>> can be arranged in a "canonical" sequence of values and held to a
>>> single specified precision and represenation.  There isn't anyone to
>>> play "pope" to provide a "canon" of data formats.  That means that
>>> there isn't anyone who can identify the "canonical" representation
>>> of a numeric data collection.
>> 
>> Not for all data, or not for any subset of the data?
>> 
>>> 1.  The NOAA GHCN adjusted precipitation data separates the
>>> geolocation data from the actual precip data - which is arranged in
>>> single year arrays (Jan as first month, Dec as the last month) using
>>> an ASCII encoding of five characters to represent an integer.  I
>>> don't think most of us would accept the notion that the data in
>>> memory (or in a file) that converted the ASCII to, say, double
>>> precision floats would suddenly render the data in memory
>>> "inauthentic".
>> 
>> So the canonicalization for that says "always compare it like this".
>> 
>> Why is that impossible?
>> 
>> 
>>> 2.  The MODIS MOD02 data product contains the lat and long values
>>> for each location in the 1 km data (if I remember what we had to
>>> deal with on CERES).  If someone takes the spectral channels for 1
>>> km res data and extracts the lats and long, and uses that for an
>>> analysis, I don't think most of us would assume that they've created
>>> "inauthentic" data.  So - is the "authentic" data the spectral
>>> radiances without the geolocation - or does the geolocation data
>>> have to accompany the spectral radiances?  If the answer is the
>>> spectral radiances, does the identifier have to refer just to that
>>> data?  If the answer is both, what identifier should someone quote
>>> who wants to use just the spectral data?
>> 
>> Now you are getting into subsets.  I'd prefer to simply postpone a
>> subset discussion by simply citing the whole file, even if you use
>> only part of it.  (Think of citing a fact from a paper.  I just cite
>> the DOI of the paper as a whole).
>> 
>> If you did have a valid canonicalization for the file as a whole, and
>> an identifier scheme that can distinguish subsets of the file, then
>> you could identify a valid canonicalization of the subset of the file.
>> 
>> We could also play with the "SEC" identifier (UUID of the
>> authoritative file) and tie that into identifiers for subsets.
>> 
>> But, as I said, let's put off identifiers for subsets of files until
>> we can at least get identifiers fo the files themselves.
>> 
>>> As additional indicators of the difficulty, you can take the
>>> different formats available for images, with the format differences
>>> persisting over multiple decades (bmp, jpg, tiff, ps, and eps).
>>> Likewise, do you really expect NASA, NOAA, and DOD to agree on
>>> exactly the format and representation they'll use in common?  Or,
>>> for that matter that NASA and ESA will agree on identical data
>>> formats and sequential order in files of the "same" data?
>> 
>> Like I said, I don't think we need to have a solution that big to have
>> something useful.
>> 
>> Right now, we have a universe of files and no way to assert SEC for
>> any of them, even if they are scientifically equivalent.
>> 
>> If we can come up with a scheme to identify scientific equivalence for
>> some small corner of the universe, it could be useful.
>> 
>>> Additionally, I think reproducibility through complete provenance
>>> capture helps address this (though I acknowledge it doesn't solve
>>> it).
>>> 
>>> <brb>Don't agree at all!</brb>
>> 
>> Need more information.  What don't you agree with?  That capturing
>> provenance is useful at all?  That provenance information doesn't help
>> address reproducibility?  Or that reproducibility is simply a lost
>> cause and we shouldn't even aim for it?
>> 
>> 
>>> We have typically relied on a trusted curator to manage and affirm
>>> this.  We can't prove it, (especially in the case of a malicious
>>> curator), but we log cryptographic digests as we produce data, and
>>> distribute them with the data files.  We (EOSDIS) dictate formats for
>>> standard data products for the "authoritative" version and that is
>>> what gets archived and distributed.  As you point out, this only
>>> checks the physical bits.
>>> 
>>> <brb>Why should we rely on unverifiable curation?  What techniques
>>> do we have to audit the claim of identity?  I do think auditing is
>>> possible and can be highly reliable.  However, it means that the
>>> curating authority has to present a chain of evidence that will
>>> somehow allow independent verification of the claim of unaltered
>>> scientific identity.  I think we have a responsibility to avoid
>>> claims that cannot be substantiated - and I don't think the current
>>> state of our claims of "unique identifiers" can be verified
>>> independently.  If we go back to "Applied Cryptography", we don't
>>> have a verifiable trust model.  Anyone in the security business can
>>> (and should) shoot at what we're claiming.  I do not believe our
>>> current position fits with long-term preservation.</brb>
>> 
>> As opposed to what we do now?  We constantly make unsubstantiated
>> claims.  Audits are rare in this world.
>> 
>> I think you and I agree substantiation and audits would be good and
>> useful.
>> 
>> If today we have no good identifiers and no substantiation, and our
>> goal is to have both good identifiers and substantiation.  Are you
>> objecting to proposing some identifiers prior to setting up a system
>> for substantiation?  We can certainly head there next, but I'd be
>> happy for self-certifying compliance with a standard as a first step.
>> Then we can move to independent audits, formal certification, etc.
>> 
>> I don't see failure to jump to a perfect end point as an argument
>> against taking the first baby step.
>> 
>>> In some cases, like the MODIS "process on demand" L1B, we can't do
>>> that.  We assert that we have the ability to reproduce an equivalent
>>> file (although with the current implementation, it actually performs
>>> what I call "reprocessing" rather than "reproducing" -- The
>>> difference being that reprocessing can use better versions of
>>> ancillary data files, or later versions of the algorithms rather
>>> than trying to apply a faithful attempt to make the same file.)
>>> 
>>> <brb>In short - it's not reproducable at all!  Bad guarantee of
>>> fixity!  Indeed, it sounds like a case of near-fraud,
>>> misrepresenting a reproduction as a (possibly bad) replica!</brb>
>> 
>> I mispoke -- I should haven't used the word equivalent there -- they
>> aren't really claiming to be reproducing the original files, no fixity
>> here.  They simply make the file when you ask for it.  The processing
>> step will make the best file the know how to make (not necessarily an
>> equivalent file to what they made last time.)
>> 
>> Curt
>> 
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> 
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve



More information about the Esip-preserve mailing list