[Esip-preserve] Identifiers Use Cases

Ruth Duerr rduerr at nsidc.org
Tue Apr 13 00:49:51 EDT 2010


Just a quick note Bruce about "classic" EOSDIS products.  In general  
you may be correct; but for MODIS we received "duplicate" data files  
(i.e., newly produced copies of data from ostensively the same  
version) often enough that we had to work out a strategy for dealing  
with them...

;-) Ruth
On Apr 12, 2010, at 7:09 AM, Alice Barkstrom wrote:

> I think it would be helpful to have some more concrete production
> and data use scenarios.  The Web site I've placed on line at oceandis
> has a whole collection of these with a table of contents at
>
> http://www.oceandis.com/metadata/Text_Documentation/Example/example_index.html
>
> Chapter 10 in this collection of pdf documents has a couple of figures
> (10.2 and 10.3) that show the buildup over time of a number of  
> versions
> (DTX's in Curt's nomenclature).  The production scenarios in these pdf
> chapters is simple compared with MODIS or CERES, but it's got a fair
> amount of reality based on the CERES production for Level 1 data.
>
> For the largest component of Earth science data, the extensions of  
> the 2.6 PB in the
> "classic" EOSDIS data centers, production is hardly a random update
> process.  Rather, it proceeds pretty systematically.  It also seems  
> true
> that once a file is produced, it is not likely to be changed -  
> unless there
> was a cataloging error.  If a scientific error is identified, the  
> file will not be modified
> although the producer may reduce such an error when he/she creates a  
> new version of the data set.
>
> For data produced operationally, e.g. GOES images, radiosondes, HCN  
> or other
> surface networks, the producers are under such time pressure that  
> they do not
> have time to go back and revise the software or coefficients -  
> meaning that they
> do not produce new versions, although they may have changes in the  
> coefficients
> or code - sometimes noted and documented.
>
> In validation campaigns, of course, the selection of data in subsets  
> that
> refer to particular times and places might have a number of interim  
> files
> that have been worked with different processes.  These should probably
> not be regarded as "published" in the normal sense - they are not  
> "peer-reviewed",
> but are steps along the way.
>
> In short, having some well-documented production and data use  
> scenarios
> with dates of data collection, dates of production, and dates of  
> data use
> is critical to getting to the bottom of these issues.  It would also  
> be wise
> to do the appropriate "production engineering" to ensure we are  
> dealing with
> "high probability" scenarios first, rather than unrepresentative,  
> extreme
> cases.  I rather suspect that the scenarios divide into two classes:  
> one being
> the highly "regular" cases that provide the bulk of the data and a  
> second being
> the cases that cause a lot more work for the archives.
>
> Bruce B.
>
> At 06:45 AM 4/12/2010, Curt Tilmes wrote:
>> Ruth Duerr wrote:
>> > However, I think it is the citation that needs that, not the
>> > identifier for the data set.
>>
>> Yes.  That is one of the differences in the examples I showed for DOI
>> vs. PURL.  It is trivial to produce thousands of PURL identifiers, so
>> it makes sense to put the full qualification in the indentifiers.   
>> For
>> DOI, not so much, so I added an additional qualifier (in my proposed
>> case, the date/time) to the citation.  You distinguished identifiers
>> from citations with better wording on the main page.
>>
>> > I also think that, as you suggest in your use cases, the time of
>> > access is one possible mechanism for doing that
>>
>> We also looked at some hashing schemes or even arbitrary identifiers
>> that mapped to sets of granules, but nothing was as clean and easy to
>> use (and understand) for users or implementers as date/time.
>>
>> > (and that is probably the simplest mechanism from a citation
>> > standpoint though not necessarily from a user standpoint if for no
>> > other reason than it might have taken the user a month to download
>> > all the data they used and the data set may have undergone a whole
>> > host of updates over that time period).
>>
>> Ok, take that case.  How should we propose to handle it?
>>
>> In my scheme, the date/time in the citation is a point in time, so  
>> you
>> could either:
>>
>> 1. grab the original set of granules that were existing at the time
>> you start that long month of downloads and cite that date/time.
>>
>> or 2. double check the data set and grab any updates and cite the
>> later date/time.
>>
>> How else could we approach it and still maintain the precision of
>> citation?
>>
>> Curt
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve



More information about the Esip-preserve mailing list