[Esip-preserve] Identifiers Use Cases
Ruth Duerr
rduerr at nsidc.org
Tue Apr 13 00:49:51 EDT 2010
Just a quick note Bruce about "classic" EOSDIS products. In general
you may be correct; but for MODIS we received "duplicate" data files
(i.e., newly produced copies of data from ostensively the same
version) often enough that we had to work out a strategy for dealing
with them...
;-) Ruth
On Apr 12, 2010, at 7:09 AM, Alice Barkstrom wrote:
> I think it would be helpful to have some more concrete production
> and data use scenarios. The Web site I've placed on line at oceandis
> has a whole collection of these with a table of contents at
>
> http://www.oceandis.com/metadata/Text_Documentation/Example/example_index.html
>
> Chapter 10 in this collection of pdf documents has a couple of figures
> (10.2 and 10.3) that show the buildup over time of a number of
> versions
> (DTX's in Curt's nomenclature). The production scenarios in these pdf
> chapters is simple compared with MODIS or CERES, but it's got a fair
> amount of reality based on the CERES production for Level 1 data.
>
> For the largest component of Earth science data, the extensions of
> the 2.6 PB in the
> "classic" EOSDIS data centers, production is hardly a random update
> process. Rather, it proceeds pretty systematically. It also seems
> true
> that once a file is produced, it is not likely to be changed -
> unless there
> was a cataloging error. If a scientific error is identified, the
> file will not be modified
> although the producer may reduce such an error when he/she creates a
> new version of the data set.
>
> For data produced operationally, e.g. GOES images, radiosondes, HCN
> or other
> surface networks, the producers are under such time pressure that
> they do not
> have time to go back and revise the software or coefficients -
> meaning that they
> do not produce new versions, although they may have changes in the
> coefficients
> or code - sometimes noted and documented.
>
> In validation campaigns, of course, the selection of data in subsets
> that
> refer to particular times and places might have a number of interim
> files
> that have been worked with different processes. These should probably
> not be regarded as "published" in the normal sense - they are not
> "peer-reviewed",
> but are steps along the way.
>
> In short, having some well-documented production and data use
> scenarios
> with dates of data collection, dates of production, and dates of
> data use
> is critical to getting to the bottom of these issues. It would also
> be wise
> to do the appropriate "production engineering" to ensure we are
> dealing with
> "high probability" scenarios first, rather than unrepresentative,
> extreme
> cases. I rather suspect that the scenarios divide into two classes:
> one being
> the highly "regular" cases that provide the bulk of the data and a
> second being
> the cases that cause a lot more work for the archives.
>
> Bruce B.
>
> At 06:45 AM 4/12/2010, Curt Tilmes wrote:
>> Ruth Duerr wrote:
>> > However, I think it is the citation that needs that, not the
>> > identifier for the data set.
>>
>> Yes. That is one of the differences in the examples I showed for DOI
>> vs. PURL. It is trivial to produce thousands of PURL identifiers, so
>> it makes sense to put the full qualification in the indentifiers.
>> For
>> DOI, not so much, so I added an additional qualifier (in my proposed
>> case, the date/time) to the citation. You distinguished identifiers
>> from citations with better wording on the main page.
>>
>> > I also think that, as you suggest in your use cases, the time of
>> > access is one possible mechanism for doing that
>>
>> We also looked at some hashing schemes or even arbitrary identifiers
>> that mapped to sets of granules, but nothing was as clean and easy to
>> use (and understand) for users or implementers as date/time.
>>
>> > (and that is probably the simplest mechanism from a citation
>> > standpoint though not necessarily from a user standpoint if for no
>> > other reason than it might have taken the user a month to download
>> > all the data they used and the data set may have undergone a whole
>> > host of updates over that time period).
>>
>> Ok, take that case. How should we propose to handle it?
>>
>> In my scheme, the date/time in the citation is a point in time, so
>> you
>> could either:
>>
>> 1. grab the original set of granules that were existing at the time
>> you start that long month of downloads and cite that date/time.
>>
>> or 2. double check the data set and grab any updates and cite the
>> later date/time.
>>
>> How else could we approach it and still maintain the precision of
>> citation?
>>
>> Curt
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
More information about the Esip-preserve
mailing list