[Esip-preserve] Identifiers

Thu Feb 16 11:58:38 EST 2012

My last thought on the intractable "what is a data set" question. FRBR may help.

In separate but related correspondence with Joe Hourcle, Joe writes:

> And the only thing that you really need to worry about FRBR is this one part:
> 
> 	'Book' has many meanings:
> 		The story
> 		The words that tell the story
> 		The packaging of the story
> 		The physical item
> 
> They then gave each of those meanings different names, and defined the standard relationships between them that are needed for cataloging.
> 
> The term 'data' isn't quite so clean, as the meanings aren't a nice progression like with 'book'.  It could mean:
> 
> 	Anything digital (computer science)
> 	Anything encoded (information science)
> 	Anything observed (most hard sciences)
> 	Anything that can serve as evidence (scientific modelers)
> 	Any collection of numbers
> 	A file containing 'data'
> 
> And to make things more fun, the OAIS definition of 'Data Object' includes physical specimens, so you'd have a whole overlapping Ven diagram of what is/isn't data within each community.
> 
> 

Enough,

-m.

On 16 Feb 2012, at 9:55 AM, Mark A. Parsons wrote:

> I have seen a data set referred to as a collection in some communities. I have also seen collection refer to a collection of data sets, like a collection from a field campaign or a particular instrument. This also sometimes called a suite of data sets. I have also seen dataset (i.e. misspelled as one word) refer to what we often call granules. Think of datasets in THREDDS. I think these are all legitimate terms that mean things in context.. I think the question Curt is really trying to address is what kind of identifiers do we assign to what kind of things, regardless of what we call them
> 
> For DOIs, whose primary purpose is literary citation, I think they should be applied to some sort of logical grouping of data (that logic is defined by the author and stewards and might usually consider what  data  are typically used in conjunction with each other). These things are typically called data sets and/or collections or maybe AIPs. They can have different levels of hierarchy (or not) that get DOIs such as one for the collection or suite and several for the data sets in that collection. DOIs are not suited for deep hierarchies or detailed identification, though, because of their financial and administrative costs. Perhaps, another way of thinking about it is that DOIs should typically point to some sort landing page, which implies that there is some sort of set or collection underlying that page, not just a single item. 
> 
> More detailed items in the hierarchy or web of data need different kinds of identifiers. Perhaps also data that are arranged hierarchically may use different identifiers that those that are arranged in more of a linked-graph based collection or distributed "e-science object". I'm sure there are other considerations as well. For example, one also needs to consider whether the identifier needs to be actionable. 
> 
> While we can say in our community that certain identifiers should be used  for "data sets" or "granules" or "collections" or whatever, what we really need to do is define the effective use of recommended identifiers for different types of things (including non-digital things and transformations of things). English is imprecise, one word doesn't always cut it.
> 
> Cheers,
> 
> -m. 
> 
> 
> 
> 
> On 16 Feb 2012, at 8:54 AM, Mark A. Parsons wrote:
> 
>> Personally, I think defining a data set too precisely is a fools errand. It is the responsibility of the data authors and stewards to define something that makes sense for their designated community and slap a DOI and a name on to it.
>> 
>> To me a data set is simply a logical arrangement of data that has meaning to a designated community.
>> 
>> Your definition below, for example, does not work for many, perhaps the majority, of NSIDC data sets.
>> 
>> Cheers,
>> 
>> -m. 
>> On 16 Feb 2012, at 8:06 AM, Curt Tilmes wrote:
>> 
>>> On 02/15/2012 03:48 PM, Bruce Barkstrom wrote:
>>>> It would be useful to at least having some clear definitions of
>>>> things.
>>>> 
>>>> So - to go back to the "undefined term" "data set" does this term
>>>> refer to
>>> 
>>> Yes, we need to define it.  We keep putting it off.  Let's debate this
>>> one now.  We might not come to complete agreement, but perhaps we can
>>> refine this sufficiently to come up with something we can make use of.
>>> 
>>> 
>>> I use it for something comparable to the EOSDIS Data Model concept of
>>> Earth Science Data Type (ESDT) + Collection.
>>> 
>>> So, for example, the { "MODIS/Terra Snow Cover 5-Min L2 Swath 500m"
>>> (MOD10_L2), "Collection 5" } is one dataset.
>>> 
>>> { MOD10_L2, Collection 6 } would be a distinct dataset, and need a
>>> distinct identifier (eventually DOI).
>>> 
>>> 
>>> I'm also not wedded to the term "dataset" for this concept -- if
>>> someone can sell me on an alternative.  I just think we need some term
>>> for this concept we can all live with..  "dataset" is the most natural
>>> I can come up with.
>>> 
>>> 
>>> A couple notes for people who don't speak "NASA EOSDIS Data Model":
>>> 
>>> 1. A dataset is made up of granules.
>>> 
>>> 2. Each granule in the dataset was made in a "common" (I won't define
>>> that for now) way.
>>> 
>>> 3. Each granule in the dataset has a common format, metadata, filename
>>> convention, etc.  A reader for one granule will also be able to read
>>> another granule from the same dataset.
>>> 
>>> 
>>> I'll further add these definitions for discussion:
>>> 
>>> A "static dataset" doesn't change.  The set of granules and their
>>> particular contents is constant.
>>> 
>>> A "dynamic dataset" can change.  For example, the datasets above will
>>> grow every day since they are part of an ongoing NASA mission that
>>> keeps capturing and processing new data.  The granules that were part
>>> of the dataset yesterday and the granules that are part of the dataset
>>> today are different.  (I know this causes some folks heartburn, but it
>>> is a reality we need to accomodate.)
>>> 
>>> 
>>> Once we get dataset straight, we can talk about subsets/other
>>> aggregations.
>>> 
>>> 
>>> -- 
>>> Curt Tilmes
>>> U.S. Global Change Research Program
>>> 1717 Pennsylvania Avenue NW, Suite 250
>>> Washington, D.C. 20006, USA
>>> 
>>> +1 202-419-3479 (office)
>>> +1 443-987-6228 (cell)
>>> _______________________________________________
>>> Esip-preserve mailing list
>>> Esip-preserve at lists.esipfed.org
>>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>> 
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>