[Esip-preserve] On Earth Science Data File Uniqueness
Curt Tilmes
Curt.Tilmes at nasa.gov
Wed Feb 9 08:56:49 EST 2011
On 02/09/11 08:09, Bruce Barkstrom wrote:
> The uniqueness of UUID's wasn't the question. The point was that if
> the generator gave out "Bob", "Bill", "Jane", and so on, if there
> weren't a place to find out about who created the object and when,
> the identifier is simply another bunch of digital garbage in the
> file or the result set from the database. In other words, the ID's
> have a "social function".
Now you're getting into provenance and metadata. Those are also
critical and important issues, related but distinct from
identification and distinguishing data files.
There are numerous standards for those things and there are
conventions and standards for connecting them to the data, either
embedded in the data file (with a rich format), or with a tag-a-long
file, or by putting them in a database.
You still need to identify and distinguish the file you just made from
everything else in the world. UUIDs give you a nice, easy way to do
that.
UUID can be a great primary key for the files in a database. If you
have your own key, it might not match someone else's key for that same
granule, or worse, it might duplicate their key for a different
granule. We're trying to converge on something that everyone could
use that would be guaranteed to be globally unique forever. I would
argue that is useful even if there isn't a single central database of
every granule in the world forever. (That would be nice, and we may
get there sometime, but practically, I don't see that happening in the
near future..)
We put a lot of thought/effort into MODIS localgranuleids, for
example, but we still blew up the database on at least one occasion by
making two granules with the same localgranuleid.
MODIS localgranuleids (filenames) include a bunch of basic metadata so
they are human friendly. They're probably pretty incoherent to
someone totally unfamiliar with them, but with a little knowledge you
can decipher them visually and tell what you're looking at.
Unfortunately, that basic metadata can end up identical, and we need
unique identifiers. We thought we'd be smart and tacked on a
"production time stamp". That should always make them unique, right?
Well in testing if you kick off a bunch of processing and happen to
make the same granule with the same basic metadata (the stuff in the
filename) at the same time, you end up making two granules with the
same localgranuleid.
Today, MODIS processing is still pretty hard. Very few people try to
reproduce the processing done in the central system. I'd like to make
that easier and more accessible. If we do that, we need a clean way
to always distinguish the identifiers that everyone is using when they
make the same granule the same way. There are many complicated ways
we could do that. Each relies on people following some conventions,
or centrally registering somewhere, or any one of many different
schemes we could suggest to make unique identifiers.
There's also an easy way -- just say use UUID for everything.
Curt
More information about the Esip-preserve
mailing list