[Esip-preserve] On Earth Science Data File Uniqueness

Bruce Barkstrom brbarkstrom at gmail.com
Tue Feb 15 08:29:38 EST 2011


I think we're going to have to work this through in detail - meaning
scenarios
at about the level required for authentication and cryptography.  MD5 and
SHA-1 both tie into bit-level content, so two files would have to be the
same
at that level to get the same identifier.  This would give the same
identifier
to copies of files created as backup.  The other versions of UUID's would
have separate identifiers - but then you have the "orphan file problem" we
discussed before: you need a registry of ID's to know the backup copy is
a bit-for-bit copy of the original.  If we go this route, we're going to
need
a real mathematician, probably one familiar with number theory and such.
I don't qualify, I'm only an applied mathematician.

As a minor note, I believe both MD5 and SHA-1 are believed to be "broken"
(or "slightly flawed") cryptographic digests.  This means that there might
be some way for someone to forge IDs.  Don't know that there have been
any successful uses of the vulnerability - but most cryptographers would
probably think that was just a matter of time.

Bruce B.

On Mon, Feb 14, 2011 at 3:34 PM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 02/09/11 14:58, Ruth Duerr wrote:
>
> Caveat: I'm not a UUID expert (though I'm considering reading up more
> on them...)
>
> > I've heard good compelling arguments for two competing best
> > practices for using UUID's (Chris' use of a message digest form of
> > UUID and Curt's pure Unique Identifier form of UUID).
>
> There are variants of the UUID algorithm that incorporate digital
> signature algorithms, including MD5 (UUID version 3) and SHA-1 (UUID
> version 5).  Those algorithms are used to make UUIDs, but are still
> unrelated to the actual content of the object being identified.
>
> There are other approaches to distinguishing objects strictly by their
> content that use a digital signature of the content to make a unique
> identifier for the object.  Such schemes can only be used if there is
> never a need to distinguish objects with the same content.
>
> Digital signatures of the content are a good way to verify the
> integrity/fixity of the content.  Such a use is orthogonal to whether
> or not that digital signature is the identifier of the object.
>
>
> For me, the question boils down to one issue.  Will distinct objects
> under the scope of this data model ever have the same content or not?
>
>
> To illustrate:
>
> Suppose I have two HDF files, "a" and "b".
>
> Suppose we have a process "P" that gets a subset of an HDF data file
> and produces a small image file from it.
>
> I apply P to "a" and produce image a.png.  Call this "Job 1".
>
> I apply P to "b" and produce image b.png.  Call this "Job 2".
>
> Just by chance, process P happens to pick an area of the data that
> happens to be all black.  The content of a.png is equal to the content
> of b.png.
>
> Now we're going to try to make identifiers for the two image files and
> store them in our database.
>
> With UUIDs, I make two identifiers:
>  5b849030-d964-44ef-a2a1-e3e20cd18637
>  d4cd7a92-9418-440d-9c39-a359d2b55944
>
> I record the fact that 5b849030-d964-44ef-a2a1-e3e20cd18637 was
> generated by "Job 1", using file "a" as an input.  (well, it would
> have a UUID too, but you get the picture).
>
> I record the fact that d4cd7a92-9418-440d-9c39-a359d2b55944 was
> generated by "Job 2", using file "b" as an input.
>
> I can clearly distinguish the two files by their unique identifiers.
>
> With an identifier derived from the content, I get an MD5 (e.g.) for
> the first object, 3da607d21285eb08f7d40ef8dd028d35, and store the fact
> that 3da607d21285eb08f7d40ef8dd028d35 was generated by "Job 1", using
> file "a" as an input.
>
> When I make the second file, I get the same object, with the same
> identifier.  I can't put it in my database.  I can't associate that
> object with "Job 2".  (Well you can, but the data model gets really
> messy -- Your DAGs aren't right any more.)  Then try to query "Who
> created the object?"  -- You get two answers!  Tomorrow you may get
> three answers!
>
>
> If we all agree that no process ever run under the scope of our data
> model will ever create distinct objects with the same content
> (Including my old friend d41d8cd98f00b204e9800998ecf8427e -- The MD5
> of an empty file), then we can accept that a digital signature of the
> content is sufficient to distinguish the objects.  If we allow for the
> possibility of duplicating the content in distinct objects, then we
> need some identifier other than the digital signature of the content.
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110215/aaaf9682/attachment.html>


More information about the Esip-preserve mailing list