[Esip-preserve] Fwd: [CF-metadata] Multiple file datasets (was: Swath observational data)
Christopher Lynnes
Chris.Lynnes at nasa.gov
Fri Nov 20 09:30:00 EST 2009
An interesting discussion of SAFE on the cf-metadata mail list...
Begin forwarded message:
> From: Stephen Emsley <SEmsley at argans.co.uk>
> Date: November 20, 2009 8:12:09 AM EST
> To: John Caron <caron at unidata.ucar.edu>, "cf-metadata at cgd.ucar.edu" <cf-metadata at cgd.ucar.edu
> >
> Subject: Re: [CF-metadata] Multiple file datasets (was: Swath
> observational data)
>
>>> Can anyone summarize what SAFE does?
>
> I will give it a shot as I brought it up in the first place!
>
> The Standard Archive Format for Europe (SAFE) was developed as a
> common format for archiving to ensure long-term preservation of EO
> data holdings, both historical and operational. The SAFE website [www.esa.int/safe
> ] is the official ESA maintained site for the maintenance and
> distribution of the standard format, specification, XML-schemas and
> tools.
>
> SAFE is a specialisation of the XML Formatted Data Unit (XFDU), a
> CCSDS (Consultative Committee for Space Data Systems) recommended
> standard for the packaging of data and metadata to facilitate
> information transfer and archiving. Every SAFE product is an XFDU
> package. SAFE is a specialisation of XFDU, which defines a
> restriction of the generic XFDU package. SAFE inherits its main
> structure from XFDU packaging format and defines high level
> constraints and new rules for Earth Observation ground segment data
> products.
>
> A SAFE product wraps, or references, data and associates that data
> with metadata, both global and local. SAFE product metadata contains
> basic information, such as the acquisition period, platform and
> sensor identification and a processing history to ensure
> traceability. For each included, or external referenced, dataset
> another layer of associated metadata may be attached providing orbit
> and geo-location information, quality information and
> representational information.
>
> Basically a SAFE product is a directory. At the top level is a
> manifest file, written in XML, that provides both a map of the
> contained data sets, defines the relationships between these
> datasets, and contains global metadata (such as platform name,
> acquisition period etc.). There is a set of required metadata
> defined by the SAFE specialisation (e.g. there is an ENVISAT
> specialisation, further restricted to apply to, say, MERIS, and
> still further specialised to, say, Level 1 processed products).
>
> The contained datasets are collections of records. They are of three
> types:
>
> Measurement Data Sets: These are typically binary format files and,
> in our case, will be netCDF-CF files. As an example we will have 46
> measurement data products and each will be stored at a netCDF file
> (data record) along with a data record containing associated quality
> information and another containing status flags.
>
> Annotation Data Sets: These contain metadata and common data.
> Although to be decided in the case of Sentinel 3 Level 2 we are
> considering storing a common set of coordinate data that is
> applicable to subsets of the measurement data. The manifest file
> will provide the association between specific measurement datasets
> and the associated coordinate data.
>
> Representation Data Sets: These are XML Schema descriptions of the
> measurement and annotation datasets. Firstly it is a key concept for
> OAIS digital preservation and secondarily third party applications
> may use these for displaying / accessing the corresponding
> measurement data sets. I appreciate that it might seem a little
> 'belt-and-braces' to have an XML schema for a netCDF file (which is
> by nature self-describing) but that is how the SAFE people have
> decided to include netCDF into the convention.
>
> There is a third type of data which can be considered as resources.
> These may be, for instance, data required for the generation of the
> end-user data products. For instance, for Level 2 data products they
> would include the Level 1 input products and possibly, for instance,
> ECMWF data required for processing (although the latter might
> equally be an annotation dataset). These resources are not packaged
> inside a SAFE container but are referenced (in the manifest file)
> using a URI.
>
> All of these taken together are a SAFE package.
>
> I hope that this provides a reasonably informative overview. The
> SAFE website is the place to go for more detailed info.
>
> Steve
>
>
> ---
> Dr Stephen
> Emsley
>
> Tel: +44 (0)1752 764 289
> ARGANS
> Limited
> Mobile: +44 (0)7912 515 418
>
>
> -----Original Message-----
> From: cf-metadata-bounces at cgd.ucar.edu [mailto:cf-metadata-bounces at cgd.ucar.edu
> ] On Behalf Of John Caron
> Sent: 20 November 2009 12:30
> To: cf-metadata at cgd.ucar.edu
> Subject: [CF-metadata] Multiple file datasets (was: Swath
> observational data)
>
> This topic deserves its own heading, so here it is.
>
> Perhaps we should gather current practices and ideas. I think
> Balaji's gridspec has a proposal about this. Can anyone summarize
> what SAFE does?
>
> Im imagining how this is actually used, eg:
>
> float data(y,x);
> data:coordinates = "lat at file1 lon at file2";
>
> ????
>
>
>
> John Graybeal wrote:
>> I like Bryan's recommendation for a UUID or similar.
>>
>> Now I'm going to be annoying and suggest the UUID *could* be a URI,
>> or
>> these days, an IRI (International ..).
>>
>> And I think the way of 'locating' the file should be neither in
>> packaging nor in local resolution; it should be in global namespace
>> resolution. This is the way of the future, and is already more
>> 'permanent' than either packaging or local resolution, IMHO.
>>
>> There is one form of URI in particular that is already resolvable: a
>> URL. OK, that's an old song, but I'm gonna stick to it for a while
>> longer. That form meets all the other requirements: it can be
>> registered in a resolver, it can be guaranteed unique (to the same
>> authority level as a UUID, anyway), and it is a unique string that
>> can
>> be used to validate the link). And it has the obvious benefit of
>> being
>> resolvable right now, for as long as the domain is held and properly
>> maintained (Good URLs don't die).
>>
>> Since the last paragraph risks starting another unique identifier
>> war, I
>> promise not to re-engage unless someone asks me to. Meanwhile, I like
>>
>> John
>>
>>
>> On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
>>
>>> On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
>>>>> ... In some cases, referencing attributes such as
>>>>> "coordinates" and "ancillary_variables" would, ideally,
>>>>> point to a
>>>>> variable in a different dataset.
>>>>
>>>> This is a general problem to which CF doesn't have a solution
>>>> because
>>>> it was
>>>> conceived as a convention for single netCDF files. However we
>>>> need a
>>>> solution
>>>> as often several files should be treated as a single dataset.
>>>>
>>>> If the files don't overlap i.e. their contents are complementary, I
>>>> think it
>>>> should be satisfactory to allow variables in one file to be pointed
>>>> to by name
>>>> from another file, with no other mechanism being required within
>>>> the
>>>> file. I
>>>> don't like the idea of naming one file within another file, as that
>>>> would be
>>>> very fragile. Instead, I think the file aggregation should be
>>>> implied by
>>>> simply defining the group of files which are to be treated as one
>>>> file e.g.
>>>> by putting them in one directory.
>>>
>>> It's the old ones that are the best ones :-) :-) this issue keeps
>>> on
>>> coming back ... :-) :-) and we keep trying to ignore it ...
>>>
>>> I think we agree that an actual physical filename including path is
>>> useless. We need both a relative link which relies on the
>>> preservation of a group of files in a particular arrangement ...
>>> AND
>>> an internal identifier so more robust linking mechanisms can be used
>>> when (if) the data ends up in a managed environment.
>>>
>>> I think it's crucial in this situation to ensure that each file
>>> has a
>>> unique identifier within it (created, for example, with uuid),
>>> because
>>> all solutions which rely on packaging are fragile (SAFE is probably
>>> better than most), but the bottom line is that users move files
>>> around
>>> ... and we need some way of ensuring that we/they can validate the
>>> links that are in place are the ones that were originally intended.
>>>
>>> So relative links would also include the identifier of the intended
>>> target as well as the relative path in operating system agnostic
>>> terms.
>>>
>>> That identifier can be used in two ways: to validate the link (my
>>> software can always check that the variable that I just opened
>>> following a link from another one is the one that was expected by
>>> checking the container identifier), and b) to produce an identifier
>>> resolver service for the situation where the packaging has had to be
>>> broken (which might occur for performance reasons or ...)
>>>
>>> CF could recommend something like this ...
>>>
>>> Bryan
>>>
>>> --
>>> Bryan Lawrence
>>> Director of Environmental Archival and Associated Research
>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>> STFC, Rutherford Appleton Laboratory
>>> Phone +44 1235 445012; Fax ... 5848;
>>> Web: home.badc.rl.ac.uk/lawrence
>>> _______________________________________________
>>> CF-metadata mailing list
>>> CF-metadata at cgd.ucar.edu
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
>>
>> --------------
>> I have my new work email address: jgraybeal at ucsd.edu
>> --------------
>>
>> John Graybeal <mailto:jgraybeal at ucsd.edu>
>> phone: 858-534-2162
>> Development Manager
>> Ocean Observatories Initiative Cyberinfrastructure Project:
>> http://ci.oceanobservatories.org
>> Marine Metadata Interoperability Project: http://marinemetadata.org
>>
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
--
Christopher Lynnes NASA/GSFC, Code 610.2
301-614-5185
More information about the Esip-preserve
mailing list