[Esip-documentation] definitive data set identification

Thu Jan 21 15:30:36 EST 2021

Hi Nan, hope you are doing well.

This problem is hard for all sorts of artifacts, and the answers I've seen in community projects have all required implementing some kind of provenance chain for all the data. As your problem statement illustrates, the problem often extends beyond pure duplication, to all kinds of derived, partial, and versioned instances of the original data.

So when there is a pipeline generating the results, the solution has to include that the pipeline self-documents the provenance of the new data set–the processes that produce it. I know MBARI's Shore Side Data System did (likely still does) that, and the original OOI CI system was designed to do that. Within the last decade, the existince of ontologies like PROV and PAV, and of more advanced metadata standards, allow for declaring the exact relationship of a derivative artifact to its parent artifacts. (The relationship definitions can be seen in the ontologies, and can be augmented if needed to reflect a particular relationship that you need.) I am pretty sure the ISO standards (19115 in particular) do have the ability to specify provenance relations to the input artifacts.  With any of these systems, you also have to have an unambiguous way to identify every existing data set in the system, which I expect OceanSITES does.

Of course, while it is relatively straightforward to modify one workflow or system to start adding the appropriate provenance metadata to each data set, OceanSITES is dealing with an entire data ecosystem. I can't recall what existing data and metadata standards you are already using—hopefully they are extensible to add these new relations, if they don't have them already—but there is the social and UI work to get everyone submitting data to document its relation properly, and then any software using the system has to be "smart enough" to understand the relations in the metadata, and not display for user or use in computations 2 data points from the same data collection.

It would be an advancement if there were a universal standard for incorporating provenance relation declarations within netCDF, for example, but I'm pretty sure there are a few communities already doing that, I just don't know for sure who's still doing it. Sorry I can't be more helpful.

John

(shameless promotion: If you need to model or prototype your metadata structure for provenance metadata, and/or create a UI for users to enter descriptive metadata into a computable text file (JSON-LD or RDF) that follows a well-defined specification, CEDAR at metadatacenter.org <http://metadatacenter.org/> may be worth a look. There's already one example of a Generic Dataset Metadata Template <https://openview.metadatacenter.org/templates/https://repo.metadatacenter.org/templates/dd6231ef-5890-48cb-9621-04c5b5577c1e> that I helped create, we were a bit agnostic about choosing a particular provenance approach though, and it's still in progress. Details on request.)

> On Jan 21, 2021, at 6:27 AM, Nan Galbraith via Esip-documentation <esip-documentation at lists.esipfed.org> wrote:
> 
> Hi all - 
> 
> The OceanSITES data management team is hoping to solve a problem 
> with identifying duplicate or secondary instances of data sets on our 
> servers. We work with in situ observational data sets, which are often 
> used by modelers and remote sensing systems. If these users unknowingly 
> access duplicate copies of data, it may skew their results by inaccurately
> weighting these data points.
> 
> We originally tried to ensure that we had only one copy of any given 
> data point on our server, but that hasn't proved to be practical. Certain 
> kinds of computed data sets, like PCO2 and surface fluxes, are more 
> useful to end users if the files contain copies of the component observed
> data variables used in their calculations. These copies may start out at a
> different rate from the originals, being gridded or averaged to match the
> time base of the related data, or, over time, the original data may change 
> slightly, as calibrations, algorithms, or clock adjustments are updated.
> 
> My question to the documentation cluster is whether you know of
> any community standards that identify a given data variable as the
> authoritative or 'original' copy. I haven't encountered any kind of
> standard for this, but I may not be looking in the right places. I feel
> that there may be a solution related to DOIs, but ... it wouldn't be
> meaningful unless our data users knew about it, and were prepared
> to use it, and if we acquired a DOI for each observed variable in a
> data set.
> 
> Any ideas on this would be very welcomed; we try, whenever possible, to 
> adopt existing standards instead of inventing our own one-off solutions.
> 
> Thanks in advance - 
> Nan Galbraith
> 
> 
> -- 
> *******************************************************
> * Nan Galbraith        Information Systems Specialist *
> * Upper Ocean Processes Group            Mail Stop 29 *
> * Woods Hole Oceanographic Institution                *
> * Woods Hole, MA 02543                 (508) 289-2444 *
> *******************************************************
> _______________________________________________
> Esip-documentation mailing list
> Esip-documentation at lists.esipfed.org
> https://lists.esipfed.org/mailman/listinfo/esip-documentation
> 

----------------------
John Graybeal
Administrator—ESIP Community Ontology Repository
jbgraybeal at sonic.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.esipfed.org/pipermail/esip-documentation/attachments/20210121/575d30ec/attachment.htm>