[Esip-preserve] Citations

Alice Barkstrom alicebarkstrom at verizon.net
Mon Apr 19 14:45:08 EDT 2010


At 11:24 AM 4/16/2010, Ruth Duerr wrote:
>Hi Bruce,
>
>So where is this table?  I couldn't find it on the wiki.

I haven't had time to insert tables into the Wiki.  Instead, I've attached
three spreadsheets (with two formats each - one for Open Office and one
for MS eXcel).

One spreadsheet breaks down the collection characterization in the
NSB report.  From my perspective, this report deals more with institutions
than with the characteristics of collections that need to be worked into
the way we deal with citations.

A second spreadsheet takes some of the data collections we've been
discussing in the Identifier WG and tries to fill in key information regarding
the structure of the collections, particularly with respect to how many files
we have to deal with in a collection, how the various sub-collections fit
together, and how much volatility the collection has.  I'm pretty dissatisfied
with my own scholarship in this spreadsheet, however, perhaps it will move
us in a direction that will assist us in getting to a robust 
discussion.  I don't
think we can avoid the complexity of the problem - or the very large number
of files we have to deal with.  While simple use cases would fit with the needs
of small communities within the Earth sciences, our identifier schema would
break when it encounters the really large-scale problems that come with
operational data sets from satellites, such as EOS or the decadal survey
and future operational missions.

The third spreadsheet is an attempt to lay out a basis of "requirements"
for citations, using the categories in the second spreadsheet, as well as
becoming quite specific about the reason a user might want to cite data
and what they would need to reference in order to satisfy their needs.

In order to finish out this work, we really need to collect information and
data on current collections and have real, quantified production scenarios
which are still missing from this work.  I think I've got a fair amount of
information about the NCDC Precip data, the solar constant (although I
still need to dig out the details regarding the various data sources and
what the files there contain), the Hurricane Ike Damage Assessment
photo collection - and will need to do a fair amount of work to provide
realistic scenarios for CERES.

Bruce B.


>- Ruth
>
>On Apr 15, 2010, at 7:20 PM, Alice Barkstrom wrote:
>
> > I'll agree.  I found the NRC report and translated their categories
> > into a table (with some bulging at the seams).  Then, I added
> > in three categories that seem to fit - but still needed a fourth
> > for the operational data production (which is the one where latency
> > requirements force inclusion of the "perturbations" into the data
> > files because there isn't time to make the record homogeneous).
> >
> > Also, I'm pretty concerned with the recording (or, in the provenance
> > world, the provenance "tracking") of the perturbations.  In my experience,
> > there's a lot of "tacit knowledge" that producers don't write down that
> > has a serious influence on the reproducability of data production
> > algorithms.  Or, to put it another way, I don't think the ATBD's are
> > really high-fidelity recordings of what is actually going on in the
> > operational algorithms.  In still another way of putting it, I expect
> > that the amount of time required to reconstruct an algorithm and
> > really ensure that it replicates what is being done operationally
> > is about the same amount of effort that's required to develop the
> > operational sofware - meaning hundreds of person-hours for some
> > of the operational or - more importantly - the climate data record code.
> >
> > Next steps on my part will be to extend the table to create
> > a context for the use cases that shows such impacts as
> > demands on the producers and demands on the accepting
> > archives - with - maybe - some comments on what users
> > experience.  Many of the comments I get are related to what
> > archives experience in trying to accomodate what they get
> > from producers.  The user experience is often something
> > different from either - and often our comments are not well
> > supported by empirical evidence from the actual user community.
> > I have strong opinions along the line that the IT community wants
> > to do certain things because they receive "good marks" from their
> > colleagues - whether or not the user community benefits or not.
> >
> > Bruce B.
> >
> > At 04:46 PM 4/15/2010, Mark A. Parsons wrote:
> >> I don't think there is necessarily a direct connection between 
> Bruce's paradigms and the NSB categories. While there are often 
> parallels, Data sets can evolve through the NSB categories with 
> use. For example the IPA Permafrost Map began as a research 
> collection and then as it was improved and more consistently 
> compiled it became a community collection. Now it is the benchmark 
> of permafrost distribution and is used by multiple disciplines as a 
> reference collection. All the while, it remains in Bruce's category 
> 2.  So while some comparison to the NSB categories is instructive, 
> it isn't exact.
> >>
> >> Cheers,
> >>
> >> -m.
> >> On 15 Apr 2010, at 8:13 AM, Alice Barkstrom wrote:
> >>
> >> > I suspect that the production paradigms create a collection 
> organization structure
> >> > that could stabilize our understanding and ensure 
> representativeness to the use
> >> > cases we choose.  This kind of structural work would also 
> provide a checklist that
> >> > could be used to make it easier to classify the kind of cases 
> we're dealing with.
> >> > I'll take a look at the NSB report and see if I can merge the 
> suggestion I made
> >> > yesterday with that categorization.
> >> >
> >> > Bruce B.
> >> >
> >> > At 06:52 PM 4/14/2010, Ruth Duerr wrote:
> >> >> Actually these descriptions correspond pretty well to the 
> descriptions of research, resource, and reference collections  in 
> the report NSB (National Science Board). 2005. Long-Lived Digital 
> Data Collections: Enabling Research and Education in the 21st 
> Century. Washington, DC: National Science Foundation. 87 pp. 
> despite the factor that you are talking about production approaches 
> and they are talking about types of data.
> >> >>
> >> >> Ruth
> >> >>
> >> >> On Apr 14, 2010, at 3:07 PM, Alice Barkstrom wrote:
> >> >>
> >> >> > It may be useful to deal with a simple separation of approaches to
> >> >> > production that incorporates the size of the groups involved:
> >> >> >
> >> >> > 1.  Single author production and publication - classic 
> sociological scenario
> >> >> > that has supported a great deal of previous work
> >> >> >
> >> >> > Scenario: author collects measurements, analyzes the data, and writes
> >> >> > up a summary paper; data may be preserved on paper, or in electronic
> >> >> > files; peer-review accomplished by submission of paper to 
> journal, with
> >> >> > a moderate number (three to five) of referees; data 
> publication would involve
> >> >> > having paper or electronic copies of data accepted by a 
> library or data center
> >> >> >
> >> >> > 2.  Working group production and publication - field 
> experiment (of a variety
> >> >> > of different kinds) would be a typical example
> >> >> >
> >> >> > Scenario: group sets up equipment, with single person in 
> charge of each
> >> >> > instrument that will collect data, management of WG done by 
> one or two
> >> >> > people (PI); data from individual instruments combined and 
> intercompared
> >> >> > within the group; data preserved in electronic files - 
> which may be distributed
> >> >> > amongst the WG; each instrument's scientist writes up a 
> paper on his or her
> >> >> > data; peer-review accomplished by submission of papers to a 
> journal special
> >> >> > issue and perhaps a special editor who selects a fair 
> number of referees;
> >> >> > data publication requires formal accession planning by a 
> data center owing
> >> >> > to the volume of data and the cost of curation
> >> >> >
> >> >> > 3.  Large-scale production and publication - "Big Science" 
> owing to the size
> >> >> > of the effort involved
> >> >> >
> >> >> > Scenario: instrument and producer teams selected by large 
> scale proposal
> >> >> > effort - may involve one hundred to two hundred people over 
> a decade; long time
> >> >> > period (5 years is typical) of preparation before data 
> collection begins, including
> >> >> > design of production system and data production software; 
> substantial pre-collection
> >> >> > peer-review, including ATBDs and related algorithm 
> outlines, as well as such documentation
> >> >> > as coordinate transformations, data formats, calibration 
> plans and procedures, etc.;
> >> >> > production highly rigid, with extensive planning and 
> scheduling; periodic (two to three
> >> >> > times per year) science team reviews of progress - 
> stretching out over a decade or
> >> >> > more; multiple publications, both jointly as a team and as 
> individual contributions to
> >> >> > journals; multiple calibration and validation exercises in 
> support of establishing bounds
> >> >> > on uncertainties; peer-review may involve intercomparisons 
> with competing instruments
> >> >> > or data sources; data publication requires resources for 
> large-scale, special purpose
> >> >> > data centers owing to cost of computing resources, storage 
> resources, and curation
> >> >> > over long periods.
> >> >> >
> >> >> > These could be neatened up - and perhaps enumerated.  We 
> really need samples of
> >> >> > each different kind of scenario and group interaction.  Is 
> it worth writing these thoughts up into
> >> >> > a format that can go into the wiki?
> >> >> >
> >> >> > Bruce B.
> >> >> >
> >> >> >
> >> >> > At 04:06 PM 4/14/2010, Mark A. Parsons wrote:
> >> >> >> After hearing today's discussion, I thought it might be 
> useful for everyone to see the essay that Ruth and I wrote on citations.
> >> >> >>
> >> >> >> Cheers,
> >> >> >>
> >> >> >> -m.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 14 Apr 2010, at 9:38 AM, Ruth Duerr wrote:
> >> >> >>
> >> >> >> > Wednesday March 10, 1 pm MST (3 pm EST)
> >> >> >> > Telephone: 877-326-0011
> >> >> >> > Meeting #: *4917475*
> >> >> >> > Agenda:
> >> >> >> >
> >> >> >> > - Identifiers paper status
> >> >> >> > - Identifiers testbed report
> >> >> >> > - Status of report on AGU townhall
> >> >> >> > - Provenance paper status
> >> >> >> > - Data management recommendations status
> >> >> >> > - Summer ESIP meeting plans
> >> >> >> > _______________________________________________
> >> >> >> > Esip-preserve mailing list
> >> >> >> > Esip-preserve at lists.esipfed.org
> >> >> >> > http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> >> >> >>
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Esip-preserve mailing list
> >> >> >> Esip-preserve at lists.esipfed.org
> >> >> >> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > Esip-preserve mailing list
> >> >> > Esip-preserve at lists.esipfed.org
> >> >> > http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
> >> >
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Citation_Scenarios.xls
Type: application/octet-stream
Size: 17408 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Citation_Scenarios.ods
Type: application/vnd.oasis.opendocument.spreadsh
Size: 10701 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Item_Inventory.xls
Type: application/octet-stream
Size: 26112 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Item_Inventory.ods
Type: application/vnd.oasis.opendocument.spreadsh
Size: 13825 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NSB_Digital_Data_Collections_Report_Characterization.xls
Type: application/octet-stream
Size: 9216 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NSB_Digital_Data_Collections_Report_Characterization.ods
Type: application/vnd.oasis.opendocument.spreadsh
Size: 10134 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20100419/fc8031f6/attachment-0005.bin>


More information about the Esip-preserve mailing list