Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

by Jeff Rothenberg, Stijn Hoorens

This Article

RAND Health Quarterly, 2012; 1(4):13

Abstract

In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in scientific, technical and medical (STM) disciplines. Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet, the role of a national library in the realm of STM data remains unclear. This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving, and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.

For more information, see RAND TR-567-BL at https://www.rand.org/pubs/technical_reports/TR567.html

Full Text

In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in the areas of scientific, technical and medical (STM). Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet the role of a national library in the realm of STM data remains unclear.

This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.

Characterising the Dimensions of Reference Data Collections

In order to develop a strategy aimed at providing access to these resources, a comprehensive picture should be developed of the inherent diversity in which research data are produced and offered. On the other hand, since the BL might function as a gateway to these resources, it is equally important to characterise the interests and needs of the potential users of these datasets. Therefore, we distinguished between the supply domain of datasets on the one hand, and the use of such data, i.e., the demand domain, on the other. As illustrated in Table 1, a set of seven supply-side dimensions and a set of five demand-side dimensions have been developed. Each dimension has several attributes to allow for a characterisation of each candidate database and to delineate a set of options for the BL related to each attribute.

The identified attributes can have different values that represent the variation among the data collections' characteristics. We explored the online resources of a small sample of candidate data collections, and, to the extent possible, reviewed documentation about their ownership, management, data processing and validation methods, access mechanisms, query interfaces, browsing capabilities, metadata, etc. The identified dimensions, their attributes and the span of plausible values on the supply and demand side are given in Table 1.

Table 1

Span of Plausible Attribute Values in Supply- and Demand-Side Dimensions

Dimension Number

Dimension

Attribute

Attribute Values

S1

Access

Restriction

none, role-based (e.g., government, commercial, individual), location-or-affiliation-based (e.g., by country, agency, professional society), by-registration, requiring unpaid-membership, paid-membership, use-payment (unlimited or by data-item, query, dataset, etc.)

Access media

online-only, offline-only, on-or-offline

Granularity

attribute, data-item, query-result, subset, dataset,

Functionality

low, medium, high

Software-requirements

generic, modifiable, free-download, server-resident-proprietary, proprietary

S2

Scale, dynamism, coverage and completeness

Scale

small, medium, large

Dynamism of discipline

frozen, static, dynamic, volatile

Dynamism (of database)

frozen, static, dynamic, volatile

Temporal-depth

historical, current-only, multiple versions/editions

Coverage

narrow, medium, broad

Completeness

low, medium, high

Collection-strategy

passive, active

Processing

none, minimal, significant, intensive

Validation

none, minimal, significant, intensive

Timeliness

low, medium, high

S3

Disciplinary usage

Cross-discipline

no, somewhat, yes

Disciplines

<discipline designations>

Level of user support

low, medium, high

S4

Interface

User-interface

menu, graphical-selection, text-query, graphics-input

Programmable-interfaces

no, server-support, framework-support, API

S5

Interoperability

Self-describing data

no, somewhat, yes

Semantic transparency

no, somewhat, yes

Linkage-to-other-collections

no, somewhat, yes

Use of semantic standards

no, somewhat, yes

Cross-domain semantic crosswalks

no, somewhat, yes

Programmable interfaces

non-existent, unique, standard, open

S6

Ownership, funding, governance, management and contributors

Reputation

low, medium, high

Involvement

low, medium, high

Accessibility

low, medium, high

Funding-level

low, medium, high

Funding-reliability

low, medium, high

Governance-quality

low, medium, high

Sustainability

short-term, medium-term, indefinite

S7

Attribution & IP

Attribution completeness

low, medium, high

Attribution accuracy

low, medium, high

Attribution granularity

low, medium, high

Licensing, registration, agreements with owners

inapplicable, minimal, partial, complete

End-user licensing

inapplicable, minimal, partial, complete

Redaction/anomalisation of data

inapplicable, minimal, partial, complete

D1

Research methodology, funding and stakeholder requirements

Required-access-granularity

attribute, data-item, query-result, dataset, database

Required-metadata

low, medium, high

Required-access-to-models

low, medium, high

Methods

<method designations>

Publication/distribution requirements

<various>

D2

Discovery methods

Search-engines

generic, specialised

Discovery metadata

generic, specialised

Other discovery resources

<indexes, catalogues, etc.>

D3

Query style

Expressivity

low, medium, high

Desired-interface

menu, graphical-selection, text-query, graphics-input

Required-programmable-access

low, medium, high

D4

Federation

Need-to-federate

low, medium, high

Required-metadata-support

low, medium, high

D5

Cross-disciplinary usage

Cross-disciplinary-usage

low, medium, high

Required-metadata-support

low, medium, high

D6

Timeliness and temporal access

Required-recency

low, medium, high

Required-timestamp-granularity

low, medium, high

Desired-update-method

asynchronous, time-stamped, transaction-based

Required-temporal-access

historical, current-only, versioned, multi-epoch, reconstructible

Bundles of Strategic Options for a National Library

Analysis of the sample of candidate collections has led to the identification of a range of optional approaches that address each or a small set of salient attribute values. Examples of such options include: the BL should (or should not) hold a given dataset itself or should (or should not) develop and provide its own metadata and query or access mechanisms for a given dataset.

As an initial exercise for how the BL can develop a strategy for providing long-term access to these high quality reference data collections, we specified three exemplary clusters of attribute values, each of which characterises a class of databases. Each such attribute cluster defines a bundle of options that, taken together, can be considered a strategy.

  1. The first cluster of attributes can be labelled as neutral: it represents the issues arising in the sample of databases that were investigated. For this cluster, the national library might consider providing transparent access to the data collections.
  2. The second cluster of attributes represents a class of databases involving a complex, demanding set of requirements combined with relatively minimum support by the database itself. For this cluster, the national library might consider providing gateway access to the data collections.
  3. The third cluster represents a class of databases involving a simple, undemanding set of requirements, combined with relatively good support by the database itself. These data collections have minimal access restriction, and their supporting mechanisms are relatively simple. For this cluster, the national library might consider providing transparent access to the data collections.

The three bundles of options associated with these attribute clusters should be considered indicative strategies rather than definitive ones. The “demanding” and “undemanding” clusters have been deliberately formulated as two extreme ends on a spectrum of plausible cases. The BL may choose different options, depending on its missions and policies.

Lessons and Next Steps

The option bundles presented are only a starting point. The BL's strategy with respect to any given database should be decided on the basis of an overall assessment of the importance and uniqueness of that database, its relevance to the BL's policies with regard to STM data, and the BL's assessment of the degree to which users of the database would benefit from having the BL apply its own curatorial, preservation, or access resources to the database.

Although the limited resources of our study enabled us to obtain reasonable information for most supply-side attributes, details of ownership and funding (accessibility of owners, owner reputation, reliability of funding, etc) could in many cases only be inferred by our necessarily informal methods. Demand-side attributes were even harder to obtain; our values for most of these attributes are derived deductively rather than empirically. These need to be validated and revised based on future demand-side analysis.

The results of this study should therefore be replicated with greater depth and resources, using a larger number and wider range of sample databases augmented by demand-side input from researchers and user groups. The more in-depth examination should employ direct contact with database administrators, parent organisations, data processing managers, discipline-based organisations whose members use the database, and user communities. This should help fill in the supply-side attributes for each database as well as providing demand-side attributes, whose values were supplied largely by assumptions in the current study.

RAND Health Quarterly is produced by the RAND Corporation. ISSN 2162-8254.