In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in the areas of scientific, technical and medical (STM). Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet the role of a national library in the realm of STM data remains unclear.
This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.
Characterising the Dimensions of Reference Data Collections
In order to develop a strategy aimed at providing access to these resources, a comprehensive picture should be developed of the inherent diversity in which research data are produced and offered. On the other hand, since the BL might function as a gateway to these resources, it is equally important to characterise the interests and needs of the potential users of these datasets. Therefore, we distinguished between the supply domain of datasets on the one hand, and the use of such data, i.e., the demand domain, on the other. As illustrated in Table 1, a set of seven supply-side dimensions and a set of five demand-side dimensions have been developed. Each dimension has several attributes to allow for a characterisation of each candidate database and to delineate a set of options for the BL related to each attribute.
The identified attributes can have different values that represent the variation among the data collections' characteristics. We explored the online resources of a small sample of candidate data collections, and, to the extent possible, reviewed documentation about their ownership, management, data processing and validation methods, access mechanisms, query interfaces, browsing capabilities, metadata, etc. The identified dimensions, their attributes and the span of plausible values on the supply and demand side are given in Table 1.
Table 1
Span of Plausible Attribute Values in Supply- and Demand-Side Dimensions
Dimension Number |
Dimension |
Attribute |
Attribute Values |
---|---|---|---|
S1 |
Access |
Restriction |
none, role-based (e.g., government, commercial, individual), location-or-affiliation-based (e.g., by country, agency, professional society), by-registration, requiring unpaid-membership, paid-membership, use-payment (unlimited or by data-item, query, dataset, etc.) |
Access media |
online-only, offline-only, on-or-offline |
||
Granularity |
attribute, data-item, query-result, subset, dataset, |
||
Functionality |
low, medium, high |
||
Software-requirements |
generic, modifiable, free-download, server-resident-proprietary, proprietary |
||
S2 |
Scale, dynamism, coverage and completeness |
Scale |
small, medium, large |
Dynamism of discipline |
frozen, static, dynamic, volatile |
||
Dynamism (of database) |
frozen, static, dynamic, volatile |
||
Temporal-depth |
historical, current-only, multiple versions/editions |
||
Coverage |
narrow, medium, broad |
||
Completeness |
low, medium, high |
||
Collection-strategy |
passive, active |
||
Processing |
none, minimal, significant, intensive |
||
Validation |
none, minimal, significant, intensive |
||
Timeliness |
low, medium, high |
||
S3 |
Disciplinary usage |
Cross-discipline |
no, somewhat, yes |
Disciplines |
<discipline designations> |
||
Level of user support |
low, medium, high |
||
S4 |
Interface |
User-interface |
menu, graphical-selection, text-query, graphics-input |
Programmable-interfaces |
no, server-support, framework-support, API |
||
S5 |
Interoperability |
Self-describing data |
no, somewhat, yes |
Semantic transparency |
no, somewhat, yes |
||
Linkage-to-other-collections |
no, somewhat, yes |
||
Use of semantic standards |
no, somewhat, yes |
||
Cross-domain semantic crosswalks |
no, somewhat, yes |
||
Programmable interfaces |
non-existent, unique, standard, open |
||
S6 |
Ownership, funding, governance, management and contributors |
Reputation |
low, medium, high |
Involvement |
low, medium, high |
||
Accessibility |
low, medium, high |
||
Funding-level |
low, medium, high |
||
Funding-reliability |
low, medium, high |
||
Governance-quality |
low, medium, high |
||
Sustainability |
short-term, medium-term, indefinite |
||
S7 |
Attribution & IP |
Attribution completeness |
low, medium, high |
Attribution accuracy |
low, medium, high |
||
Attribution granularity |
low, medium, high |
||
Licensing, registration, agreements with owners |
inapplicable, minimal, partial, complete |
||
End-user licensing |
inapplicable, minimal, partial, complete |
||
Redaction/anomalisation of data |
inapplicable, minimal, partial, complete |
||
D1 |
Research methodology, funding and stakeholder requirements |
Required-access-granularity |
attribute, data-item, query-result, dataset, database |
Required-metadata |
low, medium, high |
||
Required-access-to-models |
low, medium, high |
||
Methods |
<method designations> |
||
Publication/distribution requirements |
<various> |
||
D2 |
Discovery methods |
Search-engines |
generic, specialised |
Discovery metadata |
generic, specialised |
||
Other discovery resources |
<indexes, catalogues, etc.> |
||
D3 |
Query style |
Expressivity |
low, medium, high |
Desired-interface |
menu, graphical-selection, text-query, graphics-input |
||
Required-programmable-access |
low, medium, high |
||
D4 |
Federation |
Need-to-federate |
low, medium, high |
Required-metadata-support |
low, medium, high |
||
D5 |
Cross-disciplinary usage |
Cross-disciplinary-usage |
low, medium, high |
Required-metadata-support |
low, medium, high |
||
D6 |
Timeliness and temporal access |
Required-recency |
low, medium, high |
Required-timestamp-granularity |
low, medium, high |
||
Desired-update-method |
asynchronous, time-stamped, transaction-based |
||
Required-temporal-access |
historical, current-only, versioned, multi-epoch, reconstructible |
Bundles of Strategic Options for a National Library
Analysis of the sample of candidate collections has led to the identification of a range of optional approaches that address each or a small set of salient attribute values. Examples of such options include: the BL should (or should not) hold a given dataset itself or should (or should not) develop and provide its own metadata and query or access mechanisms for a given dataset.
As an initial exercise for how the BL can develop a strategy for providing long-term access to these high quality reference data collections, we specified three exemplary clusters of attribute values, each of which characterises a class of databases. Each such attribute cluster defines a bundle of options that, taken together, can be considered a strategy.
- The first cluster of attributes can be labelled as neutral: it represents the issues arising in the sample of databases that were investigated. For this cluster, the national library might consider providing transparent access to the data collections.
- The second cluster of attributes represents a class of databases involving a complex, demanding set of requirements combined with relatively minimum support by the database itself. For this cluster, the national library might consider providing gateway access to the data collections.
- The third cluster represents a class of databases involving a simple, undemanding set of requirements, combined with relatively good support by the database itself. These data collections have minimal access restriction, and their supporting mechanisms are relatively simple. For this cluster, the national library might consider providing transparent access to the data collections.
The three bundles of options associated with these attribute clusters should be considered indicative strategies rather than definitive ones. The “demanding” and “undemanding” clusters have been deliberately formulated as two extreme ends on a spectrum of plausible cases. The BL may choose different options, depending on its missions and policies.
Lessons and Next Steps
The option bundles presented are only a starting point. The BL's strategy with respect to any given database should be decided on the basis of an overall assessment of the importance and uniqueness of that database, its relevance to the BL's policies with regard to STM data, and the BL's assessment of the degree to which users of the database would benefit from having the BL apply its own curatorial, preservation, or access resources to the database.
Although the limited resources of our study enabled us to obtain reasonable information for most supply-side attributes, details of ownership and funding (accessibility of owners, owner reputation, reliability of funding, etc) could in many cases only be inferred by our necessarily informal methods. Demand-side attributes were even harder to obtain; our values for most of these attributes are derived deductively rather than empirically. These need to be validated and revised based on future demand-side analysis.
The results of this study should therefore be replicated with greater depth and resources, using a larger number and wider range of sample databases augmented by demand-side input from researchers and user groups. The more in-depth examination should employ direct contact with database administrators, parent organisations, data processing managers, discipline-based organisations whose members use the database, and user communities. This should help fill in the supply-side attributes for each database as well as providing demand-side attributes, whose values were supplied largely by assumptions in the current study.