It has been almost a month since the news broke that international for-profit publisher Elsevier purchased open-access-publishing software company bepress. By now the immediate shock is waning and institutions are considering the implications of what that takeover means for their open access publishing support. No matter what decisions are made, interoperable and easily shared metadata describing scholarly works is essential for any platform to play a role in the scholarly communication landscape. Good and consistent metadata is crucial for sharing or migrating data.
At the beginning of our appointment, we 2016–2017 SHARE digital curation associates were asked to assist the SHARE development team in “gathering, cleaning, linking, and enhancing metadata.” Our first task was to look at the quality of our own metadata. Bonding over food and drink at a SHARE meeting, a few of us who currently use Digital Commons, a bepress product for managing and publishing scholarly works online, discussed the possibility of collaborating on our metadata curation efforts. For our initial exploration, we bepress institutional repository mavens began by looking at our own harvested data, and quickly discovered that we had similar problem areas and deficiencies.
First, simple Dublin Core (oai_dc) is the default prefix for the bepress OAI-PMH endpoint. The chief issue with harvesting simple Dublin Core metadata is that it lacks nuance, granularity, and sometimes even important pieces of data. For example, the digital object identifier (DOI) is not exposed in oai_dc. When this unique identifier is unavailable to the SHARE harvester, it may be difficult to detect possible duplicate records. Moreover, SHARE and other harvesters cannot take advantage of the ability to simply link information using uniform resource identifiers like DOIs, connecting an author to their institution, or a grant to the final research output. (See Rick Johnson’s post, “SHARE Metadata Is Stitching Together the Research Life Cycle.”)
In addition, we also found that the Dublin Core field labelled “type” needs context because it is used both for a Dublin Core Metadata Initiative (DCMI) Type Vocabulary and for something similar to genre. The situation is a bit muddled for bepress customers because oai_dc uses ”text” as the default for everything, so for example, image files are automatically assigned the type “text” until someone changes the type to “image.” We noted, too, that Digital Commons has a required “document_type” field that could be mapped to “dc:type,” and this same field is used in journals for sections in the table of contents. Variant uses mean that the facets cannot be limited to a controlled set of terms.
Similarly, context is needed for metadata fields that support the OpenURL format, such as volume number, issue number, and first and last pages, which appear as orphaned digits when isolated from the record, and thus they can be difficult to understand in a disparate setting.
Also in Digital Commons oai_dc, the “publisher” field defaults to the name of the repository. It is not apparent how one might include both an institution name and a repository name (or if this is desirable) in the metadata.
A final example is the flat “author” field of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which does not expose affiliations, identifiers, or roles. This data would be invaluable in helping SHARE to disambiguate author names and connect related events in the research cycle, such as matching a data set to a research paper and its authors.
As we crunched through different examples and varying formats we drew some initial ideas to share with other repository minders, our “Recommendations for DC Repository Managers,” first presented at ACRL 2017 in a poster session:
We are currently in varying stages of progress in curating our own data. For example, at Western University, we have begun to polish our electronic theses and dissertations (ETD) metadata by adding language fields, supervisor name fields, and ensuring that dates, such as an embargo with correct encoding, are included and exposed to harvesting.
By the way, we certainly don’t want to give the impression to bepress users or any potential metadata provider that having your metadata harvested by SHARE is a difficult task. Nothing is further from the truth. The SHARE team provides a simple form to fill out and one only needs to provide information about the OAI feed for simple Dublin Core. Or one can send a note to email@example.com and a team member will contact you to go through the necessary steps to start harvesting from SHARE. SHARE ingests and normalizes the data, transforming it into the SHARE schema. However if we, the repository managers, are able to provide well-defined, standardized data, it minimizes the efforts of mapping and matching and is beneficial for any migration, sharing, or re-use.
In our second phase of exploration we looked for a standardized way to express our data. We cast a wider net and mapped the various bepress flavors of OAI to the evolving SHARE schema and investigated a variety of other well-defined outputs, including DataCite. We looked at documentation for other repositories and reviewed the wider repository environment to ensure our ideas went beyond our own repositories.
As the project was nearing its finish, and our document was almost complete, we discovered that SHARE had posted its own general recommendations informed in part by discussions and meetings with all the curation associates:
- Every OAI source supports oai_dc, but they usually also support at least one other format that has richer, more structured data, like oai_datacite or mods.
- Choose the format that seems to have the most useful data for SHARE, especially if a transformer for that format already exists.
- Choose oai_dc only as a last resort.
The result of our examination, beyond the general recommendations, features guidelines with detailed suggestions for specific fields. For example, we note that the “date” field is problematic because there are so many possibilities (created, published, issued, modified, etc.) and no means of distinguishing among the various date traits. Another example we wish to improve is the handling of “format.” This is a derived value in bepress metadata and there is no way to indicate controlled terms. And while qualified Dublin Core will expose a DOI, other unique identifiers remain hidden. Please see the full report of our findings, “Best Practices for Mapping Digital Commons Metadata for Harvesting by SHARE.”
We believe that employing these recommendations will improve bepress metadata for sharing and for harvesting. We hope that other metadata specialists, including the broader Digital Commons community, will give feedback on our recommendations, as we feel that if a majority can agree on the same practices, it will be easier for bepress to implement our suggestions. Please review our recommendations for alignment between Digital Commons metadata schema and SHARE. We welcome comments or feedback on the document, or you may contact any of us directly with your thoughts and ideas!