Skip to main content

SHARE

  • About SHARE
    • Our Community
  • Projects and Partners
  • News
  • Contact

SHARE News

Rick’s MetaTips | 24 May 2016

Rick’s MetaTips: Assessing and Improving SHARE’s Metadata Structure

Rick Johnson, image courtesy of University of Notre Dame Hesburgh Libraries

This month I would like to provide an update on developments around SHARE’s metadata schema. Now, a little over a year since the launch of SHARE Notify, SHARE has captured over 7 million research events. In doing so, we have made great progress in our mission to make research more discoverable. We have also laid the groundwork to take the next steps toward goals we have had in mind for SHARE from the beginning. These goals are driving a reassessment of our metadata structure:

  1. Increase links among related materials, events, and researchers
  2. Harness partial and additional metadata retrieved from multiple data providers and link across those data providers
  3. Enable better grouping and filtering of research events
  4. Improve the interoperability of SHARE’s metadata so that it is more consumable by other systems

As I have mentioned in previous posts, connections like these could be enabled by data providers populating key metadata elements in the SHARE schema (“Rick’s MetaTips: SHARE Metadata Is Stitching Together the Research Life Cycle”) and/or  providing unique identifiers so that SHARE can link events, people, and resources together (“Rick’s MetaTips: It’s All about the Links”).

However, there are still ways the SHARE schema can be tuned and data collection efforts can be focused to best align with the above priorities, thus SHARE’s metadata structure will be enhanced to greater emphasize:

  • Unique identifiers over plaintext-based values (e.g., ORCIDs in addition to author names)—addressing priorities 1 and 2 above
  • Uniform resource identifier (URI) values over free-form text—addressing priorities 2 and 3
  • Increased accommodation for geolocation and applied controlled vocabularies (e.g., utilizing URI schemes mentioned above—addressing priorities 3 and 4)
  • Accommodating records with multiple data providers—addressing priority 2
  • Provenance of records across data providers—addressing priority 2

While more unique identifiers will enhance our ability to link records across data providers and across research event types, URI values (rather than free-form text) will allow SHARE’s metadata to be more consumable and interoperable with other systems. While a human can easily see that records with similar text values are roughly equivalent, it is very difficult for software to link and compare items with different words (but similar meanings) with much confidence. However, URI values can be read from a common table with consistent meanings and common usage patterns. For example, a reference to “Paris” could be interpreted as Paris, France; Paris, Texas; or many other places called Paris. However, if the geonames URI value was used instead of plaintext it could unequivocally be interpreted as Paris, France. Similarly, grouping similar subject terms like “Biology” and “Life Sciences,” can be very difficult as just plaintext. In turn, a preference towards URI values will allow the SHARE schema to be more machine-actionable.

Restructuring the metadata schema is only half the story. As already mentioned in our April announcement about SHARE’s Curation Associates Program, much of our focus going forward will be harnessing machine-learning techniques combined with expert metadata enrichment. This is an important distinction for SHARE that should not be overlooked: manual record curation and enhancement greatly improves the effectiveness of any automated metadata enrichment. By weaving both expert and automated enrichment workflows together, we are able to leverage many different automated techniques that might not be sufficient without human review or intervention. Because each possible match that automation produces will not have a 100% level of certainty, we look to:

  • Apply a percentage of authority and confidence to imputed metadata elements based on data provider, shared unique identifiers (e.g., DOI, ORCID), shared names, etc.
  • Flag and highlight which records need and will benefit the most from validation by the curation associates, based upon the confidence percentage.
  • When de-duplicating and combining records from multiple data providers, some records will be flagged for manual review to validate that matches are correct. (Curation associates will also help manually de-duplicate records that are not identified through automation.)

Additionally, we anticipate utilizing automated processing and expert enrichment to:

  • Replace plaintext values with URIs as controlled vocabulary values
  • Utilize unique identifiers to correlate, link, and combine metadata records across multiple sources
  • Explore applying search-term aliasing (e.g., when searching on term “malaria,” events with “marsh fever” should also be returned)

We’d love to hear your thoughts on these plans both now and when we share an update on our progress at our upcoming SHARE Community Meeting in July.

By Rick Johnson

rick.johnson@nd.edu
Tags Data Curation, metadata, DOIs, Rick's MetaTips, Curation Associates, machine learning
  • Related Posts

    • March 1, 2018SHARE v3: Rebooting the Metadata-Harvesting Framework, Metadata-Editing Pipeline

      Jeffrey Spies, SHARE’s co-director and the original architect of both SHARE and the Open Science Framework (OSF), will be ... read more.

    • January 26, 2018Technical Update: January 2018

      The SHARE developers have enhanced SHARE over the past few months, by back-harvesting a variety of metadata providers, and ... read more.

  • Topics

    • Uncategorized (2)
    • Events (37)
    • SHARE News Releases (22)
    • Partners (23)
    • Career Opportunities (5)
    • SHARE Updates (41)
    • What people are saying (16)
    • Presentations (23)
    • Resources (19)
    • Rick’s MetaTips (8)
    • General (11)
  • @SHARE_research

    Tweets by @SHARE_research
  • About SHARE
  • News
  • Contact
Sign up for updates
@SHARE_research

All content is © copyright SHARE and available under a CC-BY 4.0 license.

Association of Research Libraries
21 Dupont Circle NW #800
Washington, DC 20036
202-296-2296
info@www.share-research.org
  • Credits
  • Accessibility
  • Privacy Policy
  • Brand Guidelines
  • Dashboard
This site uses cookies. By clicking 'I understand', you are agreeing to our use of cookies. More Info...
I Understand
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT