
This month I would like to provide an update on developments around SHARE’s metadata schema. Now, a little over a year since the launch of SHARE Notify, SHARE has captured over 7 million research events. In doing so, we have made great progress in our mission to make research more discoverable. We have also laid the groundwork to take the next steps toward goals we have had in mind for SHARE from the beginning. These goals are driving a reassessment of our metadata structure:
- Increase links among related materials, events, and researchers
- Harness partial and additional metadata retrieved from multiple data providers and link across those data providers
- Enable better grouping and filtering of research events
- Improve the interoperability of SHARE’s metadata so that it is more consumable by other systems
As I have mentioned in previous posts, connections like these could be enabled by data providers populating key metadata elements in the SHARE schema (“Rick’s MetaTips: SHARE Metadata Is Stitching Together the Research Life Cycle”) and/or providing unique identifiers so that SHARE can link events, people, and resources together (“Rick’s MetaTips: It’s All about the Links”).
However, there are still ways the SHARE schema can be tuned and data collection efforts can be focused to best align with the above priorities, thus SHARE’s metadata structure will be enhanced to greater emphasize:
- Unique identifiers over plaintext-based values (e.g., ORCIDs in addition to author names)—addressing priorities 1 and 2 above
- Uniform resource identifier (URI) values over free-form text—addressing priorities 2 and 3
- Increased accommodation for geolocation and applied controlled vocabularies (e.g., utilizing URI schemes mentioned above—addressing priorities 3 and 4)
- Accommodating records with multiple data providers—addressing priority 2
- Provenance of records across data providers—addressing priority 2
While more unique identifiers will enhance our ability to link records across data providers and across research event types, URI values (rather than free-form text) will allow SHARE’s metadata to be more consumable and interoperable with other systems. While a human can easily see that records with similar text values are roughly equivalent, it is very difficult for software to link and compare items with different words (but similar meanings) with much confidence. However, URI values can be read from a common table with consistent meanings and common usage patterns. For example, a reference to “Paris” could be interpreted as Paris, France; Paris, Texas; or many other places called Paris. However, if the geonames URI value was used instead of plaintext it could unequivocally be interpreted as Paris, France. Similarly, grouping similar subject terms like “Biology” and “Life Sciences,” can be very difficult as just plaintext. In turn, a preference towards URI values will allow the SHARE schema to be more machine-actionable.
Restructuring the metadata schema is only half the story. As already mentioned in our April announcement about SHARE’s Curation Associates Program, much of our focus going forward will be harnessing machine-learning techniques combined with expert metadata enrichment. This is an important distinction for SHARE that should not be overlooked: manual record curation and enhancement greatly improves the effectiveness of any automated metadata enrichment. By weaving both expert and automated enrichment workflows together, we are able to leverage many different automated techniques that might not be sufficient without human review or intervention. Because each possible match that automation produces will not have a 100% level of certainty, we look to:
- Apply a percentage of authority and confidence to imputed metadata elements based on data provider, shared unique identifiers (e.g., DOI, ORCID), shared names, etc.
- Flag and highlight which records need and will benefit the most from validation by the curation associates, based upon the confidence percentage.
- When de-duplicating and combining records from multiple data providers, some records will be flagged for manual review to validate that matches are correct. (Curation associates will also help manually de-duplicate records that are not identified through automation.)
Additionally, we anticipate utilizing automated processing and expert enrichment to:
- Replace plaintext values with URIs as controlled vocabulary values
- Utilize unique identifiers to correlate, link, and combine metadata records across multiple sources
- Explore applying search-term aliasing (e.g., when searching on term “malaria,” events with “marsh fever” should also be returned)
We’d love to hear your thoughts on these plans both now and when we share an update on our progress at our upcoming SHARE Community Meeting in July.