Developers at the Center for Open Science working on the SHARE project are constantly looking for ways to improve SHARE’s highly variable metadata about scholarly and research activity. One challenging task is to add subject areas so that users can have more options and control when searching and filtering documents. Since we have metadata on more than 6 million documents in the SHARE data set, manually labeling the documents would be very tough. Therefore, we need to rely on an automated process to add subject labels to these documents with fairly high precision. That’s where machine learning comes in.
To tackle the problem, I built a multi-label document classification model using training data from the Public Library of Science (PLOS) application programming interface (API). PLOS stores more than 160,000 documents with explicitly labeled subject areas that fit within the PLOS taxonomy. The documents from PLOS contain titles and abstracts that can be used to generate features for the classification model. The PLOS taxonomy has a hierarchical structure that contains more than 10,000 terms, but we will focus on the 11 top-level subject areas for our data model.
The PLOS training documents provide abundant training data for our supervised machine-learning model. We used many preprocessing methods to address multiple issues in the data set, which are illustrated in a follow-up post on the high dimensional space blog, but this post presents an overview of our workflow.
To begin with, certain features are extracted from each document, such as the number of times each word (or term) appears in the document (called “term frequency” or “tf-idf” for term frequency-inverse document frequency). This is called a “bag-of-words” model or an “n-gram” model. The extracted features are then used by an automated classifier to map a document into a category (subject area). Since this is a multi-label classification problem (each document can have multiple subject areas), we trained 11 one-vs.-rest classifiers, where each classifier was exclusively used to identify whether or not a given document belongs to one particular subject area. For example, when training the “Earth sciences” classifier, all documents that have “Earth sciences” as one of their subject areas will be labeled 1 and all others will be labeled 0. Training classifiers separately allowed for greater tuning flexibility and allowed us to deploy a selection of classifiers with good precision while continuing to improve other ones. The best classifiers could achieve over 90 percent precision, while others need further optimization. Nevertheless we are confident the model will keep improving over time with more feature engineering (e.g., adding word2vec), more diverse training data, and more parameter optimization.
Finally, we need to take into consideration the scalability of our framework. The traditional methods described above require all training data to be loaded into memory at once. To accommodate increasing training data size, I built a framework that can utilize batch training methods and feed in data one chunk at a time:
The follow-up post on the high dimensional space blog further explains the detailed preprocessing, feature engineering, and modeling steps. As a bonus we also show how to use Google’s TensorFlow open source software library for machine learning to build a convolutional neural network for the text-classification problem.
Big thanks to Katherine Schinkel, who contributed to model selection and metrics, and credits to Erin Braswell and Brian Gorges for editing help!