MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It consists of four main packages:
- Classify - contains the machine learning algorithms for extraction and classification as well as data structures for storing non-text data, classifiers, and evaluations of experiments. The classify package can stand on its own, so should not call any of the other packages.
- Text - this package contains the classes necessary to process text data such as emails. The text package also contains Mixup (which stands for My Information eXtraction and Understanding Program), which is a matching language for modifying TextLabels.
- UI - as the name implies, this package provides a user interface for running learning experiments on text data.
- Util - provides utilities such as the command line processor and gui framework.
To download and run MinorThird, take a look at the Getting Started tutorial.
As stated above, the classify package is where the learning is performed and can be used on its own to perform experiments on non-text data (i.e. a classification label (such as POS or NEG) and a list of features (such as symptoms)). How to use the classify package is documented in Classify Package Tutorial.
The ui package contains several classes for viewing, editing, and running experiments on text data. To learn how to put data in a format that minorthird can recognize, look at the Labeling and Loading Data Tutorial.
Before getting started looking at different classes, here is some minorthird terminology that is helpful to know:
- Document - a single example or file. For example: if a directory of emails is loaded into minorthird, each email is a separate document.
- TextToken - a particular substring of a particular document.
- Span - a series of adjacent tokens from the same document.
- SpanType - a string label that is associated with a span - binary.
- SpanProp - a string label that is associated with a span - multi-class
- TextBase - a collection of documents.
- TextLabels - assertions about types and properties of certain spans in a TextBase. In other words, the structure that stores the labels of each document in the TextBase.
- Classifier - a structure that holds what minorthird has learned from the training documents. This structure holds all the tokens from a TextBase and how strongly they are associated with the learned spanType.
Here is a breakdown of the ui classes and their functionality (click on each class name to view the tutorial for that class):
ViewLabels - viewing tool for a collection of documents and their labels which are loaded into minorthird
RunMixup and DebugMixup - runs a mixup program on documents loaded into minorthird. Running mixup with a mixup program annotates a set of documents based on the rules defined in the program. Therefore, RunMixup will create new labels for the documents you loaded. DebugMixup will run a mixup program and pops up a TextBaseEditor so that Mixup results may be hand corrected.
Extraction - extracts PORTIONS of a document. For example: names, places, noun phrases.
- TrainExtractor - tells minorthird to learn a certain spanType (such as name) based on a set of labeled documents you give it. This class will output a classifier which can be tested other labeled documents or used to annotate unlabeled documents.
- TestExtractor - requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier, it will output the statistics of how the classifier performed. If you would like the classifier's predicted labels, run ApplyAnnotator.
- TrainTestExtractor - requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell minorthird to create a classifier for a certain spanType and test it on documents either imputed by the user or created by the splitter.
Classification - classifies ENTIRE documents. For example: classify an emails as real or spam.
- TrainClassifier - tells minorthird to learn a spanType (such as spam) and create a classifier for this spanType based on a set of labeled documents that you input. The classifier can be used to test other labeled documents or output predicted labels for unlabeled documents.
- TestClassifier - requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier. If you would like the classifier's predicted labels, run ApplyAnnotator.
- TrainTestClassifier - requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell minorthird to create a classifier for a certain spanType and test it on documents either imputed by the user or created by the splitter.
MultiClassification - works the same way as Classifier except it can learn, test, and annotate multiple dimensions per document. For example: it can learn and classify both color and shape. Note: this is different than learning one dimension with several options (ex: learning whether something is square, rectangular, circular, etc.). In this case, you are only learning one label rather than several!
OnlineLearner - the OnlineTextClassifierLearner allows you to add Documents to a learner by passing in a document string rather than a document span. The OnlineTextClassifierLearner also returns a TextClassifier with a call to getTextClassifier() which returns the score of a document string rather than a document span.
ApplyAnnotator - apply a saved classifier to a set of documents to output a set of predicted labels. You can use this to either label unlabeled data or compare the predicted labels to actual labels.
EditLabels - tool for adding and/or removing labels from the collection of documents you loaded into minorthird and save a new labels document. Useful for debugging the results of ApplyAnnotator.
Viewing and comparing evaluations of Extraction and Classification experiments is explained in the EvaluationGroup Tutorial.