Minorthird is a collection of methods for learning to extract entities and categorize text.


Some basic concepts: in Minorthird, a collection of documents are stored in a {@link edu.cmu.minorthird.text.TextBase}. Annotations about these documents are stored in a corresponding {@link edu.cmu.minorthird.text.TextLabels} object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (aka a {@link edu.cmu.minorthird.text.Span}). TextLabels stored information from many sources: they might hold annotations produced by human labelers (perhaps using a GUI tool like the {@link edu.cmu.minorthird.text.gui.TextBaseEditor}) or, annotations produced by a hand-writted program, or annotations produced by a learned program. Multiple TextLabels can annotate a single TextBase, if necessary.

Annotated TextBases can be stored in many ways, so a "repository" can be configured to hold a bunch of TextLabels and their associated TextBases. TextLabels in the repository are loaded with the {@link edu.cmu.minorthird.text.FancyLoader}. TextLabels and TextBases can also be loaded directly with the {@link edu.cmu.minorthird.text.TextBaseLoader} and the {@link edu.cmu.minorthird.text.TextBaseEditor}.

Moderately complex annotation programs can be implemented with {@link edu.cmu.minorthird.text.mixup.Mixup}, a special-purpose annotation language which is part of Minorthird. Mixup can also be used to generate features for learning algorithms. A sequence of Mixup commands can be combined in a {@link edu.cmu.minorthird.text.mixup.MixupProgram}. The {@link edu.cmu.minorthird.text.gui.MixupDebugger} is a gui tool for testing a MixupProgram.

Minorthird contains a number of methods for learning to extract Spans from a document, or learning to classify Spans. Top-level programs for conducting learning experiments and training, testing and applying {@link edu.cmu.minorthird.Annotator}s can be found in the {@link edu.cmu.minorthird.ui} package. (The {@link edu.cmu.minorthird.ui.Help} class is a main program that, when invoked, lists the relevant main methods.)

Under the hood, learning is performed using classes from inside the {@link edu.cmu.minorthird.classify} package. A {@link edu.cmu.minorthird.classify.ClassifierLearner} learns a {@link edu.cmu.minorthird.classify.Classifier} from a set of labeled {@link edu.cmu.minorthird.classify.Example}s, usually stored in a {@link edu.cmu.minorthird.classify.Dataset}. Several sequential classification algorithms are also implemented in the package {@link edu.cmu.minorthird.classify.sequential}. The classify package is independent of the {@link edu.cmu.minorthird.text} package, but linked to it by the routines in {@link edu.cmu.minorthird.text.learn}. Most importantly, the {@link edu.cmu.minorthird.text.learn.SpanFE} package implements what is essentially a small feature extraction sub-language, embedded in Java, which makes it possible to easily generate a wide variety of features of a document, token, or Span. This language is even more powerful because it can base features on annotations stored in {@link edu.cmu.minorthird.text.TextLabels} that are associated with the Span.