Package edu.cmu.minorthird

Minorthird is a collection of methods for learning to extract entities and categorize text.

See:
          Description

Class Summary
Minorthird A launch bar for Minorthird applications.
 

Package edu.cmu.minorthird Description

Minorthird is a collection of methods for learning to extract entities and categorize text.

Some basic concepts: in Minorthird, a collection of documents are stored in a TextBase. Annotations about these documents are stored in a corresponding TextLabels object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (aka a Span). TextLabels stored information from many sources: they might hold annotations produced by human labelers (perhaps using a GUI tool like the TextBaseEditor) or, annotations produced by a hand-writted program, or annotations produced by a learned program. Multiple TextLabels can annotate a single TextBase, if necessary.

More about the text manipulation and processing can be found in the Javadocs for the minorthird.text and minorthird.text.mixup packages.

Annotated TextBases can be stored in many ways, so a "repository" can be configured to hold a bunch of TextLabels and their associated TextBases. TextLabels in the repository are loaded with the FancyLoader. TextLabels and TextBases can also be loaded directly with the TextBaseLoader and the TextBaseEditor.

Moderately complex annotation programs can be implemented with Mixup, a special-purpose annotation language which is part of Minorthird. Mixup can also be used to generate features for learning algorithms. A sequence of Mixup commands can be combined in a MixupProgram. The MixupDebugger is a gui tool for testing a MixupProgram.

Minorthird contains a number of methods for learning to extract Spans from a document, or learning to classify Spans. Top-level programs for conducting learning experiments and training, testing and applying Annotators can be found in the edu.cmu.minorthird.ui package. (The Help class is a main program that, when invoked, lists the relevant main methods.)

Under the hood, learning is performed using classes from inside the edu.cmu.minorthird.classify package. A ClassifierLearner learns a Classifier from a set of labeled Examples, usually stored in a Dataset. Several sequential classification algorithms are also implemented in the package edu.cmu.minorthird.classify.sequential. The classify package is independent of the edu.cmu.minorthird.text package, but linked to it by the routines in edu.cmu.minorthird.text.learn. Most importantly, the SpanFE package implements what is essentially a small feature extraction sub-language, embedded in Java, which makes it possible to easily generate a wide variety of features of a document, token, or Span. This language is even more powerful because it can base features on annotations stored in TextLabels that are associated with the Span.