Package edu.cmu.minorthird.text

Storing and manipulating annotated text.

See:
          Description

Interface Summary
Annotator Something that extends a text labeling with additional annotations.
MonotonicTextLabels Maintains assertions about 'types' and 'properties' of contiguous Spans of these Seq's.
MutableTextLabels Maintains assertions about 'types' and 'properties' of Spans.
Span A series of of adjacent Token's from the same document.
SpanFinder Finds subspans of document spans.
TextBase Maintains information about what's in a set of documents.
TextLabels Access assertions about 'types' and 'properties' of contiguous Spans of these Seq's.
Token An interned version of a string.
Tokenizer  
Trie.ResultIterator An extension of Span.Looper which also returns the ids associated with a Span.
 

Class Summary
AbstractAnnotator Generic implementation of an annotator.
AbstractSpanFinder Abstract implementation of a SpanFinder.
AbstractTextBase  
AnnotatorLoader Analogous to a ClassLoader, this finds annotators by name, so they can be applied to a set of labels.
BasicSpan Implements the Span interface.
BasicTextBase Maintains information about what's in a set of documents.
BasicTextLabels Maintains assertions about 'types' and 'properties' of contiguous Spans of these TextToken's.
BoneheadStemmer A very simple stemming algorithm.
BOWClassifierWrapper Deprecated.  
CharAnnotation Represents a stand-off annotation by character ie offset, length, and type The stand-off annotation is generally immutable
CompoundTokenizer  
DefaultAnnotatorLoader Default version of AnnotatorLoader.
Dependencies Deprecated. Use the require mechanism instead.
Details Detailed information about assertions in a TextLabels object
Document This class holds a single text 'document'.
EmptyLabels An empty text labeling.
EncapsulatedAnnotator An annotator that 'requires' some type of annotation, but exports only a selected set of spanTypes (maybe all of them) from the annotated documents.
EncapsulatingAnnotatorLoader AnnotatorLoader which contains locally a list of Annotator definitions, in the form of a list of class files, and/or mixup files.
ExtractAbbrev The ExtractAbbrev class implements a simple algorithm for extraction of abbreviations and their definitions from biomedical text.
FancyLoader Configurable method of loading data objects.
FilterTokenizer This implementation of the Tokenizer interface is used for filtering a text base based on a specified spantype.
LabeledDirectory  
MinorTagger Description: An echo-like server that labels popular entities using XML tags in an input text Instruction: VERY IMPORTANT: 1) Make sure you comment out the following line: require 'pos'; in "lib/mixup/np.mixup" because this server already does POS tagging for you! (otherwise MontyLingua will be invoked more than once) 2) If you encounter 'rcwangName.mixup' not found error, copy apps/names/lib/rcwangName.mixup and apps/names/lib/newnames.txt to lib/mixup, or make sure rcwangName.mixup and newnames.txt are in your classpath 3) To use Minorthird's createXMLmarkup function instead of mine, uncomment the line: tagged = labelsLoader.createXMLmarkup(tempFile.getName(), labels); and comment out the line: tagged = createXML(in, labelsLoader.saveTypesAsXML(labels)); in the "tag" function.
MixupAnnotator Annotate labels using a mixup program.
MixupFinder Finds spans using a mixup expression evaluated in a fixed labeling.
MutableTextBase  
NestedTextLabels A TextLabels which is defined by two TextLabels's.
POSTagger Adds part of speech tags to a TextLabels.
RegexTokenizer Maintains information about what's in a set of documents.
SampleTextBases Some sample inputs to facilitate testing.
SimpleTextLoader A no options loader.
SpanDifference Compares two sets of spans.
SpanDifference.Invoker  
SpanDifference.Looper A Span.Looper which also passes out two additional types of information about each returned span s: if s is a FALSE_POS, FALSE_NEG, or TRUE_POS, relative to the original spans.
SpanTypeTokenizer This implementation of the Tokenizer interface is used for re-tokenizing documents based on a specified spantype.
SplitTokenizer  
StopWords Builtin stoplist words (from SMART)
StringAnnotator An abstract annotator that is based on marking up substrings within a string, using the CharAnnotation class.
SubSpan A span that is a subset of another span
SummarizeLabels Main routine that loads a TextLabels object and summarizes its properties.
TextBaseLoader Configurable Text Loader.
TextBaseManager Manages the mappings between TextBases.
TextLabelsLoader Loads and saves the contents of a TextLabels into a file.
TextToken Identifies a particular substring of a particular document.
Trie Efficient scheme for matching a rote list of sequences of tokens.
URLAnnotator Annotate substrings that are legal URLs.
 

Package edu.cmu.minorthird.text Description

Storing and manipulating annotated text.

Basic Concepts In this Package

A TextToken is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occured. Specifically one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token--i.e., where it appeared in the document string.

A Span is a sequence of adjacent TextTokens from the same document.

Spans and TextTokens are considered to be inheritantly ordered. If two Spans or TextTokens are from different document, they are ordered lexigraphically based on the identifiers of those documents. Within a single document, TextTokens are according to their position in their document, and Spans are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties.)

A TextBase is a collection of tokenized "document" strings, accessible as Spans.

A TextLabels contains markup for a TextBase. This markup can consist of

There are a couple of different varieties of TextLabels's. An TextLabels can only be read, not modified. A MonotonicTextLabels can be modified by changing attribute values, adding new attribute values, or adding Spans to a type; however, Spans cannot be removed from a type. A plain old TextLabels allows spans to be removed from a type as well (ie is mutable).

Annotators and AnnotatorLoaders

Markup in a TextLabels object is usually provided by an Annotator. A sort of subroutine-calling mechanism for Annotators is provided by the textLabels.require call, the textLabels.isAnnotatedBy call, and the AnnotatorLoader mechanism. If one Annotator relies on the output of another---for instance, an NP chunker requires POS tags---it should use the textLabels.require method to make sure that the annotation is present. textLabels.require then uses an AnnotatorLoader to find an Annotator that will produce the required annotation type, using the annotatorLoader.findAnnotator method. Annotators record the fact that they have been run on a textLabels object by using the textLabels.setAnnotatedBy(...) method; this ensures that annotations are not run more than once.

Taken together these mechanisms provide something in between a programming language for annotations, and a simple planner for constructing annotations. As a planner, each Annotator corresponds to an operator: its preconditions are specified by calls to "require", and its postconditions are specified by calls to "setAnnotatedBy" (or in mixup, by "provide" statements.) The AnnotatorLoader corresponds to a backwards-chaining planner, and its decisions about what Annotator to use are how the plan is constructed.

However, the AnnotatorLoader don't do anything fancy to find Annotators: in response to a "require" call for label "foo", the AnnotatorLoader looks for a file "foo.mixup" or a Java class names "foo", in that order. So the default behavior is simple enough that it looks more like a programming language, with the AnnotatorLoader being just a binding mechanism.

There are several ways the binding mechanism can be modified.

  1. In the require call, one can specify a filename in addition to a desired label type (in mixup, this is the second argument to the "require" call). This causes this filename to be used instead of the the default "foo.mixup" or Java class "foo".
  2. In the annotators.config file, (usually located in minorthird/config), one can specify default filenames for a set of label types "foo". These will be used instead of "foo.mixup", unless some other filename is specified.
  3. The rules above rely on low-level routines to find files (like mixup files) and find Java classes. In the DefaultAnnotatorLoader, this is done using the system ClassLoader. One can also specify a non-default AnnotatorLoader in a call to require, which uses different rules to find files.

    The main use of this mechanisms is the EncapsulatingAnnotatorLoader, which contains a cache of files and/or Java classes that it will use in preference to anything on the classpath. This is useful if you want to bundle a bunch of Annotators along with a classifier or extractor that uses them.

Currently, AnnotatorLoaders are not used for loading Mixup resources like dictionary files, only for loading Annotators.

NestedTextLabels

A NestedTextLabels is an odd sort of implementation of a MonotonicTextLabels. It combines two TextLabels's, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in a NestedTextLabels is the union of the markup in the inner and outer TextLabels's, except that property values in the outer TextLabels "shadow" values in the inner TextLabels. This has several possible uses, for instance:

  1. One can add change a TextLabels and then "back out" the changes by (a) creating NestedTextLabels with an empty "outer" MonotonicTextLabels, (b) monotonically adding to this new "outer" TextLabels, and then (c) discarding the NestedTextLabels and reverting to the old "inner" TextLabels to undo the modifications.
  2. One can easily construct and view the union of two TextLabels's (or at least, some well-defined approximation of this), which still being able to modify either underlying TextLabels. For instance, one can construct a single TextLabels which contains the output of a MixupProgram, plus some hand-labeled "ground truth" data, while still being able to re-run the program and get new output and/or edit the "ground truth".