edu.cmu.minorthird.text

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package edu.cmu.minorthird.text

Storing and manipulating annotated text.

See:
Description

Interface Summary
Annotator	Something that extends a text labeling with additional annotations.
MonotonicTextLabels	Maintains assertions about 'types' and 'properties' of contiguous Spans of these Seq's.
MutableTextLabels	Maintains assertions about 'types' and 'properties' of Spans.
Span	A series of of adjacent Token's from the same document.
SpanFinder	Finds subspans of document spans.
TextBase	Maintains information about what's in a set of documents.
TextLabels	Access assertions about 'types' and 'properties' of contiguous Spans of these Seq's.
Token	An interned version of a string.
Tokenizer
Trie.ResultIterator	An extension of Span.Looper which also returns the ids associated with a Span.

Class Summary
AbstractAnnotator	Generic implementation of an annotator.
AbstractSpanFinder	Abstract implementation of a SpanFinder.
AbstractTextBase
AnnotatorLoader	Analogous to a ClassLoader, this finds annotators by name, so they can be applied to a set of labels.
BasicSpan	Implements the Span interface.
BasicTextBase	Maintains information about what's in a set of documents.
BasicTextLabels	Maintains assertions about 'types' and 'properties' of contiguous Spans of these TextToken's.
BoneheadStemmer	A very simple stemming algorithm.
BOWClassifierWrapper	Deprecated.
CharAnnotation	Represents a stand-off annotation by character ie offset, length, and type The stand-off annotation is generally immutable
CompoundTokenizer
DefaultAnnotatorLoader	Default version of AnnotatorLoader.
Dependencies	Deprecated. Use the require mechanism instead.
Details	Detailed information about assertions in a TextLabels object
Document	This class holds a single text 'document'.
EmptyLabels	An empty text labeling.
EncapsulatedAnnotator	An annotator that 'requires' some type of annotation, but exports only a selected set of spanTypes (maybe all of them) from the annotated documents.
EncapsulatingAnnotatorLoader	AnnotatorLoader which contains locally a list of Annotator definitions, in the form of a list of class files, and/or mixup files.
ExtractAbbrev	The ExtractAbbrev class implements a simple algorithm for extraction of abbreviations and their definitions from biomedical text.
FancyLoader	Configurable method of loading data objects.
FilterTokenizer	This implementation of the Tokenizer interface is used for filtering a text base based on a specified spantype.
LabeledDirectory
MinorTagger	Description: An echo-like server that labels popular entities using XML tags in an input text Instruction: VERY IMPORTANT: 1) Make sure you comment out the following line: require 'pos'; in "lib/mixup/np.mixup" because this server already does POS tagging for you! (otherwise MontyLingua will be invoked more than once) 2) If you encounter 'rcwangName.mixup' not found error, copy apps/names/lib/rcwangName.mixup and apps/names/lib/newnames.txt to lib/mixup, or make sure rcwangName.mixup and newnames.txt are in your classpath 3) To use Minorthird's createXMLmarkup function instead of mine, uncomment the line: tagged = labelsLoader.createXMLmarkup(tempFile.getName(), labels); and comment out the line: tagged = createXML(in, labelsLoader.saveTypesAsXML(labels)); in the "tag" function.
MixupAnnotator	Annotate labels using a mixup program.
MixupFinder	Finds spans using a mixup expression evaluated in a fixed labeling.
MutableTextBase
NestedTextLabels	A TextLabels which is defined by two TextLabels's.
POSTagger	Adds part of speech tags to a TextLabels.
RegexTokenizer	Maintains information about what's in a set of documents.
SampleTextBases	Some sample inputs to facilitate testing.
SimpleTextLoader	A no options loader.
SpanDifference	Compares two sets of spans.
SpanDifference.Invoker
SpanDifference.Looper	A Span.Looper which also passes out two additional types of information about each returned span s: if s is a FALSE_POS, FALSE_NEG, or TRUE_POS, relative to the original spans.
SpanTypeTokenizer	This implementation of the Tokenizer interface is used for re-tokenizing documents based on a specified spantype.
SplitTokenizer
StopWords	Builtin stoplist words (from SMART)
StringAnnotator	An abstract annotator that is based on marking up substrings within a string, using the CharAnnotation class.
SubSpan	A span that is a subset of another span
SummarizeLabels	Main routine that loads a TextLabels object and summarizes its properties.
TextBaseLoader	Configurable Text Loader.
TextBaseManager	Manages the mappings between TextBases.
TextLabelsLoader	Loads and saves the contents of a TextLabels into a file.
TextToken	Identifies a particular substring of a particular document.
Trie	Efficient scheme for matching a rote list of sequences of tokens.
URLAnnotator	Annotate substrings that are legal URLs.

Package edu.cmu.minorthird.text Description

Storing and manipulating annotated text.

Basic Concepts In this Package

A TextToken is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occured. Specifically one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token--i.e., where it appeared in the document string.

A Span is a sequence of adjacent TextTokens from the same document.

Spans and TextTokens are considered to be inheritantly ordered. If two Spans or TextTokens are from different document, they are ordered lexigraphically based on the identifiers of those documents. Within a single document, TextTokens are according to their position in their document, and Spans are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties.)

A TextBase is a collection of tokenized "document" strings, accessible as Spans.

A TextLabels contains markup for a TextBase. This markup can consist of

String-valued properties of individual TextTokens (i.e., individual occurances in the TextBase of words.)
String-valued properties of Spans of TextTokens in the TextBase.
Groupings of Spans into "types". A Span can belong to multiple types, and unlike properties, it is possible to quickly find all Spans of a given type in a TextLabels, or find all Spans of a given type in a specific document.

There are a couple of different varieties of TextLabels's. An TextLabels can only be read, not modified. A MonotonicTextLabels can be modified by changing attribute values, adding new attribute values, or adding Spans to a type; however, Spans cannot be removed from a type. A plain old TextLabels allows spans to be removed from a type as well (ie is mutable).

Annotators and AnnotatorLoaders

Markup in a TextLabels object is usually provided by an Annotator. A sort of subroutine-calling mechanism for Annotators is provided by the textLabels.require call, the textLabels.isAnnotatedBy call, and the AnnotatorLoader mechanism. If one Annotator relies on the output of another---for instance, an NP chunker requires POS tags---it should use the textLabels.require method to make sure that the annotation is present. textLabels.require then uses an AnnotatorLoader to find an Annotator that will produce the required annotation type, using the annotatorLoader.findAnnotator method. Annotators record the fact that they have been run on a textLabels object by using the textLabels.setAnnotatedBy(...) method; this ensures that annotations are not run more than once.

Taken together these mechanisms provide something in between a programming language for annotations, and a simple planner for constructing annotations. As a planner, each Annotator corresponds to an operator: its preconditions are specified by calls to "require", and its postconditions are specified by calls to "setAnnotatedBy" (or in mixup, by "provide" statements.) The AnnotatorLoader corresponds to a backwards-chaining planner, and its decisions about what Annotator to use are how the plan is constructed.

However, the AnnotatorLoader don't do anything fancy to find Annotators: in response to a "require" call for label "foo", the AnnotatorLoader looks for a file "foo.mixup" or a Java class names "foo", in that order. So the default behavior is simple enough that it looks more like a programming language, with the AnnotatorLoader being just a binding mechanism.

There are several ways the binding mechanism can be modified.

In the require call, one can specify a filename in addition to a desired label type (in mixup, this is the second argument to the "require" call). This causes this filename to be used instead of the the default "foo.mixup" or Java class "foo".
In the annotators.config file, (usually located in minorthird/config), one can specify default filenames for a set of label types "foo". These will be used instead of "foo.mixup", unless some other filename is specified.
The rules above rely on low-level routines to find files (like mixup files) and find Java classes. In the DefaultAnnotatorLoader, this is done using the system ClassLoader. One can also specify a non-default AnnotatorLoader in a call to require, which uses different rules to find files.
The main use of this mechanisms is the EncapsulatingAnnotatorLoader, which contains a cache of files and/or Java classes that it will use in preference to anything on the classpath. This is useful if you want to bundle a bunch of Annotators along with a classifier or extractor that uses them.

Currently, AnnotatorLoaders are not used for loading Mixup resources like dictionary files, only for loading Annotators.

NestedTextLabels

A NestedTextLabels is an odd sort of implementation of a MonotonicTextLabels. It combines two TextLabels's, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in a NestedTextLabels is the union of the markup in the inner and outer TextLabels's, except that property values in the outer TextLabels "shadow" values in the inner TextLabels. This has several possible uses, for instance:

One can add change a TextLabels and then "back out" the changes by (a) creating NestedTextLabels with an empty "outer" MonotonicTextLabels, (b) monotonically adding to this new "outer" TextLabels, and then (c) discarding the NestedTextLabels and reverting to the old "inner" TextLabels to undo the modifications.
One can easily construct and view the union of two TextLabels's (or at least, some well-defined approximation of this), which still being able to modify either underlying TextLabels. For instance, one can construct a single TextLabels which contains the output of a MixupProgram, plus some hand-labeled "ground truth" data, while still being able to re-run the program and get new output and/or edit the "ground truth".