Basic Concepts In this Package

Storing and manipulating annotated text

A {@link edu.cmu.minorthird.text.TextToken} is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occured. Specifically one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token--i.e., where it appeared in the document string.

A {@link edu.cmu.minorthird.text.Span} is a sequence of adjacent TextTokens from the same document.

Spans and TextTokens are considered to be inheritantly ordered. If two Spans or TextTokens are from different document, they are ordered lexigraphically based on the identifiers of those documents. Within a single document, TextTokens are according to their position in their document, and Spans are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties.)

A {@link edu.cmu.minorthird.text.TextBase} is a collection of tokenized "document" strings, accessible as Spans.

A {@link edu.cmu.minorthird.text.TextLabels} contains markup for a {@link edu.cmu.minorthird.text.TextBase}. This markup can consist of

There are a couple of different varieties of TextLabels's. An {@link edu.cmu.minorthird.text.TextLabels} can only be read, not modified. A {@link edu.cmu.minorthird.text.MonotonicTextLabels} can be modified by changing attribute values, adding new attribute values, or adding Spans to a type; however, Spans cannot be removed from a type. A plain old {@link edu.cmu.minorthird.text.TextLabels} allows spans to be removed from a type as well (ie is mutable).

A {@link edu.cmu.minorthird.text.NestedTextLabels} is an odd sort of implementation of a MonotonicTextLabels. It combines two TextLabels's, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in a NestedTextLabels is the union of the markup in the inner and outer TextLabels's, except that property values in the outer TextLabels "shadow" values in the inner TextLabels. This has several possible uses, for instance:

  1. One can add change a TextLabels and then "back out" the changes by (a) creating NestedTextLabels with an empty "outer" MonotonicTextLabels, (b) monotonically adding to this new "outer" TextLabels, and then (c) discarding the NestedTextLabels and reverting to the old "inner" TextLabels to undo the modifications.
  2. One can easily construct and view the union of two TextLabels's (or at least, some well-defined approximation of this), which still being able to modify either underlying TextLabels. For instance, one can construct a single TextLabels which contains the output of a {@link edu.cmu.minorthird.text.mixup.MixupProgram}, plus some hand-labeled "ground truth" data, while still being able to re-run the program and get new output and/or edit the "ground truth".