Basic Concepts In this Package
Storing and manipulating annotated text
A {@link edu.cmu.minorthird.text.TextToken} is a "token" (usually a single
word in a document), plus some additional information that allows one to
find out where this word/token occured. Specifically one can recover the
string that contained the token, a shorter string identifier of
this "document" string, and the character offsets of the token--i.e., where
it appeared in the document string.
A {@link edu.cmu.minorthird.text.Span} is a sequence of adjacent TextTokens
from the same document.
Spans and TextTokens are considered to be inheritantly ordered. If two
Spans or TextTokens are from different document, they are ordered lexigraphically
based on the identifiers of those documents. Within a single document,
TextTokens are according to their position in their document, and Spans
are ordered according to their leftmost TextToken (using the rightmost
TextToken to break ties.)
A {@link edu.cmu.minorthird.text.TextBase} is a collection of tokenized
"document" strings, accessible as Spans.
A {@link edu.cmu.minorthird.text.TextLabels} contains markup
for a {@link edu.cmu.minorthird.text.TextBase}. This markup can consist
of
-
String-valued properties of individual TextTokens (i.e., individual occurances
in the TextBase of words.)
-
String-valued properties of Spans of TextTokens in the TextBase.
-
Groupings of Spans into "types". A Span can belong to multiple types, and
unlike properties, it is possible to quickly find all Spans of a given
type in a TextLabels, or find all Spans of a given type in a specific document.
There are a couple of different varieties of TextLabels's. An {@link edu.cmu.minorthird.text.TextLabels}
can only be read, not modified. A {@link edu.cmu.minorthird.text.MonotonicTextLabels}
can be modified by changing attribute values, adding new attribute values,
or adding Spans to a type; however, Spans cannot be removed from a type.
A plain old {@link edu.cmu.minorthird.text.TextLabels} allows spans to
be removed from a type as well (ie is mutable).
A {@link edu.cmu.minorthird.text.NestedTextLabels} is an odd sort of
implementation of a MonotonicTextLabels. It combines two TextLabels's,
an "inner" one and an "outer" one, such that the outer one can be monotonically
added to, but the inner one is never modified. Semantically, the markup
in a NestedTextLabels is the union of the markup in the inner and outer
TextLabels's, except that property values in the outer TextLabels "shadow"
values in the inner TextLabels. This has several possible uses, for instance:
-
One can add change a TextLabels and then "back out" the changes by (a)
creating NestedTextLabels with an empty "outer" MonotonicTextLabels, (b)
monotonically adding to this new "outer" TextLabels, and then (c) discarding
the NestedTextLabels and reverting to the old "inner" TextLabels to undo
the modifications.
-
One can easily construct and view the union of two TextLabels's (or at
least, some well-defined approximation of this), which still being able
to modify either underlying TextLabels. For instance, one can construct
a single TextLabels which contains the output of a {@link edu.cmu.minorthird.text.mixup.MixupProgram},
plus some hand-labeled "ground truth" data, while still being able to re-run
the program and get new output and/or edit the "ground truth".