|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
Annotator | Something that extends a text labeling with additional annotations. |
MonotonicTextLabels | Maintains assertions about 'types' and 'properties' of contiguous Spans of these Seq's. |
MutableTextLabels | Maintains assertions about 'types' and 'properties' of Spans. |
Span | A series of of adjacent Token's from the same document. |
SpanFinder | Finds subspans of document spans. |
TextBase | Maintains information about what's in a set of documents. |
TextLabels | Access assertions about 'types' and 'properties' of contiguous Spans of these Seq's. |
Token | An interned version of a string. |
Tokenizer | |
Trie.ResultIterator | An extension of Span.Looper which also returns the ids associated with a Span. |
Class Summary | |
---|---|
AbstractAnnotator | Generic implementation of an annotator. |
AbstractSpanFinder | Abstract implementation of a SpanFinder. |
AbstractTextBase | |
AnnotatorLoader | Analogous to a ClassLoader, this finds annotators by name, so they can be applied to a set of labels. |
BasicSpan | Implements the Span interface. |
BasicTextBase | Maintains information about what's in a set of documents. |
BasicTextLabels | Maintains assertions about 'types' and 'properties' of contiguous Spans of these TextToken's. |
BoneheadStemmer | A very simple stemming algorithm. |
BOWClassifierWrapper | Deprecated. |
CharAnnotation | Represents a stand-off annotation by character ie offset, length, and type The stand-off annotation is generally immutable |
CompoundTokenizer | |
DefaultAnnotatorLoader | Default version of AnnotatorLoader. |
Dependencies | Deprecated. Use the require mechanism instead. |
Details | Detailed information about assertions in a TextLabels object |
Document | This class holds a single text 'document'. |
EmptyLabels | An empty text labeling. |
EncapsulatedAnnotator | An annotator that 'requires' some type of annotation, but exports only a selected set of spanTypes (maybe all of them) from the annotated documents. |
EncapsulatingAnnotatorLoader | AnnotatorLoader which contains locally a list of Annotator definitions, in the form of a list of class files, and/or mixup files. |
ExtractAbbrev | The ExtractAbbrev class implements a simple algorithm for extraction of abbreviations and their definitions from biomedical text. |
FancyLoader | Configurable method of loading data objects. |
FilterTokenizer | This implementation of the Tokenizer interface is used for filtering a text base based on a specified spantype. |
LabeledDirectory | |
MinorTagger | Description: An echo-like server that labels popular entities using XML tags in an input text Instruction: VERY IMPORTANT: 1) Make sure you comment out the following line: require 'pos'; in "lib/mixup/np.mixup" because this server already does POS tagging for you! (otherwise MontyLingua will be invoked more than once) 2) If you encounter 'rcwangName.mixup' not found error, copy apps/names/lib/rcwangName.mixup and apps/names/lib/newnames.txt to lib/mixup, or make sure rcwangName.mixup and newnames.txt are in your classpath 3) To use Minorthird's createXMLmarkup function instead of mine, uncomment the line: tagged = labelsLoader.createXMLmarkup(tempFile.getName(), labels); and comment out the line: tagged = createXML(in, labelsLoader.saveTypesAsXML(labels)); in the "tag" function. |
MixupAnnotator | Annotate labels using a mixup program. |
MixupFinder | Finds spans using a mixup expression evaluated in a fixed labeling. |
MutableTextBase | |
NestedTextLabels | A TextLabels which is defined by two TextLabels's. |
POSTagger | Adds part of speech tags to a TextLabels. |
RegexTokenizer | Maintains information about what's in a set of documents. |
SampleTextBases | Some sample inputs to facilitate testing. |
SimpleTextLoader | A no options loader. |
SpanDifference | Compares two sets of spans. |
SpanDifference.Invoker | |
SpanDifference.Looper | A Span.Looper which also passes out two additional types of information about each returned span s: if s is a FALSE_POS, FALSE_NEG, or TRUE_POS, relative to the original spans. |
SpanTypeTokenizer | This implementation of the Tokenizer interface is used for re-tokenizing documents based on a specified spantype. |
SplitTokenizer | |
StopWords | Builtin stoplist words (from SMART) |
StringAnnotator | An abstract annotator that is based on marking up substrings within a string, using the CharAnnotation class. |
SubSpan | A span that is a subset of another span |
SummarizeLabels | Main routine that loads a TextLabels object and summarizes its properties. |
TextBaseLoader | Configurable Text Loader. |
TextBaseManager | Manages the mappings between TextBases. |
TextLabelsLoader | Loads and saves the contents of a TextLabels into a file. |
TextToken | Identifies a particular substring of a particular document. |
Trie | Efficient scheme for matching a rote list of sequences of tokens. |
URLAnnotator | Annotate substrings that are legal URLs. |
Storing and manipulating annotated text.
A TextToken
is a "token" (usually a single word in a
document), plus some additional information that allows one to
find out where this word/token occured. Specifically one can
recover the string that contained the token, a shorter string
identifier of this "document" string, and the character
offsets of the token--i.e., where it appeared in the document
string.
A Span
is a sequence of adjacent TextTokens from the
same document.
Spans and TextTokens are considered to be inheritantly ordered. If two Spans or TextTokens are from different document, they are ordered lexigraphically based on the identifiers of those documents. Within a single document, TextTokens are according to their position in their document, and Spans are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties.)
A TextBase
is a collection of tokenized "document"
strings, accessible as Spans.
A TextLabels
contains markup for
a TextBase
. This markup can consist of
TextLabels
can only be read, not modified.
A MonotonicTextLabels
can be modified by
changing attribute values, adding new attribute values, or adding
Spans to a type; however, Spans cannot be removed from a type. A
plain old TextLabels
allows spans to be
removed from a type as well (ie is mutable).
Markup in a TextLabels object is usually provided by an Annotator
. A sort of subroutine-calling
mechanism for Annotators is provided by the
textLabels.require
call, the
textLabels.isAnnotatedBy
call, and the AnnotatorLoader
mechanism. If one
Annotator relies on the output of another---for instance, an NP
chunker requires POS tags---it should use the
textLabels.require
method to make sure that the
annotation is present. textLabels.require
then uses
an AnnotatorLoader to find an Annotator that will produce the
required annotation type, using the
annotatorLoader.findAnnotator
method. Annotators
record the fact that they have been run on a textLabels object by
using the textLabels.setAnnotatedBy(...)
method;
this ensures that annotations are not run more than once.
Taken together these mechanisms provide something in between a programming language for annotations, and a simple planner for constructing annotations. As a planner, each Annotator corresponds to an operator: its preconditions are specified by calls to "require", and its postconditions are specified by calls to "setAnnotatedBy" (or in mixup, by "provide" statements.) The AnnotatorLoader corresponds to a backwards-chaining planner, and its decisions about what Annotator to use are how the plan is constructed.
However, the AnnotatorLoader don't do anything fancy to find Annotators: in response to a "require" call for label "foo", the AnnotatorLoader looks for a file "foo.mixup" or a Java class names "foo", in that order. So the default behavior is simple enough that it looks more like a programming language, with the AnnotatorLoader being just a binding mechanism.
There are several ways the binding mechanism can be modified.
require
call, one can specify a filename
in addition to a desired label type (in mixup, this is the
second argument to the "require" call). This causes this
filename to be used instead of the the default "foo.mixup" or
Java class "foo".
annotators.config
file, (usually located
in minorthird/config), one can specify default filenames for a
set of label types "foo". These will be used instead of
"foo.mixup", unless some other filename is specified.
DefaultAnnotatorLoader
, this is done
using the system ClassLoader. One can also specify a
non-default AnnotatorLoader in a call to require
,
which uses different rules to find files.
The main use of this mechanisms is the EncapsulatingAnnotatorLoader
, which
contains a cache of files and/or Java classes that it will use in
preference to anything on the classpath. This is useful
if you want to bundle a bunch of Annotators along with
a classifier or extractor that uses them.
Currently, AnnotatorLoaders are not used for loading Mixup resources like dictionary files, only for loading Annotators.
A NestedTextLabels
is an odd
sort of implementation of a MonotonicTextLabels. It combines two
TextLabels's, an "inner" one and an "outer" one, such that the
outer one can be monotonically added to, but the inner one is
never modified. Semantically, the markup in a NestedTextLabels is
the union of the markup in the inner and outer TextLabels's,
except that property values in the outer TextLabels "shadow"
values in the inner TextLabels.
This has several possible uses, for instance:
MixupProgram
, plus some hand-labeled
"ground truth" data, while still being able to re-run the
program and get new output and/or edit the "ground truth".
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |