|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
---|---|
Mixup | A simple pattern-matching and information extraction language. |
Mixup.MixupTokenizer | |
MixupInterpreter | |
MixupProgram | Modify a textlabeling using a series of mixup expressions. |
Statement |
Exception Summary | |
---|---|
Mixup.ParseException | Signals an error in parsing a mixup document. |
Little language for creating annotations for text.
TextLabels
.
From the ground up:
any
matches any token.
eq('foo')
matches the token foo
.
This can also be abbreviated as 'foo'
(in single quotes).
re('regex')
matches any token whose string
value matches the given regular expression (from the java.util.regex
package). For instance, re('^\\d+$')
matches any
sequence of digits.
a(foo)
matches any token whose string value
in the dictionary named foo
. Dictionaries are
defined in a TextLabels. For example, a(weekday)
might match
one of 'sun', 'mon', 'tues', ..., 'sat'.
foo:bar
matches any token that has been tagged
as having the value bar
for the property foo
.
For example, pos:det
might match a determiner.
!
).
A conjunction of (optionally, negated) SPC's can be formed with angle
brackets and commas, for instance: <a(month),!may>
might match any of 'jan', 'feb', ..., 'april', 'june', ..., or 'december'.
*
, +
, ?
, or
{i,j}
(where i
and j
are numbers)
to a SPC. The RPC any*
can be abbreviated as ...
.
An RPC matches any sequence of between i
and j
tokens such that every token in the sequence matches the underlying SPC.
For example:
a(name){1,3}
matches any sequences of 1-3 tokens in the 'name' dictionary.
<!a(punct),!'and'>*
matches any sequence of tokens that are not in the
'punct' dictionary and are not the token 'and'.
pos:noun?
matches the a one-token sequence with the 'pos' property set to
'noun', or an empty sequence.
pos:adj+
L pos:adj+
only matches a sequence of adjectives that does NOT have
an adjective immediately to the left of it.
any{3,5}
any{3,5} R
only matches a sequence of 3-5 tokens that can't be
extended to the right---in other words, a sequence that is either exactly 5 tokens
long, or which ends with the final token of a document.
@foo
or @foo?
, where foo
is a type. The RPC @foo
matches a span of type 'foo'. The RPC @foo?
matches either a span of
type foo or an empty sequence.
... ',' 'Ph' '.' 'D'
matches any token sequence ending
in ", Ph.D".
... '(' !eq(')'){,10} ')' ...
matches any sequence containing
a parenthesized expression with less than 10 tokens in it.
For example:
... a(endOfSent) [ re('^[A-Z]') !a(endOfSent){3,} a(endOfSent)] ...
will extract "sentences" (roughly - really, every sequence of at least three words
between things in the endOfSent
dictionary.)
... [any any] ...
will extract all token bi-grams.
The MixupProgram class allows a series of statements to be executed, one after another, in order to modify a text labeling. Most of these statements are based on evaluating Mixup patterns, and then modifying the labels in response to those patterns.
The types of Mixup statements are:defDict D = W1,W2,...,Wk
: adds words W1...Wk to dictionary D.
If Wi is in double quotes, then Wi is interpreted as a filename, and
all lines from that file are loaded in the dictionary.
provide ANNOTATION_TYPE
: puts a marker in the labels
that annotations of the given type are present.
require ANNOTATION_TYPE,FILE
: sees if
annotations of the given type are present in the current labels.
If not, the mixup program in 'file' is executed. (File might be in
single quotes.)
defSpanType TYPE SPAN_GENERATOR
: adds all spans generated by the
SPAN_GENERATOR to the given TYPE. There are several types of SPAN_GENERATOR's.
=T: EXPR
runs the Mixup expression EXPR on every span
of type T, and returns all spans extracted by it.
=T- EXPR
runs the Mixup expression EXPR on every span
of type T, and returns all spans S in T such that nothing was successfully
extracted by EXPR.
=T~ re REGEX,N
runs the Java 1.4 regular expression REGEX
on the string associated with each span S in T, and returns the span
associated with the N-th group in that REGEX. If the N-th group of the regex
matches something that doesn't align with token boundaries, the closest
legal token span will be used instead.
defSpanProp PROP:VAL SPAN_GENERATOR
:
same, but asserts that property PROP has value VAL for all generated spans.
defoTokenProp PROP:VAL SPAN_GENERATOR
:
same, but asserts that property PROP has value VAL for all tokens contained
in a generated span.
//============================================================================= // Extract phrases about cells from biomedical captions. // // known current bugs: // need better sentence-starting rules, not using stems // (sentence start should be based on linguistically proper use of ":") // need to discard things with unbalanced parens // undesirable examples: // "in Hela-tet Of f cells" extracts "f cells" // "in contrast cells" extracts "in contrast cells" // "respective cells" extracts "respective cells" //============================================================================= // words that might start a plural noun phrase about cells defDict pluralStart = ,, no, with, within, from, of, the, these, all, in, on, only, for, by, to, other, have, indicate, represent, show, and, or; // end of a plural noun phrase about cells - ie, a plural cell-related noun defDict pluralEnd = cells,strains,clones; // end of a singular noun phrase about cells defDict singEnd = cell,strain,clone; // start of a singular noun phrase about cells defDict singStart = ,, with, from, of, the, in, on, or, a, an, each, to, other, indicate, represent, and, or, per; // numbers defDict number = one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty; // simplify syntax for these, since there's no good way to quote them defDict openParen = (; defDict closeParen = ); // 'context' is anything near a cell end. This is used to restrict search defSpanType end =: ... [a(pluralEnd)] ... || ... [a(singEnd)] ...; defSpanType context =: any+ [ any{15} @end any{2}] ... || [ any{,15} @end any{2}] ... ; // the start of a sentence might have a panel label like (a) before it. defSpanType sentStart =context: ... ['.' a(openParen) !a(closeParen){1,4} a(closeParen)] ... ; defSpanType sentStart =context: ... ['.' ] re('^[A-Z]') ... ; // something to ignore (not extract) that precedes a plural noun phrase defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart)] ...; defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) a(number) ] ...; defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) re('^[0-9]+$') ] ...; defSpanType ignoredPluralStart =context: ... [@sentStart] ...; // something to ignore (not extract) that precedes a singular noun phrase defSpanType ignoredSingStart =context: ... [stem:a(singStart)] ...; defSpanType ignoredSingStart =context: ... [@sentStart] ...; // don't allow 'breaks' (commands, periods, etc) in the adjectives that qualify a // cell-related noun. defDict breakPunct = ,, .; defSpanType qualifiers =context: ... [{1,8}] ...; // finally define noun phrases as start,qualifiers,end defSpanType cell =context: ... @ignoredPluralStart [@qualifiers a(pluralEnd)] ... ; defSpanType cell =context: ... @ignoredSingStart [@qualifiers a(singEnd)] ... ; // other cases seem to be like 'strain XY123' and 'strain XY123-fobar' defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') '-' any] ... ; defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') !'-'] ... ;
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |