Mixup - "Little" language for creating annotations for text.
At the simplest level, Mixup is a pattern language for Spans (i.e. token
sequences). Each expression is defined relative to labeling called a {@link
edu.cmu.minorthird.text.TextLabels}. From the ground up:
-
A "simple pattern component" (SPC) matches a single token. The SPC's
include:
-
any matches any token.
-
eq('foo') matches the token foo. This can also be abbreviated
as 'foo' (in single quotes).
-
re('regex') matches any token whose string value matches the given
regular expression (from the java.util.regex package). For instance, re('^\\d+$')
matches any sequence of digits.
-
a(foo) matches any token whose string value in the dictionary
named foo. Dictionaries are defined in a TextLabels. For example,
a(weekday) might match one of 'sun', 'mon', 'tues', ..., 'sat'.
-
foo:bar matches any token that has been tagged as having the value
bar for the property foo. For example, pos:det
might match a determiner.
-
SPC's can be negated by prefixing them with a bang (!). A conjunction
of (optionally, negated) SPC's can be formed with angle brackets and commas,
for instance: <a(month),!may> might match any of 'jan', 'feb',
..., 'april', 'june', ..., or 'december'.
-
A "repeated pattern component" (RPC) matches a sequence of adjacent
tokens. An RPC is formed by appending one of the regex-like postfix
operators *, +, ?, or {i,j} (where
i and j are numbers) to a SPC. The RPC any*
can be abbreviated as .... An RPC matches any sequence of between
i and j tokens such that every token in the sequence
matches the underlying SPC. For example:
-
a(name){1,3} matches any sequences of 1-3 tokens in the 'name'
dictionary.
-
<!a(punct),!'and'>* matches any sequence of tokens that are
not in the 'punct' dictionary and are not the token 'and'.
-
pos:noun? matches the a one-token sequence with the 'pos' property
set to 'noun', or an empty sequence.
-
A "repeated pattern component" (RPC) can also be preceded by the token
'L' or followed by the token 'R'. An RPC modified by a 'L' matches unless
the sequence it corresponds to can be extended one token to the left, and
still match. An RPC modified by a 'R' is analogous, but can't be extended
to the right. For instance:
-
pos:adj+
matches any sequence of adjectives (if that's what 'pos:adj' means).
However, L pos:adj+ only matches a sequence of adjectives that
does NOT have an adjective immediately to the left of it.
-
any{3,5}
matches any sequence of 3-5 tokens. However, any{3,5} R only
matches a sequence of 3-5 tokens that can't be extended to the right---in
other words, a sequence that is either exactly 5 tokens long, or which
ends with the final token of a document.
-
A "repeated pattern component" (RPC) can also be either @foo or
@foo?, where foo is a type. The RPC @foo matches
a span of type 'foo'. The RPC @foo?matches either a span of type
foo or an empty sequence.
-
A "mixup pattern" is a bunch of RPC's concatenated together. A mixup pattern
matches a token sequence if all tokens in the sequence match up with some
RPC. For instance:
-
... ',' 'Ph' '.' 'D' matches any token sequence ending in ", Ph.D".
-
... '(' !eq(')'){,10} ')' ... matches any sequence containing
a parenthesized expression with less than 10 tokens in it.
-
Returning for a moment to the 'L' and 'R' operators, which say that a matched
sequence can't be extended to the left of right...note that "can't be extended"
can be interpreted two ways: either (a) any extension causes that RPC to
fail to match or (b) any extension causes that RPC to fail to match, or
else causes some other RPC pattern elsewhere in mixup pattern to fail.
The implemenentation current adopts the first interpretation, (a).
Extraction with Mixup
Mixup is normally used for extraction, not matching. For extraction, every
Mixup expression should contain matching left and right square brackets.
For each Span that the expression is matched against, and for every
possible way the expression can be matched, a subspan of the tokens
matching the RPCS's inside the square brackets will be extracted.
For example:
-
... a(endOfSent) [ re('^[A-Z]') !a(endOfSent){3,} a(endOfSent)] ...
will extract "sentences" (roughly - really, every sequence of at least
three words between things in the endOfSent dictionary.)
-
... [any any] ... will extract all token bi-grams.
Mixup Programs
The MixupProgram class allows a series of statements to be executed, one
after another, in order to modify a text labeling. Most of these statements
are based on evaluating Mixup patterns, and then modifying the labels in
response to those patterns.
The types of Mixup statements are:
-
defDict D = W1,W2,...,Wk: adds words W1...Wk to dictionary D.
If Wi is in double quotes, then Wi is interpreted as a filename, and all
lines from that file are loaded in the dictionary.
-
provide ANNOTATION_TYPE: puts a marker in the labels that annotations
of the given type are present.
-
require ANNOTATION_TYPE,FILE: sees if annotations of the given
type are present in the current labels. If not, the mixup program in 'file'
is executed. (File might be in single quotes.)
-
defSpanType TYPE SPAN_GENERATOR: adds all spans generated by the
SPAN_GENERATOR to the given TYPE. There are several types of SPAN_GENERATOR's.
-
=T: EXPR runs the Mixup expression EXPR on every span of type
T, and returns all spans extracted by it.
-
=T- EXPR runs the Mixup expression EXPR on every span of type
T, and returns all spans S in T such that nothing was successfully extracted
by EXPR.
-
=T~ re REGEX,N runs the Java 1.4 regular expression REGEX on the
string associated with each span S in T, and returns the span associated
with the N-th group in that REGEX. If the N-th group of the regex matches
something that doesn't align with token boundaries, the closest legal token
span will be used instead.
-
defSpanProp PROP:VAL SPAN_GENERATOR: same, but asserts that property
PROP has value VAL for all generated spans.
-
defoTokenProp PROP:VAL SPAN_GENERATOR: same, but asserts that
property PROP has value VAL for all tokens contained in a generated span.
An Example Mixup Program
Here's an extended example.
//=============================================================================
// Extract phrases about cells from biomedical captions.
//
// known current bugs:
// need better sentence-starting rules, not using stems
// (sentence start should be based on linguistically proper use of ":")
// need to discard things with unbalanced parens
// undesirable examples:
// "in Hela-tet Of f cells" extracts "f cells"
// "in contrast cells" extracts "in contrast cells"
// "respective cells" extracts "respective cells"
//=============================================================================
// words that might start a plural noun phrase about cells
defDict pluralStart = ,, no, with, within, from, of, the, these, all, in, on, only, for, by, to, other,
have, indicate, represent, show, and, or;
// end of a plural noun phrase about cells - ie, a plural cell-related noun
defDict pluralEnd = cells,strains,clones;
// end of a singular noun phrase about cells
defDict singEnd = cell,strain,clone;
// start of a singular noun phrase about cells
defDict singStart = ,, with, from, of, the, in, on, or, a, an, each, to, other, indicate, represent,
and, or, per;
// numbers
defDict number = one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve,
thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty;
// simplify syntax for these, since there's no good way to quote them
defDict openParen = (;
defDict closeParen = );
// 'context' is anything near a cell end. This is used to restrict search
defSpanType end =: ... [a(pluralEnd)] ... || ... [a(singEnd)] ...;
defSpanType context =: any+ [ any{15} @end any{2}] ... || [ any{,15} @end any{2}] ... ;
// the start of a sentence might have a panel label like (a) before it.
defSpanType sentStart =context: ... ['.' a(openParen) !a(closeParen){1,4} a(closeParen)] ... ;
defSpanType sentStart =context: ... ['.' ] re('^[A-Z]') ... ;
// something to ignore (not extract) that precedes a plural noun phrase
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart)] ...;
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) a(number) ] ...;
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) re('^[0-9]+$') ] ...;
defSpanType ignoredPluralStart =context: ... [@sentStart] ...;
// something to ignore (not extract) that precedes a singular noun phrase
defSpanType ignoredSingStart =context: ... [stem:a(singStart)] ...;
defSpanType ignoredSingStart =context: ... [@sentStart] ...;
// don't allow 'breaks' (commands, periods, etc) in the adjectives that qualify a
// cell-related noun.
defDict breakPunct = ,, .;
defSpanType qualifiers =context:
... [{1,8}] ...;
// finally define noun phrases as start,qualifiers,end
defSpanType cell =context: ... @ignoredPluralStart [@qualifiers a(pluralEnd)] ... ;
defSpanType cell =context: ... @ignoredSingStart [@qualifiers a(singEnd)] ... ;
// other cases seem to be like 'strain XY123' and 'strain XY123-fobar'
defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') '-' any] ... ;
defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') !'-'] ... ;
Last
modified: Sun Feb 08 20:31:21 Eastern Standard Time 2004