edu.cmu.minorthird.text.mixup
Class Mixup

java.lang.Object
  extended by edu.cmu.minorthird.text.mixup.Mixup
All Implemented Interfaces:
java.io.Serializable

public class Mixup
extends java.lang.Object
implements java.io.Serializable

A simple pattern-matching and information extraction language.

 EXAMPLE:
 ... in('begin') @number? [ any{2,5} in('end') ] ... && [!in('begin')*] && [!in('end')*]

 BNF:
 simplePrim -> [!] simplePrim1
 simplePrim1 -> id | a(DICT) | ai(DICT) | eq(CONST) | eqi(CONST) | re(REGEX) 
 | any | ... | PROPERTY:VALUE  | PROPERTY:a(foo)  )
 prim -> < simplePrim [,simplePrim]* > | simplePrim
 repeatedPrim -> [L] prim [R] repeat | @type | @type?
 repeat -> {int,int} | {,int} | {int,} | {int} | ? | * | +
 pattern -> | repeatedPrim pattern
 basicExpr -> pattern [ pattern ] pattern 
 basicExpr -> (expr)
 expr -> basicExpr "||" expr 
 expr -> basicExpr "&&" expr

 SEMANTICS:
 basicExpr is pattern match - like a regex, but returns all matches, not just the longest one
 token-level tests:
 eq('foo') check token is exactly foo 
 'foo' is short for eq('foo')
 re('regex') checks if token matches the regex
 eqi('foo') check lowercase version of token is foo
 'foo' or eq('foo') checks a token is equal to 'foo'
 a(bar) checks a token is in dictionary 'bar'
 ai(bar) checks that the token is in dictionary 'bar', ignoring case
 color:red checks that the token has property 'color' set to 'red'
 color:a(primaryColor) checks that the token's  property 'color' is in the dictionary 'primaryColor'
 !test is negation of test
  conjoins token-level tests
 any is true for any token
 token-sequences:
 test? is 0 or 1 tokens matching test
 test+ is 1+ tokens matching test
 test* is 0+ tokens matching test
 test{3,7} is between 3 and 7 tokens matching test              
 ... is equal to any*
 @foo matches a span of type foo
 @foo? matches a span of type foo or the empty sequence
 L means sequence can't be extended to left and still match
 R means sequence can't be extended to right and still match
 expr || expr is union
 expr && expr is piping: generate with expr1, filter with expr2
 
The name's an acronym for My Information eXtraction and Understanding Package.

Author:
William Cohen
See Also:
Serialized Form

Nested Class Summary
static class Mixup.MixupTokenizer
           
static class Mixup.ParseException
          Signals an error in parsing a mixup document.
 
Field Summary
static int maxNumberOfMatches
          Without constrains, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N), since any token can be the begin or end of an extracted span.
static int maxNumberOfMatchesPerToken
          Without constraints, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N), since any token can be the begin or end of an extracted span.
static int minMatchesToApplyConstraints
          Without constraints, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N).
static java.util.regex.Pattern tokenizerPattern
           
 
Constructor Summary
Mixup(Mixup.MixupTokenizer tok)
           
Mixup(java.lang.String pattern)
          Create a new mixup query.
 
Method Summary
 java.util.Iterator<Span> extract(TextLabels labels, java.util.Iterator<Span> spanLooper)
          Extract subspans from each generated span using the mixup expression.
static void main(java.lang.String[] args)
           
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

minMatchesToApplyConstraints

public static int minMatchesToApplyConstraints
Without constraints, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N). The maxNumberOfMatches... variables below constrain this behavior, for efficiency. The variable below is a threshold after which these constraints kick in.


maxNumberOfMatchesPerToken

public static int maxNumberOfMatchesPerToken
Without constraints, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N), since any token can be the begin or end of an extracted span. The maxNumberOfMatchesPerToken value limits this to maxNumberOfMatchesPerToken*N.


maxNumberOfMatches

public static int maxNumberOfMatches
Without constrains, the maximum number of times a mixup expression can extract something from a document of length N is O(N*N), since any token can be the begin or end of an extracted span. This limits the number of matches to a fixed number.


tokenizerPattern

public static final java.util.regex.Pattern tokenizerPattern
Constructor Detail

Mixup

public Mixup(java.lang.String pattern)
      throws Mixup.ParseException
Create a new mixup query.

Throws:
Mixup.ParseException

Mixup

public Mixup(Mixup.MixupTokenizer tok)
      throws Mixup.ParseException
Throws:
Mixup.ParseException
Method Detail

extract

public java.util.Iterator<Span> extract(TextLabels labels,
                                        java.util.Iterator<Span> spanLooper)
Extract subspans from each generated span using the mixup expression.


toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

main

public static void main(java.lang.String[] args)