edu.cmu.minorthird.text.mixup
Class MixupProgram

java.lang.Object
  extended by edu.cmu.minorthird.text.mixup.MixupProgram
All Implemented Interfaces:
java.io.Serializable

public class MixupProgram
extends java.lang.Object
implements java.io.Serializable

Modify a textlabeling using a series of mixup expressions.

 BNF:
 STATEMENT -> declareSpanType TYPE
 STATEMENT -> provide ID
 STATEMENT -> require ID [,FILE]
 STATEMENT -> annotateWith FILE
 STATEMENT -> defDict [+case] NAME = ID, ... , ID
 STATEMENT -> defTokenProp PROP:VALUE = GEN
 STATEMENT -> defSpanProp PROP:VALUE = GEN
 STATEMENT -> defSpanType TYPE2 = GEN
 STATEMENT -> defLevel NAME = LEVELDEF
 STATEMENT -> onLevel NAME
 STATEMENT -> offLevel NAME
 STATEMENT -> importFromLevel NAME TYPE = TYPE

 LEVELDEF -> filter TYPE
 LEVELDEF -> pseudotoken TYPE
 LEVELDEF -> split TOKEN
 LEVELDEF -> re 'REGEX'

 GEN -> [TYPE]: MIXUP-EXPR
 GEN -> [TYPE]- MIXUP-EXPR
 GEN -> [TYPE]~ re 'REGEX',NUMBER
 GEN -> [TYPE]~ trie phrase1, phrase2, ... ;

 statements are semicolon-separated
 // and comments look like this (C++ style)

 SEMANTICS:
 execute each command in order, saving spans/tokens as types, and asserting properties
 '=:' can be replaced with '=TYPE:', in which case the expr will be applied to
 each span of the given type, rather than all top-level spans

 defDict FOO = bar,baz,bat stores a lowercase version of each word the dictionary
 defDict +case FOO = blah,Bar,baZ stores each word the dictionary, preserving case

 in dictionaries and tries, a double-quoted word "foo.txt" means to
 find foo.txt on the classpath and store all lines from the file as
 words (after trimming them).

 TYPE: MIXUP-EXPR finds all spans inside a span of type TYPE that match the expression
 TYPE- MIXUP-EXPR finds all spans inside a span of type TYPE that do not contain anything matching MIXUP-EXPR

 

Mixup is matching language for modifying TextLabels. It can label spans with a given TYPE (the new label for that token span) and assign properties to spans (much like labels, but 'invisible'). There is more documentation for Mixup programs in the package-level documents for Mixup.

Briefly, a Mixup program will look something like this:

 require "req1"; //requires that "abc" type spans have already been labeled.  If not, the default annoator
 //for "abc" will be used.
 require "req2", "req2.mixup"; 
 //file 'def.mixup' will be run to provide "def" labels if they are not already there
 //if  "def" labels were already generated by a different annotator, they will be used and
 //and 'def.mixup' won't be called.
 provide "xyz"; //this program will annotate the text with "xyz" labels
 defDict titleWord = mr, ms, mrs, dr; 
 //defines a dictionary (with scope of this program execution called 'titleWord'
 //containing the values "mr", "ms", "mrs", "dr" 
 defDict myDictionary = "dictionary.txt"; 
 //defines a dictionary called 'myDictionary' with values taken from the file "dictionary.txt"
 defTokenProp title:true =: ... [ai(titleWord)] ... ; //finds all spans matching a work in the dictionary titleWord
 //those spans are given the property "Name" with value "true" (a string, not boolean)
 //if the span previously had "Name" property with a different value, that is replaced
 // the "..." before and after indicate that it doesn't matter what comes before or after the token
 //to be labeled.  if I said "=: [ai(titleWord)];" the document would need to be JUST a titleword.
 defTokenProp titlePunc:1 =: ... title:true [','] ... || ... title:true ['.'] ... ;
 //spans "." or "," preceeded by a title are given the property titlePunc with value "1"
 //note that the entire '... title:true [','] ...' is an expression; or operators ("||") must be
 // between expressions, not within them
 defSpanType fullTitle =: ...[title:true titlePunc:1?R] ...;
 //label a span as "fullTitle" if there is a title span optionally followed b a titlePunc span
 //but not more than one (from the R)
 defSpanType the =: ... [eqi('the')] ...; 
 //labels occurances of "the" ignoring case (eq = equals, adding i ignores case)
 defTokenProp aProp:t =: ...[] ...; 
 /tokens which have the title=true property AND are labeled as req1
 //are given the property aProp=t
 defTokenProp address:x =: ... [@fullTitle any] !a(myDictionary) ...; 
 //label spans of one 'fullTitle' (the @ is needed
 //before types) and the following token, whatever it is, 
 // which are followed by something other than a myDictionary word
 defTokenProp capProp:on =req2: ... [re('^[A-Z]$')] ...; 
 //on spans of type req2, match tokens fitting the given regular expression
 defSpanType listSet =: ... [address+R] ...; 
 //label as header spans of 1 or more address tokens, going all the way to 
 //right most possible token - example: blah address1 address2 address3 blah 
 // - will return three spans: "address3", "address2 address3", and "address1 address2 address3"
 defSpanType adList =: ... [L address+ R] ...; //as above but only returns the longest span
 defSpanType header =: [L address* R] ...; 
 //label longest span of 0 or more address tokens at the beginning of the document
 defSpanType shortList =: ... [address{2,3}] ...; //label spans of 2 or 3 address tokens
 defSpanType xyz =header: ...[capProp] ...; //providing the promised xyz labeling
 //creates a new level where each document is a span with spanType
 defLevel newLevel = filter spanType;
 //creates a new level where tokens of spanType are combined into a single token
 defLevel newLevel = pseudotoken spanType;
 //creates a new level where the textBase is retokenized by splitting a a certain token
 defLevel newLevel = split '.';
 //create a new level where the textBase is retokenized using a regular expression
 defLevel newLevel = re '([^\n]+)';
 //switches current textBase and Labels to Level
 onLevel levelName;
 //returns to root (or original) level - levelName is the name of the child level which you are switching off
 offLevel childLevelName;
 //Imports spans of Type in the child level to spans of newType in the parent level
 importFromLevel childLevelName newType = type;
 

Author:
William Cohen
See Also:
Serialized Form

Field Summary
static java.util.Set<java.lang.String> legalKeywords
           
 
Constructor Summary
MixupProgram()
           
MixupProgram(java.io.File file)
          Create a MixupProgram from the contents of a file.
MixupProgram(java.lang.String program)
          Create a MixupProgram from single string with a bunch of semicolon-separated statements.
MixupProgram(java.lang.String[] statements)
          Create a MixupProgram from an array of statements
 
Method Summary
 void addStatement(Mixup.MixupTokenizer tok, java.lang.String keyword)
          Add a single statement to the current mixup program.
 void addStatement(java.lang.String statement)
          Add a single statement to the current mixup program.
 void eval(MonotonicTextLabels labels)
          Deprecated. Use MixupInterpreter to evaluate mixup programs
 MonotonicTextLabels eval(MonotonicTextLabels labels, TextBase tb)
          Deprecated. Use MixupInterpreter to evaluate mixup programs
 Statement[] getStatements()
           
static void main(java.lang.String[] args)
          usage: programFile textFile/directory [outfile] evaluates the given program file against the specified data (either a file or directory of files) if an outfile is specified it outputs the types as operators to that file
 java.lang.String toString()
          List the program
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

legalKeywords

public static java.util.Set<java.lang.String> legalKeywords
Constructor Detail

MixupProgram

public MixupProgram()

MixupProgram

public MixupProgram(java.lang.String[] statements)
             throws Mixup.ParseException
Create a MixupProgram from an array of statements

Throws:
Mixup.ParseException

MixupProgram

public MixupProgram(java.lang.String program)
             throws Mixup.ParseException
Create a MixupProgram from single string with a bunch of semicolon-separated statements.

Throws:
Mixup.ParseException

MixupProgram

public MixupProgram(java.io.File file)
             throws Mixup.ParseException,
                    java.io.FileNotFoundException,
                    java.io.IOException
Create a MixupProgram from the contents of a file.

Throws:
Mixup.ParseException
java.io.FileNotFoundException
java.io.IOException
Method Detail

eval

public MonotonicTextLabels eval(MonotonicTextLabels labels,
                                TextBase tb)
Deprecated. Use MixupInterpreter to evaluate mixup programs


eval

public void eval(MonotonicTextLabels labels)
Deprecated. Use MixupInterpreter to evaluate mixup programs


addStatement

public void addStatement(Mixup.MixupTokenizer tok,
                         java.lang.String keyword)
                  throws Mixup.ParseException
Add a single statement to the current mixup program.

Throws:
Mixup.ParseException

addStatement

public void addStatement(java.lang.String statement)
                  throws Mixup.ParseException
Add a single statement to the current mixup program.

Throws:
Mixup.ParseException

getStatements

public Statement[] getStatements()

toString

public java.lang.String toString()
List the program

Overrides:
toString in class java.lang.Object

main

public static void main(java.lang.String[] args)
usage: programFile textFile/directory [outfile] evaluates the given program file against the specified data (either a file or directory of files) if an outfile is specified it outputs the types as operators to that file