Mixup Tutorial

 

Mixup is a simple pattern-matching and information extraction language included in minorthird.  The name's an acronym for My Information eXtraction and Understanding Package.  You can run a mixup program in minorthird using the UI package (which will be covered in the next section)  To understand Mixup better, it may be helpful to look at the javadocs for Mixup and MixupProgram.

 

1. Writing and Running Mixup Programs

Minorthird’s language for manipulating text is mixup (Minorthird Information eXtraction and Understanding Program.)  Sample mixup programs can be found on William’s website (http://wcohen.com) under Teaching, Slides, notes, and sample files from the first day’s lecure.  Here is a sample program (sample1.mixup) with commentary:

 

defSpanType source1 = title: ... '(' [ ... ] ')' ;

 

In this line of mixup, defSpanType source1 defines source1 as the spanType which is defined to the right of the equal sign.  The expression to the right of the equal sign defines the pattern where source1 can be identified.  This line expresses that source1 is in the title between the parentheses.  Here is a list of what each part of the expression means:

defSpanType                -           keyword

source1                        -           name of the defined spanType

title:                               -          start with title and match to the pattern defined in the

remainder of the expression        

                                -           anything

‘(‘                                -           the left parenthesis token

[                                   -           START

                                -           anything

]                                   -           END

‘)’                                -           the right parenthsis token

 

defSpanType source2 = description: [ !'-'+R ] '-' ... ;

 

This line of mixup is very similar to the line above, but contains a few new expressions:

!                                   -           not this token

+                                  -           1+ times

R                                  -           Extend to the right

 

To see the parameters for running a mixup program type:

% java –Xmx500M edu.cmu.minorthird.ui.RunMixup –help

 

Now lets try running a sample mixup program.  To do this make sure the sample programs are in you minorthird/lib/mixup directory.  Try:

% java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –showResult

            -  The –showResult parameter will graphically display the output

                                                            OR

% java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –gui

-         Press the “Start Task” button to run the program

 

When the program is done running a window like this will appear:

 

This window looks similar to the one that appeared when you ran View labels; however, you will notice that there are now 6 span types rather than 4 since sample1.mixup defined two more span types: source1 and source2.  To see what the mixup program extracted, try going to the SpanTypes tab and highlighting source1 and source2.

Sample1a.mixup demonstrates what happens if a mixup expression contains + instead of +R.  Unlike other languages which extend patterns greedily, mixup takes each pattern literally and backtracks as needed.  To see how this works run:

%java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newdir –mixup sample1a.mixup –showResult –saveAs foo.labels

 

Note: -saveAs FILE means save as some computer –readable format, and it works for most ui programs.

 

When the window appears, highlight source2s.  Knowing the source2 is any prefix that ends before a ‘-‘, you can see how this does not work right.  Now try running sample1.mixup again and see how it does work right with the +R rather than just the +.

 

The lessons from these two sample mixup programs are:

1)      Use L and R prefixes for expressions that can match, when you can

2)      Use non-determinism when you need to

a.       Ex: defSpanType bigram =description:  ... [any any ] ... ;

 

Another example: sample2.mixup – take a look then run:

% java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample2.mixup –showResult

 

Now lets take a look at some annotators:

1)      Open sample3.mixup (don’t look at it yet)

2)      Run: java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir -mixup sample3.mixup –showResult

a.       This will take a while…

3)      Now take a look at sample3.mixup

a.       ‘require’ asks for some type of annotation

b.      Annotators are found usually in $MINORTHID/lib/mixup

c.       Annotators  can be re-defined in “annotators.config” which is usually in $MINORTHIRD/config/annotators.config

4)      When RunMixup is finished running, we will save the computation to save time later on.  To do this, click the SaveAs button at the bottom middle of the top left window (you will have to scroll to get there.)  Note: File->SaveAs does not work in this case, it is only for serializable objects.

5)      Now pick out some useful tags and save them in small-newsdir.labels

% perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels | grep addToType | cut -d" " -f5 | sort | uniq -c

% perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels > small-newsdir.labels

% java -Xmx500M edu.cmu.minorthird.ui.ViewLabels -labels small-newsdir

 

Note: To find labels for –labels FOO(1) look in repository (2) look for directory FOO (3) look for FOO.labels for markup, and ignore in-line markup

 

2. The Mixup Debugger and Label Editor

Debugging Mixup gives you the ability to edit your labels and your labeling program in parallel.  To see how this works, copy saved-handLabeled.labels to handLabeled.labels and try:

 

% java –Xmx500M edu.cmu.minorthird.ui.DebugMixup –labels small-newsdir –edir handLabeled.labels –mixup sample5.mixup

 

A window that looks like this will appear (without the highlighting at first)

 

To highlight extracted companies (which were defined by the mixup program), select extracted_company from the first pull down menu on the section divider.  All the extracted companies will turn yellow (you may have to scroll down a little to find any.)  Then to view the true companies, which were defined by handLabeled.labels, select true_company from the second pull down menu.  All hand labeled companies that were properly extracted by the mixup program will turn green, all companies that were missed by the mixup program will turn blue, and false positives will turn red.  (See above picture for reference.)

 

To edit the labels, click on a document, and click the Import button at the bottom of the window.  This will import all the extracted company labels.  To correct these labels click the Next button and Delete if it is a false positive.  To add a label, highlight the span and click Add.  When you are finished labeling a document, click Export.  Click save when you finish.

 

Some Tricks:

1)      On RHS of the center bar, replace -top- with -body- to focus the window to what you care about.

2)      Replace -top- with -extracted company- and move the slide to look for extractions-in-context.

 

When you're close enough with the debugging, you might want to hand

the task over to someone else to get more training data.  First run the

current program:

 

% java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample5.mixup -saveAs sample5.labels

 

Now take the relevant part of its output, and your hand-labeling results,

and merge them:

 

% grep extracted_company sample5.labels > labelingTask.labels

% cat handLabeled.labels >> labelingTask.labels

 

Now run the labelling tool (which is somewhat stripped down, probably

not enough), on the result:

 

% java -Xmx500M edu.cmu.minorthird.ui.EditLabels -labels small-newsdir -edit labelingTask.labels -extractedType extracted_company -trueType true_company

 

SourceForge.net Logo