MinorThird - FAQ
Frequently Asked Questions Page
Maintained by Quinten Mercer
MinorThird stands for "Methods for Identifying Names and Ontological Relationships in Text
using Heuristics for Identifying Relationships in Data" .
It is a collection is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.
Please check the License and Use section in the Minorthird homepage .
A04 Where do I find some documentation?
First, check the Documentation section in the Minorthird homepage. Other overview documents and additional info can be found in :
B. Download, Setup and Configuration
You can download it from SourceForge on http://sourceforge.net/projects/minorthird/.
Also, check http://minorthird.sourceforge.net/ for more detailed information. You’ll need Ant and a recent version (1.4 or higher) of JDK.
Most of the times, the environment variables are not set correctly. Please check if the following variables are set in your environment before trying to compile it:
ANT_HOME -> should be set to the directory where you put ANT (for instance, /usr0/local/apache-ant-1.6.1)
JAVA_HOME -> should be set to wherever you put jdk (for instance, /usr0/local/j2sdk1.4.2_04)
MINORTHIRD -> set to wherever you installed minorthird (for instance, /usr0/project/minorthird)
CLASSPATH -> should be set according to the directions in $MINORTHIRD/script/setup.xx file. (for instance, using csh, if your $MINORTHIRD is/usr0/project/minorthird,you need to set your CLASSPATH to $MINORTHIRD:$MINORTHIRD/class:$MINORTHIRD/lib:$MINORTHIRD/lib/minorThirdIncludes.jar:$MINORTHIRD/lib/mixup:$MINORTHIRD/config:$MINORTHIRD/lib/montylingua)
Try doing 'cvs update -dP' to get rid of the duplicate files first.
Try optimizing your memory (initial heap size, maximum heap size, etc). Type java -X for details.
For instance, try java -Xmx500m edu.cmu.minorthird....
Please check your environment variables (CLASSPATH, JAVA_HOME, etc) first. Then check http://www.inonit.com/cygwin/faq/.
C01 If I want to run a classification experiment, in which format should I transform the data to be compatible with minorthird? How do I perform a classification experiment if I already have a dataset with all features extracted?
The possible dataset formats are specified in the documentation of the class DatasetLoader. One of the possible ways to run a classification experiment is, after formatting your data into minorthird format, use NumericDemo.java to select your which experiment you want. In this file you can change different classifiers, different splits, test sets, etc.
You can use DatasetLoader to return a dataset from SVMlightformat(see method loadSVM). Then you have a standard Dataset and should be able to do whatever you want with it.
Yes. If you specify the learner on the command line, with-learn, you can specify any learner you like - not simply ones that popup in the GUI. This could include aOneVsAll based learner.
C04 For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need - over the entire input? Over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm?
To classify documents into multiple classes using the ui-packagetools (like TrainTestClassifier) you should define a span property,
which is a mapping from spans to strings. For instance, if thespanTypesdeleteCommand, insertCommand, replaceCommand have been
defined, you could use this mixup to define the span prop "whatCommand".
defSpanPropwhatCommand:delete =: [@deleteCommand];
defSpanPropwhatCommand:insert =: [@insertCommand];
defSpanPropwhatCommand:replace =: [@replaceCommand];
Right now you can't see span properties in the various viewers.
After the property is defined, you can specify a oneVsAll learner withthe -learner command, and specify that you want to train against the
"whatCommand" property with the option "-spanPropwhatCommand" (replacing "-spanTypedeleteCommand", or whatever). The result of
training will be an Annotator that assigns some other span property (asspecified by the "-output" option) to a document.
To summarize the training, you need to:
- specify the class (insert, delete, replace) for every training
document, using some spanProperty (e.g. whichCommand)
- tell the ui what spanProperty you're training against (using
- tell the ui what learner to use, and make sure it's a learner that
can handle non-binary data
After training (e.g. using TestClassifier, or the ApplyAnnotatormethodin the ui package) the learned annotator will add to every document its
predicted class as the value for the span property specified by "-output"
The set of labels will be inferred from data.
C06 if I wanted to run an Extractor Learner that used a experimental NameFE class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one?
Follow the output of "java edu.cmu.minorthird.ui.TrainTestExtractor -help".If you've put NameFE on the classpath (which it's not, by default) then
you can specify it
(a) with the option '-fe "new NameFE()"'
(b) in the gui, if you add the class name of NameFE into yourselectableTypes.txt file
There's a bug in the UI which keeps you from inspecting the changed FE,but this does work.
If you want to parameterize NameFE, the simplest approach is to add theparameters to the constructor, so you can specify them directly in -fe
argument. If you look at the javadocs for util.CommandLineProcessorandhave your NameFE implementation implement the
CommandLineProcessor.Configurable interface then you can pass inadditional arguments on the command line as well.
The supported token extractors (Recommended.TokenFE(), eg) can beconfigured in the gui, or by first explicitly specifying them with -fe,
and then using one of the command line options, which you can discover by using '-feRecommended.TokenFE() -help' or looking at the javadocs.
All of supported FE's need mixup; that provides a specified type of annotation.
Many of the extraction learners work by reducing extraction to tagging, i.e., labeling each word with one label, like "inside a name" vs "outside aname".
Text is translated from strings to tokens in text.TextBase. A Span isbasically just a sequence of tokens. The conversion from Spans to
Instances is inside text.learn.SpanFE.
C11 While I have it working with a feature extractor that I write and compile, I currently have no way to really test this at all other than look at the final error values and see that they have changed.
I recommend (1) engineering as much of the FE processs as possible inmixup, and using the ui.LabelViewer to check the results, and (2) using
the database viewer to view the final results of extraction.
Use the options: -showResult –showTestDetails
The feature extractor should indeed be serializable. If it's not youcan still perform cross-validation experiments, but an error will be (or
at least should be) thrown when you try and write it out, egwithTrainExtractor.
Serialization only saves instance-dependent (non-static) data, itdoesn't save the code associated with a class, so you'll need to have
the feature extraction code on your classpath when you load it back in.
It's probably possible to hack the mixup interpreter to serialize mixupcode and dictionaries along with an extractor - if you really want to do
that (if the dictionaries are pretty big, you might not want tosave multiple copies!).
There are three ways to learn to extract multiple types:
D. Other Issues
Bugs?!?! There are no bugs in this package ;-) But in case you really think you found one, please email William W. Cohen.