MinorThird - FAQ

Frequently Asked Questions Page

Maintained by Quinten Mercer
Last update: 
14 May 2004

 

 

 

 

 

A.  Introduction

B.  Download, Setup and Configuration

C.  Experiments

D.  Other Issues

 

 


 

 

A. Introduction

A01 What is MinorThird?

 

MinorThird stands for "Methods for Identifying Names and Ontological Relationships in Text

using Heuristics for Identifying Relationships in Data"

It is a collection is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.

 

A02 Does it really work?

Depends mostly on the weather in Pittsburgh.

A03 Do I need a license to use Minorthird?

Please check the License and Use section in the Minorthird homepage .

A04 Where do I find some documentation?

First, check the Documentation section in the Minorthird homepage. Other overview documents and additional info can be found in :

minorthird.htmltextpackage.htmlandmixuppackage.html .

 

 

 

B. Download, Setup and Configuration

B01 How can I download it?

 

You can download it from SourceForge on  http://sourceforge.net/projects/minorthird/.

Also, check http://minorthird.sourceforge.net/ for more detailed information. You’ll need Ant and a recent version (1.4 or higher) of JDK.

B02 Why can't I make it compile?

Most of the times, the environment variables are not set correctly. Please check if the following variables are set in your environment before trying to compile it:

ANT_HOME -> should be set to the directory where you put ANT (for instance, /usr0/local/apache-ant-1.6.1)

JAVA_HOME -> should be set to wherever you put jdk (for instance, /usr0/local/j2sdk1.4.2_04)

MINORTHIRD -> set to wherever you installed minorthird (for instance, /usr0/project/minorthird)

CLASSPATH -> should be set according to the directions in $MINORTHIRD/script/setup.xx file. (for instance, using csh, if your $MINORTHIRD is/usr0/project/minorthird,you need to set your CLASSPATH to $MINORTHIRD:$MINORTHIRD/class:$MINORTHIRD/lib:$MINORTHIRD/lib/minorThirdIncludes.jar:$MINORTHIRD/lib/mixup:$MINORTHIRD/config:$MINORTHIRD/lib/montylingua)

B03 I'm having troubles in generating the document files. Why can't I compile javadoc ?

Try doing 'cvs update -dP' to get rid of the duplicate files first.

B04 I'm frequently running out of memory. What should I do?

 

Try optimizing your memory (initial heap size, maximum heap size, etc). Type java -X for details.

For instance, try java -Xmx500m edu.cmu.minorthird.... 

 

B05 I'm frequently running out of memory. What should I do? 

Please check your environment variables (CLASSPATH, JAVA_HOME, etc) first. Then check http://www.inonit.com/cygwin/faq/.

 

C. Experiments

C01 If I want to run a classification experiment, in which format should I transform the data to be compatible with minorthird? How do I perform a classification experiment if I already have a dataset with all features extracted?

 

The possible dataset formats are specified in the documentation of the class DatasetLoader. One of the possible ways to run a classification experiment is, after formatting your data into minorthird format, use NumericDemo.java to select your which experiment you want. In this file you can change different classifiers, different splits, test sets, etc.

 

C02 How can I run a classification experiment using data in SVMlight format?

 

You can use DatasetLoader to return a dataset from SVMlightformat(see method loadSVM). Then you have a standard Dataset and should be able to do whatever you want with it.

C03 Is there a way to use non-recommended learners (like something using one-vs-all) using the minorthird.ui routines?

Yes. If you specify the learner on the command line, with-learn, you can specify any learner you like - not simply ones that popup in the GUI. This could include aOneVsAll based learner.

C04 For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need - over the entire input? Over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm?

To classify documents into multiple classes using the ui-packagetools (like TrainTestClassifier) you should define a span property,

which is a mapping from spans to strings. For instance, if thespanTypesdeleteCommandinsertCommandreplaceCommand have been

defined, you could use this mixup to define the span prop "whatCommand".

defSpanPropwhatCommand:delete =: [@deleteCommand];

defSpanPropwhatCommand:insert =: [@insertCommand];

defSpanPropwhatCommand:replace =: [@replaceCommand];

Right now you can't see span properties in the various viewers.

After the property is defined, you can specify a oneVsAll learner withthe -learner command, and specify that you want to train against the

"whatCommand" property with the option "-spanPropwhatCommand" (replacing "-spanTypedeleteCommand", or whatever). The result of

training will be an Annotator that assigns some other span property (asspecified by the "-output" option) to a document.

To summarize the training, you need to:

- specify the class (insert, delete, replace) for every training

document, using some spanProperty (e.g. whichCommand)

- tell the ui what spanProperty you're training against (using

"-spanPropwhichCommand")

- tell the ui what learner to use, and make sure it's a learner that

can handle non-binary data

After training (e.g. using TestClassifier, or the ApplyAnnotatormethodin the ui package) the learned annotator will add to every document its

predicted class as the value for the span property specified by "-output"

C05 How do I communicate the k classes to OneVsAll? Does it simply take all declared span types in the training set? Or do I pass a set of labels?

The set of labels will be inferred from data.

C06 if I wanted to run an Extractor Learner that used a experimental NameFE class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one?

Follow the output of "java edu.cmu.minorthird.ui.TrainTestExtractor -help".If you've put NameFE on the classpath (which it's not, by default) then

you can specify it

(a) with the option '-fe "new NameFE()"'

(b) in the gui, if you add the class name of NameFE into yourselectableTypes.txt file

There's a bug in the UI which keeps you from inspecting the changed FE,but this does work.

If you want to parameterize NameFE, the simplest approach is to add theparameters to the constructor, so you can specify them directly in -fe

argument. If you look at the javadocs for util.CommandLineProcessorandhave your NameFE implementation implement the

CommandLineProcessor.Configurable interface then you can pass inadditional arguments on the command line as well.

C07 How can the supported token extractors be configured?

The supported token extractors (Recommended.TokenFE(), eg) can beconfigured in the gui, or by first explicitly specifying them with -fe,

and then using one of the command line options, which you can discover by using '-feRecommended.TokenFE() -help' or looking at the javadocs.

All of supported FE's need mixup; that provides a specified type of annotation.

C08 Where would I look to get a better understanding of how the spans are build up?

Try text.learn.SpanFE.

C09 When I look at the features of each span (in the window I get when checking "displayDataSetBeforeLearning"), each span seems to have only a single token. Is it correct?

Many of the extraction learners work by reducing extraction to tagging, i.e., labeling each word with one label, like "inside a name" vs "outside aname".

C10 Is it true that there would be multiple tokens per span? Where is the code that translates the sentences from string into a set of tokens/spans?

Text is translated from strings to tokens in text.TextBase. A Span isbasically just a sequence of tokens. The conversion from Spans to

Instances is inside text.learn.SpanFE.

C11 While I have it working with a feature extractor that I write and compile, I currently have no way to really test this at all other than look at the final error values and see that they have changed.

I recommend (1) engineering as much of the FE processs as possible inmixup, and using the ui.LabelViewer to check the results, and (2) using

the database viewer to view the final results of extraction.

C12 How do I get additional details (weights, features, etc) in a name extraction test?

Use the options: -showResultshowTestDetails

C13 Must the feature extractor be declared as a serializable class, when building a serialized model and passing in a hand-coded feature extractor?

The feature extractor should indeed be serializable. If it's not youcan still perform cross-validation experiments, but an error will be (or

at least should be) thrown when you try and write it out, egwithTrainExtractor.

Serialization only saves instance-dependent (non-static) data, itdoesn't save the code associated with a class, so you'll need to have

the feature extraction code on your classpath when you load it back in.

It's probably possible to hack the mixup interpreter to serialize mixupcode and dictionaries along with an extractor - if you really want to do

that (if the dictionaries are pretty big, you might not want tosave multiple copies!).

C14 How can I run an experiment to learn multiple types?

There are three ways to learn to extract multiple types:

  1. Define a spanProperty and pass that into TrainTestExtractor with the spanProp option, eg:

    % java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecause

  2. Use the -spanProp option with a comma-separated list of non-overlapping types, eg:

    % java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecauseStart,trueName

  3. You can also run ui.TrainExtractor multiple times and learn multiple extractors, which of course might extract overlapping spans.

In the first two cases what is learned is an extractor that inserts a new spanProperty (by default named "_prediction").

 

 

 

D. Other Issues

D01 How do I report bugs?



Bugs?!?! There are no bugs in this package ;-) But in case you really think you found one, please email William W. Cohen