MinorThird - Frequently Asked Questions


MinorThird - FAQ Frequently Asked Questions Page Maintained by Quinten Mercer Last update: 14 May 2004

A. Introduction What is MinorThird? Does is really work? Do I need a licence to use MinorThird? Where do I find some documentation ? B. Download, Setup and Configuration How can I download it? How do I compile it? I'm having troubles in generating the document files. Why can't I compile javadoc ? I'm running out of memory. What should I do? I'm having problems when trying to run minorthird under Cygwin. What now?. C. Experiments If I want to run a classification experiment, in which format should I transform the data to be compatible with minorthird? How do I perform a classification experiment, if I already have a dataset with with all features extracted? How can I run a classification experiment using data in SVMlight format? Is there a way to use non-recommended learners (like something using one-vs-all) using the minorthird.ui routines? For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need - over the entire input? Over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm? How do I communicate the k classes to OneVsAll? Does it simply take all declared span types in the training set? Or do I pass a set of labels? If I wanted to run an Extractor Learner that used a experimental NameFE class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one? How can the supported token extractors be configured? Where would I look to get a better understanding of how the spans are build up? When I look at the features of each span (in the window I get when checking "displayDataSetBeforeLearning"), each span seems to have only a single token. Is it correct? Is it true that there would be multiple tokens per span? Where is the code that translate the sentences from string into a set of tokens/spans? While I have it working with a feature extractor that I write and compile, I currently have no way to really test this at all other than look at the final error values and see that they have changed. How do I get additional details(weights, features, etc) in a name extraction test? Must the feature extractor be declared as a serializable class, when building a serialized model and passing in a hand-coded feature extractor? How can I run an experiment to learn multiple types? D. Other Issues How do I report bugs? Your question…. A. Introduction A01 What is MinorThird? MinorThird stands for "Methods for Identifying Names and Ontological Relationships in Text using Heuristics for Identifying Relationships in Data" . It is a collection is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. A02 Does it really work? Depends mostly on the weather in Pittsburgh. A03 Do I need a license to use Minorthird? Please check the License and Use section in the Minorthird homepage . A04 Where do I find some documentation? First, check the Documentation section in the Minorthird homepage. Other overview documents and additional info can be found in : minorthird.html , textpackage.htmlandmixuppackage.html . B. Download, Setup and Configuration B01 How can I download it? You can download it from SourceForge on http://sourceforge.net/projects/minorthird/. Also, check http://minorthird.sourceforge.net/ for more detailed information. You’ll need Ant and a recent version (1.4 or higher) of JDK. B02 Why can't I make it compile? Most of the times, the environment variables are not set correctly. Please check if the following variables are set in your environment before trying to compile it: ANT_HOME -> should be set to the directory where you put ANT (for instance, /usr0/local/apache-ant-1.6.1) JAVA_HOME -> should be set to wherever you put jdk (for instance, /usr0/local/j2sdk1.4.2_04) MINORTHIRD -> set to wherever you installed minorthird (for instance, /usr0/project/minorthird) CLASSPATH -> should be set according to the directions in $MINORTHIRD/script/setup.xx file. (for instance, using csh, if your $MINORTHIRD is/usr0/project/minorthird,you need to set your CLASSPATH to $MINORTHIRD:$MINORTHIRD/class:$MINORTHIRD/lib:$MINORTHIRD/lib/minorThirdIncludes.jar:$MINORTHIRD/lib/mixup:$MINORTHIRD/config:$MINORTHIRD/lib/montylingua) B03 I'm having troubles in generating the document files. Why can't I compile javadoc ? Try doing 'cvs update -dP' to get rid of the duplicate files first. B04 I'm frequently running out of memory. What should I do? Try optimizing your memory (initial heap size, maximum heap size, etc). Type java -X for details. For instance, try java -Xmx500m edu.cmu.minorthird.... B05 I'm frequently running out of memory. What should I do? Please check your environment variables (CLASSPATH, JAVA_HOME, etc) first. Then check http://www.inonit.com/cygwin/faq/. C. Experiments C01 If I want to run a classification experiment, in which format should I transform the data to be compatible with minorthird? How do I perform a classification experiment if I already have a dataset with all features extracted? The possible dataset formats are specified in the documentation of the class DatasetLoader. One of the possible ways to run a classification experiment is, after formatting your data into minorthird format, use NumericDemo.java to select your which experiment you want. In this file you can change different classifiers, different splits, test sets, etc. C02 How can I run a classification experiment using data in SVMlight format? You can use DatasetLoader to return a dataset from SVMlightformat(see method loadSVM). Then you have a standard Dataset and should be able to do whatever you want with it. C03 Is there a way to use non-recommended learners (like something using one-vs-all) using the minorthird.ui routines? Yes. If you specify the learner on the command line, with-learn, you can specify any learner you like - not simply ones that popup in the GUI. This could include aOneVsAll based learner. C04 For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need - over the entire input? Over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm? To classify documents into multiple classes using the ui-packagetools (like TrainTestClassifier) you should define a span property, which is a mapping from spans to strings. For instance, if thespanTypesdeleteCommand, insertCommand, replaceCommand have been defined, you could use this mixup to define the span prop "whatCommand". defSpanPropwhatCommand:delete =: [@deleteCommand]; defSpanPropwhatCommand:insert =: [@insertCommand]; defSpanPropwhatCommand:replace =: [@replaceCommand]; Right now you can't see span properties in the various viewers. After the property is defined, you can specify a oneVsAll learner withthe -learner command, and specify that you want to train against the "whatCommand" property with the option "-spanPropwhatCommand" (replacing "-spanTypedeleteCommand", or whatever). The result of training will be an Annotator that assigns some other span property (asspecified by the "-output" option) to a document. To summarize the training, you need to: - specify the class (insert, delete, replace) for every training document, using some spanProperty (e.g. whichCommand) - tell the ui what spanProperty you're training against (using "-spanPropwhichCommand") - tell the ui what learner to use, and make sure it's a learner that can handle non-binary data After training (e.g. using TestClassifier, or the ApplyAnnotatormethodin the ui package) the learned annotator will add to every document its predicted class as the value for the span property specified by "-output" C05 How do I communicate the k classes to OneVsAll? Does it simply take all declared span types in the training set? Or do I pass a set of labels? The set of labels will be inferred from data. C06 if I wanted to run an Extractor Learner that used a experimental NameFE class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one? Follow the output of "java edu.cmu.minorthird.ui.TrainTestExtractor -help".If you've put NameFE on the classpath (which it's not, by default) then you can specify it (a) with the option '-fe "new NameFE()"' (b) in the gui, if you add the class name of NameFE into yourselectableTypes.txt file There's a bug in the UI which keeps you from inspecting the changed FE,but this does work. If you want to parameterize NameFE, the simplest approach is to add theparameters to the constructor, so you can specify them directly in -fe argument. If you look at the javadocs for util.CommandLineProcessorandhave your NameFE implementation implement the CommandLineProcessor.Configurable interface then you can pass inadditional arguments on the command line as well. C07 How can the supported token extractors be configured? The supported token extractors (Recommended.TokenFE(), eg) can beconfigured in the gui, or by first explicitly specifying them with -fe, and then using one of the command line options, which you can discover by using '-feRecommended.TokenFE() -help' or looking at the javadocs. All of supported FE's need mixup; that provides a specified type of annotation. C08 Where would I look to get a better understanding of how the spans are build up? Try text.learn.SpanFE. C09 When I look at the features of each span (in the window I get when checking "displayDataSetBeforeLearning"), each span seems to have only a single token. Is it correct? Many of the extraction learners work by reducing extraction to tagging, i.e., labeling each word with one label, like "inside a name" vs "outside aname". C10 Is it true that there would be multiple tokens per span? Where is the code that translates the sentences from string into a set of tokens/spans? Text is translated from strings to tokens in text.TextBase. A Span isbasically just a sequence of tokens. The conversion from Spans to Instances is inside text.learn.SpanFE. C11 While I have it working with a feature extractor that I write and compile, I currently have no way to really test this at all other than look at the final error values and see that they have changed. I recommend (1) engineering as much of the FE processs as possible inmixup, and using the ui.LabelViewer to check the results, and (2) using the database viewer to view the final results of extraction. C12 How do I get additional details (weights, features, etc) in a name extraction test? Use the options: -showResult –showTestDetails C13 Must the feature extractor be declared as a serializable class, when building a serialized model and passing in a hand-coded feature extractor? The feature extractor should indeed be serializable. If it's not youcan still perform cross-validation experiments, but an error will be (or at least should be) thrown when you try and write it out, egwithTrainExtractor. Serialization only saves instance-dependent (non-static) data, itdoesn't save the code associated with a class, so you'll need to have the feature extraction code on your classpath when you load it back in. It's probably possible to hack the mixup interpreter to serialize mixupcode and dictionaries along with an extractor - if you really want to do that (if the dictionaries are pretty big, you might not want tosave multiple copies!). C14 How can I run an experiment to learn multiple types? There are three ways to learn to extract multiple types: Define a spanProperty and pass that into TrainTestExtractor with the spanProp option, eg: % java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecause Use the -spanProp option with a comma-separated list of non-overlapping types, eg: % java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecauseStart,trueName You can also run ui.TrainExtractor multiple times and learn multiple extractors, which of course might extract overlapping spans. In the first two cases what is learned is an extractor that inserts a new spanProperty (by default named "_prediction"). D. Other Issues D01 How do I report bugs? Bugs?!?! There are no bugs in this package ;-) But in case you really think you found one, please email William W. Cohen.