Labeling and Loading Data Tutorial
1. Labeling Data
UI programs load a collection of text documents as
data.� Documents may be a collection of
files in a directory or one document per line in a single file.� The collection of documents is stored in a TextBase
. Annotations about
these documents are stored in a corresponding TextLabels
object. Each
annotation asserts a category or property for a word, a document, or a
subsequence of words (aka a Span
).
TextLabels stored information from many sources: they
might hold annotations produced by human labelers (perhaps using a GUI tool
like the TextBaseEditor
) or,
annotations produced by a hand-writted program, or
annotations produced by a learned program.
TextLabels can be loaded in two ways.� 1) A labels file 2) Embedded XML tags
Example of text labels
in a labels file:
addToType doc1 184 3 name
addToType doc1 189 11 name
addToType doc2 205 3 name
In a labels document the second word is the document name, the third word is the starting token of the span, the fourth word is the length of the span, and the last word is the label.
Labels files are not meant to be created by hand.� There are a few gui tools in minorthird that allow users to graphically see documents, add and/or edit labels, and save their work in a labels file.�
To label whole documents, such as when labeling emails spam or real, try using the TextBaseLabeler:
java �Xmx500M edu.cmu.minorthird.text.gui.TextBaseLabeler DATA_DIRECTORY DATA.labels
where DATA_DIRECTORY is where your documents are stored and DATA.labels is where you would like to save you labels.� If you would like minorthird to automatically load your hand-edited labels the name of the labels file should be the same as the directory.� For example if you have a directory named foo, you will want to name you labels file foo.labels.
A window that looks like this will appear:
To select a document, click on it in the top panel and the text from that document will appear in the bottom panel.� To label the currently selected document, pick a label from the pull down menu (if the label you would like is on that menu) or type in a new label next to New class.� Once you have picked your label, simply press the Accept class button and go to the next document you would like to label.
If you look at the documentation index, you will see that there are two other classes for editing TextBases called DebugMixup and EditLabels.� These classes are useful when you have some results and are interested in hand-correcting, but will not help you label unlabeled documents.
If you would like to hand label unlabeled extraction data (such as names or places), inserting embedded XML tags is probably you best option.
Example of a document
with embedded labels:
The <location>
In this type of document a labeled span lies between the < > and </ > markers.� The label is the word between the marks.
To get a better feel for what is happening in Minorthird, you can look at the javadocs.� To construct the javadocs type: % ant javadoc.� Also an older version of the docs are on William�s website or on http://minorthird.sourceforge.net/javadoc
Notes:
How to Load your
Labeled Data
����������� A)� Simple Loading � specify the location of all data in the gui or by typing �labels PATH/DATA_DIRECTORY on the command line
����������� ����� Note: if you name a .labels file the same as your directory, it will automatically load as will XML tags
B) More advanced users (who would have a lot of data) may want to use a repository structure.� To do this you must create a data.properties file in the config directory which points to a directory called repository.� The repository directory must then contain three folders: data, labels, and loaders.� For more information on how to set up a repository contact the current maintainer of MinorThird listed on the front page.