Labeling and Loading Data Tutorial

 

1. Labeling Data

UI programs load a collection of text documents as data.Documents may be a collection of files in a directory or one document per line in a single file.The collection of documents is stored in a TextBase. Annotations about these documents are stored in a corresponding TextLabels object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (aka a Span). TextLabels stored information from many sources: they might hold annotations produced by human labelers (perhaps using a GUI tool like the TextBaseEditor) or, annotations produced by a hand-writted program, or annotations produced by a learned program.

 

TextLabels can be loaded in two ways.1) A labels file 2) Embedded XML tags

 

Example of text labels in a labels file:

addToType doc1 184 3 name

addToType doc1 189 11 name

addToType doc2 205 3 name

 

In a labels document the second word is the document name, the third word is the starting token of the span, the fourth word is the length of the span, and the last word is the label.

 

Labels files are not meant to be created by hand.There are a few gui tools in minorthird that allow users to graphically see documents, add and/or edit labels, and save their work in a labels file.

 

To label whole documents, such as when labeling emails spam or real, try using the TextBaseLabeler:

java �Xmx500M edu.cmu.minorthird.text.gui.TextBaseLabeler DATA_DIRECTORY DATA.labels

where DATA_DIRECTORY is where your documents are stored and DATA.labels is where you would like to save you labels.If you would like minorthird to automatically load your hand-edited labels the name of the labels file should be the same as the directory.For example if you have a directory named foo, you will want to name you labels file foo.labels.

 

A window that looks like this will appear:

To select a document, click on it in the top panel and the text from that document will appear in the bottom panel.To label the currently selected document, pick a label from the pull down menu (if the label you would like is on that menu) or type in a new label next to New class.Once you have picked your label, simply press the Accept class button and go to the next document you would like to label.

 

If you look at the documentation index, you will see that there are two other classes for editing TextBases called DebugMixup and EditLabels.These classes are useful when you have some results and are interested in hand-correcting, but will not help you label unlabeled documents.

 

If you would like to hand label unlabeled extraction data (such as names or places), inserting embedded XML tags is probably you best option.

 

Example of a document with embedded labels:

The <location>Pittsburgh</location> Steelers headed by coach <name>Bill Cowher</name> are going to <location>Clevland</location> to play the Browns.

In this type of document a labeled span lies between the < > and </ > markers.The label is the word between the marks.

 

To get a better feel for what is happening in Minorthird, you can look at the javadocs.To construct the javadocs type: % ant javadoc.Also an older version of the docs are on William�s website or on http://minorthird.sourceforge.net/javadoc

 

Notes:

  • what you are able to mark up is defined by TextLabels API
  • spanTypes are the "most central" construct, and tokenProps and others not as well supported (saving, viewing, etc)
  • the metaphor is that the toolkit is a programming language for annotations.
    • As a programming language, need subroutines and libraries, which are invoked textLabels.require(), return output by adding annotations, and completion/status information in textLabels.isAnnotatedBy().

 

How to Load your Labeled Data

����������� A)Simple Loading � specify the location of all data in the gui or by typing �labels PATH/DATA_DIRECTORY on the command line

����������� ����� Note: if you name a .labels file the same as your directory, it will automatically load as will XML tags

B)     More advanced users (who would have a lot of data) may want to use a repository structure.To do this you must create a data.properties file in the config directory which points to a directory called repository.The repository directory must then contain three folders: data, labels, and loaders.For more information on how to set up a repository contact the current maintainer of MinorThird listed on the front page.

 

SourceForge.net Logo