edu.cmu.minorthird.text
Class TextBaseLoader

java.lang.Object
  extended by edu.cmu.minorthird.text.TextBaseLoader

public class TextBaseLoader
extends java.lang.Object

Configurable Text Loader.

Usage: Configure a loader object using the constructors. Call .load(File) with the file object to your data (could be a directory) load(File) returns the TextBase object for the data.

 Default: 
 TextBaseLoader tbl = new TextBaseLoader();
 Loads One Document per File and uses embedded labels 
 ------------------------------------------------------
 Specify Document Style
 TextBaseLoader tbl = new TextBaseLoader(TextBaseLoader.DOC_PER_LINE); // Loads One document per line
 TextBaseLoader tbl = new TextBaseLoader(TextBaseLoader.DOC_PER_FILE); // Loads One document per file
 ------------------------------------------------------
 Specify document type and whether to use embedded Labels
 // ex: Loads one doc per line and ignores embedded labels
 TextBaseLoader tbl = new TextBaseLoader(TextBaseLoader.DOC_PER_LINE, false); 
 ------------------------------------------------------
 Specify document type and whether to use embedded Labels
 // ex: Loads one doc per file, uses embedded labels, and recurses directories
 TextBaseLoader tbl = new TextBaseLoader(TextBaseLoader.DOC_PER_FILE, true, true); 
 

In ALL cases use: tbl.load(FILE);

Author:
William Cohen, Kevin Steppe, Cameron Williams, Quinten Mercer

Field Summary
static int DIRECTORY_NAME
           
static int DOC_PER_FILE
           
static int DOC_PER_LINE
           
static int FILE_NAME
           
static boolean IGNORE_XML
           
static int IN_FILE
           
static int NONE
           
static boolean USE_XML
           
 
Constructor Summary
TextBaseLoader()
          Default constructor.
TextBaseLoader(int documentStyle)
          Specifies the document style to use, but leaves all other properties to their defaults.
TextBaseLoader(int documentStyle, boolean use_markup)
           
TextBaseLoader(int documentStyle, boolean use_markup, boolean recurseDirectories)
           
TextBaseLoader(int documentStyle, int docID)
          Deprecated.  
TextBaseLoader(int documentStyle, int docID, boolean use_markup)
          Deprecated.  
TextBaseLoader(int documentStyle, int docID, int groupID, int categoryID)
          Deprecated.  
TextBaseLoader(int documentStyle, int docID, int groupID, int categoryID, boolean labelsInFile, boolean recurseDirectories)
          Deprecated.  
 
Method Summary
 MutableTextLabels getLabels()
          get labeling generated by tags in data file
protected  java.lang.String labelLine(java.lang.String line, java.lang.StringBuffer docBuffer, java.lang.String docId, java.util.List<edu.cmu.minorthird.text.TextBaseLoader.CharSpan> spanList)
          Takes a single line of text.
 MutableTextBase load(java.io.File dataLocation)
          Load data from the given location according to configuration and whether location is a directory or not Calling load a second time will load into the same text base (thus the second call returns documents from both the first and second locations).
 MutableTextBase load(java.io.File dataLocation, Tokenizer tok)
          Load data from the given location according to configuration and whether location is a directory or not Calling load a second time will load into the same text base (thus the second call returns documents from both the first and second locations).
static MutableTextLabels loadDirOfTaggedFiles(java.io.File dir)
          Deprecated.  
static TextBase loadDocPerLine(java.io.File file, boolean hasGroupID)
          Deprecated.  
 void loadTaggedFiles(TextBase base, java.io.File dir)
          Deprecated.  
 MutableTextBase loadWordPerLineFile(java.io.File file)
          Load a document where each word has it's own line and is follwed by three desscriptor words.
 void setDocumentStyle(int style)
          Sets the document style for loaded documents.
 void setLabelsInFile(boolean b)
          Sets whether the loader should use or ignore XML markup in the files.
 void setRecurseDirectories(boolean rec)
          Sets whether the loader should recurse directories when loading docs.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NONE

public static final int NONE
See Also:
Constant Field Values

DIRECTORY_NAME

public static final int DIRECTORY_NAME
See Also:
Constant Field Values

FILE_NAME

public static final int FILE_NAME
See Also:
Constant Field Values

IN_FILE

public static final int IN_FILE
See Also:
Constant Field Values

DOC_PER_LINE

public static final int DOC_PER_LINE
See Also:
Constant Field Values

DOC_PER_FILE

public static final int DOC_PER_FILE
See Also:
Constant Field Values

USE_XML

public static final boolean USE_XML
See Also:
Constant Field Values

IGNORE_XML

public static final boolean IGNORE_XML
See Also:
Constant Field Values
Constructor Detail

TextBaseLoader

public TextBaseLoader()
Default constructor. It will load each file as a single document, use XML markup, and NOT recurse recurse.


TextBaseLoader

public TextBaseLoader(int documentStyle)
Specifies the document style to use, but leaves all other properties to their defaults.


TextBaseLoader

public TextBaseLoader(int documentStyle,
                      boolean use_markup)

TextBaseLoader

public TextBaseLoader(int documentStyle,
                      boolean use_markup,
                      boolean recurseDirectories)

TextBaseLoader

public TextBaseLoader(int documentStyle,
                      int docID)
Deprecated. 


TextBaseLoader

public TextBaseLoader(int documentStyle,
                      int docID,
                      boolean use_markup)
Deprecated. 


TextBaseLoader

public TextBaseLoader(int documentStyle,
                      int docID,
                      int groupID,
                      int categoryID)
Deprecated. 


TextBaseLoader

public TextBaseLoader(int documentStyle,
                      int docID,
                      int groupID,
                      int categoryID,
                      boolean labelsInFile,
                      boolean recurseDirectories)
Deprecated. 

Method Detail

load

public MutableTextBase load(java.io.File dataLocation)
                     throws java.io.IOException,
                            java.text.ParseException
Load data from the given location according to configuration and whether location is a directory or not Calling load a second time will load into the same text base (thus the second call returns documents from both the first and second locations). Use setTextBase(null) to reset the text base.

Parameters:
dataLocation - File representation of location (single file or directory)
Returns:
the loaded TextBase
Throws:
java.io.IOException - - problem reading the file
java.text.ParseException - - problem with xml of internal tagging

load

public MutableTextBase load(java.io.File dataLocation,
                            Tokenizer tok)
                     throws java.io.IOException,
                            java.text.ParseException
Load data from the given location according to configuration and whether location is a directory or not Calling load a second time will load into the same text base (thus the second call returns documents from both the first and second locations). Use setTextBase(null) to reset the text base.

Parameters:
dataLocation - File representation of location (single file or directory)
Returns:
the loaded TextBase
Throws:
java.io.IOException - - problem reading the file
java.text.ParseException - - problem with xml of internal tagging

loadWordPerLineFile

public MutableTextBase loadWordPerLineFile(java.io.File file)
                                    throws java.io.IOException,
                                           java.io.FileNotFoundException
Load a document where each word has it's own line and is follwed by three desscriptor words. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag.

Throws:
java.io.IOException
java.io.FileNotFoundException

setLabelsInFile

public void setLabelsInFile(boolean b)
Sets whether the loader should use or ignore XML markup in the files.

Valid values are: TextBaseLoader.IGNORE_XML and TextBaseLoader.USE_XML


setDocumentStyle

public void setDocumentStyle(int style)
Sets the document style for loaded documents.

Valid styles are: TextBaseLoader.DOC_PER_LINE and TextBaseLoader.DOC_PER_FILE


setRecurseDirectories

public void setRecurseDirectories(boolean rec)
Sets whether the loader should recurse directories when loading docs.


getLabels

public MutableTextLabels getLabels()
get labeling generated by tags in data file


loadDirOfTaggedFiles

public static MutableTextLabels loadDirOfTaggedFiles(java.io.File dir)
                                              throws java.text.ParseException,
                                                     java.io.IOException
Deprecated. 

One document per file in a directory, labels are embedded in the data as xml tags NB: Don't use this if the data isn't labbed - it will remove things that look like which could cause problems. Returns the TextLabels object, the textbase is embedded

Throws:
java.text.ParseException
java.io.IOException

loadTaggedFiles

public void loadTaggedFiles(TextBase base,
                            java.io.File dir)
                     throws java.io.IOException,
                            java.io.FileNotFoundException
Deprecated. 

Throws:
java.io.IOException
java.io.FileNotFoundException

loadDocPerLine

public static TextBase loadDocPerLine(java.io.File file,
                                      boolean hasGroupID)
                               throws java.text.ParseException,
                                      java.io.IOException
Deprecated. 

Throws:
java.text.ParseException
java.io.IOException

labelLine

protected java.lang.String labelLine(java.lang.String line,
                                     java.lang.StringBuffer docBuffer,
                                     java.lang.String docId,
                                     java.util.List<edu.cmu.minorthird.text.TextBaseLoader.CharSpan> spanList)
                              throws java.text.ParseException
Takes a single line of text. Uses the markupPattern field to remove labelings (must be xml styled). These labelling are added to the span list

Parameters:
line - - String of a single line to have it's labels parsed
spanList - - List of span labelings
Returns:
a String with the labelings removed
Throws:
java.text.ParseException - improper xml format will cause a parse exception