edu.cmu.minorthird.text
Class RegexTokenizer

java.lang.Object
  extended by edu.cmu.minorthird.text.RegexTokenizer
All Implemented Interfaces:
Tokenizer

public class RegexTokenizer
extends java.lang.Object
implements Tokenizer

Maintains information about what's in a set of documents. Specifically, this contains a set of character sequences (TextToken's) from some sort of set of containing documents - typically found by tokenization.


Field Summary
 java.lang.String regexPattern
           
static java.lang.String standardTokenRegexPattern
           
static java.lang.String TOKEN_REGEX_DEFAULT_VALUE
           
static java.lang.String TOKEN_REGEX_PROP
          How to split tokens up
 
Constructor Summary
RegexTokenizer()
           
RegexTokenizer(java.lang.String pattern)
           
 
Method Summary
 TextToken[] splitIntoTokens(Document document)
          Tokenize a document.
 java.lang.String[] splitIntoTokens(java.lang.String string)
          Tokenize a string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOKEN_REGEX_PROP

public static final java.lang.String TOKEN_REGEX_PROP
How to split tokens up

See Also:
Constant Field Values

TOKEN_REGEX_DEFAULT_VALUE

public static final java.lang.String TOKEN_REGEX_DEFAULT_VALUE
See Also:
Constant Field Values

standardTokenRegexPattern

public static java.lang.String standardTokenRegexPattern

regexPattern

public java.lang.String regexPattern
Constructor Detail

RegexTokenizer

public RegexTokenizer()

RegexTokenizer

public RegexTokenizer(java.lang.String pattern)
Method Detail

splitIntoTokens

public java.lang.String[] splitIntoTokens(java.lang.String string)
Tokenize a string.

Specified by:
splitIntoTokens in interface Tokenizer

splitIntoTokens

public TextToken[] splitIntoTokens(Document document)
Tokenize a document.

Specified by:
splitIntoTokens in interface Tokenizer