edu.cmu.minorthird.text
Class RegexTokenizer
java.lang.Object
edu.cmu.minorthird.text.RegexTokenizer
- All Implemented Interfaces:
- Tokenizer
public class RegexTokenizer
- extends java.lang.Object
- implements Tokenizer
Maintains information about what's in a set of documents.
Specifically, this contains a set of character sequences (TextToken's)
from some sort of set of containing documents - typically found by
tokenization.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TOKEN_REGEX_PROP
public static final java.lang.String TOKEN_REGEX_PROP
- How to split tokens up
- See Also:
- Constant Field Values
TOKEN_REGEX_DEFAULT_VALUE
public static final java.lang.String TOKEN_REGEX_DEFAULT_VALUE
- See Also:
- Constant Field Values
standardTokenRegexPattern
public static java.lang.String standardTokenRegexPattern
regexPattern
public java.lang.String regexPattern
RegexTokenizer
public RegexTokenizer()
RegexTokenizer
public RegexTokenizer(java.lang.String pattern)
splitIntoTokens
public java.lang.String[] splitIntoTokens(java.lang.String string)
- Tokenize a string.
- Specified by:
splitIntoTokens
in interface Tokenizer
splitIntoTokens
public TextToken[] splitIntoTokens(Document document)
- Tokenize a document.
- Specified by:
splitIntoTokens
in interface Tokenizer