edu.cmu.minorthird.util
Class LineProcessingUtil

java.lang.Object
  extended by edu.cmu.minorthird.util.LineProcessingUtil

public class LineProcessingUtil
extends java.lang.Object

Line processing utilities. Matcher for regular expressions, adding features to stringBuffer in svmformat, etc

Author:
Vitor R. Carvalho (vitor [at] cs..cmu...)

Constructor Summary
LineProcessingUtil()
           
 
Method Summary
static void addFeature(java.lang.String line, java.lang.String regexp, java.lang.String featureName, java.lang.StringBuffer features_out)
          If the line substring matches the regexp, it adds a " featurename=1" to the string buffer It is useful for producing external datasets in Minorthird format
static double AtoZPercentage(java.lang.String line)
          Returns the percentage of A-Z or a-z characters in a line
static java.lang.String[] getMessageLines(java.lang.String tmp)
          Method to split a message (string format) into lines
static int indentNumber(java.lang.String line)
           
static double indentPercentage(java.lang.String line)
          returns the percentage of tabs in a line
static boolean lineMatcher(java.lang.String patternStr, java.lang.String tmpstr)
          Returns true if substring in input (or part of it) matches the pattern.
static int numberOfMatches(java.lang.String expression, java.lang.String line)
           
static double punctuationPercentage(java.lang.String line)
          Returns the percentage of punctuation (\p{punct}) characters in a line
static TextLabels readBsh(java.io.File dir, java.io.File envfile)
           
static java.lang.String readFile(java.lang.String in)
          Method to read a file and turn it into a string - based on rcwang's code
static boolean startWithSameInitialPunctCharacters(java.lang.String tmp, java.lang.String tmp1)
          detect a sequence of 2 lines starting with the same punctuation (\p{Punct}) character
static double wordCharactersPercentage(java.lang.String line)
          Returns the percentage characters [\w] in a line
static void writeToOutputFile(java.lang.String outputFileName, java.lang.StringBuffer aux)
          Writes the contents of a String Buffer to an output file
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LineProcessingUtil

public LineProcessingUtil()
Method Detail

lineMatcher

public static boolean lineMatcher(java.lang.String patternStr,
                                  java.lang.String tmpstr)
Returns true if substring in input (or part of it) matches the pattern.

Parameters:
patternStr - regexp (in String format)
tmpstr - line to be matched to regexp (in String format)
Returns:
true (if pattern is matched) or false (otherwise)

addFeature

public static void addFeature(java.lang.String line,
                              java.lang.String regexp,
                              java.lang.String featureName,
                              java.lang.StringBuffer features_out)
If the line substring matches the regexp, it adds a " featurename=1" to the string buffer It is useful for producing external datasets in Minorthird format

Parameters:
line - in String format
regexp - in String format
featureName - feature name to be added, in case the regexp matches the line substring
features_out - StringBuffer to which the feature should be added

punctuationPercentage

public static double punctuationPercentage(java.lang.String line)
Returns the percentage of punctuation (\p{punct}) characters in a line

Parameters:
line - in String format
Returns:
a double with the percentage of characters

AtoZPercentage

public static double AtoZPercentage(java.lang.String line)
Returns the percentage of A-Z or a-z characters in a line

Parameters:
line - in String format
Returns:
the percentage of [a-z] or [A-Z] characters in the line

wordCharactersPercentage

public static double wordCharactersPercentage(java.lang.String line)
Returns the percentage characters [\w] in a line

Parameters:
line - in String format
Returns:
the percentage of "\w" characters in the line

indentPercentage

public static double indentPercentage(java.lang.String line)
returns the percentage of tabs in a line

Parameters:
line - in String format
Returns:
the percentage of "\t" characters in the line

indentNumber

public static int indentNumber(java.lang.String line)

numberOfMatches

public static int numberOfMatches(java.lang.String expression,
                                  java.lang.String line)

startWithSameInitialPunctCharacters

public static boolean startWithSameInitialPunctCharacters(java.lang.String tmp,
                                                          java.lang.String tmp1)
detect a sequence of 2 lines starting with the same punctuation (\p{Punct}) character

Parameters:
tmp - line1 in String format
tmp1 - line2 in String format
Returns:
true, if both lines start with same punctuation symbol

getMessageLines

public static java.lang.String[] getMessageLines(java.lang.String tmp)
Method to split a message (string format) into lines

Parameters:
tmp - message as String
Returns:
message lines in a String[]

readFile

public static java.lang.String readFile(java.lang.String in)
                                 throws java.io.IOException
Method to read a file and turn it into a string - based on rcwang's code

Parameters:
in - String with the name of file
Returns:
the original fine in a String format
Throws:
java.io.IOException

writeToOutputFile

public static void writeToOutputFile(java.lang.String outputFileName,
                                     java.lang.StringBuffer aux)
                              throws java.io.IOException
Writes the contents of a String Buffer to an output file

Parameters:
outputFileName - output File name (as a String)
aux - string buffer to be written to output file
Throws:
java.io.IOException

readBsh

public static TextLabels readBsh(java.io.File dir,
                                 java.io.File envfile)
                          throws java.lang.Exception
Throws:
java.lang.Exception