The main purpose of this work was
to process the raw text files to produce the data that will be used by the prediction algorythm.
to describe the goals for the app to be written for doing the prediction
The work was complicated by the fact that the files contained a large amount of text, so standard tasks that would run easily in a standard PC if the files were small, take a long time to run, and in some cases the R application just hangs. So I had to try different approaches until managing to write code that processes the data quickly and reliably
Here are some basic statistics of the files to be analyzed:
# Get data about the raw files
Files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
FileSize <- c(file.size("en_US.blogs.txt"), file.size("en_US.news.txt"), file.size("en_US.twitter.txt"))
NumberOfLines <- c(length(readLines("en_US.blogs.txt", skipNul = TRUE)),
length(readLines("en_US.news.txt", skipNul = TRUE)),
length(readLines("en_US.twitter.txt", skipNul = TRUE)))
FilesStatistics <- data.frame(Files, FileSize, NumberOfLines)
# Show the data in a table
grid.table(format(FilesStatistics, big.mark="," , small.interval=3), rows = NULL)
We carried the following tasks to clean the data:
We didn’t use the tm library to carry these tasks, because it was too slow. This library has lots of useful functions for this type of jobs, but we found out that it was just to slow for the size of the data we were handling.
We didn’t do any stemming (the reduction of inflected words to their word stem, base or root form), because we think that in our application, it doesn’t make sense. We want to suggest the words with the appropriate inflection, rather than the stem version of the word.
The code to do the clean up looks like this:
# Remove urls
urlPattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
myData <-str_replace_all(myData, urlPattern, " ")
# Remove special chars
myData<-str_replace_all(myData, "[^A-Za-z'\\- ]", " ")
#Remove profanities
profanities <- c(" arse","ass","asshole","bastard","bitch","bollocks","crap",
"damn","fuck","goddam","shit ")
profanitiesPattern <- paste(profanities, collapse="|")
myData<-str_replace_all(myData, profanitiesPattern, " ")
# Collapse white spaces
spacePattern <- "\\s+"
myData<-str_replace_all(myData, spacePattern, " ")
# Convert the text to lower case
myData <-gsub(pattern = '([[:upper:]])', perl = TRUE, replacement = '\\L\\1', myData)
The prediction algorythm will use a library of frequent words and frequent word combinations (also called n-grams, where 2-gram is a combination of 2 words, like “very much”, 3-gram is for 3 words, etc.). Using the information of what the user has typed so far, and the library of terms frequencies, the algorythm will attempt to match the longest sequence of words possible, to predict the next one.
The practical limit for n-grams is 3. This is because the number of possible word combination grows exponentially with the number of words, and calculating 4-grams would require lot of computer power and/or time.
So, to implement our algorythm, we need to write code that calculates the frequency of n-grams with n = 1,2,3, to produce the libraries of terms frequencies needed for the predictions.
Our code produces 3 files for each raw file provided, one for each n-gram. The code is a bit more complicated of what it should, because of performance issues with the “tm” package. It first creates a “corpora” (a large and structured set of texts) from the cleaned up file. It then creates a “document term matrix” from the corpora. This is a matrix that has one row for each document (in our case a document is a line from the cleaned up file) and one row for each term (or combination of n words). The element (i,j) of the matrix has the number of times that the j term is present in the i document. Most of the elements of this matrix are zero, because a line of text contains only a few of the large number of possible terms.
We have to add all the elements in a column to calculate the total number of times the corresponding term appears in the original raw file.
The tweaks we had to implement to avoid R crashing or taking too long include:
We calculated frequencies only for terms that appear at least n times. We set n = 1000, but this was arbitrary, it can be changed to a different value. The idea was that for the prediction algorythm, we want to use small library files of very frequent terms, rather than a huge library that takes lots of memory and takes more time to read, since low frequency words are probably bad guesses anyway. It is better to provide the user a likely prediction very quickly, rather than taking a long time to make the prediction and suggest a word that is probably wrong.
Here is an extract of the code used to calculate the term frequencies:
# Read file
myData <- readLines(sourceFile, skipNul = TRUE)
# Create corpora object
docs <- VCorpus(VectorSource(myData))
# Create term document matrix
myTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = myTokenizer))
# Remove corporabfrom RAM to free
rm(docs)
# Create sparse matrix
dtm_sparse <- MakeSparseDTM(dtm)
# Get frequent terms
freqTerms <- findFreqTerms(dtm, lowfreq = lowFreq)
# Iterate over the columns and calculate the frequencies
for (col in 1:ncol(dtm)){
# Get term
word <- dtm$dimnames$Terms[col]
if (word %in% freqTerms){
# Get frequency of term
freq <- sum(dtm_sparse[,col])
# Save term and frequency to file
write(paste(word, freq, sep = ","), file = outputFile, append = TRUE)
}
}
Here is what we found after running our code.
par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.1.gramFrequencies.txt")
makeNgramPlot("en_US.news.1.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.1.gramFrequencies.txt")
par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.2.gramFrequencies.txt")
makeNgramPlot("en_US.news.2.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.2.gramFrequencies.txt")
par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.3.gramFrequencies.txt")
makeNgramPlot("en_US.news.3.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.3.gramFrequencies.txt")
We didn’t remove apostrophes in our cleanup, but the tm library by default removes them when creating the dtm matrix, so in our results the apostrophes are missing. We will have to invetigate how to change this default behaviour in order to leave them.
# Does the data cleanup of all the raw files
cleanUpRawFiles <- function() {
# Convert to lowercase, remove numbers, stop words,
# special chars, extra white spaces and profanities
preProcess("en_US.blogs.txt")
preProcess("en_US.news.txt")
preProcess("en_US.twitter.txt")
reduceNumberOfLines("en_US.twitter.txt.preprocessed")
getNgramFrequencies("en_US.blogs.txt.preprocessed")
getNgramFrequencies("en_US.news.txt.preprocessed")
getNgramFrequencies("en_US.twitter.txt.preprocessed.reduced")
return(frequentTerms)
}
# Helper function that actually does the data
# cleaning of a raw file
preProcess <- function(sourceFile) {
library(stringr)
validateFile(sourceFile)
logActivity("Reading data.")
# Read data
myData <- readLines(sourceFile, skipNul = TRUE)
# Remove urls
logActivity("Removing urls.")
urlPattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
myData <- str_replace_all(myData, urlPattern, " ")
# Remove special chars
logActivity("Removing special characters and numbers.")
myData <- str_replace_all(myData, "[^A-Za-z'\\- ]",
" ")
# Remove profanities
logActivity("Removing profanities")
profanities <- c(" arse", "ass", "asshole", "bastard",
"bitch", "bollocks", "crap", "damn", "fuck",
"goddam", "shit ")
profanitiesPattern <- paste(profanities, collapse = "|")
myData <- str_replace_all(myData, profanitiesPattern,
" ")
# Collapse white spaces
logActivity("Removing extra white spaces.")
spacePattern <- "\\s+"
myData <- str_replace_all(myData, spacePattern,
" ")
# Convert the text to lower case
logActivity("Removing extra white spaces.")
myData <- gsub(pattern = "([[:upper:]])", perl = TRUE,
replacement = "\\L\\1", myData)
writeLines(myData, paste(sourceFile, "preprocessed",
sep = "."))
logActivity("Finished processing.")
# return(docs)
}
# Auxiliary function to combine lines from a text
# file, if the number of lines exceedes the
# parameter n (n defaults to 1 million) It returns
# a file that has 1 million or less lines Needed
# because the creation of corpora with more than 1
# million lines crashes my computer
reduceNumberOfLines <- function(sourceFile, n = 1e+06) {
library(stringr)
validateFile(sourceFile)
logActivity("Reading data.")
# Read data
myData <- readLines(sourceFile, skipNul = TRUE)
initialRows <- length(myData)
reductionRate <- floor((initialRows/n)) + 1
print(paste("Reduction rate:", reductionRate, sep = " "))
logActivity("Saving data.")
outputFile <- paste(sourceFile, "reduced", sep = ".")
for (i in 1:n) {
concatenatedText <- ""
for (j in 1:reductionRate) {
if ((i - 1) * reductionRate + j < initialRows) {
textToAdd <- myData[(i - 1) * reductionRate +
j]
concatenatedText <- paste(concatenatedText,
textToAdd, sep = " ")
concatenatedText <- str_replace_all(concatenatedText,
"\\s+", " ")
}
}
write(concatenatedText, file = outputFile,
append = TRUE)
processedRecords <- i * reductionRate
if (i%%5000 == 0) {
sprintf("%d records processed", processedRecords)
}
}
}
# Auxiliary method that check if a file parameter
# is correct
validateFile <- function(sourceFile) {
if (is.null(sourceFile)) {
stop("sourceFile not provided")
}
if (!file.exists(sourceFile)) {
stop(paste(sourceFile, "does not exist", sep = " "))
}
}
# Calculates frequencies of n-grams that appear at
# least lowFreq times and saves them to a file
getNgramFrequencies <- function(inputFile, lowFreq = 1000) {
library(tm)
library(RWeka)
library(stringr)
library(textmineR)
for (ngram in 1:3) {
docs <- createCorpora(inputFile)
prefix <- str_replace_all(inputFile, "\\.txt|\\.reduced|\\.preprocessed",
"")
outputFile <- paste(prefix, ngram, "gramFrequencies",
"txt", sep = ".")
# Creating term document matrix
logActivity("Creating term document matrix.")
myTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = ngram, max = ngram))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = myTokenizer))
rm(docs)
# Create sparse matrix
logActivity("Creating sparse matrix.")
dtm_sparse <- MakeSparseDTM(dtm)
# Get frequent terms
logActivity("Getting frequent terms.")
freqTerms <- findFreqTerms(dtm, lowfreq = lowFreq)
logActivity("Calculating frequencies and saving to file.")
i <- 0
j <- 0
# Iterate over the columns and calculate the
# frequencies
for (col in 1:ncol(dtm)) {
i <- i + 1
if (i%%500 == 0) {
sprintf("%d records processed", i)
}
# Get term
word <- dtm$dimnames$Terms[col]
if (word %in% freqTerms) {
j <- j + 1
if (j%%500 == 0) {
sprintf("Added %d records to file",
j)
}
# Get frequency of term
freq <- sum(dtm_sparse[, col])
# Save term and frequency to file
write(paste(word, freq, sep = ","),
file = outputFile, append = TRUE)
}
}
}
}
# Returns a corpora object corresponding to the
# data of the file provided
createCorpora <- function(sourceFile) {
library(tm)
validateFile(sourceFile)
logActivity("Reading data.")
# Read data
myData <- readLines(sourceFile, skipNul = TRUE)
# Generate corpus
logActivity("Generating corpus.")
library("tm")
docs <- VCorpus(VectorSource(myData))
logActivity("Finished creating corpus.")
return(docs)
}
# Utility to display a message that includes the
# time Used to keep track of the time that each
# action takes
logActivity <- function(text) {
print(paste(text, Sys.time(), sep = " "))
}
# Makes a barplot of the 10 most used n-gramms
# Reads the data from the files produced by
# getNgramFrequencies()
makeNgramPlot <- function(file) {
validateFile(file)
myData <- read.csv(file, header = FALSE)
top10 <- top_n(myData[order(myData$V2, decreasing = T)[1:10],
], 10, V2)
names(top10) <- c("Term", "Frequency")
barplot(top10$Frequency, names.arg = top10$Term,
las = 2)
}
# Makes barplots of the frequencies of the top 10
# n-gramms from all raw files
makeNgramPlots <- function() {
par(mfrow = c(1, 3))
makeNgramPlot("en_US.blogs.1.gramFrequencies.txt")
makeNgramPlot("en_US.blogs.2.gramFrequencies.txt")
makeNgramPlot("en_US.blogs.3.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.1.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.2.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.3.gramFrequencies.txt")
}