Exploratory analysis of the text data that will be used in the final project

Summary

The main purpose of this work was

to process the raw text files to produce the data that will be used by the prediction algorythm.
to describe the goals for the app to be written for doing the prediction

The work was complicated by the fact that the files contained a large amount of text, so standard tasks that would run easily in a standard PC if the files were small, take a long time to run, and in some cases the R application just hangs. So I had to try different approaches until managing to write code that processes the data quickly and reliably

Raw data and clean up process

Here are some basic statistics of the files to be analyzed:

# Get data about the raw files
Files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
FileSize <- c(file.size("en_US.blogs.txt"), file.size("en_US.news.txt"), file.size("en_US.twitter.txt"))
NumberOfLines <- c(length(readLines("en_US.blogs.txt", skipNul = TRUE)), 
    length(readLines("en_US.news.txt", skipNul = TRUE)), 
    length(readLines("en_US.twitter.txt", skipNul = TRUE)))
FilesStatistics <- data.frame(Files, FileSize, NumberOfLines)

# Show the data in a table
grid.table(format(FilesStatistics,   big.mark="," , small.interval=3), rows = NULL)

We carried the following tasks to clean the data:

Removal of urls
Removal of characters thar were not letters of the english alphabet or the special characters “-” (dash), “’” (apostrophe) and space
Removal of text that contains profanities
Removal of extra white spaces
Convert all uppercase letters to lowercase

We didn’t use the tm library to carry these tasks, because it was too slow. This library has lots of useful functions for this type of jobs, but we found out that it was just to slow for the size of the data we were handling.

We didn’t do any stemming (the reduction of inflected words to their word stem, base or root form), because we think that in our application, it doesn’t make sense. We want to suggest the words with the appropriate inflection, rather than the stem version of the word.

The code to do the clean up looks like this:

    # Remove urls
    urlPattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    myData <-str_replace_all(myData, urlPattern, " ")
    
    # Remove special chars    
    myData<-str_replace_all(myData, "[^A-Za-z'\\- ]", " ")
    
    #Remove profanities
    profanities <- c(" arse","ass","asshole","bastard","bitch","bollocks","crap",
                     "damn","fuck","goddam","shit ")
    profanitiesPattern <- paste(profanities, collapse="|")
    myData<-str_replace_all(myData, profanitiesPattern, " ")
    
    # Collapse white spaces
    spacePattern <- "\\s+"
    myData<-str_replace_all(myData, spacePattern, " ")
    
    # Convert the text to lower case
    myData <-gsub(pattern = '([[:upper:]])', perl = TRUE, replacement = '\\L\\1', myData)

Goals for our prediction application

The prediction algorythm will use a library of frequent words and frequent word combinations (also called n-grams, where 2-gram is a combination of 2 words, like “very much”, 3-gram is for 3 words, etc.). Using the information of what the user has typed so far, and the library of terms frequencies, the algorythm will attempt to match the longest sequence of words possible, to predict the next one.

The practical limit for n-grams is 3. This is because the number of possible word combination grows exponentially with the number of words, and calculating 4-grams would require lot of computer power and/or time.

R code to generate n-gram libraries from the supplied data

So, to implement our algorythm, we need to write code that calculates the frequency of n-grams with n = 1,2,3, to produce the libraries of terms frequencies needed for the predictions.

Our code produces 3 files for each raw file provided, one for each n-gram. The code is a bit more complicated of what it should, because of performance issues with the “tm” package. It first creates a “corpora” (a large and structured set of texts) from the cleaned up file. It then creates a “document term matrix” from the corpora. This is a matrix that has one row for each document (in our case a document is a line from the cleaned up file) and one row for each term (or combination of n words). The element (i,j) of the matrix has the number of times that the j term is present in the i document. Most of the elements of this matrix are zero, because a line of text contains only a few of the large number of possible terms.

We have to add all the elements in a column to calculate the total number of times the corresponding term appears in the original raw file.

The tweaks we had to implement to avoid R crashing or taking too long include:

combining rows to reduce their number. The Tweeter file has almost 3 million lines. The size of the corpora object created by the tm package is too much for my PC when the number of lines is more than 1 million, so I had to reduce the number of lines by combining every 3 rows into 1
converting the document term matrix to a sparse matrix. This accelerates considerably the adding of all elements in a column
delete the corpora object after the document matrix has been created. This is because the corpora takes several gigabytes of memory, due to the size of the files provided

We calculated frequencies only for terms that appear at least n times. We set n = 1000, but this was arbitrary, it can be changed to a different value. The idea was that for the prediction algorythm, we want to use small library files of very frequent terms, rather than a huge library that takes lots of memory and takes more time to read, since low frequency words are probably bad guesses anyway. It is better to provide the user a likely prediction very quickly, rather than taking a long time to make the prediction and suggest a word that is probably wrong.

Here is an extract of the code used to calculate the term frequencies:

# Read file
myData <- readLines(sourceFile, skipNul = TRUE)

# Create corpora object
docs <- VCorpus(VectorSource(myData))

# Create term document matrix
myTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = myTokenizer))

# Remove corporabfrom RAM to free 
rm(docs)   
        
# Create sparse matrix
dtm_sparse <- MakeSparseDTM(dtm)

# Get frequent terms
freqTerms <- findFreqTerms(dtm, lowfreq = lowFreq)

# Iterate over the columns and calculate the frequencies
for (col in 1:ncol(dtm)){
    # Get term
    word <- dtm$dimnames$Terms[col]
    if (word %in% freqTerms){
        # Get frequency of term
        freq <- sum(dtm_sparse[,col])
        # Save term and frequency to file
        write(paste(word, freq, sep = ","), file = outputFile, append = TRUE)
    }
}

Frequent terms found

Here is what we found after running our code.

Single Terms

par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.1.gramFrequencies.txt")
makeNgramPlot("en_US.news.1.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.1.gramFrequencies.txt")

2-grams

par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.2.gramFrequencies.txt")
makeNgramPlot("en_US.news.2.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.2.gramFrequencies.txt")

3-gramms

par(mfrow=c(1,3))
makeNgramPlot("en_US.blogs.3.gramFrequencies.txt")
makeNgramPlot("en_US.news.3.gramFrequencies.txt")
makeNgramPlot("en_US.twitter.3.gramFrequencies.txt")

We didn’t remove apostrophes in our cleanup, but the tm library by default removes them when creating the dtm matrix, so in our results the apostrophes are missing. We will have to invetigate how to change this default behaviour in order to leave them.

Appendix: Full source code

# Does the data cleanup of all the raw files
cleanUpRawFiles <- function() {
    # Convert to lowercase, remove numbers, stop words,
    # special chars, extra white spaces and profanities
    preProcess("en_US.blogs.txt")
    preProcess("en_US.news.txt")
    preProcess("en_US.twitter.txt")
    reduceNumberOfLines("en_US.twitter.txt.preprocessed")
    
    getNgramFrequencies("en_US.blogs.txt.preprocessed")
    getNgramFrequencies("en_US.news.txt.preprocessed")
    getNgramFrequencies("en_US.twitter.txt.preprocessed.reduced")
    
    return(frequentTerms)
}

# Helper function that actually does the data
# cleaning of a raw file
preProcess <- function(sourceFile) {
    library(stringr)
    validateFile(sourceFile)
    
    logActivity("Reading data.")
    # Read data
    myData <- readLines(sourceFile, skipNul = TRUE)
    
    # Remove urls
    logActivity("Removing urls.")
    urlPattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    myData <- str_replace_all(myData, urlPattern, " ")
    
    # Remove special chars
    logActivity("Removing special characters and numbers.")
    myData <- str_replace_all(myData, "[^A-Za-z'\\- ]", 
        " ")
    
    # Remove profanities
    logActivity("Removing profanities")
    profanities <- c(" arse", "ass", "asshole", "bastard", 
        "bitch", "bollocks", "crap", "damn", "fuck", 
        "goddam", "shit ")
    profanitiesPattern <- paste(profanities, collapse = "|")
    myData <- str_replace_all(myData, profanitiesPattern, 
        " ")
    
    # Collapse white spaces
    logActivity("Removing extra white spaces.")
    spacePattern <- "\\s+"
    myData <- str_replace_all(myData, spacePattern, 
        " ")
    
    # Convert the text to lower case
    logActivity("Removing extra white spaces.")
    myData <- gsub(pattern = "([[:upper:]])", perl = TRUE, 
        replacement = "\\L\\1", myData)
    
    writeLines(myData, paste(sourceFile, "preprocessed", 
        sep = "."))
    
    logActivity("Finished processing.")
    # return(docs)
}

# Auxiliary function to combine lines from a text
# file, if the number of lines exceedes the
# parameter n (n defaults to 1 million) It returns
# a file that has 1 million or less lines Needed
# because the creation of corpora with more than 1
# million lines crashes my computer
reduceNumberOfLines <- function(sourceFile, n = 1e+06) {
    library(stringr)
    validateFile(sourceFile)
    
    logActivity("Reading data.")
    # Read data
    myData <- readLines(sourceFile, skipNul = TRUE)
    initialRows <- length(myData)
    reductionRate <- floor((initialRows/n)) + 1
    print(paste("Reduction rate:", reductionRate, sep = " "))
    
    logActivity("Saving data.")
    outputFile <- paste(sourceFile, "reduced", sep = ".")
    for (i in 1:n) {
        concatenatedText <- ""
        for (j in 1:reductionRate) {
            if ((i - 1) * reductionRate + j < initialRows) {
                textToAdd <- myData[(i - 1) * reductionRate + 
                  j]
                concatenatedText <- paste(concatenatedText, 
                  textToAdd, sep = " ")
                concatenatedText <- str_replace_all(concatenatedText, 
                  "\\s+", " ")
            }
        }
        write(concatenatedText, file = outputFile, 
            append = TRUE)
        processedRecords <- i * reductionRate
        if (i%%5000 == 0) {
            sprintf("%d records processed", processedRecords)
        }
    }
}

# Auxiliary method that check if a file parameter
# is correct
validateFile <- function(sourceFile) {
    if (is.null(sourceFile)) {
        stop("sourceFile not provided")
    }
    if (!file.exists(sourceFile)) {
        stop(paste(sourceFile, "does not exist", sep = " "))
    }
}

# Calculates frequencies of n-grams that appear at
# least lowFreq times and saves them to a file
getNgramFrequencies <- function(inputFile, lowFreq = 1000) {
    library(tm)
    library(RWeka)
    library(stringr)
    library(textmineR)
    
    
    for (ngram in 1:3) {
        docs <- createCorpora(inputFile)
        prefix <- str_replace_all(inputFile, "\\.txt|\\.reduced|\\.preprocessed", 
            "")
        outputFile <- paste(prefix, ngram, "gramFrequencies", 
            "txt", sep = ".")
        
        # Creating term document matrix
        logActivity("Creating term document matrix.")
        myTokenizer <- function(x) NGramTokenizer(x, 
            Weka_control(min = ngram, max = ngram))
        dtm <- DocumentTermMatrix(docs, control = list(tokenize = myTokenizer))
        rm(docs)
        
        # Create sparse matrix
        logActivity("Creating sparse matrix.")
        dtm_sparse <- MakeSparseDTM(dtm)
        
        # Get frequent terms
        logActivity("Getting frequent terms.")
        freqTerms <- findFreqTerms(dtm, lowfreq = lowFreq)
        
        logActivity("Calculating frequencies and saving to file.")
        i <- 0
        j <- 0
        # Iterate over the columns and calculate the
        # frequencies
        for (col in 1:ncol(dtm)) {
            i <- i + 1
            if (i%%500 == 0) {
                sprintf("%d records processed", i)
            }
            
            # Get term
            word <- dtm$dimnames$Terms[col]
            if (word %in% freqTerms) {
                j <- j + 1
                if (j%%500 == 0) {
                  sprintf("Added %d records to file", 
                    j)
                }
                # Get frequency of term
                freq <- sum(dtm_sparse[, col])
                # Save term and frequency to file
                write(paste(word, freq, sep = ","), 
                  file = outputFile, append = TRUE)
            }
        }
    }
    
}

# Returns a corpora object corresponding to the
# data of the file provided
createCorpora <- function(sourceFile) {
    library(tm)
    validateFile(sourceFile)
    
    logActivity("Reading data.")
    # Read data
    myData <- readLines(sourceFile, skipNul = TRUE)
    
    # Generate corpus
    logActivity("Generating corpus.")
    library("tm")
    docs <- VCorpus(VectorSource(myData))
    logActivity("Finished creating corpus.")
    return(docs)
}

# Utility to display a message that includes the
# time Used to keep track of the time that each
# action takes
logActivity <- function(text) {
    print(paste(text, Sys.time(), sep = " "))
}

# Makes a barplot of the 10 most used n-gramms
# Reads the data from the files produced by
# getNgramFrequencies()
makeNgramPlot <- function(file) {
    validateFile(file)
    myData <- read.csv(file, header = FALSE)
    top10 <- top_n(myData[order(myData$V2, decreasing = T)[1:10], 
        ], 10, V2)
    names(top10) <- c("Term", "Frequency")
    barplot(top10$Frequency, names.arg = top10$Term, 
        las = 2)
}

# Makes barplots of the frequencies of the top 10
# n-gramms from all raw files
makeNgramPlots <- function() {
    par(mfrow = c(1, 3))
    makeNgramPlot("en_US.blogs.1.gramFrequencies.txt")
    makeNgramPlot("en_US.blogs.2.gramFrequencies.txt")
    makeNgramPlot("en_US.blogs.3.gramFrequencies.txt")
    makeNgramPlot("en_US.twitter.1.gramFrequencies.txt")
    makeNgramPlot("en_US.twitter.2.gramFrequencies.txt")
    makeNgramPlot("en_US.twitter.3.gramFrequencies.txt")
}

Assignment 1 for Capstone Project

Jose Rosengurtt