library(R.utils)
library(tm)
library(dplyr)
library(slam)
library(ggplot2)
library(RWeka)
library(parallel)

options(mc.cores=1)

Introduction

For the Data Science capstone project we’ll be building a word prediction model based on a set of documents. In this report I’ll perform the loading of the documents and will perform an exploratory data analysis.

Loading the data

Downloading

First off we’ll need to download the necessary files.

#Set the URL for the download
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" 
#To prevent unnecessary downloading, only perform this if the final directory is not available
if(!dir.exists("./Data/final"))
{
    download.file(url, destfile = "./Data/Coursera-Swiftkey.zip", method = 'curl', quiet=TRUE)
    unzip("./Data/Coursera-Swiftkey.zip")
}

Let’s take a look at what we downloaded:

writeLines(dir('./Data/final', recursive=TRUE))
## de_DE/de_DE.blogs.txt
## de_DE/de_DE.news.txt
## de_DE/de_DE.twitter.txt
## en_US/en_US.blogs.txt
## en_US/en_US.news.txt
## en_US/en_US.twitter.txt
## fi_FI/fi_FI.blogs.txt
## fi_FI/fi_FI.news.txt
## fi_FI/fi_FI.twitter.txt
## ru_RU/ru_RU.blogs.txt
## ru_RU/ru_RU.news.txt
## ru_RU/ru_RU.twitter.txt

There are a total of twelve different documents divided over four languages: English (EN), German (DE), Finnish (FI) and Russion (RU). For this report we’ll only be focusing on the English documents. These range in size between 167 and 225 MB. So how many lines are we talking about here? Let’s take the blog file as an example.

countLines('./Data/final/en_US/en_US.blogs.txt')
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE

Wow, that’s a lot. Let’s take a look at the first few lines of the English blog, to see what we’re dealing with:

writeLines(readLines('./Data/final/en_US/en_US.blogs.txt')[1:3])
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## We love you Mr. Brown.
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

It’s clear that every line has no connection to the previous. We can assume that every line in this file is a separate sample and thus a document in its own right.

Sampling

Of course a textfile of this size will not fit in the memory of most computers. We’ll be making use of sample files to bypass this problem. A 5% sample should be sufficient for analyses purposes. This will still give us a large number of samples per file to work with. For this purpose I’ve written two functions. The first one will take a sample of a file and write it at a given location. It could be possible to read the file in chunks, but I figured a 200+ MB file should pose no problems for a computer with 8 GB memory.

sampleFile <- function(source, destination)
{
    con <- readLines(source) #Read the file in memory
     #Create a sample based on a vector of random binominal values with a 10 percent chance of success. 
    sample <- con[rbinom(length(con), 1, 0.05) > 0 ]
    rm(con) #Remove the file from memory
    writeLines(sample, destination) #Write the sample to its destination
}

Next we’ll also write a function to trawl through the different documents and call the sampleFile function for each of them. It takes a sourcedir variable with the root folder for all the documents and a destdir folder with the root for where all files should be written to.

processFiles <- function(sourcedir, destdir)
{
    if(!dir.exists(destdir))
        dir.create(destdir)
    
    #This function section will create the subfolders if they're not present. 
    dirs <- list.dirs(sourcedir) 
    dirs <- dirs[nchar(dirs) > 0]
    dirs <- paste(destdir,dirs, sep='/')
    #Invisible to prevent return values
    invisible(sapply(dirs, function(x){if(!dir.exists(x))dir.create(x)})) 
    
    #Find all text files hidden within the subfolders
    files <- dir(sourcedir, "*.txt", recursive = T) 
    
    for( i in 1:length(files)) #For each found list
    {
        #Create a full source filepath
        source <- paste(sourcedir, files[i], sep='/') 
        #Create a full destination filepath
        destination <- paste(destdir, files[i], sep='/') 
        #Call the sampleFile function to process.
        sampleFile(source, destination)  
    }
    
}

Next we set the seed and start with our sampling.

#We'll set a seed for reproducibility
set.seed(12345)
if(!dir.exists('./Data/samples'))
    processFiles('./Data/final', './Data/samples')

We’ll now have a set of sample files only 5% of their original size. This should be small enough for general analysis purposes.

Analysis

For our analysis, we’ll be focussing on just the English files. We’ll create a corpus (a collection of documents) of all the English files (blogs, tweets and news messages). Because each line is a separate entity, we’ll be adding each line as a separate document in the corpus.

Corpus

Since we want each line to be a separate document, we will first need to split up each file. To prevent any unnecessary strain on the harddrive, we’ll process each file in memory before feeding it to the corpus. For this purpose a special function is written. I prefer to write everything in functions, it allows for reusability in the future.

#Parameter is the language folder where the documents are located
createCorpus <- function(sourcedir) 
{
    #Extract the language from the subfolder (Last folder)
    lan <- tail(strsplit(sourcedir, '/')[[1]], 1)
    
    #Assemble a vector with the files and their full pathnames
    sourcedocs <- list.files(sourcedir, full.names = T )
    #Assemble a second vector with just the filenames
    sourcefile <- list.files(sourcedir, full.names = F )
    #Create a dummy variable. We'll combine the corpus when ready. 
    corp <- NULL
    #For each document
    for(i in 1:length(sourcedocs))
    {
        #Read the files into a variable
        corpfile <- readLines(sourcedocs[i], encoding = lan)
        #Create a temporary corpus file based on the corpfile Vector
        corptemp  <- VCorpus(VectorSource(corpfile)
                         , readerControl = list(reader=readPlain, language = lan))
        
        #Add the filename to the metadata
        meta(corptemp, "source") <- sourcefile[i]
        #If it's the first document then just create the corpus
        if(i == 1)
            corp <- corptemp
        #Otherwise combine both of them to one large corpus
        else
            corp <- c(corp, corptemp)
        
    }
    #Return corp value
    corp
}

#Call the function
corp <- createCorpus('./Data/samples/en_US')
#Display basic information
corp
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 1
## Content:  documents: 213710

Our corpus shows that it contains over 200,000 documents. Since we added the sourcefile to the metadata during creation, we can even see from which sourcefile the data came from:

group_by(meta(corp), source) %>% summarise(total_docs = n())
## # A tibble: 3 × 2
##              source total_docs
##               <chr>      <int>
## 1   en_US.blogs.txt      45216
## 2    en_US.news.txt      50308
## 3 en_US.twitter.txt     118186

Cleaning

A problem with language is that it’s not structured. A lot of junk can be inserted. Let’s consider the goal of this exercise. We intend to create a body of text to base our predictions on. This means that we don’t need:

  • Numbers
  • Punctuation
  • Twitter tags

But because we want to predict a written language, we can’t shorten or remove words. It’s tempting to shorten our list of possible words by removing stopwords or “stemming” words (shortening similar words to their common root).

For now we’ll clean up our text with the following actions:

  • Change the characters :, - and / to space.
  • Bring all words to lower case
  • Remove all numbers
  • Remove all punctuation

First we’ll create a custom function to change certain patterns to a space. This function is not mine but copied from the helpful blog Eight2Late. The idea is that some people tend to write words like “Therefore I say:Hi!”. Without cleaning this up, tm will transform this as “therefore i sayhi”. By first replacing the colon with a space, we’ll preserve the words in their intended form.

toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

So now let’s clean up our text:

corp <- tm_map(corp, toSpace, "-")
corp <- tm_map(corp, toSpace, ":")
corp <- tm_map(corp, toSpace, "/")

Next we need to clean up our twitter documents. All those hashtags and handles need to be replaced.
We’ll make two new functions and apply them to our corpus. The regex expression is adapted from this stackoverflow post

labelHandle <- content_transformer(function(x) {return (gsub("\\S*@(?:\\[[^\\]]+\\]|\\S+)", "twitterhandle", x))})
labelHashtag <- content_transformer(function(x) {return (gsub("\\S*#(?:\\[[^\\]]+\\]|\\S+)", "twitterhashtag", x))})

corp <- tm_map(corp, labelHandle)
corp <- tm_map(corp, labelHashtag)

And finally we’ll remove numbers, punctuation and bring all our text to lowercase.

corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(tolower))

Document Term Matrix

It’s nice that we have our words cleaned up, but we know very little about what we have without looking at specific samples. To fix this we’ll create a DocumentTermMatrix. A sparse matrix with a column for each unique word and a row for each document.

corpdtm <- DocumentTermMatrix(corp)
corpdtm 
## <<DocumentTermMatrix (documents: 213710, terms: 121580)>>
## Non-/sparse entries: 3469719/25979392081
## Sparsity           : 100%
## Maximal term length: 322
## Weighting          : term frequency (tf)

We got a collection of 213710 documents, which matches our corpus, and we have 121580 terms, in this case distinct words. That’s a lot of words, I’m sure there are some words that are rarely used. Let’s take a look at what we’ve got.

dat <- data.frame(word = colnames(corpdtm), total = col_sums(corpdtm), mean = col_means(corpdtm))
dat <- arrange(dat, desc(total))

Our top 5 words are:

head(dat, 5)
##   word  total      mean
## 1  the 237116 1.1095222
## 2  and 119352 0.5584764
## 3  for  54846 0.2566375
## 4 that  51919 0.2429414
## 5  you  46798 0.2189790

This isn’t unexpected. “The”" is the most popular word and appears on average at least once in every document. One thing that’s quite clear is that the frequency of the word is rapidly dropping. Let’s take a look at how fast this drops, we’ll present it in log10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.429   0.699   5.375

Most words have very few occurances, mostly one or two times. The words in our top 6 are a not the norm. So what kind of words can be found in the bottom end of our word list?

tail(dat, 3)
##               word total         mean
## 121578 谢谢可依you     1 4.679238e-06
## 121579  轻灵的笑声     1 4.679238e-06
## 121580      追星梦     1 4.679238e-06

These are clearly not words that are supposed to be in an english blog. It’s safe to say that if a word only appears once in 121580 documents, there are not enough observations to make a prediction. For now we’ll leave the words in, but we’ll filter them out when we create our n-grams.

So we looked at some word statistics. Let’s also look at some document information. What is the average word length of a document? Is there any difference between news, blog and twitter documents?

dat <- data.frame(source = meta(corp)$source, total = row_sums(corpdtm))
g <- ggplot(data=dat, aes(x=source, y = total))
g <- g+geom_boxplot(aes(col=source)) + ylab("Word count")
print(g)

Of course the Twitter source contains the fewest words per document. With a 140 character limit that’s to be expected. It’s clear that the news articles have a more uniform size, the distance between the 25th and 75 quartile is a lot smaller. The fact that many documents have a small word count shows that it was a good idea to consider each line a document in it’s own right. If we didn’t then we would eventually end up with n-grams constructed over multiple documents, these wouldn’t make any sense and would pollute our model.

Tri-grams

Now that we took a look at the words in our document, it’s time to take a look at the possible three-word combinations (tri-grams) in this corpus. We’ll construct a new term matrix with all the possible tri-grams.

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdmngram <- TermDocumentMatrix(corp, control = list(tokenize = TrigramTokenizer))
tdmngram
## <<TermDocumentMatrix (terms: 3344954, documents: 213710)>>
## Non-/sparse entries: 4561985/714845557355
## Sparsity           : 100%
## Maximal term length: 134
## Weighting          : term frequency (tf)

Wow, we have a LOT more terms now. Just like our word list, let’s create an overview of the top 6 tri-grams:

dat <- data.frame(word = rownames(tdmngram), total = row_sums(tdmngram), mean = row_means(tdmngram))
dat <- arrange(dat, desc(total))
head(dat)
##             word total        mean
## 1     one of the  1684 0.007879837
## 2       a lot of  1513 0.007079687
## 3 thanks for the  1217 0.005694633
## 4        to be a   907 0.004244069
## 5    going to be   896 0.004192597
## 6     the end of   759 0.003551542

One thing that’s noticeable is that the mean of even the top tri-gram is significantly lower than the mean of the top word. Let’s see the number of occurances in log10 displayed in a histograph.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05124 0.00000 3.22600

It’s clear we need to clean up our term document matrix. We could filter on sparsity, the percentage of total documents that a certain term does NOT appear in, but our list of documents is so big that even our term document matrix tells us that we have attained a 100% sparsity.
As proof, look what happens when we filter our matrix to 99.9% sparsity:

removeSparseTerms(tdmngram, 0.999)
## <<TermDocumentMatrix (terms: 139, documents: 213710)>>
## Non-/sparse entries: 52094/29653596
## Sparsity           : 100%
## Maximal term length: 18
## Weighting          : term frequency (tf)

We end up with only 139 terms. Not nearly enough to create a good model. Instead, we’ll filter on minimum amount of occurances. We’ll maintain that a tri-gram should appear at least 10 times for it to be included in our matrix.

tdmngram <-  tdmngram[findFreqTerms(tdmngram, lowfreq = 10),]
tdmngram
## <<TermDocumentMatrix (terms: 24373, documents: 213710)>>
## Non-/sparse entries: 613233/5208140597
## Sparsity           : 100%
## Maximal term length: 44
## Weighting          : term frequency (tf)

There we go, we got 24373 terms that have occured at least once in our overview. Let’s take a new look at our data. First we’ll collect and order the terms again. After that we’ll check our last 6 terms to see if they indeed have at least 10 occurances.

dat <- data.frame(word = rownames(tdmngram), total = row_sums(tdmngram), mean = row_means(tdmngram))
dat <- arrange(dat, desc(total))
tail(dat)
##                 word total         mean
## 24368    you were on    10 4.679238e-05
## 24369    you were to    10 4.679238e-05
## 24370    you who are    10 4.679238e-05
## 24371    you will do    10 4.679238e-05
## 24372 you would love    10 4.679238e-05
## 24373    you you can    10 4.679238e-05

There we go, all of them at least 10 occurances. Let’s see how the new distribution looks like:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.079   1.204   1.274   1.398   3.226

This looks a lot better. It’s still in log 10, but at least the huge spike of one hit wonders has been removed.

So, let’s poke around in our data. What’s better, love or hate?

filter(dat, word == "i love you" | word == "i hate you")
##         word total         mean
## 1 i love you   443 0.0020729025
## 2 i hate you    44 0.0002058865

Wow, ten times more love than hate in this corpus. But besides you, what do people love most? :) Here’s the top 10:

head(filter(dat, grepl("i love ", word)), 10)
##           word total         mean
## 1   i love you   443 0.0020729025
## 2   i love the   203 0.0009498854
## 3    i love it   174 0.0008141875
## 4    i love my   125 0.0005849048
## 5  i love that   113 0.0005287539
## 6  i love this    79 0.0003696598
## 7    i love to    67 0.0003135090
## 8  i love your    65 0.0003041505
## 9   i love how    64 0.0002994712
## 10    i love u    50 0.0002339619

It’s still you :)

How to create a model

To create a model, I would start with a 2, 3 and 4-gram matrix. Each of these would be used to find the most frequent 2nd, 3rd or 4th word in each sequence. For example, with a 4-gram matrix, I would create a list of most popular sequences just like the previous examples. However, I would then split the sentence of four words into two seperate columns. “I love vanilla icecream” would turn into c(“I love vanilla”, “icecream”). The total count would indicate it’s popularity. So next time someone types “I love vanilla” it would look up all 4-grams which begin with that sentence and take the most popular one. In case the combination of words can not be found, a new search will be performed in the 3-gram matrix for the last two words and so on.

Final notes

So far our exploratory analysis. Certain things might change in the future. For example the 10 count cut-off for the tri-gram matrix might seem too harsh when it’s time to create prediction models. I might decide to lower or raise the cut-off to increase accuracy or reduce system load respectively.