library(R.utils)
library(tm)
library(dplyr)
library(slam)
library(ggplot2)
library(RWeka)
library(parallel)
options(mc.cores=1)
For the Data Science capstone project we’ll be building a word prediction model based on a set of documents. In this report I’ll perform the loading of the documents and will perform an exploratory data analysis.
First off we’ll need to download the necessary files.
#Set the URL for the download
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#To prevent unnecessary downloading, only perform this if the final directory is not available
if(!dir.exists("./Data/final"))
{
download.file(url, destfile = "./Data/Coursera-Swiftkey.zip", method = 'curl', quiet=TRUE)
unzip("./Data/Coursera-Swiftkey.zip")
}
Let’s take a look at what we downloaded:
writeLines(dir('./Data/final', recursive=TRUE))
## de_DE/de_DE.blogs.txt
## de_DE/de_DE.news.txt
## de_DE/de_DE.twitter.txt
## en_US/en_US.blogs.txt
## en_US/en_US.news.txt
## en_US/en_US.twitter.txt
## fi_FI/fi_FI.blogs.txt
## fi_FI/fi_FI.news.txt
## fi_FI/fi_FI.twitter.txt
## ru_RU/ru_RU.blogs.txt
## ru_RU/ru_RU.news.txt
## ru_RU/ru_RU.twitter.txt
There are a total of twelve different documents divided over four languages: English (EN), German (DE), Finnish (FI) and Russion (RU). For this report we’ll only be focusing on the English documents. These range in size between 167 and 225 MB. So how many lines are we talking about here? Let’s take the blog file as an example.
countLines('./Data/final/en_US/en_US.blogs.txt')
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
Wow, that’s a lot. Let’s take a look at the first few lines of the English blog, to see what we’re dealing with:
writeLines(readLines('./Data/final/en_US/en_US.blogs.txt')[1:3])
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## We love you Mr. Brown.
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
It’s clear that every line has no connection to the previous. We can assume that every line in this file is a separate sample and thus a document in its own right.
Of course a textfile of this size will not fit in the memory of most computers. We’ll be making use of sample files to bypass this problem. A 5% sample should be sufficient for analyses purposes. This will still give us a large number of samples per file to work with. For this purpose I’ve written two functions. The first one will take a sample of a file and write it at a given location. It could be possible to read the file in chunks, but I figured a 200+ MB file should pose no problems for a computer with 8 GB memory.
sampleFile <- function(source, destination)
{
con <- readLines(source) #Read the file in memory
#Create a sample based on a vector of random binominal values with a 10 percent chance of success.
sample <- con[rbinom(length(con), 1, 0.05) > 0 ]
rm(con) #Remove the file from memory
writeLines(sample, destination) #Write the sample to its destination
}
Next we’ll also write a function to trawl through the different documents and call the sampleFile function for each of them. It takes a sourcedir variable with the root folder for all the documents and a destdir folder with the root for where all files should be written to.
processFiles <- function(sourcedir, destdir)
{
if(!dir.exists(destdir))
dir.create(destdir)
#This function section will create the subfolders if they're not present.
dirs <- list.dirs(sourcedir)
dirs <- dirs[nchar(dirs) > 0]
dirs <- paste(destdir,dirs, sep='/')
#Invisible to prevent return values
invisible(sapply(dirs, function(x){if(!dir.exists(x))dir.create(x)}))
#Find all text files hidden within the subfolders
files <- dir(sourcedir, "*.txt", recursive = T)
for( i in 1:length(files)) #For each found list
{
#Create a full source filepath
source <- paste(sourcedir, files[i], sep='/')
#Create a full destination filepath
destination <- paste(destdir, files[i], sep='/')
#Call the sampleFile function to process.
sampleFile(source, destination)
}
}
Next we set the seed and start with our sampling.
#We'll set a seed for reproducibility
set.seed(12345)
if(!dir.exists('./Data/samples'))
processFiles('./Data/final', './Data/samples')
We’ll now have a set of sample files only 5% of their original size. This should be small enough for general analysis purposes.
For our analysis, we’ll be focussing on just the English files. We’ll create a corpus (a collection of documents) of all the English files (blogs, tweets and news messages). Because each line is a separate entity, we’ll be adding each line as a separate document in the corpus.
Since we want each line to be a separate document, we will first need to split up each file. To prevent any unnecessary strain on the harddrive, we’ll process each file in memory before feeding it to the corpus. For this purpose a special function is written. I prefer to write everything in functions, it allows for reusability in the future.
#Parameter is the language folder where the documents are located
createCorpus <- function(sourcedir)
{
#Extract the language from the subfolder (Last folder)
lan <- tail(strsplit(sourcedir, '/')[[1]], 1)
#Assemble a vector with the files and their full pathnames
sourcedocs <- list.files(sourcedir, full.names = T )
#Assemble a second vector with just the filenames
sourcefile <- list.files(sourcedir, full.names = F )
#Create a dummy variable. We'll combine the corpus when ready.
corp <- NULL
#For each document
for(i in 1:length(sourcedocs))
{
#Read the files into a variable
corpfile <- readLines(sourcedocs[i], encoding = lan)
#Create a temporary corpus file based on the corpfile Vector
corptemp <- VCorpus(VectorSource(corpfile)
, readerControl = list(reader=readPlain, language = lan))
#Add the filename to the metadata
meta(corptemp, "source") <- sourcefile[i]
#If it's the first document then just create the corpus
if(i == 1)
corp <- corptemp
#Otherwise combine both of them to one large corpus
else
corp <- c(corp, corptemp)
}
#Return corp value
corp
}
#Call the function
corp <- createCorpus('./Data/samples/en_US')
#Display basic information
corp
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 1
## Content: documents: 213710
Our corpus shows that it contains over 200,000 documents. Since we added the sourcefile to the metadata during creation, we can even see from which sourcefile the data came from:
group_by(meta(corp), source) %>% summarise(total_docs = n())
## # A tibble: 3 × 2
## source total_docs
## <chr> <int>
## 1 en_US.blogs.txt 45216
## 2 en_US.news.txt 50308
## 3 en_US.twitter.txt 118186
A problem with language is that it’s not structured. A lot of junk can be inserted. Let’s consider the goal of this exercise. We intend to create a body of text to base our predictions on. This means that we don’t need:
But because we want to predict a written language, we can’t shorten or remove words. It’s tempting to shorten our list of possible words by removing stopwords or “stemming” words (shortening similar words to their common root).
For now we’ll clean up our text with the following actions:
First we’ll create a custom function to change certain patterns to a space. This function is not mine but copied from the helpful blog Eight2Late. The idea is that some people tend to write words like “Therefore I say:Hi!”. Without cleaning this up, tm will transform this as “therefore i sayhi”. By first replacing the colon with a space, we’ll preserve the words in their intended form.
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
So now let’s clean up our text:
corp <- tm_map(corp, toSpace, "-")
corp <- tm_map(corp, toSpace, ":")
corp <- tm_map(corp, toSpace, "/")
Next we need to clean up our twitter documents. All those hashtags and handles need to be replaced.
We’ll make two new functions and apply them to our corpus. The regex expression is adapted from this stackoverflow post
labelHandle <- content_transformer(function(x) {return (gsub("\\S*@(?:\\[[^\\]]+\\]|\\S+)", "twitterhandle", x))})
labelHashtag <- content_transformer(function(x) {return (gsub("\\S*#(?:\\[[^\\]]+\\]|\\S+)", "twitterhashtag", x))})
corp <- tm_map(corp, labelHandle)
corp <- tm_map(corp, labelHashtag)
And finally we’ll remove numbers, punctuation and bring all our text to lowercase.
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(tolower))
It’s nice that we have our words cleaned up, but we know very little about what we have without looking at specific samples. To fix this we’ll create a DocumentTermMatrix. A sparse matrix with a column for each unique word and a row for each document.
corpdtm <- DocumentTermMatrix(corp)
corpdtm
## <<DocumentTermMatrix (documents: 213710, terms: 121580)>>
## Non-/sparse entries: 3469719/25979392081
## Sparsity : 100%
## Maximal term length: 322
## Weighting : term frequency (tf)
We got a collection of 213710 documents, which matches our corpus, and we have 121580 terms, in this case distinct words. That’s a lot of words, I’m sure there are some words that are rarely used. Let’s take a look at what we’ve got.
dat <- data.frame(word = colnames(corpdtm), total = col_sums(corpdtm), mean = col_means(corpdtm))
dat <- arrange(dat, desc(total))
Our top 5 words are:
head(dat, 5)
## word total mean
## 1 the 237116 1.1095222
## 2 and 119352 0.5584764
## 3 for 54846 0.2566375
## 4 that 51919 0.2429414
## 5 you 46798 0.2189790
This isn’t unexpected. “The”" is the most popular word and appears on average at least once in every document. One thing that’s quite clear is that the frequency of the word is rapidly dropping. Let’s take a look at how fast this drops, we’ll present it in log10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.429 0.699 5.375
Most words have very few occurances, mostly one or two times. The words in our top 6 are a not the norm. So what kind of words can be found in the bottom end of our word list?
tail(dat, 3)
## word total mean
## 121578 谢谢可依you 1 4.679238e-06
## 121579 轻灵的笑声 1 4.679238e-06
## 121580 追星梦 1 4.679238e-06
These are clearly not words that are supposed to be in an english blog. It’s safe to say that if a word only appears once in 121580 documents, there are not enough observations to make a prediction. For now we’ll leave the words in, but we’ll filter them out when we create our n-grams.
So we looked at some word statistics. Let’s also look at some document information. What is the average word length of a document? Is there any difference between news, blog and twitter documents?
dat <- data.frame(source = meta(corp)$source, total = row_sums(corpdtm))
g <- ggplot(data=dat, aes(x=source, y = total))
g <- g+geom_boxplot(aes(col=source)) + ylab("Word count")
print(g)
Of course the Twitter source contains the fewest words per document. With a 140 character limit that’s to be expected. It’s clear that the news articles have a more uniform size, the distance between the 25th and 75 quartile is a lot smaller. The fact that many documents have a small word count shows that it was a good idea to consider each line a document in it’s own right. If we didn’t then we would eventually end up with n-grams constructed over multiple documents, these wouldn’t make any sense and would pollute our model.
Now that we took a look at the words in our document, it’s time to take a look at the possible three-word combinations (tri-grams) in this corpus. We’ll construct a new term matrix with all the possible tri-grams.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdmngram <- TermDocumentMatrix(corp, control = list(tokenize = TrigramTokenizer))
tdmngram
## <<TermDocumentMatrix (terms: 3344954, documents: 213710)>>
## Non-/sparse entries: 4561985/714845557355
## Sparsity : 100%
## Maximal term length: 134
## Weighting : term frequency (tf)
Wow, we have a LOT more terms now. Just like our word list, let’s create an overview of the top 6 tri-grams:
dat <- data.frame(word = rownames(tdmngram), total = row_sums(tdmngram), mean = row_means(tdmngram))
dat <- arrange(dat, desc(total))
head(dat)
## word total mean
## 1 one of the 1684 0.007879837
## 2 a lot of 1513 0.007079687
## 3 thanks for the 1217 0.005694633
## 4 to be a 907 0.004244069
## 5 going to be 896 0.004192597
## 6 the end of 759 0.003551542
One thing that’s noticeable is that the mean of even the top tri-gram is significantly lower than the mean of the top word. Let’s see the number of occurances in log10 displayed in a histograph.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05124 0.00000 3.22600
It’s clear we need to clean up our term document matrix. We could filter on sparsity, the percentage of total documents that a certain term does NOT appear in, but our list of documents is so big that even our term document matrix tells us that we have attained a 100% sparsity.
As proof, look what happens when we filter our matrix to 99.9% sparsity:
removeSparseTerms(tdmngram, 0.999)
## <<TermDocumentMatrix (terms: 139, documents: 213710)>>
## Non-/sparse entries: 52094/29653596
## Sparsity : 100%
## Maximal term length: 18
## Weighting : term frequency (tf)
We end up with only 139 terms. Not nearly enough to create a good model. Instead, we’ll filter on minimum amount of occurances. We’ll maintain that a tri-gram should appear at least 10 times for it to be included in our matrix.
tdmngram <- tdmngram[findFreqTerms(tdmngram, lowfreq = 10),]
tdmngram
## <<TermDocumentMatrix (terms: 24373, documents: 213710)>>
## Non-/sparse entries: 613233/5208140597
## Sparsity : 100%
## Maximal term length: 44
## Weighting : term frequency (tf)
There we go, we got 24373 terms that have occured at least once in our overview. Let’s take a new look at our data. First we’ll collect and order the terms again. After that we’ll check our last 6 terms to see if they indeed have at least 10 occurances.
dat <- data.frame(word = rownames(tdmngram), total = row_sums(tdmngram), mean = row_means(tdmngram))
dat <- arrange(dat, desc(total))
tail(dat)
## word total mean
## 24368 you were on 10 4.679238e-05
## 24369 you were to 10 4.679238e-05
## 24370 you who are 10 4.679238e-05
## 24371 you will do 10 4.679238e-05
## 24372 you would love 10 4.679238e-05
## 24373 you you can 10 4.679238e-05
There we go, all of them at least 10 occurances. Let’s see how the new distribution looks like:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.079 1.204 1.274 1.398 3.226
This looks a lot better. It’s still in log 10, but at least the huge spike of one hit wonders has been removed.
So, let’s poke around in our data. What’s better, love or hate?
filter(dat, word == "i love you" | word == "i hate you")
## word total mean
## 1 i love you 443 0.0020729025
## 2 i hate you 44 0.0002058865
Wow, ten times more love than hate in this corpus. But besides you, what do people love most? :) Here’s the top 10:
head(filter(dat, grepl("i love ", word)), 10)
## word total mean
## 1 i love you 443 0.0020729025
## 2 i love the 203 0.0009498854
## 3 i love it 174 0.0008141875
## 4 i love my 125 0.0005849048
## 5 i love that 113 0.0005287539
## 6 i love this 79 0.0003696598
## 7 i love to 67 0.0003135090
## 8 i love your 65 0.0003041505
## 9 i love how 64 0.0002994712
## 10 i love u 50 0.0002339619
It’s still you :)
To create a model, I would start with a 2, 3 and 4-gram matrix. Each of these would be used to find the most frequent 2nd, 3rd or 4th word in each sequence. For example, with a 4-gram matrix, I would create a list of most popular sequences just like the previous examples. However, I would then split the sentence of four words into two seperate columns. “I love vanilla icecream” would turn into c(“I love vanilla”, “icecream”). The total count would indicate it’s popularity. So next time someone types “I love vanilla” it would look up all 4-grams which begin with that sentence and take the most popular one. In case the combination of words can not be found, a new search will be performed in the 3-gram matrix for the last two words and so on.
So far our exploratory analysis. Certain things might change in the future. For example the 10 count cut-off for the tri-gram matrix might seem too harsh when it’s time to create prediction models. I might decide to lower or raise the cut-off to increase accuracy or reduce system load respectively.