mileStoneReport

To access the version of this report without code, here is the link : http://rpubs.com/rdsn/138846

Overview

People are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

Introduction

This project is about creating a Shiny application designed to make text predictions. In order to do so, we will use the HC Corpora dataset, achieve an exploratory analysis, and implement a predictive model so that to predict the next word.

This Milestone Report is presenting our understanding of the data, our exploratroy analysis, and our plans for implementing a predictive algorithm.

Data

HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages by a web crawler.
More details on the corpora can be found here : http://www.corpora.heliohost.org/aboutcorpus.html In this dataset, we are provided with 3 text files (.txt) available in 4 different languages : Deutch, English, Finnish, Russian.
We will focus here only on the english files. Those files are :

en_US.blogs.txt (Blogs)
en_US.news.txt (News)
en_US.twitter.txt (Twitter)

Here is the link to download the dataset : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data characteristics

Once we have loaded the data into R, we can observe for each those characteristics :

# Loading the libraries
library(NLP)
library(tm)
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

cname <- file.path("~", "Documents", "Coursera", "Data science - Johns Hopkins University", "Capstone Project", "final", "en_US")

docs <- Corpus(DirSource(cname))

# size of files
blogSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.blogs.txt")$size / 1024^2,2))
newSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.news.txt")$size / 1024^2,2))
twitterSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.twitter.txt")$size / 1024^2,2))

# number of lines per file
blogLines <- as.character(format(length(docs[[1]]$content), big.mark = " "))
newsLines <- as.character(format(length(docs[[2]]$content), big.mark = " "))
twitterLines <- as.character(format(length(docs[[3]]$content), big.mark = " "))
# maximum number of words
blogMax <- as.character(format(max(nchar(docs[[1]]$content)), big.mark = " "))
newsMax <- as.character(format(max(nchar(docs[[2]]$content)), big.mark = " "))
twitterMax <- as.character(format(max(nchar(docs[[3]]$content)), big.mark = " "))
# minimum numbe of words
blogMin <- as.character(min(nchar(docs[[1]]$content)))
newsMin <- as.character(min(nchar(docs[[2]]$content)))
twitterMin <- as.character(min(nchar(docs[[3]]$content)))

# word counts
blogWcount = 0
for (item in docs[[1]]$content){
    blogWcount = blogWcount + nchar(item)
}
blogWcount = format(blogWcount, big.mark = " ")
newsWcount = 0
for (item in docs[[2]]$content){
    newsWcount = newsWcount + nchar(item)
}
newsWcount = format(newsWcount, big.mark = " ")
twitterWcount = 0
for (item in docs[[3]]$content){
    twitterWcount = twitterWcount + nchar(item)
}
twitterWcount = format(twitterWcount, big.mark = " ")

Data	Size (Mo)	Number of lines	Min Word in item	Max Words in item	Total Word Count
Blogs	200.42	899 288	1	40 833	206 824 505
News	196.28	1 010 242	1	11 384	203 223 159
Twitter	159.36	2 360 148	2	140	162 096 031

Twitter presents a set of short entries (140 words max) written in a not formal manner
News presents a set of entries written in a formal manner but very topics focused
Blogs is somehow between News and Twitter. There is less noise (unformal manner) in the texts than in Twitter’s entries, and there are more topics discussed than in News’ entries.

Data subsetting

Each of these 3 files discussed above presents a very large size that may represent a problem in terms of time of computation. To avoid encountering those issues, we have subsetted our 3 files so that to obtain much smaller objects but representative of the original ones. Here are the characteristics of those subsets :

# Subsetting
set.seed(1)
subset <- docs
subset[[1]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[1]]$content),1,prob = 0.15))]
subset[[2]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[2]]$content),1,prob = 0.15))]
subset[[3]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[3]]$content),1,prob = 0.15))]

# save(subset, file="subset.RData")
# load("subset.RData")

rm(docs)

# size of files
blogSize2 <- as.character(round(object.size(subset[[1]]$content) / 1024^2,2))
newSize2 <- as.character(round(object.size(subset[[2]]$content) / 1024^2,2))
twitterSize2 <- as.character(round(object.size(subset[[3]]$content) / 1024^2,2))

# number of lines per file
blogLines2 <- as.character(format(length(subset[[1]]$content), big.mark = " "))
newsLines2 <- as.character(format(length(subset[[2]]$content), big.mark = " "))
twitterLines2 <- as.character(format(length(subset[[3]]$content), big.mark = " "))
# maximum number of words
blogMax2 <- as.character(format(max(nchar(subset[[1]]$content)), big.mark = " "))
newsMax2 <- as.character(format(max(nchar(subset[[2]]$content)), big.mark = " "))
twitterMax2 <- as.character(format(max(nchar(subset[[3]]$content)), big.mark = " "))
# minimum numbe of words
blogMin2 <- as.character(min(nchar(subset[[1]]$content)))
newsMin2 <- as.character(min(nchar(subset[[2]]$content)))
twitterMin2 <- as.character(min(nchar(subset[[3]]$content)))

blogWcount2 = 0
for (item in subset[[1]]$content){
    blogWcount2 = blogWcount2 + nchar(item)
}
blogWcount2t = format(blogWcount2, big.mark = " ")
newsWcount2 = 0
for (item in subset[[2]]$content){
    newsWcount2 = newsWcount2 + nchar(item)
}
newsWcount2t = format(newsWcount2, big.mark = " ")
twitterWcount2 = 0
for (item in subset[[3]]$content){
    twitterWcount2 = twitterWcount2 + nchar(item)
}
twitterWcount2t = format(twitterWcount2, big.mark = " ")

Data	Size (Mo)	Number of lines	Min Word in item	Max Words in item	Total Word Count
Blogs	37.24	134 734	1	37 191	30 990 754
News	6.57	151 267	1	37 191	4 888 176
Twitter	8.18	354 075	2	3 845	5 357 746

Data Preprocessing

In order to be able to analyse properly these datasets, we need to perform some transformations that will “clean” the data :

Removing punctuation : punctuation marks and other special characters are treated by the computer like any other word although they are not useful for our prediction task.
Removing numbers : like punctuation, numbers are not useful for prediction, and removing them will furthermore get our dataset a little bit smaller.
Converting everything to lowercase : for our analysis, we want everyword to be treated the same, wether it contains capital letters or not. For example, we want the word “String” to be considered the same as the word “string”.
Stemming documents : stemming the documents allows the computer to recgnize words the same way independently of their endings. For example, “stayed” “stays” “stay” will be recognized as the same word, as they are all about the verb “stay”.
Removing whitespaces : after all those transformations, documents will be left with some whitespaces and we want to remove them. For example, " average" will become “average”.
NOT Removing stop words : we could have chosen to remove stop words like “a”, “and”, “also”, but in the case of this analysis, those words are going to have a value for predicting the next word. So we choose to keep them.

Let’s look at this example to see the way those transformations are performing:

Before Preprocessing

subset[[1]]$content[13]

## [1] "Origin: Middle English: from Old French joie, based on Latin gaudium, from gaudere ‘rejoice’"

library(SnowballC)

sub <- tm_map(subset, removePunctuation)
sub <- tm_map(sub, removeNumbers)
sub <- tm_map(sub, content_transformer(tolower))
sub <- tm_map(sub, stemDocument,language = ("english"))
sub <- tm_map(sub, stripWhitespace)

# save(sub, file="sub.RData")
# load("sub.RData")

rm(subset)

After Preprocessing

sub[[1]]$content[13]

## [1] "origin middl english from old french joie base on latin gaudium from gauder rejoic"

Exploratory Analysis

Tokenization

Now that we have preprocessed the data, we are able to build a term-document matrix that will allow us to perform some exploratory analysis and observe the words frequencies, and the n-gram frequencies.

# Creating (or loading if it already exists) a term-document matrix
mat <- DocumentTermMatrix(sub)
    
# save(mat, file="Mat.RData")
# load("Mat.RData")

Words frequency

# Build a dataframe containing the frequency for each word, sorted in decreasing order
freq <- sort(colSums(as.matrix(mat)), decreasing=TRUE)
wordF <- data.frame(word=names(freq), freq=freq)
# wordF$F <- wordF$freq * 100 /(blogWcount2 + newsWcount2 + twitterWcount2)

library(ggplot2)
require(gridExtra)

## Loading required package: gridExtra

g <- ggplot(wordF[wordF$freq>35000, ], aes(x=word, y=freq))
g <- g + geom_bar(stat="identity", fill = "darkblue")
g <- g + xlab("") + ylab("Frequency (number of occurances)")
g <- g + ggtitle("Words that appear more than 35 000\ntimes in the 3 Datasets")
g <- g + coord_flip()

h <- ggplot(wordF, aes(wordF$freq)) + geom_histogram(breaks=seq(0, 20, by = 1), fill="green", alpha = .7)
h <- h + xlab("Number of words") + ylab("Frequency")
h <- h + ggtitle("Word frequencies")
h <- h + xlim(0,20)

grid.arrange(g, h, nrow = 1, ncol = 2)

As we can see in this graph, the words that occur most frequently in the 3 texts are mainly connecting words, as we could have expected. For example the most occuring word here is “the”. This shows us that we will have to go further to perform an efficient predictive model and consider other patterns than only word frequencies. That should be n-grams frequencies.

N-Grams frequency

An n-gram is a contiguous sequence of n items from a given sequence of text.
So in this analysis, we split all the strings in n-grams (n being the number of contiguous words) and look at their frequency through the 3 text files.
For the purpose of this analysis, we look at 1-grams, 2-grams, and 3-grams.

options(java.parameters = "-Xmx2048m")
library(RWeka)
library(SnowballC)

# 1-gram
if(!file.exists("Unimat.RData")){
    options(mc.cores=1)
    UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
    unimat <- TermDocumentMatrix(sub, control = list(tokenize = UnigramTokenizer))
    save(unimat, file="Unimat.RData")
}else{
      load("Unimat.RData")
}

UgramFreq <- sort(rowSums(as.matrix(unimat)), decreasing=TRUE)
UgramF <- data.frame(word=names(UgramFreq), freq=UgramFreq)

# 2-gram
if(!file.exists("Bimat.RData")){
    options(mc.cores=1)
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
    bimat <- TermDocumentMatrix(sub, control = list(tokenize = BigramTokenizer))
    save(bimat, file="Bimat.RData")
}else{
      load("Bimat.RData")
}

BgramFreq <- sort(rowSums(as.matrix(bimat)), decreasing=TRUE)
BgramF <- data.frame(word=names(BgramFreq), freq=BgramFreq)

# 3-gram
if(!file.exists("Trimat.RData")){
    options(mc.cores=1)
    TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
    trimat <- TermDocumentMatrix(sub, control = list(tokenize = TrigramTokenizer))
    save(trimat, file="Trimat.RData")
}else{
      load("Trimat.RData")
}

TgramFreq <- sort(rowSums(as.matrix(trimat)), decreasing=TRUE)
TgramF <- data.frame(word=names(TgramFreq), freq=TgramFreq)

require(gridExtra)

g <- ggplot(UgramF[UgramF$freq>35000, ], aes(x=word, y=freq))
g <- g + geom_bar(stat="identity", fill = "darkblue")
g <- g + xlab("") + ylab("Frequency (number of occurances)")
g <- g + ggtitle("1-grams that appear more than 35 000\ntimes in the three Datasets")
g <- g + coord_flip()

h <- ggplot(BgramF[BgramF$freq>10000, ], aes(x=word, y=freq))
h <- h + geom_bar(stat="identity", fill = "green")
h <- h + xlab("") + ylab("Frequency (number of occurances)")
h <- h + ggtitle("2-grams that appear more than 10 000\ntimes in the three Datasets")
h <- h + coord_flip()

i <- ggplot(TgramF[TgramF$freq>1000, ], aes(x=word, y=freq))
i <- i + geom_bar(stat="identity", fill = "red")
i <- i + xlab("") + ylab("Frequency (number of occurances)")
i <- i + ggtitle("3-grams that appear more than 1 000\ntimes in the three Datasets")
i <- i + coord_flip()

grid.arrange(g, h, i, nrow = 2, ncol = 2)

UgramLength <- as.character(format(length(UgramF$word),big.mark = " "))
BgramLength <- as.character(format(length(BgramF$word),big.mark = " "))
TgramLength <- as.character(format(length(TgramF$word),big.mark = " "))

Here are the characteristics of each corpus defined above :

1-gram : 96 257 different 1-grams ; max 1-gram frequency = 359 869
2-gram : 1 337 085 different 2-grams ; max 2-gram frequency = 36 175
3-gram : 3 503 532 different 3-grams ; max 3-gram frequency = 2 796

Watching at the graphs above, we can observe that frequencies are very skewed, as the most occuring N-grams (N = 1 or 2 or 3) have frequencies much higher than others. Furthermore, we can observe that coordination words, such as “the” or “and” have much higher frequency than others.

Other steps

Profanity filtering : there are some words that we don’t want to predict, and that we don’t want either to use in order to build our prediction model. Those words must be removed. But we can’t just remove those words, we have to remove the entire sentences in which they are displayed because those whole sentences become unuseful by the presence of those words.
To perform this task, we can find a list of profane words, for example at this link https://gist.github.com/jamiew/1112488..

perc = 0
total <- sum(UgramF$freq)
totW <- length(UgramF$word)
i = 1
while (perc <= 0.9){
    perc = perc + UgramF$freq[i]/total
    i = i+1
}

Covering 90% of word instances : in order to cover 90% of word instances in the subset we have defined above, we only have to consider the 4 044 most occuring words in the subsetted corpus, the total number of words in the subsetted corpus being 96 257, so around 4% of the words in the subsetted corpus.
That consideration gives us a large margin to reduce the size of the corpus we will have to use in our predictive model in order to have the opportunity to reduce calculation time.

Plan

Here are some tracks we will dig to build our predictive model :

Markov chains : it’s the assumption saying that the probability of a word depends only on the probability of the previous word, and can be generalized by saying that it depends only on the probabilty of previous words. We could there use the n-grams defined above to estimate the probabilty of occurance of a word.
Maximum Likelihood Estimation (MLE) : it’s about getting the count of a word from a corpus and normalizing that count. For example, if you want to calculate the probability of a word Y, knowing that it’s previous word is X, you can compute the count of the 2-gram XY and divide it by the sum of all 2-gram starting witht the word X.
Logarithm : we could use log probabilities instead of raw probabilities, knowing that we will compute probabilities by multiplying them, so that to get numbers that are not so small.
Backoff models : there will be lots of unobserved N-grams we will encounter. One way to tackle this issue is to use backoff models, saying that, if the n-gram is unobserved in the corpus, let’s watch at the (n-1)-gram, and so on until finding a reference in the corpus.