Building a predictive alogorithm

Have you ever wondered how your smartphone corrects the spellings and even type suggests the next word in the text messages? Spelling correction and type aheads are based on prediction techniques combining statistics and grammer rules (language syntax).

This report presents the steps in building a word prediction app as a part of Coursera/Johns Hopkins Dat Science specialization capstone. Natural Language Processing pipeline

A prediction algorithm is built based on the observed patterns in the corpora of documents collected in the past. The steps to build the algorithm is summarized below.
* Data acquisition
* Cleaning and Transformation
* Slicing and Sampling
* Modeling (n-Gram model) and
* Predictive algorithm

Data Aquisition from Corpora

The data is from HC Corpora which is free corpora available for learning and research purpose. See the readme file at About Corpus for details on the corpora available.

The RMD file will look for the data files in the current folder where ever the RMD file is located. If the data files are not present, the code will download them and save the data file in zip format. The files will be expanded into a sub-folder ./final. If you are running the code from the R console, please set the working directory using setwd() to the directory where the .RMD file is present.

# In order to ensure reproduceable results, lets download the files from the source
# If the data file does not exist, download again and save the zip file 
if (!file.exists("Coursera-SwiftKey.zip"))
        download.file(
                "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                destfile = "Coursera-SwiftKey.zip", method="curl")
        
# expand the zip file. Old files will be overwritten
unzip("Coursera-SwiftKey.zip",overwrite = TRUE)        

Corpora has data files in four languages. We are interested in English tweets, blogs and news stored in the sub-folder.

Exploratory Analysis

Let’s show how much data is present in the Corpus by source.

# read the files
#docs <- Corpus(DirSource(dataFolder))
news=readLines(paste(DataLocation,"/","en_US.news.txt", sep=""),encoding='UTF-8',skipNul=TRUE)
blogs=readLines(paste(DataLocation,"/","en_US.blogs.txt", sep=""),encoding='UTF-8',skipNul=TRUE)
tweets=readLines(paste(DataLocation,"/","en_US.twitter.txt", sep=""),encoding='UTF-8',skipNul=TRUE)

#how many chars each?
tweets_char = nchar(tweets)#count the number of chars for each tweet
blogs_char = nchar(blogs)#count the number of chars for each blog
news_char = nchar(news)#count the number of chars for each news article

# how many document in each file?
news_num = length(news)
blogs_num = length(blogs)
tweets_num = length(tweets)

# how many words in each file?
news_Words <- sum(stri_count_words(news))
blogs_Words <- sum(stri_count_words(blogs))
twitter_Words <- sum(stri_count_words(tweets))

#size?
blogs_file_size = round(file.size(paste(DataLocation,"/","en_US.blogs.txt", sep=""))/1024^2, digits=1)
news_file_size = round(file.size(paste(DataLocation,"/","en_US.news.txt", sep=""))/1024^2,digits=1)
tweets_file_size = round(file.size(paste(DataLocation,"/","en_US.twitter.txt", sep=""))/1024^2, digits = 1)

filenames <- c("en_US.blogs.txt", "en_US.news.txt","en_US.twitter.txt")
filesize <- c(blogs_file_size,news_file_size,tweets_file_size) # files size in MB
wordCounts <- c(news_Words,blogs_Words,twitter_Words)
numDocs <- c(blogs_num, news_num, tweets_num)

dataSummary <- as.data.frame(cbind(filenames, filesize, numDocs, wordCounts))
colnames(dataSummary) <- c("Source","Size (MB)","Documents", "Words")
# print the table
kable(dataSummary, format="markdown")
Source Size (MB) Documents Words
en_US.blogs.txt 200.4 899288 34762395
en_US.news.txt 196.3 1010242 37546246
en_US.twitter.txt 159.4 2360148 30093410

Slicing and Sampling

For the training exercise, I am using 3% of the sample data from the Corpora.

#read all the files containing profanity in any language and convert it to a single list of words
profanity <- readLines(paste(getwd(),"/","Bad-Words-master/profanity.txt",sep=""),encoding='UTF-8',skipNul=TRUE)

SampleNews=sample(news, 0.03* news_num) # 3% as a sample for training
SampleBlogs=sample(blogs, 0.03 * blogs_num)
SampleTweets=sample(tweets, 0.03 * tweets_num)

#combine the sample vectros into a single vector
sampleData=paste(SampleBlogs,SampleNews,SampleTweets, sep=" ")
writeLines(sampleData,"./TrainingData/TrainingData.txt")

Cleaning and Transforming

For the purpose of Natural Language Processing (NLP), we dont need punctuations, white spaces etc. * remove white spaces
* convert all words to lower case
* remove stop words which has no function in NLP e.g. the, at, is, which and on
* remove all puntuation
* remove all numbers
* remove profanity and swear words, downloaded from (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en)
* remove suffixes from the words (stemming) e.g. -es, -ed, s
* covert the documents to plain text
* remove special charecters/foreign words

After our data is cleaned up, we create a giant matrix to convert words into individual tokens and note them against the document a.k.a. document/term matrix (DTM). Lets map the frequency of the words. Instead of using histogram that utilizes bins on numeric scale, I am using a bar plot to show the words as labels.

Modeling

I am using RWeka and tm packages to tokenize the content of the corpora into 1, 2 and 3 word clusters (N-gram).

Unigrams

We build a fun WordCloud to show most frequently occuring single words.

Unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
UniDoc <- DocumentTermMatrix(SwiftKey)

UniDoc.matrix <- as.matrix(UniDoc)
frequency <- colSums(UniDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
UniGramFrequency <- data.frame(word=names(frequency),freq=frequency)   

#build a wordcloud
colspectrum <- brewer.pal(6, "Dark2")   
wordcloud(names(frequency), frequency, max.words=50, rot.per=0.1, colors=colspectrum) 

Tokenize word pairs

Building bigram tokens.

BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

biDTM <- DocumentTermMatrix(SwiftKey, control = list(tokenize = BigramTokenizer))

biDTM2 <- as.matrix(biDTM)
frequency <- colSums(biDTM2)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)

#show the first few most used word pairs
BiGramFrequency <- data.frame(word=names(frequency),freq=frequency)

BiGramFrequency %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="lightgreen") +
  ggtitle("Bigrams with the highest frequencies") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Trigram

Lastly we tokenize three words together.

TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

TriDTM <- DocumentTermMatrix(SwiftKey, control = list(tokenize = TrigramTokenizer))

TriDTM2 <- as.matrix(TriDTM)
frequency <- colSums(TriDTM2)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
TriGramFrequency <- data.frame(word=names(frequency),freq=frequency)

#show the first few most used word groups
head(TriGramFrequency, 10)
##                                          word freq
## new york city                   new york city  147
## two years ago                   two years ago  130
## cant wait see                   cant wait see  129
## president barack obama president barack obama  112
## happy mothers day           happy mothers day  107
## new york times                 new york times   95
## caprera hotel venice     caprera hotel venice   84
## hotel venice italy         hotel venice italy   84
TriGramFrequency %>%
  ggplot(., aes(x=reorder(word, -freq),freq)) +
  geom_bar(stat="identity",colour="blue",fill="lightgreen") +
  ggtitle("Trigrams with the highest frequencies") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Next steps

I plan to use the n-gram data as the foundation for further analysis. We build quadgram tokens as well and use it in the prediction model. After the basic analysis, our goal is to build different prediction algorithms based on popular models. Finally, the prediction algorithm will be packaged as an app and hosted on Shiny.io.

Re-producibility instructions This report uses following R packages. Please install and load them using the install.packages().
1. require(tm); require(RWeka);require(ggplot2);require(wordcloud);require(RColorBrewer);require(xtable); require(knitr); require(SnowballC);require(stringi);require(R.utils)
2. For reproduceability, set the seed to 80 using set.seed() command
3. Sampling size 3%