Predict the next word based on previous words in the sentence

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner for this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

“I went to the”

the keyboard presents three options for what the next word might be. For example, the three words might be “gym”, “store”, “restaurant”. In this capstone I will work on understanding and building predictive text models like those used by SwiftKey.

The data for this project comes from Swiftkey. For the purpose of this project, I will be using only english dat afiles.

MileStone Objectives:

Demonstrate that I have downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that I amassed so far.
Get feedback on my plans for creating a prediction algorithm and Shiny app.

I will be using TM package for mining this data. I have also created a helpers file that will load all the required libraries and functions.

source("helpers.R")

Reading the data, so we can review it. Here is the same 2 lines for blogs data.

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â€œgodsâ€."
## [2] "We love you Mr. Brown."

Here is the sample 2 lines from news data.

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

Here is the sample 2 lines from twitter data.

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

Loading the data

Before we load the entire corpus of data, I have read couple of lines to see the format of the data and I have used the same data to create preprocessing functions (part of helpers.R file)

cname <- file.path(".","COursera-SwiftKey","final","en_us")
docs <- Corpus(DirSource(cname))

## Warning: incomplete final line found on
## './COursera-SwiftKey/final/en_us/en_US.news.txt'

Total number of words before preprocessing

##                        blogs  news twitter
## Total number of words 899288 77259 2360148

Pre-Processing

For the pre-processing phase, I have created functions, which will remove all wierd characters(because of UTF-8 data), numbers, punctuations, convert all data into lowercase, remove extra white spaces, I have also removed the stop words (so it can reduce the size of the corpus)

preProcess(docs)

List of top words in blogs and news

head(blogstable)

## sb_blogs
##     the      to     and      of       a       I 
## 1659151 1043878 1015714  862906  857102  738534

head(newstable)

## sb_news
##    the     to    and      a     of     in 
## 131810  68417  65167  63401  58675  47526

Total number of words after PreProcessing

##                                                 blogs    news
## Total number of words after PreProcessing:   37334131 2643969

Interesting findings:

I have found that there are so many wierd chracters when looking through the data files. After worrying about them for a while, I was able to filter all of them but converting into ASCII data. I also found thatnumbers, some smiles, emails and web address are really not that important for data modelling.

Summary of words in blogs data and news documents (sorted by number of words). I am only displaying 1st one hundred words, so it can be plotted.I am also not displaying twitter data as it is analagous to the other data files.

plot of chunk unnamed-chunk-11

Prediction Model

As we know that the goal of the application is to predict the next word in the sentence. We will develop a model based on google n-grams. We will use Markov assumption on The Chain Rule. Basically we will estimate the probability of word(wi) based on last few words say 3words.

I am still worried, on how shiny Apps server will handle so much load.I will probbaly deploy them to yhat servers as well, so we can use the JSON data in a client application.