Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner for this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
“I went to the”
the keyboard presents three options for what the next word might be. For example, the three words might be “gym”, “store”, “restaurant”. In this capstone I will work on understanding and building predictive text models like those used by SwiftKey.
The data for this project comes from Swiftkey. For the purpose of this project, I will be using only english dat afiles.
I will be using TM package for mining this data. I have also created a helpers file that will load all the required libraries and functions.
source("helpers.R")
Reading the data, so we can review it. Here is the same 2 lines for blogs data.
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “godsâ€."
## [2] "We love you Mr. Brown."
Here is the sample 2 lines from news data.
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
Here is the sample 2 lines from twitter data.
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
Before we load the entire corpus of data, I have read couple of lines to see the format of the data and I have used the same data to create preprocessing functions (part of helpers.R file)
cname <- file.path(".","COursera-SwiftKey","final","en_us")
docs <- Corpus(DirSource(cname))
## Warning: incomplete final line found on
## './COursera-SwiftKey/final/en_us/en_US.news.txt'
Total number of words before preprocessing
## blogs news twitter
## Total number of words 899288 77259 2360148
For the pre-processing phase, I have created functions, which will remove all wierd characters(because of UTF-8 data), numbers, punctuations, convert all data into lowercase, remove extra white spaces, I have also removed the stop words (so it can reduce the size of the corpus)
preProcess(docs)
List of top words in blogs and news
head(blogstable)
## sb_blogs
## the to and of a I
## 1659151 1043878 1015714 862906 857102 738534
head(newstable)
## sb_news
## the to and a of in
## 131810 68417 65167 63401 58675 47526
Total number of words after PreProcessing
## blogs news
## Total number of words after PreProcessing: 37334131 2643969
I have found that there are so many wierd chracters when looking through the data files. After worrying about them for a while, I was able to filter all of them but converting into ASCII data. I also found thatnumbers, some smiles, emails and web address are really not that important for data modelling.
Summary of words in blogs data and news documents (sorted by number of words). I am only displaying 1st one hundred words, so it can be plotted.I am also not displaying twitter data as it is analagous to the other data files.
As we know that the goal of the application is to predict the next word in the sentence. We will develop a model based on google n-grams. We will use Markov assumption on The Chain Rule. Basically we will estimate the probability of word(wi) based on last few words say 3words.
I am still worried, on how shiny Apps server will handle so much load.I will probbaly deploy them to yhat servers as well, so we can use the JSON data in a client application.