Project Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, a leading software company has built a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Introduction

The goal of this report is just to display that we’ve gotten used to work with the data and that we are on track to create your prediction algorithm. This report should be concise and explain major features of the data we have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

Preparing data

Download the data

Downloaded the data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip to start. We use the english database but may consider three other database in German, Russian and Finnish.

Loading the dataset

twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
blog <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8") 
Preparing libraries
library(ggplot2)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(knitr)
library(RWeka)
library(RWekajars)
library(rJava)

library(wordcloud)
## Loading required package: RColorBrewer
library(stringi)
library(stringr)
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Basic report of summary statistics about the data sets

Preview the dataset

length(twitter)
## [1] 2360148
length(blog)
## [1] 899288
length(news)
## [1] 1010242
head(twitter)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
head(blog)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+201C>gods<U+201D>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"
head(news)
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

Summary statistics about the datasets

size_twitter <- file.info("final/en_US/en_US.twitter.txt")$size/1024^2
size_blog <- file.info("final/en_US/en_US.blogs.txt")$size/1024^2
size_news <- file.info("final/en_US/en_US.news.txt")$size/1024^2

words_twitter <- strsplit(twitter, " ")   
words_blog <- strsplit(blog, " ")   
words_news <- strsplit(news, " ")   

wordCount_twitter <- sum(sapply(words_twitter, FUN=length, simplify=TRUE))
wordCount_blog <- sum(sapply(words_blog, FUN=length, simplify=TRUE))
wordCount_news <- sum(sapply(words_news, FUN=length, simplify=TRUE))

summary_table <- data.frame(filename = c("twitter","blog","news"),
                            file_size_MB = c(size_twitter, size_blog, size_news),
                            num_lines = c(length(twitter),length(blog),length(news)),
                            wordCount = c(wordCount_twitter, wordCount_blog, wordCount_news))


kable(x=summary_table, col.names=c("File Name" , "File Size (MB)" , "Line Count" , "Word Count"), digits=1)
File Name File Size (MB) Line Count Word Count
twitter 159.4 2360148 30373543
blog 200.4 899288 37334131
news 196.3 1010242 34372530

Data Pre-Processing

This dataset is fairly large. To speed up data pre-processing, we are going to bulid sampling models. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. We will create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate simpling dataset.

sampleTwitter <- twitter[sample(1:length(twitter), 10000)]
sampleBlog <- blog[sample(1:length(blog), 10000)]
sampleNews <- news[sample(1:length(news), 10000)]

sampleText <- c(sampleBlog, sampleNews, sampleTwitter)

samplingdata <- Corpus(VectorSource(sampleText))
rm(sampleText)

Cleanup the data: Tokenization and Profanity filtering

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers.
  2. Profanity filtering - removing profanity and other words you do not want to predict.
samplingdata <- tm_map(samplingdata, content_transformer(tolower))
samplingdata <- tm_map(samplingdata, stripWhitespace)
samplingdata <- tm_map(samplingdata, removePunctuation)
samplingdata <- tm_map(samplingdata, removeNumbers)
samplingdata <- tm_map(samplingdata, PlainTextDocument)

Exploratory Data Analysis

Build basic n-gram model

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

options(mc.cores=1)

getFrequency <- function(tdm) {
  datafrequency <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(datafrequency), freq = datafrequency))
}

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 10, hjust = 1)) +
         geom_bar(stat = "identity", fill = "orange")
}

# Get frequencies of most common n-grams in data sample
unigramsGraph <- getFrequency(removeSparseTerms(TermDocumentMatrix(samplingdata), 0.9999))
bigramsGraph <- getFrequency(removeSparseTerms(TermDocumentMatrix(samplingdata, control = list(tokenize = bigram)), 0.9999))
trigramsGraph <- getFrequency(removeSparseTerms(TermDocumentMatrix(samplingdata, control = list(tokenize = trigram)), 0.9999))

Unigram Analysis

makePlot(unigramsGraph, "Top 30 Unigrams")

Bigram Analysis

makePlot(bigramsGraph, "Top 30 Bigrams")

Trigram Analysis

makePlot(trigramsGraph, "Top 30 Trigrams")

Construct Word Cloud

Remove stop words from the sampling data and create a Word Cloud to check the difference.

What’s Next?

In this report, the exploratory analysis is using n-gram models like bigrams, trigrams and quadgrams with a most frequency table. Performance and speed would be the biggest concern on the next prediction algorithm project.

These are some task for the plan: