Overview

This text predictive model was completed for Johns Hopkins’ “Data Science Capstone” class on coursera, the final class in a 10-course specialization. The model aims to accurately predict what word a user plans on writing next when typing. This can be used to improve the user experience of typing on a hand-held device by saving the user time and reducing mispellings and errors.

Data

The text files used for this analysis can be downloaded from httpsd396qusza40orc.cloudfront.netdsscapstonedatasetCoursera-SwiftKey.zip.

#load data
blogs <- readLines("../Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
twitter <- readLines("../Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
## Warning in readLines("../Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 167155 appears to contain an embedded nul
## Warning in readLines("../Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 268547 appears to contain an embedded nul
## Warning in readLines("../Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 1274086 appears to contain an embedded nul
## Warning in readLines("../Coursera-SwiftKey/final/en_US/
## en_US.twitter.txt", : line 1759032 appears to contain an embedded nul
news <- readLines("../Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8")
## Warning in readLines("../Coursera-SwiftKey/final/en_US/en_US.news.txt", :
## incomplete final line found on '../Coursera-SwiftKey/final/en_US/
## en_US.news.txt'
#print snapshot of data
head(blogs)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

Common Natural Language Processing Steps

Below is some general information about text mining and the common steps taken in natural language processing.

Automatic summarization
Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper.

[Add more NPL techniques here]

Pre-Processing

Because the text data provided is large and processing power is limited, we extract a small sample to build our model. In addition, we prep the data for analysis by applying some common cleaning functions.

#use a small sample of data set due to memory limitations
set.seed(123)
blogs.sample <- blogs[sample(length(blogs), length(blogs)*.1/100)] #.1%
twitter.sample <- twitter[sample(length(twitter), length(twitter)*.1/100)] #.1%
news.sample <- news[sample(length(news), length(news)*.1/100)] #.1%
sample <- c(blogs.sample, twitter.sample, news.sample)

#convert to ASCII encoding to get rid of non-standard characters
sample <- iconv(sample, to = "ASCII", sub = "")

#convert to corpus
library(tm)
## Loading required package: NLP
corpus <- VCorpus(VectorSource(sample))

#Convert all words to lower case
corpus <- tm_map(corpus, content_transformer(tolower))

#remove punctuation
corpus <- tm_map(corpus, removePunctuation)

#remove numbers
corpus <- tm_map(corpus, removeNumbers)

#strip whitespace
corpus <- tm_map(corpus, stripWhitespace)

Exploratory Analysis

An important first step is to gain a general understanding of the data that was provided. We begin by computing basic statistics of the datasets that were provided to us and then look at the most frequent terms and combination of terms.

#Dataset statistics

#file sizes
fs_blogs <- file.info("../Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size
fs_twitter <- file.info("../Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size
fs_news <- file.info("../Coursera-SwiftKey/final/en_US/en_US.news.txt")$size
fs <- c(fs_blogs, fs_twitter, fs_news)
fs <- fs/1000000
#line counts
lc_blogs <- length(blogs)
lc_twitter <- length(twitter)
lc_news <- length(news)
lc <- c(lc_blogs, lc_twitter, lc_news)
#word count
library(stringi)
wc_blogs <- sum(stri_count_words(blogs))
wc_twitter <- sum(stri_count_words(twitter))
wc_news <- sum(stri_count_words(news))
wc <- c(wc_blogs, wc_twitter, wc_news)

#save to dataframe
stat <- data.frame(c("blogs", "twitter", "news"), fs, lc, wc)
colnames(stat) <- c("Data Set", "Size in Megabytes", "Line Counts", "Word Count")

#Print
library(knitr)
kable(stat, format = "markdown")
Data Set Size in Megabytes Line Counts Word Count
blogs 210.1600 899288 37546246
twitter 167.1053 2360148 30093369
news 205.8119 77259 2674536
#count word frequency
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:base':
## 
##     Filter
frequent_terms <- freq_terms(corpus,100)

#plot word frequency
plot(frequent_terms[1:10,])

library(wordcloud)
wordcloud(words = frequent_terms$WORD, freq = frequent_terms$FREQ, max.words = 100)

#What are the frequencies of 2-grams and 3-grams in the dataset? 
library(RWeka)
twogramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
threegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
dtm_twogram <- DocumentTermMatrix(corpus, control=list(tokenize=twogramTokenizer))
dtm_threegram <- DocumentTermMatrix(corpus, control=list(tokenize=threegramTokenizer))
twogram_freq <- sort(colSums(as.matrix(dtm_twogram)), decreasing = TRUE)
threegram_freq <- sort(colSums(as.matrix(dtm_threegram)), decreasing = TRUE)

#graph results
barplot(twogram_freq[1:30],
        ylab='frequency',
        main='top 30 most freqent word combinations',
        names.arg = names(twogram_freq)[1:30],
        col = 'red', las=2, cex.names = .7)

barplot(threegram_freq[1:30],
        ylab='frequency',
        main='top 30 most freqent 3 word combinations',
        names.arg = names(threegram_freq)[1:30],
        col = 'blue', las=2, cex.names = .7)

#Additional questions to consider:

# How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

# How do you evaluate how many of the words come from foreign languages?

# Can you think of a way to increase the coverage identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Plans

The next step will be to test a few different models, including text2vec, to evaluate which one performs best. Additionally, I may research ways of training my model on more than 0.1% of the text that was provided in order to increase accuracy. In any case, speed will be critical for the implementation of this model.