Introduction

The goal of this analysis is to develop a predictive text model based on text data provided by SwiftKey. The final model will be presented in the form of a Shiny app, which will suggest the next word to the user after receiving a text input. The main goals for this milestone report are:

  1. Demonstrate that the data have been successfully obtained and uploaded.
  2. Create basic summary statistics to describe the data.
  3. Report any interesting findings based off the initial analysis.
  4. Discuss a plan for completing the modeling of the data set.

Packages

library(knitr)
library(stringi)
library(tm)
library(quanteda)
library(ggplot2)
library(wordcloud)

Loading the Data

The initial data supplied by Swiftkey contains text messages from three separate sources: Blogs, News, & Twitter. Once downloaded and unzipped these files are available in 4 separate folders designated for each of the available languages: German (DE), English (US), Finnish (FI), & Russian (RU). For the purposes of this analysis we will focus only on the English versions of the three sources.

durl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

#Determine if it has already been downloaded
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file(durl)
  unzip("Coursera-SwiftKey.zip")
}

#Get the 3 data sets
blogUS <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = T)
newsUS <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = T)
twitrUS <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = T)

Example Entries

Before summarizing the data, a sample entry was chosen from each of the three data sets to give the user a basic idea of the format of the data. A single row (101) was haphazardly chosen from each data set with the resulting entried found below. As one would expect from the title of the sources, the blog entry is much longer and the news and tweet are much shorter.

 blogUS[101]
## [1] "I have mixed emotions about the start to this book. As I’ve already mentioned I very much enjoyed the characters and the voice of the protagonist. But as much as I was enjoying it, it took me a little while to get into the story. We’re presented with a huge case of insta-love right from the start. Lucas, the mysterious and hot new neighbor, comes on strong. I’m pretty sure they start making out on their second or third meeting! I’m warning you now. There’s not a lot of rationalization for the insta-love…you won’t really understand the reasons until the end of the book. Which is why, in retrospect, I kinda like that I had to wait for all of the info to be revealed."
 newsUS[101]
## [1] "For Square 19, we find ourselves in a neighborhood that was long considered to be the wrong side of the tracks."
 twitrUS[101]
## [1] "hahahahhahah u just made my day :D"

Summary Statistics

To summarize the data, the size, the number of lines (length), the number of words, and number characters were calculated for each data set. A quick review of the summaries indicates that there are many small tweets and fewer but longer blog entries.

#Determine size in MB
blogsize <- file.size("final/en_US/en_US.blogs.txt")/1024/1024  
newsize <- file.size("final/en_US/en_US.news.txt")/1024/1024  
twittersize <- file.size("final/en_US/en_US.twitter.txt")/1024/1024  

#Determine the number of lines in the files
bloglength <- length(blogUS)
newslength <- length(newsUS)
twitterlength <- length(twitrUS)

#Determine the word count
blogword <- sum(stri_count_words(blogUS))
newsword <- sum(stri_count_words(newsUS))
twitterword <- sum(stri_count_words(twitrUS))

#Detemine the character count
blogChar <- sum(nchar(blogUS))
newsChar <- sum(nchar(newsUS))
twitterChar <- sum(nchar(twitrUS))

file <- c("blogs", "news", "twitter")
sizeMB <- c(blogsize, newsize, twittersize)
length <- c(bloglength, newslength, twitterlength)
wordCt <- c(blogword, newsword, twitterword)
charCt <- c(blogChar,newsChar,twitterChar)

sumInfo <- data.frame(file, sizeMB, length, wordCt,charCt)

#sumInfo
knitr::kable(sumInfo)
file sizeMB length wordCt charCt
blogs 200.4242 899288 37546806 206824505
news 196.2775 77259 2674561 15639408
twitter 159.3641 2360148 30096690 162096241
#blogsize
#newsize
#twittersize
#bloglength
#newslength
#twitterlength

Sampling the data

Due to the a large size of the data sets, we will want to sample from each of the 3 files and create a small subset for developing our models. One strategy would be to take a consistent sized sample from each file but due to the divergent sizes of the 3 sources we will opt for a percentage from each, keeping the relative sizes of the three types within the subset representative to their size in the original files. A 1.2944% sample was taken to achieve approximately n=1000 lines from the smaller news data set (and a total data set with approximately 43,188 = .012944(899,288 + 77,259 + 2,360,148))

set.seed(4321)
data.sample <- c(sample(blogUS, length(blogUS) * 0.012944),
                 sample(newsUS, length(newsUS) * 0.012944),
                 sample(twitrUS, length(twitrUS) * 0.012944))
saveRDS(data.sample, 'subset.rds')

sampledt <- readRDS("subset.rds")

# Create a Corpus
train <- corpus(sampledt)
ndoc(train)
## [1] 43189

Data Cleaning

We next remove all of the unnecessary detritus from the text (numbers, punctuation, URL, separators, symbols) and change all entries to lowercase.

trainToken <- tokens(train,
                      what="word",
                      remove_numbers = TRUE,         #Remove numbers
                      remove_punct = TRUE,           #Remove punctuation
                      remove_url =TRUE,              #Remove web addresses 
                      remove_separators = TRUE,      #Remove separators
                      remove_symbols = TRUE,         #Remove symbols
                      verbose = quanteda_options("verbose"))

trainNoStop <- tokens_remove(trainToken, pattern = stopwords("en"))  #Remove common "stopwords"

trainToken <- tokens_tolower(trainToken)             #enforce lowercase
trainNoStop <- tokens_tolower(trainNoStop)

Profanity

We remove any profanity from the data set, using a standard offensive/profane word list was obtained from the website of Luis von Ahn “https://www.cs.cmu.edu/~biglou/resources/”. Any words found on this list were removed from the training data before any modeling began.

badwords_url <-"http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(badwords_url, destfile = "bad_words.txt", quiet = TRUE)
badwords <- readLines("bad_words.txt", encoding="UTF-8")

cltrainToken <- tokens_remove(trainToken, pattern = badwords)
cltrainNoSt <- tokens_remove(trainNoStop, pattern = badwords)

Initial Analysis - Tokenization

To begin the real analysis of the text data sets the first step is to convert the raw text into tokens which represent the words found in the text. For this analysis we will do 3 n-grams (1,2, & 3) which will correspond to single words, word pairs, and triplets. After the tokens are defined a document-feature matrix is create for each of the n-grams. The final step in this initial analysis is to create bar graphs and wordclouds illuminating the most common n-grams for each category.

Unigrams

ngrams_1NoSt <- tokens_ngrams(cltrainNoSt, n = 1, concatenator = " ")
ngrams_1 <- tokens_ngrams(cltrainToken, n = 1, concatenator = " ")
uniDfm <- dfm(ngrams_1NoSt, 
                    remove_padding = TRUE,
                    #tolower = TRUE,           #done in the token stage
                    #remove = bad.words,       #done in the token stage  
                    verbose = FALSE)
unig100 <- topfeatures(uniDfm, 100)
unig100df <- data.frame(unigram = names(unig100), freq = unig100)
kable(head(unig100df))
unigram freq
just just 3366
like like 2799
one one 2746
can can 2486
get get 2421
time time 2262
Top 20 Words
g1 <- ggplot(unig100df[1:20,],aes(x=reorder(unigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 Single word Frequency",x="Word",y="Frequency")
g1

wordcloud(words = unig100df$unigram,
              freq = unig100df$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set1"))

Bigrams

ngrams_2NoSt <- tokens_ngrams(cltrainNoSt, n = 2, concatenator = " ")
ngrams_2 <- tokens_ngrams(cltrainToken, n = 2, concatenator = " ")
biDfm <- dfm(ngrams_2NoSt, 
                    remove_padding = TRUE,
                    #tolower = TRUE,           #done in the token stage
                    #remove = bad.words,       #done in the token stage  
                    verbose = FALSE)
big100 <- topfeatures(biDfm, 100)
big100df <- data.frame(bigram = names(big100), freq = big100)
head(big100df)
##                          bigram freq
## right now             right now  307
## last night           last night  178
## feel like             feel like  144
## looking forward looking forward  128
## can get                 can get  125
## looks like           looks like  121
Top 20 Words
g1 <- ggplot(big100df[1:20,],aes(x=reorder(bigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 Bi-word Frequency",x="Word Pairs",y="Frequency")
g1

wordcloud(words = big100df$bigram,
              freq = big100df$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set1"))

Trigrams

A review of the trigrams barchart reveals one problem with this type of text analysis. The 3rd and 5th most common trigrams are “happy mothers day” and “happy mother’s day” respectively. Clearly these two groupings should be combined into a single category, which would make them the most common trigram in the data set. Although the tokens were created utilizing remove_punct = TRUE to remove punctuation, the quanteda package removes tokens that are entirely punctuation, but it typically preserves apostrophes and hyphens within words (like “mother’s”) to maintain word integrity. Removal of the apostrophe can be achieved before tokens are constructed by hardcoding the removal of them but due to the loss of information from contractions (he’ll becomes hell and we’re becomes were) this action in generally not recommended.This specific situation could be correctly manually but for the purposes of this analysis, this is ignored.

ngrams_3NoSt <- tokens_ngrams(cltrainNoSt, n = 3, concatenator = " ")
ngrams_3 <- tokens_ngrams(cltrainToken, n = 3, concatenator = " ")
triDfm <- dfm(ngrams_3NoSt, 
                    remove_padding = TRUE,
                    #tolower = TRUE,           #done in the token stage
                    #remove = bad.words,       #done in the token stage  
                    verbose = FALSE)
trig100 <- topfeatures(triDfm, 100)
trig100df <- data.frame(trigram = names(trig100), freq = trig100)
head(trig100df)
##                               trigram freq
## happy new year         happy new year   32
## let us know               let us know   29
## happy mothers day   happy mothers day   21
## new york city           new york city   20
## happy mother's day happy mother's day   18
## cinco de mayo           cinco de mayo   16
Top 20 Words
g1 <- ggplot(trig100df[1:20,],aes(x=reorder(trigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 tri-word Frequency",x="Word Triplets",y="Frequency")
g1

wordcloud(words = trig100df$trigram,
              freq = trig100df$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set1"))
mtext("Popular Trigrams", side=3, line=3, cex=1.5)

Prediction Model and Shiny App

The frequency data calculated and summarized here will be the basis of my predictive model. I plan to utilize this data to compute observed and unobserved token probabilities for input words and phrases. The model will be executed in a Shiny App and I plan to utilize the n-gram data frames rather than the original data within the app to save processing time and memory.