The goal of this analysis is to develop a predictive text model based on text data provided by SwiftKey. The final model will be presented in the form of a Shiny app, which will suggest the next word to the user after receiving a text input. The main goals for this milestone report are:
library(knitr)
library(stringi)
library(tm)
library(quanteda)
library(ggplot2)
library(wordcloud)
The initial data supplied by Swiftkey contains text messages from three separate sources: Blogs, News, & Twitter. Once downloaded and unzipped these files are available in 4 separate folders designated for each of the available languages: German (DE), English (US), Finnish (FI), & Russian (RU). For the purposes of this analysis we will focus only on the English versions of the three sources.
durl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#Determine if it has already been downloaded
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(durl)
unzip("Coursera-SwiftKey.zip")
}
#Get the 3 data sets
blogUS <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = T)
newsUS <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = T)
twitrUS <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = T)
Before summarizing the data, a sample entry was chosen from each of the three data sets to give the user a basic idea of the format of the data. A single row (101) was haphazardly chosen from each data set with the resulting entried found below. As one would expect from the title of the sources, the blog entry is much longer and the news and tweet are much shorter.
blogUS[101]
## [1] "I have mixed emotions about the start to this book. As I’ve already mentioned I very much enjoyed the characters and the voice of the protagonist. But as much as I was enjoying it, it took me a little while to get into the story. We’re presented with a huge case of insta-love right from the start. Lucas, the mysterious and hot new neighbor, comes on strong. I’m pretty sure they start making out on their second or third meeting! I’m warning you now. There’s not a lot of rationalization for the insta-love…you won’t really understand the reasons until the end of the book. Which is why, in retrospect, I kinda like that I had to wait for all of the info to be revealed."
newsUS[101]
## [1] "For Square 19, we find ourselves in a neighborhood that was long considered to be the wrong side of the tracks."
twitrUS[101]
## [1] "hahahahhahah u just made my day :D"
To summarize the data, the size, the number of lines (length), the number of words, and number characters were calculated for each data set. A quick review of the summaries indicates that there are many small tweets and fewer but longer blog entries.
#Determine size in MB
blogsize <- file.size("final/en_US/en_US.blogs.txt")/1024/1024
newsize <- file.size("final/en_US/en_US.news.txt")/1024/1024
twittersize <- file.size("final/en_US/en_US.twitter.txt")/1024/1024
#Determine the number of lines in the files
bloglength <- length(blogUS)
newslength <- length(newsUS)
twitterlength <- length(twitrUS)
#Determine the word count
blogword <- sum(stri_count_words(blogUS))
newsword <- sum(stri_count_words(newsUS))
twitterword <- sum(stri_count_words(twitrUS))
#Detemine the character count
blogChar <- sum(nchar(blogUS))
newsChar <- sum(nchar(newsUS))
twitterChar <- sum(nchar(twitrUS))
file <- c("blogs", "news", "twitter")
sizeMB <- c(blogsize, newsize, twittersize)
length <- c(bloglength, newslength, twitterlength)
wordCt <- c(blogword, newsword, twitterword)
charCt <- c(blogChar,newsChar,twitterChar)
sumInfo <- data.frame(file, sizeMB, length, wordCt,charCt)
#sumInfo
knitr::kable(sumInfo)
| file | sizeMB | length | wordCt | charCt |
|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 37546806 | 206824505 |
| news | 196.2775 | 77259 | 2674561 | 15639408 |
| 159.3641 | 2360148 | 30096690 | 162096241 |
#blogsize
#newsize
#twittersize
#bloglength
#newslength
#twitterlength
Due to the a large size of the data sets, we will want to sample from each of the 3 files and create a small subset for developing our models. One strategy would be to take a consistent sized sample from each file but due to the divergent sizes of the 3 sources we will opt for a percentage from each, keeping the relative sizes of the three types within the subset representative to their size in the original files. A 1.2944% sample was taken to achieve approximately n=1000 lines from the smaller news data set (and a total data set with approximately 43,188 = .012944(899,288 + 77,259 + 2,360,148))
set.seed(4321)
data.sample <- c(sample(blogUS, length(blogUS) * 0.012944),
sample(newsUS, length(newsUS) * 0.012944),
sample(twitrUS, length(twitrUS) * 0.012944))
saveRDS(data.sample, 'subset.rds')
sampledt <- readRDS("subset.rds")
# Create a Corpus
train <- corpus(sampledt)
ndoc(train)
## [1] 43189
We next remove all of the unnecessary detritus from the text (numbers, punctuation, URL, separators, symbols) and change all entries to lowercase.
trainToken <- tokens(train,
what="word",
remove_numbers = TRUE, #Remove numbers
remove_punct = TRUE, #Remove punctuation
remove_url =TRUE, #Remove web addresses
remove_separators = TRUE, #Remove separators
remove_symbols = TRUE, #Remove symbols
verbose = quanteda_options("verbose"))
trainNoStop <- tokens_remove(trainToken, pattern = stopwords("en")) #Remove common "stopwords"
trainToken <- tokens_tolower(trainToken) #enforce lowercase
trainNoStop <- tokens_tolower(trainNoStop)
We remove any profanity from the data set, using a standard offensive/profane word list was obtained from the website of Luis von Ahn “https://www.cs.cmu.edu/~biglou/resources/”. Any words found on this list were removed from the training data before any modeling began.
badwords_url <-"http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(badwords_url, destfile = "bad_words.txt", quiet = TRUE)
badwords <- readLines("bad_words.txt", encoding="UTF-8")
cltrainToken <- tokens_remove(trainToken, pattern = badwords)
cltrainNoSt <- tokens_remove(trainNoStop, pattern = badwords)
To begin the real analysis of the text data sets the first step is to convert the raw text into tokens which represent the words found in the text. For this analysis we will do 3 n-grams (1,2, & 3) which will correspond to single words, word pairs, and triplets. After the tokens are defined a document-feature matrix is create for each of the n-grams. The final step in this initial analysis is to create bar graphs and wordclouds illuminating the most common n-grams for each category.
ngrams_1NoSt <- tokens_ngrams(cltrainNoSt, n = 1, concatenator = " ")
ngrams_1 <- tokens_ngrams(cltrainToken, n = 1, concatenator = " ")
uniDfm <- dfm(ngrams_1NoSt,
remove_padding = TRUE,
#tolower = TRUE, #done in the token stage
#remove = bad.words, #done in the token stage
verbose = FALSE)
unig100 <- topfeatures(uniDfm, 100)
unig100df <- data.frame(unigram = names(unig100), freq = unig100)
kable(head(unig100df))
| unigram | freq | |
|---|---|---|
| just | just | 3366 |
| like | like | 2799 |
| one | one | 2746 |
| can | can | 2486 |
| get | get | 2421 |
| time | time | 2262 |
g1 <- ggplot(unig100df[1:20,],aes(x=reorder(unigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 Single word Frequency",x="Word",y="Frequency")
g1
wordcloud(words = unig100df$unigram,
freq = unig100df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
ngrams_2NoSt <- tokens_ngrams(cltrainNoSt, n = 2, concatenator = " ")
ngrams_2 <- tokens_ngrams(cltrainToken, n = 2, concatenator = " ")
biDfm <- dfm(ngrams_2NoSt,
remove_padding = TRUE,
#tolower = TRUE, #done in the token stage
#remove = bad.words, #done in the token stage
verbose = FALSE)
big100 <- topfeatures(biDfm, 100)
big100df <- data.frame(bigram = names(big100), freq = big100)
head(big100df)
## bigram freq
## right now right now 307
## last night last night 178
## feel like feel like 144
## looking forward looking forward 128
## can get can get 125
## looks like looks like 121
g1 <- ggplot(big100df[1:20,],aes(x=reorder(bigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 Bi-word Frequency",x="Word Pairs",y="Frequency")
g1
wordcloud(words = big100df$bigram,
freq = big100df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
A review of the trigrams barchart reveals one problem with this type
of text analysis. The 3rd and 5th most common trigrams are “happy
mothers day” and “happy mother’s day” respectively. Clearly these two
groupings should be combined into a single category, which would make
them the most common trigram in the data set. Although the tokens were
created utilizing remove_punct = TRUE to remove
punctuation, the quanteda package removes tokens that are
entirely punctuation, but it typically preserves apostrophes and hyphens
within words (like “mother’s”) to maintain word integrity. Removal of
the apostrophe can be achieved before tokens are constructed by
hardcoding the removal of them but due to the loss of information from
contractions (he’ll becomes hell and we’re becomes were) this action in
generally not recommended.This specific situation could be correctly
manually but for the purposes of this analysis, this is ignored.
ngrams_3NoSt <- tokens_ngrams(cltrainNoSt, n = 3, concatenator = " ")
ngrams_3 <- tokens_ngrams(cltrainToken, n = 3, concatenator = " ")
triDfm <- dfm(ngrams_3NoSt,
remove_padding = TRUE,
#tolower = TRUE, #done in the token stage
#remove = bad.words, #done in the token stage
verbose = FALSE)
trig100 <- topfeatures(triDfm, 100)
trig100df <- data.frame(trigram = names(trig100), freq = trig100)
head(trig100df)
## trigram freq
## happy new year happy new year 32
## let us know let us know 29
## happy mothers day happy mothers day 21
## new york city new york city 20
## happy mother's day happy mother's day 18
## cinco de mayo cinco de mayo 16
g1 <- ggplot(trig100df[1:20,],aes(x=reorder(trigram, -freq),y=freq))+ geom_bar(stat="identity", fill = "blue")
g1 <- g1 + theme(axis.text.x=element_text(angle=45, hjust = 1))
g1 <- g1 + labs(title="Top 20 tri-word Frequency",x="Word Triplets",y="Frequency")
g1
wordcloud(words = trig100df$trigram,
freq = trig100df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
mtext("Popular Trigrams", side=3, line=3, cex=1.5)
The frequency data calculated and summarized here will be the basis of my predictive model. I plan to utilize this data to compute observed and unobserved token probabilities for input words and phrases. The model will be executed in a Shiny App and I plan to utilize the n-gram data frames rather than the original data within the app to save processing time and memory.