Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
“I went to the”
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this project, I will work on understanding and building predictive text models like those used by SwiftKey.
This project will start by analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, I will build a predictive text product.
First, I downloaded the data set from the link provided on the Coursera website. It is a compilation of blog posts, news articles, and Twitter posts scraped from the web.
if(!file.exists("Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
file.rename("final", "Data")
}
Next, I saved each file to a vector using the readLines function.
con <- file("Data/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con)
con <- file("Data/en_US/en_US.news.txt", "r")
news <- readLines(con)
con <- file("Data/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
close(con)
See some basic information about each file below.
| file | file.size.mb | lines | words | longest.line |
|---|---|---|---|---|
| blogs | 200.4 | 899,288 | 206,824,505 | 40,833 |
| news | 196.3 | 1,010,242 | 203,223,159 | 11,384 |
| 159.4 | 2,360,148 | 162,096,031 | 140 |
After combining all three sets of data, I used the sample function to randomly select five percent of all lines to use for my sample set. Using functions from the {tm} package, I removed stop words, punctuation, numbers, and unicode characters from the sample.
library(tm)
set.seed(2017-10-29)
vector.all <- c(blogs, news, twitter)
vector.sample <- sample(vector.all, length(vector.all)*.05)
vector.sample <- tolower(vector.sample)
vector.sample <- removeWords(vector.sample, stopwords("en"))
vector.sample <- removePunctuation(vector.sample, preserve_intra_word_dashes = TRUE)
vector.sample <- removeNumbers(vector.sample)
vector.sample <- iconv(vector.sample, "UTF-8", "ASCII", sub = "")
I also filtered profane language and extra whitespace from the sample.
if(!file.exists("bad-words.txt")) {
download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", "bad-words.txt")
}
con <- file("bad-words.txt", "r")
bad.words <- readLines(con)
close(con)
bad.words <- bad.words[-1]
vector.sample <- removeWords(vector.sample, bad.words)
vector.sample <- stripWhitespace(vector.sample)
In order to avoid repeating the above steps, I wrote my clean sample to a text file.
if(!file.exists("en_US.clean.sample.txt")) {
writeLines(vector.sample, "en_US.clean.sample.txt")
}
Using functions from the {tokenizers} package, I generated term frequency tables for unigram, bigram, and trigram models. You will find plots of each under the next heading.
library(dplyr)
library(tokenizers)
vector.token.one <- as.data.frame(table(tokenize_words(paste(vector.sample, collapse = " ")))) %>% arrange(desc(Freq))
names(vector.token.one)[1] <- "string"
vector.token.two <- as.data.frame(table(tokenize_ngrams(paste(vector.sample, collapse = " "), n = 2, n_min = 2))) %>% arrange(desc(Freq))
names(vector.token.two)[1] <- "string"
vector.token.three <- as.data.frame(table(tokenize_ngrams(paste(vector.sample, collapse = " "), n = 3, n_min = 3))) %>% arrange(desc(Freq))
names(vector.token.three)[1] <- "string"
The plot below depicts the top 15 most frequent words in the sample.
ggplot(vector.token.one[1:15,], aes(x = reorder(string, desc(Freq)), y = Freq)) +
geom_bar(fill = "coral1", stat = "Identity") +
theme_TCR() +
scale_y_continuous(expand = c(0,0)) +
xlab("Top 15 Words") +
ylab("Frequency")
The next plot depicts the top 15 most frequent two-word phrases in the sample.
ggplot(vector.token.two[1:15,], aes(x = reorder(string, desc(Freq)), y = Freq)) +
geom_bar(fill = "coral1", stat = "Identity") +
theme_TCR() +
scale_y_continuous(expand = c(0,0)) +
xlab("Top 15 Two-Word Phrases") +
ylab("Frequency") +
theme(axis.text.x = element_text(margin = margin(-23,0,30,0)),
axis.line.x = element_line(color = "#222222", size = 0.5))
And the final plot in this section depicts the top 15 most frequent three-word phrases in the sample.
ggplot(vector.token.three[1:15,], aes(x = reorder(string, desc(Freq)), y = Freq)) +
geom_bar(fill = "coral1", stat = "Identity") +
theme_TCR() +
scale_y_continuous(expand = c(0,0)) +
xlab("Top 15 Three-Word Phrases") +
ylab("Frequency") +
theme(axis.text.x = element_text(margin = margin(-35,0,30,0)),
axis.line.x = element_line(color = "#222222", size = 0.5))
Using the function below, I determined how many words from the sample covered 50 and 90 percent of all word instances, respectively.
getCoverage <- function(coverage) {
freq <- 0
reqFreq <- coverage * sum(vector.token.one$Freq)
for (i in 1:nrow(vector.token.one)) {
if (freq >= reqFreq) {
return (i)
}
freq <- freq + vector.token.one[i, 2]
}
}
getCoverage(0.5) ; getCoverage(0.9)
## [1] 946
## [1] 14847
Thus far, there have not been any particularly interesting findings, aside from those highlighted above.
In order to build a prediction algorithm, I am considering using the lightweight predict_Backoff function from the {ANLP} package. Its documentation is here. My future Shiny app will take a string of text as an input, and return the top three suggestions for the next word, much like SwiftKey’s product.