Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types: I went to the the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone I will work on understanding and building predictive text models like those used by SwiftKey.
An intermediate R markdown report that describes in plain language, plots, and code your exploratory analysis of the course data set is put forward next.
The dataset is downloaded from the following url: Capstone Dataset. At this stage, I will load relevant libraries, as well as three datasets: twitter, news and blogs. In this project, only the English-language datases will be considered.
library(readr)
library(stringi)
library(tidyverse)
library(tidytext)
library(wordcloud)
We have English-language data from three different sources:
blogs (en_US.blogs.txt) news (en_US.news.txt) Twitter (en_US.twitter.txt)
#en_US_twitter <- read_csv("C:/Users/pmps/Downloads/en_US.twitter.txt")
twitter <- readLines("D:/en_US.twitter.txt", skipNul = TRUE, encoding = "UTF-8")
news <- readLines("D:/en_US.news.txt", skipNul = TRUE, encoding = "UTF-8")
## Warning in readLines("D:/en_US.news.txt", skipNul = TRUE, encoding = "UTF-8"):
## incomplete final line found on 'D:/en_US.news.txt'
blogs <- readLines("D:/en_US.blogs.txt", skipNul = TRUE, encoding = "UTF-8")
Now that the datasets were imported, let’s collect some preliminary stats.
First, the size (Mg) of each object:
object_size <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) object.size(x)/1024^2)
object_size
## twitter news blogs
## 318.98975 19.76917 255.35453
Number of lines:
number_lines <- sapply(list(twitter=twitter, news = news, blogs = blogs), length)
number_lines
## twitter news blogs
## 2360148 77259 899288
Number of chars:
number_chars <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) nchar(x) |> sum())
number_chars
## twitter news blogs
## 162096241 15639408 206824505
Number of words:
number_words <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) str_split(x, " ") |> length())
number_words
## twitter news blogs
## 2360148 77259 899288
File sizes:
file.info("D:/en_US.twitter.txt")$size/1024^2
## [1] 159.3641
file.info("D:/en_US.blogs.txt")$size/1024^2
## [1] 200.4242
file.info("D:/en_US.news.txt")$size/1024^2
## [1] 196.2775
Alternatives, using the strigi package:
sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) stri_count_words(x) |> sum())
## twitter news blogs
## 30093413 2674536 37546250
sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_general)
## twitter news blogs
## Lines 2360148 77259 899288
## LinesNEmpty 2360148 77259 899288
## Chars 162096241 15639408 206824382
## CharsNWhite 134082806 13072698 170389539
sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_general)[4,]
## twitter news blogs
## 134082806 13072698 170389539
# num words per file
sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_latex)[4,]
## twitter news blogs
## 30451170 2651432 37570839
sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_latex)
## twitter news blogs
## CharsWord 125570778 12476453 162464653
## CharsCmdEnvir 3032 0 9
## CharsWhite 35958529 3096618 42636700
## Words 30451170 2651432 37570839
## Cmds 963 0 3
## Envirs 0 0 0
qplot(twitter %>% str_count(),
geom = "histogram",
main = "US Twitter",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
qplot(blogs %>% str_count(),
geom = "histogram",
main = "US Blogs",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
qplot(news %>% str_count(),
geom = "histogram",
main = "US News",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 5)
##Sampling the data
As we can see, the datasets are quite large in size. For that reason, I randomly sample 10.000 rows from each dataset. At the same time. I combine them into a single dataset.
set.seed(007)
sampleData <- sample(twitter,10000) %>%
transform(file = "twitter") %>%
rbind(sample(news,10000) %>% transform(file = "news")) %>%
rbind(sample(blogs,10000) %>% transform(file = "blogs"))
sampleData_tidy <- sampleData %>%
as_tibble() %>%
mutate(linenumber = row_number(), .before= 1) %>%
rename(text0 = X_data)
#chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
# ignore_case = TRUE)))) %>%
Now, let’s remove digits and punctuation and perform some tokenization:
# tokenization
library(tidytext)
sampleData_tidy <- sampleData_tidy %>%
mutate(text0 = str_remove_all(text0, "[:digit:]")) %>%
mutate(text0 = str_remove_all(text0, "[:punct:]")) %>%
unnest_tokens(word, text0) %>%
mutate(word = tolower(word))
As required, I also remove some bad words and stop words in the way:
download.file(url = "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt",
destfile = "D:/bad-words.txt", quiet = TRUE, method = "auto")
bad_words <- read_delim("D:/bad-words.txt",
delim = "\t", escape_double = FALSE,
col_names = FALSE, trim_ws = TRUE)
## Rows: 1383 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bad_words <- setNames(bad_words, c("word"))
data(stop_words)
sampleData_tidy <- sampleData_tidy %>%
anti_join(stop_words, by = "word") %>%
anti_join(bad_words, by = "word")
In this exercise, the analysis is conducted per file. This is an important advantage of the tidytext package.
sampleData_tidy %>%
group_by(file) %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 10)
## # A tibble: 30 × 3
## # Groups: file [3]
## file word n
## <chr> <chr> <int>
## 1 blogs time 960
## 2 blogs im 701
## 3 blogs people 621
## 4 blogs dont 618
## 5 blogs day 577
## 6 blogs love 479
## 7 blogs life 445
## 8 blogs book 331
## 9 blogs didnt 310
## 10 blogs ive 300
## # ℹ 20 more rows
sampleData_tidy %>%
group_by(file) %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 10) %>%
ggplot() +
geom_col(aes(x = word, y = n, fill = file)) +
facet_wrap(~ file, scales = "free") +
coord_flip()
sampleData_tidy %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 100) %>%
with(., wordcloud(words = word, freq = n, colors = "blue"))
In this bigram exercise, I do not remove stopwords because they could be important for the prediction exercise.
tidy_bigrams <- sampleData %>%
rename(text = X_data) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(bigram0=bigram) %>%
separate_wider_delim(cols = bigram0, delim = " ", names = c("word1", "word2")) %>%
anti_join(bad_words, by = c("word1" = "word")) %>%
anti_join(bad_words, by = c("word2" = "word")) %>%
filter(!is.na(bigram)) %>%
select(-word1, -word2)
tidy_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(n=15, n) %>%
ggplot(aes(x = fct_reorder(bigram, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
In this trigram exercise, I do not remove stopwords because they could be important for the prediction exercise.
tidy_trigrams <- sampleData %>%
rename(text = X_data) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
mutate(trigram0=trigram) %>%
separate_wider_delim(cols = trigram0, delim = " ", names = c("word1", "word2", "word3")) %>%
anti_join(bad_words, by = c("word1" = "word")) %>%
anti_join(bad_words, by = c("word2" = "word")) %>%
anti_join(bad_words, by = c("word3" = "word")) %>%
filter(!is.na(trigram)) %>%
select(-starts_with("word"))
tidy_trigrams %>%
count(trigram, sort = TRUE) %>%
top_n(n=15, n) %>%
ggplot(aes(x = fct_reorder(trigram, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
The same exercise, but now with the tm and tokenizers packages.
Clean the data using tm_map:
I kept stopwords in this exercise, as they are useful to predict the next word.
library(tm)
DataframeSource(sampleData
%>% setNames(c("text", "doc_id"))) %>%
VCorpus() -> corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove multiple whitespace
corpus <- tm_map(corpus, content_transformer(function(x, pattern) str_remove_all(x, "/|@|\\|")))
#corpus <- tm_map(corpus, content_transformer(function(x) str_remove_all(x, "[^[:graph:]]"))) #Cleaning all non ASCII characters
corpus <- tm_map(corpus, removeWords, unlist(bad_words)) #Removing Profanities
#corpus <- tm_map(corpus,removeWords,c(stopwords("english"),letters))# Deleting all English stopwords and any stray letters left
corpus <- tm_map(corpus,removeWords,letters) #Removing all stray letters left by the last two calls
corpus <- tm_map(corpus, stripWhitespace) # remove multiple whitespace
corpus[[1]]$content
## [1] " holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs"
content(corpus[[1]])
## [1] " holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs"
# convert corpus to a dataframe
corpus_df <- data.frame(text = unlist(sapply(corpus, '[', "content")), stringsAsFactors = FALSE)
head(corpus_df)
## text
## 1 holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs
## 2 tweet tweet tweet voicegram
## 3 oh girl
## 4 you are amazing way to make me cry you
## 5 its sarah michelle gellar bring back ringer
## 6 butwhy bring it up now will you bring it up if no bg cant believe wrote that makes the final four
tdm <- TermDocumentMatrix(corpus) |> removeSparseTerms(0.99)
freq <- as.matrix(tdm) %>% rowSums() #sort(, decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq) %>% arrange(desc(freq))
wordFreq |> head()
## word freq
## the the 43904
## and and 22393
## that that 9515
## for for 9068
## you you 6498
## with with 6325
wordFreq %>%
{wordcloud(words = .$word,
freq = .$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set2"))}
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
wordFreq[1:10, ] %>%
ggplot(aes(x = fct_reorder(word, freq), y = freq)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 4.3.3
tokenize_words(corpus_df |> unlist(),
lowercase = TRUE,
#stopwords = ("to"),
strip_punct = TRUE,
strip_numeric = TRUE,
simplify = TRUE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
head()
## # A tibble: 6 × 2
## value n
## <chr> <int>
## 1 the 44055
## 2 to 24059
## 3 and 22454
## 4 of 18765
## 5 in 14848
## 6 that 9736
tokenize_words(corpus_df |> unlist(),
lowercase = TRUE,
# stopwords = ("to"),
strip_punct = TRUE,
strip_numeric = TRUE,
simplify = TRUE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
na.omit() %>%
slice_max(n, n= 10) %>%
ggplot(aes(x = fct_reorder(value, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
tokenize_ngrams(corpus_df$text ,
lowercase = TRUE,
n = 2L,
n_min = 2,
# stopwords = character(),
ngram_delim = " ",
simplify = FALSE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
head()
## # A tibble: 6 × 2
## value n
## <chr> <int>
## 1 of the 4054
## 2 in the 3840
## 3 to the 1997
## 4 on the 1725
## 5 for the 1663
## 6 to be 1472
tokenize_ngrams(corpus_df |> unlist(),
lowercase = TRUE,
n = 2L,
n_min = 2,
# stopwords = character(),
ngram_delim = " ",
simplify = FALSE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
na.omit() %>%
slice_max(n, n= 10) %>%
ggplot(aes(x = fct_reorder(value, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
tokenize_ngrams(corpus_df$text,
lowercase = TRUE,
n = 3L,
n_min = 3,
# stopwords = character(),
ngram_delim = " ",
simplify = FALSE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
head()
## # A tibble: 6 × 2
## value n
## <chr> <int>
## 1 <NA> 1320
## 2 one of the 305
## 3 going to be 158
## 4 the end of 140
## 5 out of the 125
## 6 some of the 123
tokenize_ngrams(corpus_df |> unlist(),
lowercase = TRUE,
n = 3L,
n_min = 3,
# stopwords = character(),
ngram_delim = " ",
simplify = FALSE) %>%
unlist() %>%
as_tibble() %>%
count(value, sort = TRUE) %>%
na.omit() %>%
top_n(n= 10, n) %>%
ggplot(aes(x = fct_reorder(value, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
The same exercise, but now with quanteda library. Now, I will remove stopwords to see the differences.
library(quanteda)
q_corpus <- corpus(sampleData$X_data)
text_tokens <- tokens(q_corpus,
what="word1",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url =TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
verbose = quanteda_options("verbose"))
text_tokens <- tokens_remove(text_tokens, pattern = stopwords("en"))
text_tokens <- tokens_remove(text_tokens, pattern = unlist(bad_words))
unigram <- tokens_ngrams(text_tokens, n = 1, concatenator = " ")
# create the document-feature matrix
text_unigram <- dfm(unigram,
tolower = TRUE,
remove_padding = TRUE,
verbose = FALSE)
text_unigram <- text_unigram %>% dfm_remove(unlist(bad_words))
topfeatures(text_unigram, 100) %>%
data.frame(names = names(.), freq = .) %>%
head()
## names freq
## said said 2838
## one one 2563
## just just 2226
## like like 2183
## can can 2064
## time time 1803
topfeatures(text_unigram, 100) %>%
data.frame(names = names(.), freq = .) %>%
{wordcloud(words = .$names,
freq = .$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set2"))}
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
topfeatures(text_unigram, 10) %>%
data.frame(names = names(.), freq = .) %>%
ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
bigram <- tokens_ngrams(text_tokens, n = 2, concatenator = " ")
# create the document-feature matrix
text_bigram <- dfm(bigram,
tolower = TRUE,
remove_padding = TRUE,
verbose = FALSE)
text_bigram <- text_bigram %>% dfm_remove(unlist(bad_words))
topfeatures(text_bigram, 100) %>%
data.frame(names = names(.), freq = .) %>%
head()
## names freq
## new york new york 171
## last year last year 169
## right now right now 152
## last week last week 136
## years ago years ago 133
## high school high school 121
topfeatures(text_bigram, 100) %>%
data.frame(names = names(.), freq = .) %>%
{wordcloud(words = .$names,
freq = .$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set2"))}
topfeatures(text_bigram, 10) %>%
data.frame(names = names(.), freq = .) %>%
ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
trigram <- tokens_ngrams(text_tokens, n = 3, concatenator = " ")
# create the document-feature matrix
text_trigram <- dfm(trigram,
tolower = TRUE,
remove_padding = TRUE,
verbose = FALSE)
text_trigram <- text_trigram %>% dfm_remove(unlist(bad_words))
topfeatures(text_trigram, 100) %>%
data.frame(names = names(.), freq = .) %>%
head()
## names freq
## swag swag swag swag swag swag 23
## new york city new york city 19
## president barack obama president barack obama 16
## nec mobilepro pda nec mobilepro pda 14
## mobile south carolina mobile south carolina 13
## happy mothers day happy mothers day 12
topfeatures(text_trigram, 20) %>%
data.frame(names = names(.), freq = .) %>%
{wordcloud(words = .$names,
freq = .$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set2"))}
topfeatures(text_trigram, 10) %>%
data.frame(names = names(.), freq = .) %>%
ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
geom_col(fill = "blue") +
coord_flip() +
theme_light() +
labs(x = "Words", y = "Frequency")
Following the exploratory data analysis, the next steps involve finalizing the predictive algorithm, deploying a Shiny app, and creating a presentation deck.
The predictive algorithm will employ an n-gram backoff model. This model prioritizes longer n-grams (2-grams or 3-grams) that match the input text. If no suitable match is found, it progressively reduces the n-gram size, ultimately relying on unigram probabilities. The model will be trained on a larger dataset than the one used for our exploratory analysis, ensuring more accurate predictions.
The Shiny app will provide an interactive interface. Users can input text, and the app will instantly suggest the most likely next word based on the model’s predictions.