Coursera Data Science Specialization Capstone Project: Milestone Report Week 2

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types: I went to the the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone I will work on understanding and building predictive text models like those used by SwiftKey.

An intermediate R markdown report that describes in plain language, plots, and code your exploratory analysis of the course data set is put forward next.

Preliminary steps: import libraries and datasets

The dataset is downloaded from the following url: Capstone Dataset. At this stage, I will load relevant libraries, as well as three datasets: twitter, news and blogs. In this project, only the English-language datases will be considered.

library(readr)
library(stringi)
library(tidyverse)
library(tidytext)
library(wordcloud)

We have English-language data from three different sources:

blogs (en_US.blogs.txt) news (en_US.news.txt) Twitter (en_US.twitter.txt)

#en_US_twitter <- read_csv("C:/Users/pmps/Downloads/en_US.twitter.txt")
twitter <- readLines("D:/en_US.twitter.txt", skipNul = TRUE, encoding = "UTF-8")
news <- readLines("D:/en_US.news.txt", skipNul = TRUE, encoding = "UTF-8")

## Warning in readLines("D:/en_US.news.txt", skipNul = TRUE, encoding = "UTF-8"):
## incomplete final line found on 'D:/en_US.news.txt'

blogs <- readLines("D:/en_US.blogs.txt", skipNul = TRUE, encoding = "UTF-8")

Preliminary stats

Now that the datasets were imported, let’s collect some preliminary stats.

First, the size (Mg) of each object:

object_size <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) object.size(x)/1024^2)
object_size

##   twitter      news     blogs 
## 318.98975  19.76917 255.35453

Number of lines:

number_lines <- sapply(list(twitter=twitter, news = news, blogs = blogs), length)
number_lines

## twitter    news   blogs 
## 2360148   77259  899288

Number of chars:

number_chars <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) nchar(x) |> sum())
number_chars

##   twitter      news     blogs 
## 162096241  15639408 206824505

Number of words:

number_words <- sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) str_split(x, " ") |> length())
number_words

## twitter    news   blogs 
## 2360148   77259  899288

File sizes:

file.info("D:/en_US.twitter.txt")$size/1024^2

## [1] 159.3641

file.info("D:/en_US.blogs.txt")$size/1024^2

## [1] 200.4242

file.info("D:/en_US.news.txt")$size/1024^2

## [1] 196.2775

Alternatives, using the strigi package:

sapply(list(twitter=twitter, news = news, blogs = blogs), function(x) stri_count_words(x) |> sum())

##  twitter     news    blogs 
## 30093413  2674536 37546250

sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_general)

##               twitter     news     blogs
## Lines         2360148    77259    899288
## LinesNEmpty   2360148    77259    899288
## Chars       162096241 15639408 206824382
## CharsNWhite 134082806 13072698 170389539

sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_general)[4,]

##   twitter      news     blogs 
## 134082806  13072698 170389539

# num words per file
sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_latex)[4,]

##  twitter     news    blogs 
## 30451170  2651432 37570839

sapply(list(twitter=twitter, news = news, blogs = blogs), stri_stats_latex)

##                 twitter     news     blogs
## CharsWord     125570778 12476453 162464653
## CharsCmdEnvir      3032        0         9
## CharsWhite     35958529  3096618  42636700
## Words          30451170  2651432  37570839
## Cmds                963        0         3
## Envirs                0        0         0

Histogram of Words per Line

qplot(twitter %>% str_count(),
               geom = "histogram",
               main = "US Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

qplot(blogs %>% str_count(),
               geom = "histogram",
               main = "US Blogs",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

qplot(news %>% str_count(),
               geom = "histogram",
               main = "US News",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 5)

##Sampling the data

As we can see, the datasets are quite large in size. For that reason, I randomly sample 10.000 rows from each dataset. At the same time. I combine them into a single dataset.

set.seed(007)
sampleData <- sample(twitter,10000) %>% 
  transform(file = "twitter") %>% 
  rbind(sample(news,10000) %>% transform(file = "news")) %>% 
  rbind(sample(blogs,10000) %>% transform(file = "blogs"))

Data cleaning and tokenization

Using the tidy(text) framework

sampleData_tidy <- sampleData %>%
  as_tibble() %>% 
  mutate(linenumber = row_number(), .before= 1) %>% 
  rename(text0 = X_data)
  
#chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
#                                                 ignore_case = TRUE)))) %>%

Now, let’s remove digits and punctuation and perform some tokenization:

# tokenization
library(tidytext)
sampleData_tidy <- sampleData_tidy %>%
  mutate(text0 = str_remove_all(text0, "[:digit:]")) %>% 
  mutate(text0 = str_remove_all(text0, "[:punct:]")) %>% 
  unnest_tokens(word, text0) %>% 
  mutate(word = tolower(word))

As required, I also remove some bad words and stop words in the way:

download.file(url = "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", 
               destfile = "D:/bad-words.txt", quiet = TRUE, method =  "auto")

bad_words <- read_delim("D:/bad-words.txt", 
    delim = "\t", escape_double = FALSE, 
    col_names = FALSE, trim_ws = TRUE)

## Rows: 1383 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bad_words <- setNames(bad_words, c("word"))

data(stop_words)

sampleData_tidy <- sampleData_tidy %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(bad_words, by = "word")

Basic Summary of Data

In this exercise, the analysis is conducted per file. This is an important advantage of the tidytext package.

sampleData_tidy %>% 
  group_by(file) %>% 
  count(word, sort = TRUE) %>% 
  slice_max(n, n = 10)

## # A tibble: 30 × 3
## # Groups:   file [3]
##    file  word       n
##    <chr> <chr>  <int>
##  1 blogs time     960
##  2 blogs im       701
##  3 blogs people   621
##  4 blogs dont     618
##  5 blogs day      577
##  6 blogs love     479
##  7 blogs life     445
##  8 blogs book     331
##  9 blogs didnt    310
## 10 blogs ive      300
## # ℹ 20 more rows

sampleData_tidy %>% 
  group_by(file) %>% 
  count(word, sort = TRUE) %>% 
  slice_max(n, n = 10) %>% 
  ggplot() +
  geom_col(aes(x = word, y = n, fill = file)) +
  facet_wrap(~ file, scales = "free") +
  coord_flip()

wordcloud

sampleData_tidy %>% 
  count(word, sort = TRUE) %>% 
  slice_max(n, n = 100) %>% 
  with(., wordcloud(words = word, freq = n, colors = "blue"))

Bigrams

In this bigram exercise, I do not remove stopwords because they could be important for the prediction exercise.

tidy_bigrams <- sampleData %>% 
  rename(text = X_data) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  mutate(bigram0=bigram) %>% 
  separate_wider_delim(cols = bigram0, delim = " ", names = c("word1", "word2")) %>% 
  anti_join(bad_words, by = c("word1" = "word")) %>% 
  anti_join(bad_words, by = c("word2" = "word")) %>%
  filter(!is.na(bigram)) %>% 
  select(-word1, -word2) 




tidy_bigrams %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(n=15, n) %>% 
  ggplot(aes(x = fct_reorder(bigram, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Trigrams

In this trigram exercise, I do not remove stopwords because they could be important for the prediction exercise.

tidy_trigrams <- sampleData %>% 
  rename(text = X_data) %>% 
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
  mutate(trigram0=trigram) %>% 
  separate_wider_delim(cols = trigram0, delim = " ", names = c("word1", "word2", "word3")) %>% 
  anti_join(bad_words, by = c("word1" = "word")) %>% 
  anti_join(bad_words, by = c("word2" = "word")) %>%
  anti_join(bad_words, by = c("word3" = "word")) %>%
  filter(!is.na(trigram)) %>% 
  select(-starts_with("word")) 




tidy_trigrams %>% 
  count(trigram, sort = TRUE) %>% 
  top_n(n=15, n) %>% 
  ggplot(aes(x = fct_reorder(trigram, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Using the tm and tokenizers libraries

The same exercise, but now with the tm and tokenizers packages.

Clean the data using tm_map:

I kept stopwords in this exercise, as they are useful to predict the next word.

library(tm)
DataframeSource(sampleData 
                %>% setNames(c("text", "doc_id"))) %>%
  VCorpus() -> corpus



corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove multiple whitespace
corpus <- tm_map(corpus, content_transformer(function(x, pattern) str_remove_all(x, "/|@|\\|")))
#corpus <- tm_map(corpus, content_transformer(function(x) str_remove_all(x, "[^[:graph:]]"))) #Cleaning all non ASCII characters
corpus <- tm_map(corpus, removeWords, unlist(bad_words)) #Removing Profanities
#corpus <- tm_map(corpus,removeWords,c(stopwords("english"),letters))# Deleting all English stopwords and any stray letters left
corpus <- tm_map(corpus,removeWords,letters) #Removing all stray letters left by the last two calls
corpus <- tm_map(corpus, stripWhitespace) # remove multiple whitespace

corpus[[1]]$content

## [1] " holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs"

content(corpus[[1]])

## [1] " holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs"

# convert corpus to a dataframe
corpus_df <- data.frame(text = unlist(sapply(corpus, '[', "content")), stringsAsFactors = FALSE)

head(corpus_df)

##                                                                                                   text
## 1  holding onto the hope of owning my own biz that makes difference for folks time to get serious hugs
## 2                                                                          tweet tweet tweet voicegram
## 3                                                                                              oh girl
## 4                                                               you are amazing way to make me cry you
## 5                                                          its sarah michelle gellar bring back ringer
## 6    butwhy bring it up now will you bring it up if no bg cant believe wrote that makes the final four

tdm <- TermDocumentMatrix(corpus) |> removeSparseTerms(0.99)
freq <- as.matrix(tdm) %>% rowSums() #sort(, decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq) %>% arrange(desc(freq))
wordFreq |> head()

##      word  freq
## the   the 43904
## and   and 22393
## that that  9515
## for   for  9068
## you   you  6498
## with with  6325

wordFreq %>% 
  {wordcloud(words = .$word,
              freq = .$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set2"))}

## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

wordFreq[1:10, ] %>% 
  ggplot(aes(x = fct_reorder(word, freq), y = freq)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

library(tokenizers)

## Warning: package 'tokenizers' was built under R version 4.3.3

tokenize_words(corpus_df |> unlist(),
               lowercase = TRUE,
               #stopwords = ("to"),
               strip_punct = TRUE, 
               strip_numeric = TRUE,
               simplify = TRUE)  %>%  
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  head()

## # A tibble: 6 × 2
##   value     n
##   <chr> <int>
## 1 the   44055
## 2 to    24059
## 3 and   22454
## 4 of    18765
## 5 in    14848
## 6 that   9736

tokenize_words(corpus_df |> unlist(),
               lowercase = TRUE,
               # stopwords = ("to"),
               strip_punct = TRUE, 
               strip_numeric = TRUE,
               simplify = TRUE)  %>%  
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  na.omit() %>% 
  slice_max(n, n= 10) %>% 
  ggplot(aes(x = fct_reorder(value, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

tokenize_ngrams(corpus_df$text ,
                lowercase = TRUE,
                n = 2L,
                n_min = 2,
                # stopwords = character(),
                ngram_delim = " ",
                simplify = FALSE) %>% 
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  head()

## # A tibble: 6 × 2
##   value       n
##   <chr>   <int>
## 1 of the   4054
## 2 in the   3840
## 3 to the   1997
## 4 on the   1725
## 5 for the  1663
## 6 to be    1472

tokenize_ngrams(corpus_df |> unlist(),
                lowercase = TRUE,
                n = 2L,
                n_min = 2,
                # stopwords = character(),
                ngram_delim = " ",
                simplify = FALSE) %>% 
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  na.omit() %>% 
  slice_max(n, n= 10) %>% 
  ggplot(aes(x = fct_reorder(value, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

tokenize_ngrams(corpus_df$text,
                lowercase = TRUE,
                n = 3L,
                n_min = 3,
                # stopwords = character(),
                ngram_delim = " ",
                simplify = FALSE) %>% 
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  head()

## # A tibble: 6 × 2
##   value           n
##   <chr>       <int>
## 1 <NA>         1320
## 2 one of the    305
## 3 going to be   158
## 4 the end of    140
## 5 out of the    125
## 6 some of the   123

tokenize_ngrams(corpus_df |> unlist(),
                lowercase = TRUE,
                n = 3L,
                n_min = 3,
                # stopwords = character(),
                ngram_delim = " ",
                simplify = FALSE) %>% 
  unlist() %>% 
  as_tibble() %>% 
  count(value, sort = TRUE) %>% 
  na.omit() %>% 
  top_n(n= 10, n) %>% 
  ggplot(aes(x = fct_reorder(value, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Quanteda

The same exercise, but now with quanteda library. Now, I will remove stopwords to see the differences.

library(quanteda)

q_corpus <- corpus(sampleData$X_data)

text_tokens <- tokens(q_corpus,
                      what="word1",
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url =TRUE,
                      remove_separators = TRUE,
                      remove_symbols = TRUE,
                      verbose = quanteda_options("verbose"))

text_tokens <-  tokens_remove(text_tokens, pattern = stopwords("en"))
text_tokens <-  tokens_remove(text_tokens, pattern = unlist(bad_words))

Unigrams

unigram <- tokens_ngrams(text_tokens, n = 1, concatenator = " ")

# create the document-feature matrix
text_unigram <- dfm(unigram, 
                    tolower = TRUE,
                    remove_padding = TRUE,
                    verbose = FALSE)
text_unigram <- text_unigram %>% dfm_remove(unlist(bad_words))


topfeatures(text_unigram, 100) %>% 
  data.frame(names = names(.), freq = .) %>% 
  head()

##      names freq
## said  said 2838
## one    one 2563
## just  just 2226
## like  like 2183
## can    can 2064
## time  time 1803

topfeatures(text_unigram, 100) %>% 
  data.frame(names = names(.), freq = .) %>% 
  {wordcloud(words = .$names,
              freq = .$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set2"))}

## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

topfeatures(text_unigram, 10) %>% 
  data.frame(names = names(.), freq = .) %>% 
  ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Bigrams

bigram <- tokens_ngrams(text_tokens, n = 2, concatenator = " ")

# create the document-feature matrix
text_bigram <- dfm(bigram, 
                    tolower = TRUE,
                    remove_padding = TRUE,
                    verbose = FALSE)
text_bigram <- text_bigram %>% dfm_remove(unlist(bad_words))


topfeatures(text_bigram, 100) %>% 
  data.frame(names = names(.), freq = .) %>% 
  head()

##                   names freq
## new york       new york  171
## last year     last year  169
## right now     right now  152
## last week     last week  136
## years ago     years ago  133
## high school high school  121

topfeatures(text_bigram, 100) %>% 
  data.frame(names = names(.), freq = .) %>% 
  {wordcloud(words = .$names,
              freq = .$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set2"))}

topfeatures(text_bigram, 10) %>% 
  data.frame(names = names(.), freq = .) %>% 
  ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Trigrams

trigram <- tokens_ngrams(text_tokens, n = 3, concatenator = " ")

# create the document-feature matrix
text_trigram <- dfm(trigram, 
                    tolower = TRUE,
                    remove_padding = TRUE,
                    verbose = FALSE)
text_trigram <- text_trigram %>% dfm_remove(unlist(bad_words))


topfeatures(text_trigram, 100) %>% 
  data.frame(names = names(.), freq = .) %>% 
  head()

##                                         names freq
## swag swag swag                 swag swag swag   23
## new york city                   new york city   19
## president barack obama president barack obama   16
## nec mobilepro pda           nec mobilepro pda   14
## mobile south carolina   mobile south carolina   13
## happy mothers day           happy mothers day   12

topfeatures(text_trigram, 20) %>% 
  data.frame(names = names(.), freq = .) %>% 
  {wordcloud(words = .$names,
              freq = .$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35,
              colors=brewer.pal(9, "Set2"))}

topfeatures(text_trigram, 10) %>% 
  data.frame(names = names(.), freq = .) %>% 
  ggplot(aes(x = fct_reorder(names, freq), y = freq)) +
  geom_col(fill = "blue") +
  coord_flip() +
  theme_light() +
  labs(x = "Words", y = "Frequency")

Next steps: Predictive Model and Shiny App

Following the exploratory data analysis, the next steps involve finalizing the predictive algorithm, deploying a Shiny app, and creating a presentation deck.

The predictive algorithm will employ an n-gram backoff model. This model prioritizes longer n-grams (2-grams or 3-grams) that match the input text. If no suitable match is found, it progressively reduces the n-gram size, ultimately relying on unigram probabilities. The model will be trained on a larger dataset than the one used for our exploratory analysis, ensuring more accurate predictions.

The Shiny app will provide an interactive interface. Users can input text, and the app will instantly suggest the most likely next word based on the model’s predictions.

Coursera Data Science Specialization Capstone Project: Milestone Report Week 2

Paulo Pereira da Silva

2024-12-04

Introduction

Preliminary steps: import libraries and datasets

Preliminary stats

Histogram of Words per Line

Data cleaning and tokenization

Using the tidy(text) framework

Basic Summary of Data

wordcloud

Bigrams

Trigrams

Using the tm and tokenizers libraries

Quanteda

Unigrams

Bigrams

Trigrams

Next steps: Predictive Model and Shiny App