This report aims to explore a corpora downloaded from the Coursera Data Science Capstone by Johns Hopkins University. We are looking to explore how this dataset is able to support our goal to develop a smart keyboard word prediction application.
Let’s begin by setting up our R environment.
library(data.table)
library(dplyr)
library(ggplot2)
library(quanteda)
options(scipen = 999)blogs <- readLines('./Coursera-SwiftKey/final/en_US/en_US.blogs.txt')
news <- readLines('./Coursera-SwiftKey/final/en_US/en_US.news.txt')
twitter <- readLines('./Coursera-SwiftKey/final/en_US/en_US.twitter.txt')
str(blogs)## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>." ...
str(news)## chr [1:77259] "He wasn't home alone, apparently." ...
str(twitter)## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
There are 899288 documents in blogs, 77259 documents in news and 2360148 documents in twitter.
Due to limited computational resources, we will only explore 5% of the entire data available. The data to be explored will be sampled via a random sample approach.
rand_samp <- function(table, pct_retained, seed){
k <- length(table)
smp_size <- floor(pct_retained * k)
set.seed(seed)
keep_ind <- sample(seq_len(k), size = smp_size)
table <- table[keep_ind]
return(table)
}
# retain 5% for exploration
blogs <- rand_samp(blogs, .05, 123)
news <- rand_samp(news, .05, 456)
twitter <- rand_samp(twitter, .05, 789)After sample selection, there are 44964 documents in blogs, 3862 documents in news and 118007 documents in twitter for exploration.
We conduct a simple data cleansing to remove any double spaces or leading/trailing whitespaces.
clean_text <- function(string){
string <- gsub('\\s+', ' ', string) # remove double spaces
string <- trimws(string, 'both') # remove leading/trailing whitespaces
return(string)
}
blogs <- clean_text(blogs)
news <- clean_text(news)
twitter <- clean_text(twitter)# count characters
nchar_blogs <- data.table(n_char = nchar(blogs))
nchar_news <- data.table(n_char = nchar(news))
nchar_twitter <- data.table(n_char = nchar(twitter))
nchar_blogs$dataset <- 'Blogs'
nchar_news$dataset <- 'News'
nchar_twitter$dataset <- 'Twitter'
# consolidate summary
nchar_all <- rbindlist(list(nchar_blogs, nchar_news, nchar_twitter),
idcol = F, use.names = T, fill = F)
# ggplot
ggplot(data = nchar_all, aes(n_char)) +
geom_histogram(bins = 20) +
facet_grid(. ~ dataset, scales = 'free') +
labs(title = 'Number of Characters per Document')Next, let’s explore number of words per document.
# count tokens
ntoken_blogs <- data.table(n_words = ntoken(blogs, remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T))
ntoken_news <- data.table(n_words = ntoken(news, remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T))
ntoken_twitter <- data.table(n_words = ntoken(twitter, remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T))
ntoken_blogs$dataset <- 'Blogs'
ntoken_news$dataset <- 'News'
ntoken_twitter$dataset <- 'Twitter'
# consolidate summary
ntoken_all <- rbindlist(list(ntoken_blogs, ntoken_news, ntoken_twitter),
idcol = F, use.names = T, fill = F)
# ggplot
ggplot(data = ntoken_all, aes(n_words)) +
geom_histogram(bins = 20) +
facet_grid(. ~ dataset, scales = 'free') +
labs(title = 'Number of Words per Document',
subtitle = 'Excluded numbers, punctuations, symbols and urls')We’ll use quanteda’s stopwords dictionary to remove commonly used english grammars from our documents. Let’s print out the list of stopwords for easy reference.
text_to_remove <- unique(stopwords('english'), letters) %>% sort()
text_to_remove## [1] "a" "about" "above" "after" "again"
## [6] "against" "all" "am" "an" "and"
## [11] "any" "are" "aren't" "as" "at"
## [16] "be" "because" "been" "before" "being"
## [21] "below" "between" "both" "but" "by"
## [26] "can't" "cannot" "could" "couldn't" "did"
## [31] "didn't" "do" "does" "doesn't" "doing"
## [36] "don't" "down" "during" "each" "few"
## [41] "for" "from" "further" "had" "hadn't"
## [46] "has" "hasn't" "have" "haven't" "having"
## [51] "he" "he'd" "he'll" "he's" "her"
## [56] "here" "here's" "hers" "herself" "him"
## [61] "himself" "his" "how" "how's" "i"
## [66] "i'd" "i'll" "i'm" "i've" "if"
## [71] "in" "into" "is" "isn't" "it"
## [76] "it's" "its" "itself" "let's" "me"
## [81] "more" "most" "mustn't" "my" "myself"
## [86] "no" "nor" "not" "of" "off"
## [91] "on" "once" "only" "or" "other"
## [96] "ought" "our" "ours" "ourselves" "out"
## [101] "over" "own" "same" "shan't" "she"
## [106] "she'd" "she'll" "she's" "should" "shouldn't"
## [111] "so" "some" "such" "than" "that"
## [116] "that's" "the" "their" "theirs" "them"
## [121] "themselves" "then" "there" "there's" "these"
## [126] "they" "they'd" "they'll" "they're" "they've"
## [131] "this" "those" "through" "to" "too"
## [136] "under" "until" "up" "very" "was"
## [141] "wasn't" "we" "we'd" "we'll" "we're"
## [146] "we've" "were" "weren't" "what" "what's"
## [151] "when" "when's" "where" "where's" "which"
## [156] "while" "who" "who's" "whom" "why"
## [161] "why's" "will" "with" "won't" "would"
## [166] "wouldn't" "you" "you'd" "you'll" "you're"
## [171] "you've" "your" "yours" "yourself" "yourselves"
We will also exclude profanity in the model build process. However, these words are retained for now during our EDA.
# unigram frequency
unigram <- c(blogs, news, twitter) %>%
iconv('latin1', 'ASCII', sub = '') %>%
tokens(what = 'word', remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T) %>%
tokens_remove(text_to_remove) %>%
dfm(tolower = T, ngrams = 1) %>%
textstat_frequency() %>%
arrange(-frequency) %>%
slice(1:10) %>%
as.data.table()
# ggplot
ggplot(data = unigram, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(title = 'Top 10 unigrams by Frequency',
subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
x = 'Feature')The top 3 words by frequency in our corpora are “just”, “like” and “one”.
# bigram frequency
bigram <- c(blogs, news, twitter) %>%
iconv('latin1', 'ASCII', sub = '') %>%
tokens(what = 'word', remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T) %>%
tokens_remove(text_to_remove) %>%
dfm(tolower = T, ngrams = 2) %>%
textstat_frequency() %>%
arrange(-frequency) %>%
slice(1:10) %>%
mutate(feature = gsub('_', ' ', feature)) %>%
as.data.table()
# ggplot
ggplot(data = bigram, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(title = 'Top 10 bigrams by Frequency',
subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
x = 'Feature')The top 3 bigrams by frequency in our corpora are “right now”, “last night” and “feel like”.
# trigram frequency
trigram <- c(blogs, news, twitter) %>%
iconv('latin1', 'ASCII', sub = '') %>%
tokens(what = 'word', remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T) %>%
tokens_remove(text_to_remove) %>%
dfm(tolower = T, ngrams = 3) %>%
textstat_frequency() %>%
arrange(-frequency) %>%
slice(1:10) %>%
mutate(feature = gsub('_', ' ', feature)) %>%
as.data.table()
# ggplot
ggplot(data = trigram, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(title = 'Top 10 trigrams by Frequency',
subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
x = 'Feature')The top 3 trigrams by frequency in our corpora are “let us know”, “happy new year” and “happy mother’s day”.
As a basic model, we will use an n-gram model for our smart keyboard word application. This model provides predictions based on frequently occuring n-gram pairs given the input word. Let’s go through some questions to consider prior to building our model.
The model contains a large list of n-grams and their frequency/probability of occurence. Given an input word new, we will query the model to produce a sample table as below.
## feature frequency probability
## 1: new york 40 0.40
## 2: new city 30 0.30
## 3: new times 15 0.15
## 4: new york city 10 0.10
## 5: new york times 5 0.50
The above represents a probability table for a single transfer in a Markov chain model. In the model above, the word new would lead to a word suggestion for new york.
Iterating this process, the list of choices would now be:
## feature frequency probability
## 1: new york city 10 0.67
## 2: new york times 5 0.33
Hence, the next word suggestion given the word input of new york would be new york city. This process repeats recursively until n, where n is the longest chain of words in our model.
Let’s look at the dictionary coverage of the words in our data.
# word frequencies
dictionary_coverage <- c(blogs, news, twitter) %>%
iconv('latin1', 'ASCII', sub = '') %>%
tokens(what = 'word', remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T) %>%
dfm(tolower = T, ngrams = 1) %>%
textstat_frequency() %>%
arrange(-frequency) %>%
mutate(coverage = cumsum(frequency) / sum(frequency)) %>%
select(feature, frequency, rank, coverage) %>%
as.data.table()
# 80/20 rule
pareto <- dictionary_coverage %>%
filter(abs(.8 - coverage) == min(abs(.8 - coverage))) %>%
select(rank, coverage) %>%
as.data.table()
# ggplot
ggplot(data = dictionary_coverage, aes(x = rank, y = coverage)) +
geom_line() +
labs(title = 'Dictionary Coverage of Words',
subtitle = 'Excludes non-ASCII characters, numbers, punctuations, symbols, urls and stopwords')The top 2047 most frequent terms (i.e. 1.8% of all terms) covers 80% of the entire word occurence in our corpora. We can use this information to find a balance between our model’s vocabulary/complexity vs efficiency.
# word frequencies
seq_coverage <- function(ngram){
x <- c(blogs, news, twitter) %>%
iconv('latin1', 'ASCII', sub = '') %>%
tokens(what = 'word', remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_url = T) %>%
dfm(tolower = T, ngrams = ngram) %>%
textstat_frequency() %>%
filter(docfreq > 1) %>%
mutate(feature = gsub('_', ' ', feature), ngram = ngram) %>%
group_by(ngram) %>%
summarise(no_of_features = n()) %>%
ungroup() %>%
as.data.table()
return(x)
}
n2 <- seq_coverage(2)
n3 <- seq_coverage(3)
n4 <- seq_coverage(4)
n5 <- seq_coverage(5)
n6 <- seq_coverage(6)
n7 <- seq_coverage(7)
optimal_n <- rbindlist(list(n2, n3, n4, n5, n6, n7),
idcol = F, use.names = T, fill = F) %>%
mutate(coverage = cumsum(no_of_features) / sum(no_of_features)) %>%
as.data.table()
# ggplot
ggplot(data = optimal_n, aes(x = ngram, y = no_of_features)) +
geom_step() +
geom_point() +
labs(title = 'Number of Features by N-gram',
subtitle = 'Excludes non-ASCII characters, numbers, punctuations, symbols and urls')optimal_n## ngram no_of_features coverage
## 1: 2 251070 0.4169248
## 2: 3 227212 0.7942311
## 3: 4 88342 0.9409311
## 4: 5 24626 0.9818248
## 5: 6 7472 0.9942328
## 6: 7 3473 1.0000000
Note: Only features occuring in more than 1 document is retained in the exploration above.
By selecting n<=4 in our n-gram model, we would be able to cover approximately 94.09% of all sentence sequence in our corpora. We can use this information to find a suitable cutoff for the n-grams in our model.
Let’s explore why is “smoothing” required for an n-gram model. To answer this question, we revisit our simplified model built earlier. The word input is new york.
## feature frequency probability
## 1: new york city 10 0.67
## 2: new york times 5 0.33
In the scenario above, our data only contains 2 choices of trigram for new york. However, there could be other word sequences such as new york state or new york pizza which is unseen in our model.
In order to attend to unseen n-grams, “smoothing” is used to enable the prediction of unseen n-grams from our corpora.
Types of smoothing methods:
An example of Laplace smoothing illustrated below, i.e. +1 smoothing.
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features new york city times state
## new . 2 . . .
## york . . 1 1 .
## city . . . . .
## times . . . . .
## state . . . . .
## 5 x 5 Matrix of class "dgeMatrix"
## features
## features new york city times state
## new 1 3 1 1 1
## york 1 1 2 2 1
## city 1 1 1 1 1
## times 1 1 1 1 1
## state 1 1 1 1 1
By “smoothing”, we are now able to predict new york state as it is now seen in our n-gram model (see row 2 column 5). This concept can be further refined to better distribute prediction probabilities among the unseen n-grams.
Model validation on an unseen test set. There will be two metrics to evaluate our model, prediction accuracy and perplexity.
Accuracy: A binary evaluation of accuracy of the word predicted.
Perplexity: To evaluate the confidence of the model in predicting the next word.
In an unobserved n-gram scenario, we will back off to the (n-1)-gram scenario recursively until a word prediction is available. This means that if a 3-gram model is unobserved, a 2-gram model will be used to provide a partial context. Below we demonstrate a simple application of the backoff model algorithm to predict the full sentence “new york city is fabulous”. The legends are: input word, suggested word, strikethrough word is unavailable to the algorithm.
Step-by-step algorithm for the Backoff Model
Step 1: new york
Step 2: new york city
Step 3: new york city is (No 4-gram predicted on 3-gram input “new york city”)
Step 4: new york city is (Backed off from 4-gram model. No 3-gram predicted on 2-gram input “york city”)
Step 5: new york city is (Backed off from 3-gram model and a 2-gram model is found on input “city”)
Step 6: new york city is fabulous
Step 7: new york city is fabulous
And we’re done! The model backs off to the (n-1)-gram model in an unobserved input for the n-gram model.
Thank you for reading this EDA. Our next steps concentrates on: