1 Introduction

This report aims to explore a corpora downloaded from the Coursera Data Science Capstone by Johns Hopkins University. We are looking to explore how this dataset is able to support our goal to develop a smart keyboard word prediction application.

Let’s begin by setting up our R environment.

library(data.table)
library(dplyr)
library(ggplot2)
library(quanteda)
options(scipen = 999)

2 Data Exploration

2.1 Data structure and content

blogs <- readLines('./Coursera-SwiftKey/final/en_US/en_US.blogs.txt')
news <- readLines('./Coursera-SwiftKey/final/en_US/en_US.news.txt')
twitter <- readLines('./Coursera-SwiftKey/final/en_US/en_US.twitter.txt')
str(blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>." ...

str(news)

##  chr [1:77259] "He wasn't home alone, apparently." ...

str(twitter)

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

There are 899288 documents in blogs, 77259 documents in news and 2360148 documents in twitter.

2.2 Random sampling

Due to limited computational resources, we will only explore 5% of the entire data available. The data to be explored will be sampled via a random sample approach.

rand_samp <- function(table, pct_retained, seed){
  k <- length(table)
  smp_size <- floor(pct_retained * k)
  
  set.seed(seed)
  keep_ind <- sample(seq_len(k), size = smp_size)
  
  table <- table[keep_ind]
  return(table)
}

# retain 5% for exploration
blogs <- rand_samp(blogs, .05, 123)
news <- rand_samp(news, .05, 456)
twitter <- rand_samp(twitter, .05, 789)

After sample selection, there are 44964 documents in blogs, 3862 documents in news and 118007 documents in twitter for exploration.

2.3 Data cleansing

We conduct a simple data cleansing to remove any double spaces or leading/trailing whitespaces.

clean_text <- function(string){
  string <- gsub('\\s+', ' ', string) # remove double spaces
  string <- trimws(string, 'both') # remove leading/trailing whitespaces
  return(string)
}

blogs <- clean_text(blogs)
news <- clean_text(news)
twitter <- clean_text(twitter)

2.4 Document character length distribution

# count characters
nchar_blogs <- data.table(n_char = nchar(blogs))
nchar_news <- data.table(n_char = nchar(news))
nchar_twitter <- data.table(n_char = nchar(twitter))

nchar_blogs$dataset <- 'Blogs'
nchar_news$dataset <- 'News'
nchar_twitter$dataset <- 'Twitter'

# consolidate summary
nchar_all <- rbindlist(list(nchar_blogs, nchar_news, nchar_twitter),
                       idcol = F, use.names = T, fill = F)

# ggplot
ggplot(data = nchar_all, aes(n_char)) +
  geom_histogram(bins = 20) +
  facet_grid(. ~ dataset, scales = 'free') +
  labs(title = 'Number of Characters per Document')

2.5 Document word length distribution

Next, let’s explore number of words per document.

# count tokens
ntoken_blogs <- data.table(n_words = ntoken(blogs, remove_numbers = T, remove_punct = T,
                                            remove_symbols = T, remove_url = T))
ntoken_news <- data.table(n_words = ntoken(news, remove_numbers = T, remove_punct = T,
                                           remove_symbols = T, remove_url = T))
ntoken_twitter <- data.table(n_words = ntoken(twitter, remove_numbers = T, remove_punct = T,
                                              remove_symbols = T, remove_url = T))

ntoken_blogs$dataset <- 'Blogs'
ntoken_news$dataset <- 'News'
ntoken_twitter$dataset <- 'Twitter'

# consolidate summary
ntoken_all <- rbindlist(list(ntoken_blogs, ntoken_news, ntoken_twitter),
                        idcol = F, use.names = T, fill = F)

# ggplot
ggplot(data = ntoken_all, aes(n_words)) +
  geom_histogram(bins = 20) +
  facet_grid(. ~ dataset, scales = 'free') +
  labs(title = 'Number of Words per Document',
       subtitle = 'Excluded numbers, punctuations, symbols and urls')

2.6 Terms frequencies

2.6.1 Stopwords

We’ll use quanteda’s stopwords dictionary to remove commonly used english grammars from our documents. Let’s print out the list of stopwords for easy reference.

text_to_remove <- unique(stopwords('english'), letters) %>% sort()
text_to_remove

##   [1] "a"          "about"      "above"      "after"      "again"     
##   [6] "against"    "all"        "am"         "an"         "and"       
##  [11] "any"        "are"        "aren't"     "as"         "at"        
##  [16] "be"         "because"    "been"       "before"     "being"     
##  [21] "below"      "between"    "both"       "but"        "by"        
##  [26] "can't"      "cannot"     "could"      "couldn't"   "did"       
##  [31] "didn't"     "do"         "does"       "doesn't"    "doing"     
##  [36] "don't"      "down"       "during"     "each"       "few"       
##  [41] "for"        "from"       "further"    "had"        "hadn't"    
##  [46] "has"        "hasn't"     "have"       "haven't"    "having"    
##  [51] "he"         "he'd"       "he'll"      "he's"       "her"       
##  [56] "here"       "here's"     "hers"       "herself"    "him"       
##  [61] "himself"    "his"        "how"        "how's"      "i"         
##  [66] "i'd"        "i'll"       "i'm"        "i've"       "if"        
##  [71] "in"         "into"       "is"         "isn't"      "it"        
##  [76] "it's"       "its"        "itself"     "let's"      "me"        
##  [81] "more"       "most"       "mustn't"    "my"         "myself"    
##  [86] "no"         "nor"        "not"        "of"         "off"       
##  [91] "on"         "once"       "only"       "or"         "other"     
##  [96] "ought"      "our"        "ours"       "ourselves"  "out"       
## [101] "over"       "own"        "same"       "shan't"     "she"       
## [106] "she'd"      "she'll"     "she's"      "should"     "shouldn't" 
## [111] "so"         "some"       "such"       "than"       "that"      
## [116] "that's"     "the"        "their"      "theirs"     "them"      
## [121] "themselves" "then"       "there"      "there's"    "these"     
## [126] "they"       "they'd"     "they'll"    "they're"    "they've"   
## [131] "this"       "those"      "through"    "to"         "too"       
## [136] "under"      "until"      "up"         "very"       "was"       
## [141] "wasn't"     "we"         "we'd"       "we'll"      "we're"     
## [146] "we've"      "were"       "weren't"    "what"       "what's"    
## [151] "when"       "when's"     "where"      "where's"    "which"     
## [156] "while"      "who"        "who's"      "whom"       "why"       
## [161] "why's"      "will"       "with"       "won't"      "would"     
## [166] "wouldn't"   "you"        "you'd"      "you'll"     "you're"    
## [171] "you've"     "your"       "yours"      "yourself"   "yourselves"

We will also exclude profanity in the model build process. However, these words are retained for now during our EDA.

2.6.2 Unigrams

# unigram frequency
unigram <- c(blogs, news, twitter) %>%
  iconv('latin1', 'ASCII', sub = '') %>%
  tokens(what = 'word', remove_numbers = T, remove_punct = T,
         remove_symbols = T, remove_url = T) %>%
  tokens_remove(text_to_remove) %>%
  dfm(tolower = T, ngrams = 1) %>%
  textstat_frequency() %>%
  arrange(-frequency) %>%
  slice(1:10) %>%
  as.data.table()

# ggplot
ggplot(data = unigram, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = 'Top 10 unigrams by Frequency',
       subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
       x = 'Feature')

The top 3 words by frequency in our corpora are “just”, “like” and “one”.

2.6.3 Bigrams

# bigram frequency
bigram <- c(blogs, news, twitter) %>%
  iconv('latin1', 'ASCII', sub = '') %>%
  tokens(what = 'word', remove_numbers = T, remove_punct = T,
         remove_symbols = T, remove_url = T) %>%
  tokens_remove(text_to_remove) %>%
  dfm(tolower = T, ngrams = 2) %>%
  textstat_frequency() %>%
  arrange(-frequency) %>%
  slice(1:10) %>%
  mutate(feature = gsub('_', ' ', feature)) %>%
  as.data.table()

# ggplot
ggplot(data = bigram, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = 'Top 10 bigrams by Frequency',
       subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
       x = 'Feature')

The top 3 bigrams by frequency in our corpora are “right now”, “last night” and “feel like”.

2.6.4 Trigrams

# trigram frequency
trigram <- c(blogs, news, twitter) %>%
  iconv('latin1', 'ASCII', sub = '') %>%
  tokens(what = 'word', remove_numbers = T, remove_punct = T,
         remove_symbols = T, remove_url = T) %>%
  tokens_remove(text_to_remove) %>%
  dfm(tolower = T, ngrams = 3) %>%
  textstat_frequency() %>%
  arrange(-frequency) %>%
  slice(1:10) %>%
  mutate(feature = gsub('_', ' ', feature)) %>%
  as.data.table()

# ggplot
ggplot(data = trigram, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = 'Top 10 trigrams by Frequency',
       subtitle = 'Excluded non-ASCII characters, numbers, punctuations, symbols, urls and stopwords',
       x = 'Feature')

The top 3 trigrams by frequency in our corpora are “let us know”, “happy new year” and “happy mother’s day”.

3 Model Plan

As a basic model, we will use an n-gram model for our smart keyboard word application. This model provides predictions based on frequently occuring n-gram pairs given the input word. Let’s go through some questions to consider prior to building our model.

3.1 How can you efficiently store an n-gram model (think Markov Chains)?

The model contains a large list of n-grams and their frequency/probability of occurence. Given an input word new, we will query the model to produce a sample table as below.

##           feature frequency probability
## 1:       new york        40        0.40
## 2:       new city        30        0.30
## 3:      new times        15        0.15
## 4:  new york city        10        0.10
## 5: new york times         5        0.50

The above represents a probability table for a single transfer in a Markov chain model. In the model above, the word new would lead to a word suggestion for new york.

Iterating this process, the list of choices would now be:

##           feature frequency probability
## 1:  new york city        10        0.67
## 2: new york times         5        0.33

Hence, the next word suggestion given the word input of new york would be new york city. This process repeats recursively until n, where n is the longest chain of words in our model.

3.2 How can you use the knowledge about word frequencies to make your model smaller and more efficient?

Let’s look at the dictionary coverage of the words in our data.

# word frequencies
dictionary_coverage <- c(blogs, news, twitter) %>%
  iconv('latin1', 'ASCII', sub = '') %>%
  tokens(what = 'word', remove_numbers = T, remove_punct = T,
         remove_symbols = T, remove_url = T) %>%
  dfm(tolower = T, ngrams = 1) %>%
  textstat_frequency() %>%
  arrange(-frequency) %>%
  mutate(coverage = cumsum(frequency) / sum(frequency)) %>%
  select(feature, frequency, rank, coverage) %>%
  as.data.table()

# 80/20 rule
pareto <- dictionary_coverage %>%
  filter(abs(.8 - coverage) == min(abs(.8 - coverage))) %>%
  select(rank, coverage) %>%
  as.data.table()

# ggplot
ggplot(data = dictionary_coverage, aes(x = rank, y = coverage)) +
  geom_line() +
  labs(title = 'Dictionary Coverage of Words',
       subtitle = 'Excludes non-ASCII characters, numbers, punctuations, symbols, urls and stopwords')

The top 2047 most frequent terms (i.e. 1.8% of all terms) covers 80% of the entire word occurence in our corpora. We can use this information to find a balance between our model’s vocabulary/complexity vs efficiency.

3.3 How many parameters do you need (i.e. how big is n in your n-gram model)?

# word frequencies
seq_coverage <- function(ngram){
  x <- c(blogs, news, twitter) %>%
  iconv('latin1', 'ASCII', sub = '') %>%
  tokens(what = 'word', remove_numbers = T, remove_punct = T,
         remove_symbols = T, remove_url = T) %>%
  dfm(tolower = T, ngrams = ngram) %>%
  textstat_frequency() %>%
  filter(docfreq > 1) %>%
  mutate(feature = gsub('_', ' ', feature), ngram = ngram) %>%
  group_by(ngram) %>%
  summarise(no_of_features = n()) %>%
  ungroup() %>%
  as.data.table()
  
  return(x)
}

n2 <- seq_coverage(2)
n3 <- seq_coverage(3)
n4 <- seq_coverage(4)
n5 <- seq_coverage(5)
n6 <- seq_coverage(6)
n7 <- seq_coverage(7)
optimal_n <- rbindlist(list(n2, n3, n4, n5, n6, n7),
                       idcol = F, use.names = T, fill = F) %>%
  mutate(coverage = cumsum(no_of_features) / sum(no_of_features)) %>%
  as.data.table()

# ggplot
ggplot(data = optimal_n, aes(x = ngram, y = no_of_features)) +
  geom_step() +
  geom_point() +
  labs(title = 'Number of Features by N-gram',
       subtitle = 'Excludes non-ASCII characters, numbers, punctuations, symbols and urls')

optimal_n

##    ngram no_of_features  coverage
## 1:     2         251070 0.4169248
## 2:     3         227212 0.7942311
## 3:     4          88342 0.9409311
## 4:     5          24626 0.9818248
## 5:     6           7472 0.9942328
## 6:     7           3473 1.0000000

Note: Only features occuring in more than 1 document is retained in the exploration above.

By selecting n<=4 in our n-gram model, we would be able to cover approximately 94.09% of all sentence sequence in our corpora. We can use this information to find a suitable cutoff for the n-grams in our model.

3.4 Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data)?

Let’s explore why is “smoothing” required for an n-gram model. To answer this question, we revisit our simplified model built earlier. The word input is new york.

##           feature frequency probability
## 1:  new york city        10        0.67
## 2: new york times         5        0.33

In the scenario above, our data only contains 2 choices of trigram for new york. However, there could be other word sequences such as new york state or new york pizza which is unseen in our model.

In order to attend to unseen n-grams, “smoothing” is used to enable the prediction of unseen n-grams from our corpora.

Types of smoothing methods:

Laplace smoothing
Good-Turing
Kneser-Ney
Witter-Bell

An example of Laplace smoothing illustrated below, i.e. +1 smoothing.

## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
##         features
## features new york city times state
##    new     .    2    .     .     .
##    york    .    .    1     1     .
##    city    .    .    .     .     .
##    times   .    .    .     .     .
##    state   .    .    .     .     .

## 5 x 5 Matrix of class "dgeMatrix"
##         features
## features new york city times state
##    new     1    3    1     1     1
##    york    1    1    2     2     1
##    city    1    1    1     1     1
##    times   1    1    1     1     1
##    state   1    1    1     1     1

By “smoothing”, we are now able to predict new york state as it is now seen in our n-gram model (see row 2 column 5). This concept can be further refined to better distribute prediction probabilities among the unseen n-grams.

3.5 How do you evaluate whether your model is any good?

Model validation on an unseen test set. There will be two metrics to evaluate our model, prediction accuracy and perplexity.

Accuracy: A binary evaluation of accuracy of the word predicted.

Perplexity: To evaluate the confidence of the model in predicting the next word.

3.6 How can you use backoff models to estimate the probability of unobserved n-grams?

In an unobserved n-gram scenario, we will back off to the (n-1)-gram scenario recursively until a word prediction is available. This means that if a 3-gram model is unobserved, a 2-gram model will be used to provide a partial context. Below we demonstrate a simple application of the backoff model algorithm to predict the full sentence “new york city is fabulous”. The legends are: input word, suggested word, ~~strikethrough~~ word is unavailable to the algorithm.

Step-by-step algorithm for the Backoff Model

Step 1: new york

Step 2: new york city

Step 3: new york city is (No 4-gram predicted on 3-gram input “new york city”)

Step 4: new york city is (Backed off from 4-gram model. No 3-gram predicted on 2-gram input “york city”)

Step 5: new york city is (Backed off from 3-gram model and a 2-gram model is found on input “city”)

Step 6: new york city is fabulous

Step 7: new york city is fabulous

And we’re done! The model backs off to the (n-1)-gram model in an unobserved input for the n-gram model.

4 Next Step

Thank you for reading this EDA. Our next steps concentrates on:

Developing our model for smart keyboard word prediction
Implementing the model into Shinyapp for a user-friendly interface for word prediction

Coursera Data Science Capstone: Milestone Report

EDA & Model Plan for a Smart Keyboard Application

Shaun Loong

October 21, 2018