About the project

This is a capstone project for the Data Science Specialization course offered by the John Hopkins University through Coursera. The capstone project is developed in partnership with SwiftKey, a leading company in the field of predictive text analytics.

SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices including Android and iOS keyboards. One cornerstone of their smart keyboard is predictive text models using Natural Language Processing (NLP) techniques.

The goal of this Capstone project is to understand and build predictive text models like those used by SwiftKey. The product will be an application that takes in a word phrase and returns next-word choices.

Analysis of text data with natural language processing is a brand new application that is introduced at the end of the Data Science Specialization course. A rational for the choice of the capstone topic is the belief that a practicing data scientist will be frequently confronted with new data types and problems, and a big part of the fun and challenge of being a data scientist is figuring out how to work with these new data types to build data products people love.

The results of the capstone will be reported as follows:

  1. An intermediate R markdown report that describes in plain language, plots, and my codes for exploratory analysis of the course data set.
  2. Development and application of the predictive model to real data to check how it is working.
  3. A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.
  4. A 5 slide deck created with R presentations pitching my algorithm and app to my boss or investor.

The current report concerns the first phase: exploratory analysis.

The capstone dataset

The data is from a corpus called HC Corpora, which were previously collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. The data consists of collections of texts organized into four languages.

Download and load the data

The codes below downloaded the dataset.

if(!file.exists("data")){dir.create("data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileDest <- "C:/Users/...../Coursera_SwiftKey.zip"

download.file(fileUrl, destfile = fileDest)
dateDownloaded <- date()
list.files()

Load the data

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In the current exercise, the English database will be used. The three other databases in German, Russian and Finnish may not be considered in the project.

The codes below are used to read in only the English text files.

## After setting the working directory, get the data directory
data_dir <- getwd()

## read in text data: all English files
en_texts <- readtext(paste0(data_dir, "/final/en_US/*"))

Explore the English corpus

Using quanteda, a corpus object is built from the English texts and explored. The codes below accomplish this task.

# create a corpus 
en_corpus <- corpus(en_texts)

# summarize the texts from the corpus 
summ <- as.data.frame(summary(en_corpus))   
summ$Size <- c(format(object.size(en_texts[1,2]), "GB", digits = 3), 
             format(object.size(en_texts[2,2]), "GB", digits = 3),
             format(object.size(en_texts[3,2]), "GB", digits = 3)) 

# extract the numerical values from character string of sizes

x1 <- summ$Size[1]
x1 <- as.numeric(stringr::word(x1, 1, 1))

x2 <- summ$Size[2]
x2 <- as.numeric(stringr::word(x2, 1, 1))

x3 <- summ$Size[3]
x3 <- as.numeric(stringr::word(x3, 1, 1))

# organize in a table

summ <- rbind(summ,
      data.frame(Text = "Corpus",
                 Types = sum(summ$Types),
                 Tokens = sum(summ$Tokens),
                 Sentences = sum(summ$Sentences),
                 Size = paste(sum(x1,x2,x3), "Gb", sep = " "))) 

summ$TTR <- round(c(summ$Types/summ$Tokens), 3)

# output summary table

knitr::kable(summ, caption = "Summary description of the English corpus")
Summary description of the English corpus
Text Types Tokens Sentences Size TTR
en_US.blogs.txt 530014 44346797 2020072 0.195 Gb 0.012
en_US.news.txt 118765 3113070 140515 0.015 Gb 0.038
en_US.twitter.txt 581869 36898241 2583841 0.153 Gb 0.016
Corpus 1230648 84358108 4744428 0.363 Gb 0.015

The English corpus includes three text document types:Blogs, News, and Twitter. The number of words (tokens) and word types are shown for these documents. The type-token ratio (TTR) is the number of different words (types) in a text document divided by the total number of words (tokens), an index of lexical diversity or degree of variation.

Compared to the other two documents, the News document appears to show half as much word variation given its twice as high TTR of 0.038. However TTR is known to be sensitive to text-length and may not be a reliable comparative index (https://en.wikipedia.org/wiki/Lexical_diversity).

The three documents consists of over 84 million words and uses 0.36 Gb of computer memory.

Random sample for analysis

To build the predictive model, given the fairly large size of the English corpus, it is decided to work with a random sample of the text data to get an accurate approximation to results that would be obtained using all the data. This will minimize computer memory use.

To get a representative sample that can be used to infer facts about population of texts specific for Blog, News and Twitter, random sampling is performed within each document type. Furthermore, to predict a next word within the context of a sentence construction, sentence is treated as the unit of sampling.

The next task is to write a function to randomly sample sentences.

randSampleFN <- function(corp, p=0.1) {
        corp_sent <- corpus_reshape(corp, to = 'sentences')
        n1 <- ndoc(corp_sent) # get number of docs
        n <- n1*p # get % random sample from total sentences
        corp_samp <- corpus_sample(corp_sent, n, replace = FALSE)
}

Next, apply the function to randomly select 10% sentences from each document; also display some sample texts.

x <- corpus(en_texts[1,2])  
set.seed(4321)
blogs_corp <- randSampleFN(x, p=0.1) 

x <- corpus(en_texts[2,2])
set.seed(4321)
news_corp <- randSampleFN(x, p=0.1) 

x <- corpus(en_texts[3,2])
set.seed(4321)
tweet_corp <- randSampleFN(x, p=0.1) 
rm(x)

# print some example texts
blogs_corp$documents[1:5,1]
## [1] "(For Oswald Plummer it is Delia's bottom that is of especial interest â\200\" as evidenced by the fact that his hand is at present handling it â\200\" but we will come to that.)"
## [2] "Keep some space for try another food in another shop as well!!!"                                                                                                            
## [3] "â\200\231 nonsense then maybe thereâ\200\231d be an interesting show here."                                                                                                             
## [4] "Practice at least one."                                                                                                                                                     
## [5] "It's such an outrageous story that one is reminded of Josef Goebbels' famous dictum that \"the bigger the lie, the more people will believe it.\""
news_corp$documents[1:5,1]
## [1] "But only in recent years, experts say, have they begun routinely disenrolling Indians deemed inauthentic members of a group."                                                                  
## [2] "My daughter is a modern dancer, working in New York."                                                                                                                                          
## [3] "During our conversation, I learned he was the first in his family to go to college, and that his undergraduate studies were made possible in part by a Pell Grant from the federal government."
## [4] "I really enjoyed this evening.\""                                                                                                                                                              
## [5] "Adams was taken Saturday afternoon in critical condition to St. Louis University Hospital, after he shot himself in the head."
tweet_corp$documents[1:5,1]
## [1] "you're back only twitches if you're sexy DUH mine does it all the time ;) Maybe it'sme: can stay off forever!"
## [2] "Happy Birthday to !"                                                                                          
## [3] "Definitely something to be afraid of. should bring back Assy McGee."                                          
## [4] "Foundation honors muckraking journalists."                                                                    
## [5] "Starts 5:30 #Fulfillment takes resolve."

The codes below provide the summary of the final analytic sample.

summ2 <- data.frame(Corpus = c("blogs", "news", "tweeter"),
           Size = c(format(object.size(blogs_corp), "GB", digits = 3),
             format(object.size(news_corp), "GB", digits = 3),
             format(object.size(tweet_corp), "GB", digits = 3)),
           Sentences = c(ndoc(blogs_corp), ndoc(news_corp),
                         ndoc(tweet_corp)) )


knitr::kable(summ2, caption = "Summary description of the analytic sample")
Summary description of the analytic sample
Corpus Size Sentences
blogs 0.05 Gb 202007
news 0.004 Gb 14051
tweeter 0.053 Gb 258384

Training and test data

The analytic text data is now divided into training and testing sets in ratio 60% to 40%. The training set will be used to train the algorithm and statistical parameters. These will be used to compute probabilities on the test set. The split is done separately by document to minimize potential bias of one set against the other with respect to data source domain.

# blogs
n <- length(blogs_corp$documents[,1])
n.train <- round(0.8*n) 

train.blogs <- blogs_corp$documents[1:n.train, 1]
test.blogs <- blogs_corp$documents[(n.train + 1):n, 1]

# news
n <- length(news_corp$documents[,1])
n.train <- round(0.8*n)

train.news <- news_corp$documents[1:n.train, 1]
test.news <- news_corp$documents[(n.train + 1):n, 1]

# tweeter
n <- length(tweet_corp$documents[,1])
n.train <- round(0.8*n)

train.tweet <- tweet_corp$documents[1:n.train, 1]
test.tweet <- tweet_corp$documents[(n.train + 1):n, 1]

Combine documents for later modeling

The Blogs, News, and Twitter text documents are from different domains, as reflected by their different TTRs for example. Notwithstanding, in order to further reduce the overall document size they will be combined into one dataset for modeling purpose.

The following codes combine the documents, currently represented as vectors of sentences, separately by training and testing data.

### Combine sentence vectors for training, testing samples

train <- c(train.blogs, train.news, train.tweet)
length(train)
## [1] 379554
train[1:10]
##  [1] "(For Oswald Plummer it is Delia's bottom that is of especial interest â\200\" as evidenced by the fact that his hand is at present handling it â\200\" but we will come to that.)"
##  [2] "Keep some space for try another food in another shop as well!!!"                                                                                                            
##  [3] "â\200\231 nonsense then maybe thereâ\200\231d be an interesting show here."                                                                                                             
##  [4] "Practice at least one."                                                                                                                                                     
##  [5] "It's such an outrageous story that one is reminded of Josef Goebbels' famous dictum that \"the bigger the lie, the more people will believe it.\""                          
##  [6] "His wood working studio was adjacent and he had to walk through the pea gravel to come and go."                                                                             
##  [7] "Also it can be helpful to play the melody on a piano and sing with it to get the notes right in the head."                                                                  
##  [8] "In Genesis 15: God told Abraham that he would have a son."                                                                                                                  
##  [9] "I am sure I can't be the only paranoid wife of a deployed service member."                                                                                                  
## [10] "Mim is a top position now Sarika."
rm("train.blogs", "train.news", "train.tweet")

test <- c(test.blogs, test.news, test.tweet)
length(test)
## [1] 94888
rm("test.blogs", "test.news", "test.tweet")

Preprocessing the text content

The data cleaning approach chosen for the free text content is influenced by the nature of the NLP task at hand. The primary interest is in sequential words within a sentence. Therefore the sentence structure, a string of consectives words, will be preserved through preprocessing.

To improve computational ef???ciency, efforts are made to remove specified text characters while retaining only the 26 English alphabets.

It is common practice in some text analytics to exclude certain frequently used words, the standard stopwords (e.g. the, and, for), to reduce the working data object size. This is favored especially since the removal of these functional or frequent words may not change the meaning of the sentence. However, for the same reason (of commonly used), it is decided to not exclude them from the next word prediction dictionary: end-users should reap greater benefit by being helped to type these frequently used words.

A function is written to clean the sentences; this function is modified from http://amunategui.github.io/speak-like-a-doctor/. Specific details of the preprocessing can be found along with the codes.

Clean_SentencesFn <- function(text_blob) {
        # swap all sentence ends with code 'ootoo'
        text_blob <- gsub(pattern=';|\\.|!|\\?', x=text_blob, replacement='ootoo')
        
        # convert encoding from Latin-1 to ASCII 
        text_blob <- iconv(text_blob, "latin1", "ASCII", sub="")
        
        # To avoid the conversion of e.g. "U.S."" to "us"
        text_blob <- gsub(pattern="U.S.A.| U.S.A | U.S | U.S.| U S | U.S | u s | u.s | u.s.a |united states |United States", x=text_blob, replacement='USA')
        
        # remove all non-alpha text (numbers etc)
        text_blob <- gsub(pattern="[^[:alpha:]]", x=text_blob, replacement = ' ')
        
        # force all characters to lower case
        text_blob <- tolower(text_blob)

        # Restore instances of e.g. "can't" when not appropriately converted   
        text_blob <- gsub("\u0092", "'", text_blob)
        text_blob <- gsub("\u0093|\u0094", "", text_blob)
        
        # remove any small (1 character) words {size} or {min,max}
        text_blob <- gsub(pattern="\\W*\\b\\w{1}\\b", x=text_blob, replacement=' ')
        
        # remove contiguous spaces
        text_blob <- gsub(pattern="\\s+", x=text_blob, replacement=' ')
        
        # split sentences by split code
        sentence_vector <- unlist(strsplit(x=text_blob, split='ootoo',fixed = TRUE))
        return (sentence_vector)
}

Both training and testing data are cleaned by applying the Clean_SentencesFn. Six sample sentences are printed to check the success of cleaning.

# Apply function

train.clean <- Clean_SentencesFn(paste(train, collapse = " "))
test.clean <- Clean_SentencesFn(paste(test, collapse = " "))

train.clean[1:6]
## [1] " for oswald plummer it is delia bottom that is of especial interest as evidenced by the fact that his hand is at present handling it but we will come to that"
## [2] " keep some space for try another food in another shop as well"                                                                                                
## [3] ""                                                                                                                                                             
## [4] ""                                                                                                                                                             
## [5] " nonsense then maybe thered be an interesting show here"                                                                                                      
## [6] " practice at least one"

Results show there are few empty elements in the text vectors. The following codes removed the empty elements:

# delete empty vector elements: ''

train.clean <- train.clean[train.clean != ""]
test.clean <- test.clean[test.clean != ""]

table(train.clean == "")
## 
##  FALSE 
## 463200
table(test.clean == "")
## 
##  FALSE 
## 115590

To remove non-English words

Next, the train.clean dataset needs to contain only english words. This requires an English vocabulary list against which to check each text string of words in the training data.

A free online English dictionary has been found useful for building a basic algorithm. It is available for download at https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt .

# download dictionary

fileUrl <- "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt"
fileDest <- "C:/Users/Peter/DataScience/MachineLearningCourse/TM_CapstoneProj/words_alpha.txt"

download.file(fileUrl, destfile = fileDest)
dateDownloaded <- date()

Next, read in the dictionary file.

dictEn <- read.csv(file = "words_alpha.txt", header = FALSE, sep = ",", stringsAsFactors = FALSE)

dim(dictEn)
## [1] 370099      1
head(dictEn)
##       V1
## 1      a
## 2     aa
## 3    aaa
## 4    aah
## 5  aahed
## 6 aahing

There are over 370,000 words in the downloaded english dictionary.

Once loaded into R, the english dictionary is converted to a quanteda dictionary for removing non english vocabularies from the training data.

mydict <- dictionary(list(engwords = dictEn[,1]))

The constructed mydict will be applied in the next step when creating n-grams.

There are 463,219 final sentences in the training data. The test data consists of 1155596 sentences.

Tokenization

The next step in this NLP pipeline is to split the text into smaller units (tokens) corresponding to words or combinations of words (e.g. pairs, triplets) in their natural sequence as used in a sentence.

Specifically the algorithm will use 2-grams (two adjacent words) and 3-grams (three adjacent words) created from the clean training data, with non-English words removed, to predict the next word based on the previous 1, 2, or 3 words in the phrase.

These tokens (1-grams, 2-grams, 3-grams) are stored in the n-gram dictionary that will be part of the text predictive model app to be developed.

To evaluate the performance of predictive algorithm, the test data will also be subjected to exactly the same preprocessing to ensure consistency between training and testing.

Construction of n-grams

Here a function is written to tokenize input text data into n-grams while specifying the desired number of words (or n), and removing non-English words (i.e. words not found in mydict). This function outputs a data-feature matrix.

# Write function that creates a dfm for only english words

dfm_usedict_Fn <- function(text_set, dict, n_gram = 1) {
    if (n_gram < 2) {
        dtm <- text_set %>% tokens() %>% tokens_select(pattern = dict, selection = "keep", 
            case_insensitive = TRUE) %>% dfm()
    } else {
        dtm <- text_set %>% tokens() %>% tokens_select(pattern = dict, selection = "keep", 
            case_insensitive = TRUE) %>% tokens_ngrams(n = n_gram, concatenator = " ") %>% 
            dfm()
    }
    return(dtm)
}

Next, apply the above function to create n-grams from the training data. Then generate datasets ordered with most frequent n-grams at the top.

1-gram

dfm_1gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 1)

format(object.size(dfm_1gram_train), "GB", digits = 3)
## [1] "0.119 Gb"
head(textstat_frequency(dfm_1gram_train)[, 1:4])
##   feature frequency rank docfreq
## 1     the    235811    1  153531
## 2      to    154406    2  120147
## 3     and    127667    3  102207
## 4      of    103850    4   82405
## 5      in     82467    5   71313
## 6     you     73832    6   60882
tail(textstat_frequency(dfm_1gram_train)[, 1:4])
##             feature frequency  rank docfreq
## 56088  impermanence         1 56088       1
## 56089  proctologist         1 56089       1
## 56090      consults         1 56090       1
## 56091         vitae         1 56091       1
## 56092        cristi         1 56092       1
## 56093 biomechanical         1 56093       1

2-gram

dfm_2gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 2)

format(object.size(dfm_2gram_train), "GB", digits = 3)
## [1] "0.212 Gb"
head(textstat_frequency(dfm_2gram_train)[, 1:4])
##   feature frequency rank docfreq
## 1  of the     20794    1   19092
## 2  in the     19535    2   18603
## 3 for the     11047    3   10811
## 4  to the     10839    4   10516
## 5  on the     10454    5   10162
## 6   to be      9561    6    9292
tail(textstat_frequency(dfm_2gram_train)[, 1:4])
##                 feature frequency    rank docfreq
## 1354618     was suspect         1 1354618       1
## 1354619 suspect because         1 1354619       1
## 1354620   because ideas         1 1354620       1
## 1354621        grow big         1 1354621       1
## 1354622    grow because         1 1354622       1
## 1354623     on slippery         1 1354623       1

3-gram

dfm_3gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 3)

format(object.size(dfm_3gram_train), "GB", digits = 3)
## [1] "0.348 Gb"
head(textstat_frequency(dfm_3gram_train)[, 1:4])
##              feature frequency rank docfreq
## 1     thanks for the      1832    1    1831
## 2         one of the      1720    2    1708
## 3        going to be      1050    3    1047
## 4      thank you for       837    4     836
## 5 looking forward to       802    5     801
## 6         the end of       780    6     775
tail(textstat_frequency(dfm_3gram_train)[, 1:4])
##                  feature frequency    rank docfreq
## 3044466   they tried not         1 3044466       1
## 3044467 show their shock         1 3044467       1
## 3044468   their shock as         1 3044468       1
## 3044469    shock as they         1 3044469       1
## 3044470       did as she         1 3044470       1
## 3044471     as she asked         1 3044471       1

Description of n-grams

Word clouds of 1-grams

Some words are more frequent than others. Word clouds provide a visual analytics approach to identifying the most frequent words.

textplot_wordcloud(dfm_1gram_train, min_size = 2, max_size = 5, max_words = 100, 
    rotation = 0.25, color = rev(RColorBrewer::brewer.pal(11, "RdBu")))

As expected, the stopwords are the most frequently typed words.

A quantitative alternative to word clouds is to plot the distributions of word frequencies.

Frequency distribution of 1-grams

freqData <- head(textstat_frequency(dfm_1gram_train), 30)

ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) + 
    geom_bar(stat = "identity") + coord_flip() + labs(title = "30 most frequent words", 
    x = "1-gram")

rm(freqData)

Frequency distribution of 2-grams and 3-grams

freqData <- head(textstat_frequency(dfm_2gram_train), 30)
g1 <- ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) + 
    geom_bar(stat = "identity") + coord_flip() + labs(title = "Bigrams", x = "2-gram")

freqData <- head(textstat_frequency(dfm_3gram_train), 30)
g2 <- ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) + 
    geom_bar(stat = "identity") + coord_flip() + labs(title = "Trigrams", x = "3-gram")

grid.arrange(g1, g2, ncol = 2)

rm(freqData)

Proportions of n-grams with frequency of 1

dat <- textstat_frequency(dfm_1gram_train)
x1.1 <- nrow(dat)  # number of tokens
x1.2 <- nrow(dat[dat$frequency == 1, ])
x1.3 <- round(x1.2/x1.1, 2)

dat <- textstat_frequency(dfm_2gram_train)
x2.1 <- nrow(dat)  # number of tokens
x2.2 <- nrow(dat[dat$frequency == 1, ])
x2.3 <- round(x2.2/x2.1, 2)

dat <- textstat_frequency(dfm_3gram_train)
x3.1 <- nrow(dat)
x3.2 <- nrow(dat[dat$frequency == 1, ])
x3.3 <- round(x3.2/x3.1, 2)

singles <- data.frame(n_tokens = c(x1.1, x2.1, x3.1), n_freq_of_1 = c(x1.2, 
    x2.2, x3.2), p_freq_of_1 = c(x1.3, x2.3, x3.3))

row.names(singles) <- c("1-gram", "2-gram", "3-gram")

rm(dat)

knitr::kable(singles, caption = "Proportion of 2-grams and 3-grams that appear once")
Proportion of 2-grams and 3-grams that appear once
n_tokens n_freq_of_1 p_freq_of_1
1-gram 56093 15212 0.27
2-gram 1354623 992816 0.73
3-gram 3044471 2712279 0.89

Many of the n-grams are rarely used yet they potentially contribute significantly to the model dictionary size. About a third of the 1-grams occur only once; worst still, up to 90% of the 3-grams occur only once. It may be more computational efficient and accurate to re-classify all these singletons as unseen n-grams in the dictionary.

Coverage

It is of interest to know how many unique words are needed in a frequency sorted dictionary to cover say 50% of all word instances in the language.

Write a function to calculate coverage statistics, that accepts document dataframe (sorted by ‘proportion’ variable) and its label as input.

# Write coverage funtion

coverageFn <- function(dat, datName) {
    n <- sum(dat$frequency)
    dat$prop <- dat$frequency/n
    dat$cumprop <- cumsum(dat$prop)
    
    # calculate coverage statistics
    x <- nrow(dat)
    x.5 <- nrow(dat[dat$cumprop <= 0.5, ])
    x.9 <- nrow(dat[dat$cumprop <= 0.9, ])
    x.95 <- nrow(dat[dat$cumprop <= 0.95, ])
    x.975 <- nrow(dat[dat$cumprop <= 0.975, ])
    x.99 <- nrow(dat[dat$cumprop <= 0.99, ])
    
    # organize into datatable
    d <- data.frame(Number = c(x.5, x.9, x.95, x.975, x.99, x), Proportion = c(round(x.5/x, 
        3), round(x.9/x, 3), round(x.95/x, 3), round(x.975/x, 3), round(x.99/x, 
        3), 1))
    row.names(d) <- c("Cover 50% of all word instances", "Cover 90% of all word instances", 
        "Cover 95% of all word instances", "Cover 975% of all word instances", 
        "Cover 99% of all word instances", "Cover 100% of all word instances")
    names(d) <- c(paste0(datName, "_N"), paste0(datName, "_Prop"))
    return(d)
}

First, the textstat_frequency() function of quanteda is applied to a dfm to output a dataframe of ngrams ranked according to frequency. Then the coverageFn function is applied on this output separately by n-gram:

1-gram

dfm_1gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Unigram")
##                                  Unigram_N Unigram_Prop
## Cover 50% of all word instances        121        0.002
## Cover 90% of all word instances       4821        0.086
## Cover 95% of all word instances       9916        0.177
## Cover 975% of all word instances     16752        0.299
## Cover 99% of all word instances      27374        0.488
## Cover 100% of all word instances     56093        1.000

2-gram

dfm_2gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Bigram")
##                                  Bigram_N Bigram_Prop
## Cover 50% of all word instances     28731       0.021
## Cover 90% of all word instances    888927       0.656
## Cover 95% of all word instances   1121775       0.828
## Cover 975% of all word instances  1238199       0.914
## Cover 99% of all word instances   1308053       0.966
## Cover 100% of all word instances  1354623       1.000

3-gram

dfm_3gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Trigram")
##                                  Trigram_N Trigram_Prop
## Cover 50% of all word instances     928958        0.305
## Cover 90% of all word instances    2621368        0.861
## Cover 95% of all word instances    2832919        0.931
## Cover 975% of all word instances   2938695        0.965
## Cover 99% of all word instances    3002160        0.986
## Cover 100% of all word instances   3044471        1.000

While less than 50% of the unique words covers 99% instances of all words used in the corpora, essentially all of the unique bigrams and triagrams are required for the same coverage. This suggests that it may be more efficient in terms of dictionary size to keep only the to 50% of the most frequent 1-grams. But regarding bigrams and triagrams, a better argument may be to still consider them as unseen n-grams.

Next step

N-gram model dictionary

A look-up n-gram dictionary with acceptable data size (< 1.0 Gb) will be produced from the current 1-grams (0.1 Gb), 2-grams (0.2 Gb) and 3-grams (0.35 Gb). This will also limit the amount of memory (physical RAM) required to run the model in R. The dictionary will be saved as a .csv datafile. Each n-gram will occupy a row, with observations of frequncy proportions.

Prediction algorithm

Markov Assumption will be applied: the probability of a next word depends only on the n-1 previous words (the history), where n = 2 (bigram) or 3 (trigram) in this case.

For example, when the input is a 2-word phrase (w1 w2), that is, a bigram, the algorithm will apply the chain rule to estimate the probability of the next word w3 to complete the string w1 w2 w3 (trigram):

P(w3|w1 w2)= P(w1 w2 w3)/ P(w1)P(w2|w1)

The proportion (probability) column of the n-gram dictionary will be adjusted by smoothing using [Good-Turing] (https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) estimation method. The smoothing will assign non-zero values to tokens absent from the dictionary but are supplied as input phrase e.g. from test data. The number of n-grams (or estimated probabilities) in the dictionary is the number of parameters to work with. To reduce the number of parameters the n-grams seen once will be treated same as unseen and represented by one parameter.

Also, Katz-Backoff approach will be considered for estimating the probability of unobserved n-grams which are expected to be high for the bigrams and trigrams. For the basic model, if no matching 3-gram is found the algorithm backs off to the 2-gram with the highest count, and if no matching 2-gram is found it backs off to the most frequent 1-gram.

N-gram model accuracy

The testing data will be similarly tokenized. The bigrams and trigrams will be split into input phrase and known next word, e.g. a trigram is split into a 2-word input phrase and next word. The algorithm will accept each input phrase and predict the most likely next word. Percent correctly predicted will be a measure of its accuracy.

Appendix

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gridExtra_2.3  tibble_1.4.2   ggplot2_2.2.1  rio_0.5.10    
## [5] dplyr_0.7.4    readtext_0.50  quanteda_1.2.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16       RColorBrewer_1.1-2 formatR_1.5       
##  [4] highr_0.6          cellranger_1.1.0   compiler_3.5.0    
##  [7] pillar_1.2.2       plyr_1.8.4         bindr_0.1.1       
## [10] forcats_0.3.0      tools_3.5.0        stopwords_0.9.0   
## [13] digest_0.6.15      lubridate_1.7.4    evaluate_0.10.1   
## [16] gtable_0.2.0       lattice_0.20-35    pkgconfig_2.0.1   
## [19] rlang_0.2.0        openxlsx_4.0.17    Matrix_1.2-14     
## [22] fastmatch_1.1-0    curl_3.2           yaml_2.1.19       
## [25] haven_1.1.1        bindrcpp_0.2.2     stringr_1.3.0     
## [28] httr_1.3.1         knitr_1.20         rprojroot_1.3-2   
## [31] grid_3.5.0         glue_1.2.0         data.table_1.11.0 
## [34] R6_2.2.2           readxl_1.1.0       foreign_0.8-70    
## [37] rmarkdown_1.9      spacyr_0.9.9       magrittr_1.5      
## [40] backports_1.1.2    scales_0.5.0       htmltools_0.3.6   
## [43] assertthat_0.2.0   colorspace_1.3-2   labeling_0.3      
## [46] stringi_1.1.7      lazyeval_0.2.1     RcppParallel_4.4.0
## [49] munsell_0.4.3