A Model to Predict the Next Word You Are Typing

This report describes my work on the John Hopkins’ Data Science Capstone project (provided via Coursera). I discuss pre-processing of the data including some exploratory analysis, and the design of the model. All scripts are included in the appendix and can be used to replicate the model that is used in the resulting shiny app.

Project Description from Coursera

This section primarily uses text directly cited from the Coursera website of the John Hopkins’ data Science Capstone project.

    Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

    "I went to the"

    the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

    This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product you can show off to your family, friends, and potential employers.

    The capstone includes the following deliverables:

    - An intermediate R markdown report that describes in plain language, plots, and code your exploratory analysis of the course data set. (the main focus of the current report)
    - A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.
    - A 5 slide deck created with R presentations pitching your algorithm and app to your boss or investor.

Executive summary

This report provides a description of the following steps:

data description
data sampling into train, validation, and test sets
data pre-processing including generation of ngrams
exploratory analysis
model design and further steps needed to prepare the data for the model
Appendix including all R scripts needed to: pre-process the data; develop the model; and, run the model

Data Description

This section primarily uses text directly cited from the Coursera website of the John Hopkins’ data Science Capstone project. Some comments and data summaries are provided by the author.

    The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language*.

    Each entry is tagged with it's date of publication. Where user comments are included they will be tagged with the date of the main entry.

    Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with "sports" subject).In many cases it's not feasible to tag the entries (for example, it's not really practical to tag each individual Twitter entry, though I've got some ideas which might be implemented in the future) or no subject is found by the automated process, in which case the entry is tagged with a '0'.

    To save space, the subject and type is given as a numerical code.

    The files have been language filtered but may still contain some foreign text.

    Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

    This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.

The data is sourced from the following location:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The blogs corpus is comprised of paragraphs, where each paragraph is a line (or document) in the corpus. The news corpus is also comprised of paragraphs, where each paragraph is a line (or document) in the corpus. The twitter corpus is comprised of tweets, where each tweet is a line (or document) in the corpus. Initial summarization of the data provides the size of each corpus, number of lines and number of words in the table below.

##           file.name size(MB) num.of.lines num.of.words
## 1   en_US.blogs.txt   200.42       899288     37334085
## 2    en_US.news.txt   196.28      1010242     34372814
## 3 en_US.twitter.txt   159.36      2360148     30373565

We observe that the the number of words are overall similar across the three source documents. However, since each line in the blogs and news documents is a paragraph, the twitter document contains more than twice as many lines as the news and blogs documents.

Another thing to note is that each file is already plenty large in terms of MB size. R keeps the data into memory and appears to handle it very inefficiently, meaning the size of the data explodes when loaded and processed in R. As a result it most often won’t be tractable to work with the combined corpus directly to pre-process the data.

The R code used to produce this table is presented in Appendix D.

Data sampling

The data was split into 4 sub samples for each corpus source: 15% testing sample, 15% validation/development sample, 70% training sample. Out of the 70% training sample a 10% sub sample is selected to test code. Each source, blogs/news/twitter is sub sampled separately, and subsequently the separate source corpora are aggregated into a 15/15/70/10 corpus including all three sources. The random selection of lines from each corpus is achieved with binomial random sampling.

The 10% training sample is used to test code, and will likely further be used to test initial models. The 70% training sample is used for exploratory analysis and will be used later to train the models if my machine has enough memory to support this much data.

The R code used to produce the samples is presented in Appendix A.

Data pre-processing and preparation for analysis

The original data, as obtained by us from Coursera-SwiftKey, contains many irregularities that need to be addressed before the data is ready for exploratory analysis or modeling. For example the data contains emails, http/s addresses, emojis, contractions as “don’t”, upper/lower case, sympols as & @, etc… that have to either be removed or replaced/expanded (“don’t” expands into do not, e.g.) before ngrams are created.

The pre-processing code has taken many, many iterations and testing of each procedure. I hope I am not missing something important as a transformation, but that likely is not the case. Please advise if you see something that may not be producing the intended effect, or some majorly important transformation I am missing.

We apply the following transformations using the tm package in R, in the exact sequence described below:

Step 1. Remove duplicate rows
Step 2. Separate words connected with - or /. This helps later when we remove punctuation, so that two words do not stick to each
Step 3. Establish end of sentence, and replace it with “EEOSS”. Identify abbreviations as H.S.B.C. and replace them with “AABRR”. Identify numbers and replace them with “AABRR”. We need to do this so that when we are forming ngrams, the 2/3/4ngrams are not formed over end of sentences, abbreviations or numbers. For example consider the following tweet line
```
  "RT : According to the N.R.T $16.3 BILLION was spent on #MothersDay last year! Of the $16.3 BILLION spent $1 BILLION was spent on flowers"
```

If we were to directly exclude numbers, and then punctuation, we may form a 3gram reading “to the BILLION” which is incorrect as it does not really exist in this sentence. In addition, if we just remove end of sentence in “!”, we would form a 4gram that reads “year of the BILLION”, which also was not intended by the person who wrote the tweet.

Step 4. Remove email and http/s links
Step 5. Remove retweet entries (RT:), Remove @people, twitter usernames
Step 6. Transform to lower case. For the purposes of our models “Hello” and “hello” should be treated as the same piece of information.
Step 7. Remove/replace &, @, ’m, ’s, ’are, ’ll, and all other contractions
Step 8. Remove emoji’s, emoticons like ❤#RelayForLife ❤“😖😰💔💔”
Step 9. Remove g, mg, lbs, s, m, h etc abbreviations; Removes all single letters except “a” and “i”
Step 10. Remove numbers (if such still exist)
Step 11. Remove punctuation. I remove punctuation via the tm_map available function, while “preserving preserve_intra_word_dashes = TRUE” if such dashes still exist. In addition I separately remove the following symbols that were not removed by the tm_map functionality: ““”,””“,”‘“,”’“.
Step 12. Remove profanity. I don’t want my predictions to suggest profanity words. For this purpose i may train a model with these words in and remove them in the end by re-calibrating the model on non-profanity data. Or else, which makes more sense, is to remove them from the beginning and not train the model on those. The latter choice is applied. To remove profanity a list with profanity words is downloaded from:
```
  https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
```

and the list is used to filter out the “bad words”.

Step 13. Remove stop words. From this point forward i keep two types of corpora - one with stop words included and one with stop words removed. In many NLP applications removing stop words is appropriate. However, for the current application I believe stop words need to be used to train the model, so that we provide a natural language prediction. The corpora with stop words removed are, however, used for exploratory analysis as indicated in next sections.
Step 14. Remove whitespace. The substitutions and deletions applied in the previous steps lead to extra white spaces that are typically removed in NLP applications. I use the tm_map functionality for this white-space removal.

Stemming, reducing word types to their stems, is a further step that can be applied to NLP problems. However, for this particular application we rather work with the complete word type, then with its stem. This will provide for more correct model training and prediction.

The preprocessing is done by corpus source. The ready pre-processed corpora are exported and ready for ngram generation. The R code used to pre-process the samples is presented in Appendix B.

Tokenization and nGram Generation

NGrams are a popular method introduced by Jurafsky and Martin (2000) for NLP when applied to text prediction (and other tasks) problems. The ngram is a sequence of words found in the text that can be further sorted by frequency of appearance. For example UniGrams are 1 word ngrams, while TriGrams are sequences of three words as “I will go” or “Happy New Year”. NGrams are very useful for simple text prediction models, because, ngrams that appear with higher frequency are the ones that have a higher chance (probabilistically speaking) of appearing in prediction problems out of sample. For example a sequence of two words “I will” (first and second word) can be the base for 100s if not 1000s TriGrams (possible third word). However, among the possible third words in the training sample that form TriGrams with “I will”, some appear more often than others and will be selected with higher probability in prediction out of sample.

In case a TriGram that contains the first two words “I will” is not found in the model, then the model can refer to BiGrams containing a first word “will” and use this information to predict the third word in “I will …”. Naturally if no BiGrams exist, the model can refer to the stored UniGrams and suggest the highest probability single word as found in the training sample.

One obvious disadvantage to NGrams modeling is that not all possible third words to “I will …” are available in the training sets. Smoothing techniques can be applied to the available and unavailable three grams containing “I will …”, where probability mass is transferred from available in the training sample TriGrams to the ones that have not appeared even once. There are a few less or more sophisticated techniques for probability smoothing, and some of these techniques will be used in modeling to improve the performance of the model out of sample.

One quickly materializing problem with higher order ngrams, like 3ngrams and 4ngrams, is that very few 3ngrams and 4ngrams appear with higher frequency, and most 3/4ngrams appear just once (called singletons). This leads to enormous dataframes and potential problems with memory availability to process these dataframes. A known method is to remove the singletons when storing the 3/4ngrams.

NGrams are generated and are used for the moment to explore the data in a subsequent section. The R code used to generate and store 1/2/3/4grams is presented in Appendix C. For the moment the ngrams are stored as they are generated. In a later step I will look into storing the ngrams based on indexing of single words to see if this approach can reduce the memory demands of the subsequent modeling tasks. This will also likely be necessary to be able to operate the shiny app that has to be developed once the model is ready.

Exploratory analysis

Top Ngrams frequency

In this subsection we directly measure the top 20 most frequent 1/2/3grams for different corpora. It is important to look at the data with and without stop words excluded. For the purposes of modeling we need to include stop words. However, for exploratory analysis it is more interesting to remove the stop words and get a sense of the main topics prevailing in different type of corpora.

First we present the 1/2/3gram top 20 grams by frequency when stop words are not excluded from the corpora. The most common word is “The”, which represents about 5% coverage of the whole sample dictionary. Overall we see few surprises here - the top 20 ngrams here mainly represent some combination using a stop word and represent a very large portion of the dictionary.

Figure: Top 20 ngrams for the combined Corpus when stop words are not excluded from the corpus:

Next we present the 1/2/3gram top 20 grams by frequency when stop words are excluded from the corpora. The most common word is “will”, which represents about 0.75% coverage of the whole sample dictionary. Overall we see more variety here - the top 20 ngrams here have very low probability each, and the probability of each lower ranked TriGram drops very quickly compared to UniGrams and DiGrams. In addition, in terms of context we notice that likely the combined corpus represent relatively recent text since President Barack Obama is a fairly frequent TriGram. We see that we also have over-representation in the corpus of topics around holidays. Mostly, however, both DiGrams and TriGrams represent highly frequently used general expressions of the English language (pardon my lack of linguistic terminology).

Figure: Top 20 ngrams for the combined Corpus when stop words are excluded from the corpus:

For the beauty of it I also present a wordcloud plot of the combined corpus Trigrams when stop words are excluded from the corpus. More of the text topics and sentiments can be deduced from this more detailed plot.

Figure: Wordcloud plot of TriGrams for the combined Corpus when stop words are excluded from the corpus:

Next I present the TriGram top 20 grams by frequency for the three separate corpora - Blogs/News?Twitter - when stop words are NOT excluded from the corpora. The overall goal is to see if there are any topical/sentiment differences. We observe some specific for Blogs/Twitter TriGrams that will be unlikely for News articles as “I have been” and “I want to”. In the News corpus we observe also very specific for News corpus TriGrams as “The United States” and “according to the”. The Twitter corpus also reveals specific TriGrams that would rarely be observed in Blogs and News corpora as “thanks for the” and “looking forward to”.

Figure: Top 20 ngrams for the Blogs, News, Twitter corpora when stop words are not excluded from the corpus:

Last I present the TriGram top 20 grams by frequency for the three separate corpora - Blogs/News?Twitter - when stop words are excluded from the corpora. We observe some very specific for Blogs topics like “Amazon LLC”, “I pretty sure”, specific News topics as “senior court judge”, “President Barack Obama”, and specific tweets expressions as “let us go” or “I will take”. Topical differentiation between the three corpora based on TriGrams makes sense and is somewhat a loose confirmation that the data processing has gone relatively well.

Figure: Top 20 ngrams for the Blogs, News, Twitter corpora when stop words are not excluded from the corpus:

TTR - types of words to tokens ratio

The next characteristic of the data that we explore is the so called TTR: word.type to word.tokes ratio (Richards, 1987). The TTR demonstrates how rich the language is in a particular corpus. The higher TTR the more unique words are used to achieve to number of tokens. We calculate number of tokens (all words used) in the corpus, the number of word types (number of unique words) used in the corpus, and the respective TTR in the tables below. We do this for corpora that has stop words excluded or included.

## [1] "Word Types to tokens ratio (TTR) for corpora with NO stop words removed"

##    Source   Tokens Word.Types         TTR
## 1   Blogs 26296381     260377 0.009901629
## 2    News 23990968     195090 0.008131810
## 3 Twitter 20942380     331649 0.015836261
## 4  Corpus 71229729     574816 0.008069889

Note that the Twitter corpus is much richer than the Blogs and News corpora. Probable reason is that with a limited number of words allowed to use for a tweet people look for more precise words, which leads to richness of the Twitter corpus.

## [1] "Word Types to tokens ratio (TTR) for corpora with stop words removed"

##    Source   Tokens Word.Types        TTR
## 1   Blogs 13641106     260254 0.01907866
## 2    News 13851582     194967 0.01407543
## 3 Twitter 11814786     331526 0.02806026
## 4  Corpus 39307474     574693 0.01462045

We observe that the richness of the corpora increases when the stop words are excluded, which is expected. Stop words are extremely high frequency, low information words.

Cumulative frequences for 1grams

An interesting question to answer is what portion of your unique words comprise 50%, 75%, 90% of your sample dictionary. We present such information graphically first, as we can more clearly see the speed by which the dictionary coverage increases with additional unique words (word types). We constrain these plots to the first 50000 unique words in the 1grams. We present the speed of coverage increase for the separate source corpora - blogs/news/tweets - as well as for the combined corpus.

Figure: Cumulative frequences for 1grams excluding stop words:

We observe that we need about 1000 words to reach 50% coverage in each corpus, and about 50000 unique words to reach 95% coverage of our sample dictionary. It is interesting to note that coverage initially increases much faster for Twitter corpus than the other corpora. Even though this is the richest corpus in terms of TTR, we still have a smaller set of unique words that are represented with higher frequency as compared to the Blogs and News corpora.

Next we also calculate some dictionary coverage quantiles. The idea is to see how many unique words we need for, say 50% coverage. In addition to this we want to see how quickly the number of words increases to achieve an increase of 10% extra in coverage.

Figure: Quantiles, and ranges of quantiles in the combined corpus 1gram with stop words excluded:

## [1] "number of words comprising 49.99 to 50.01% of the dictionary:  966 966"

## [1] "number of words comprising 50.01 to 60.01% of the dictionary:  967 1720"

## [1] "number of words comprising 60.01 to 70.01% of the dictionary:  1721 3108"

## [1] "number of words comprising 70.01 to 80.01% of the dictionary:  3109 6067"

## [1] "number of words comprising 80.01 to 90.01% of the dictionary:  6068 15048"

## [1] "number of words comprising 90.01 to 95.01% of the dictionary:  15048 32788"

We note that we need about a 1000 unique words to achieve 50% coverage in the dictionary when stop words are excluded. From this quantile forward we need to about double the number of unique words in the corpus for each marginal 10% increase in coverage. The coverage speed significantly slows down beyond 90% coverage. For example to add another 5% coverage beyond 90% we need to move from 15,000 unique words to 32,000 unique words. The combined corpus clearly contains a small number of unique words that are represented very frequently and a large set of singletons.

The R code for all plots and tables presented in the section of exploratory analysis are included in Appendix D.

Model

The model uses Katz back-off method with Good-Turing smoothing, based on a 4gram. A Good-Turing (GT) with k<=5 smoothing algorithm is used to calculate the frequency of appearance of a given word in the training set given its history (GoodTuring Freq). As a natural next step conditional probabilities (GoodTuring Prob) of a word appearing given its history are calculated based on the GoodTuring Frequencies. Katz probabilities are calculated in case of a back-off as described in Jurafsky and Martin (2007), and Katz (1987).

The model development takes two steps. First, I calculate the Good-Turing (GT) Smoothing matrix for k<=5. The goal of this step is to redistribute some of the probability mass from low count ngrams, to ngrams that have not appeared in the training dataset. With k<=5 counts are redistributed from ngrams that appear in the training data 5 or fewer times into ngrams that appear zero times. Then using this GT matrix I smooth the counts for k<=5 and calculate conditional probabilities based on the GT counts. This results in lower than normal conditional probabilities for k in [1:5], and a positive probability for k=0. This completes any data pre-processing, and the data now is ready to load in the model. The scripts to complete this step are included in Appendix E.

The model itself takes a user unput text/phrase, cleans it, and depending on the length of the phrase executes a Katz back-off algorithm from the highest possible ngram, starting from the 4gram. Katz back-off simply starts looking for a match in the highest possible n-gram, it takes all matches available there and then starts looking for extra matches that have not appeared in the higher level ngram, but appear in lower level ngrams. The probability of matches that appear in lower ngrams depends on the GT probability of the ngram, multiplied by Katz Alpha. Katz alpha is calculated in the process of finding matches. The resulting probability of each match is called Katz probability in this paper. Finally, all matches that came from all possible ngrams are sorted by Katz probability. The scripts to run the model are included in Appendix F.

The model outputs a large table with predicted words, and Appendix G provides a short script to subset this table by Katz-probability groups to present on the main page of the Shiny app.

As suggested by Katz (1987), discarding all singletons (ngrams that appear only once in the training data) “does not affect the performance of the model”. Discarding all signletons significantly reduces the amount of time needed for the model to calculate the predictions in an “every-day-use” app. For the purposes of the predictions calculated in the shiny app I discarded all singletons.

Out-of-sample (OOS) validation

Jurafsky and Martin (2007) suggest that Katz Back-of method is better than other smoothing methods as well as better than weighing predictions across ngrams. For this reason, and due to limited time, I went after Katz back-off model and never had the chance to fully develop and test simpler models or more sophisticated alternative models. There are no models to compare and in this case testing the model via the perplexity test is not useful.

Instead, I rely on an out-of-sample test to find how the model predicts unseen text from the 15% validation sub-sample. I clean the 15% validation data set and form the 1,2,3,4 ngrams based on this validation data. Then starting from ngram=2 I feed the history of each row from the specific validation ngram to the Katz back-off algorithm based off the 70% training data set. For example I take the validation 3gram and feed its historic component (the first two words) the the Katz back-off algorithm, which uses the model based on 1,2,3 grams from the 70% training set and backs-off to lower ngram when necessary, calculates Katz probabilities and sorts the predictions by Katz probability. Then I take the top three words from this prediction and compare them to the third word from the validation 3gram. Then I simply sum up the matches and divide by the total frequency of 3grams.

This test is very expensive and for this reason I remove the singletons from the validation 15% sample. For example to calculate the test for the 2gram with excluded singletons took me about 24h calculations with the available to me resources. The results presented below do not incorporate our ability to predict the singletons, which most likely leads to a too optimistic result from the OOS validation.

2 gram results - unique ngrams 841646, total ngrams frequency 11112663, top 1 word match frequency = 16.49%, top 3 words match frequency = 28.64%.
3 gram results - unique ngrams 1000565, total ngrams frequency 5687561, top 1 word match frequency = 30.10%, top 3 words match frequency = 47.08%.
4 gram results - unique ngrams 534743, total ngrams frequency 2004421, top 1 word match frequency = 45.11%, top 3 words match frequency = 64.22%.

The scripts to run the OOS tests are located in Appendix H.

Shiny app

I explored ways to reduce file sizes and run the app in an efficient way so that a response to user’s input is very quick. I found the following helps a lot:

Saving data in Rdata is a much smaller file than csv for example. Loading R data is about 10 times faster than loading csv.
Not much efficiency is gained by replacing a 4gram Katz back-off model with a 3gram Katz back-off model. The gain is at best 15% savings in time per prediction.
Katz back-off generates predictions across more than one ngram, say you find some matches in a 4 gram, extra matches in 3 gram, etc… This creates three tables that have to be rbind. Rbind is very expensive, and i found that if I take just the first 1000 rows of each table that has to go into rbind, then the rbind process reduces to almost no time needed. In addition the gains in time in sorting the final table with predictions is of a similar magnitued.
I found that the biggest efficiency in time needed per prediciton is achieved by removing singletons. Using alternative ways to achieve a task does not have as much of an impact as removing singletons from your model.

The shiny server has memory limitations for the free version of the server. The better 4gram model was running out of memory on the server, and for this reason i provide a link to the 3gram model that fits in the shiny app server memory limitation of 1GB.

The shiny app is located here

References

As I started this assignment this was a brand new area for me. I mainly learned from Jurafsky, Martin, and Manning contributions to the field. I have not been able to find another source that so carefully walks you step by step though NLP for the particular application of text prediction at a conceptual level. I apply the Katz back-off method and Katz’s original paper was instrumental in calculating the Katz probabilities.

I have also used numerous resources, too many to keep track of, on Regular Expressions, to adapt those to the data pre-processing needs of this project.

Jurafsky, D. & Martin, J.H. (2007). Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Englewood Cliffs, NJ: Prentice Hall.
Dan Jurafsky & Chris Manning, Stamford Course Lectures on NLP: https://www.youtube.com/watch?v=nfoudtpBV68
Katz, Slava (1987). Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech, AND Signal Pprocessing, Vol. ASP-35
Stackoverflow
Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of Child Language, 14, pp. 201209. Doi: 10:1017/S0305000900012885. Retrieved from http://hfroehlich.files.wordpress.com/2013/05/s0305000900012885a.pdf.

Appendix

The Appendix mainly includes all the code that I have used and the steps I applied the code. One can use the code and fuly replicate my results.

For all calculations I used a personal 2012 MAC-book Pro computer with a 2.5 GHz Intel Core i5 and 16 GB 1600 MHz DDR3 installed. I have many other applications and a browser with about 30 Tabs open at the same time I am pre-processing data. With this set up I am often close to maxing out the available memory, and sometimes I max it out and R crashes. For this reason, I often split the pre-processing of the large training samples by blogs/news/twitter (or into halves or thirds) and later aggregate them into a combined sample in a separate step.

Appendix A: Data subsampling

I sub sample the data into 3 main samples - training (70%), validation/development (15%), testing (15%). I further take a random 10% sample coming from the 70% training sample to use for testing my code. Working with a large corpus is very time consuming.

# set working folder
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# A function to extract samples
textFileSample <- function(infile, outfile, selection) { 
  conn <- file(infile, "r")
  fulltext <- readLines(conn)
  nlines <- length(fulltext)
  close(conn)
  
  conn <- file(outfile, "w")

  for (i in 1:nlines) {
    if (selection[i]==1) { cat(fulltext[i], file=conn, sep="\n") }
  }
  close(conn)
  
  paste("Saved", sum(selection), "lines to file", outfile)
}

###################
## blogs ##########
###################

# read the txt file and count the number of lines in blogs document
conn <- file("./data/en_US/en_US.blogs.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for BLOGS
set.seed(42)
fraction1 <- 0.15      # 15% size of test sample
fraction2 <- 0.17      # approx 15% size of validation sample
fraction3 <- 0.14      # approx 10% size of training subsample out of full 70% train sample

# draw random samples based on rbinom
test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples and store
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.train10pct.txt",
               trainsample.index)

###################
## news ##########
###################

# read the txt file and count the number of lines in blogs document
conn <- file("./data/en_US/en_US.news.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for NEWS
set.seed(42)
fraction1 <- 0.15                                    # 15% size of test sample
fraction2 <- 0.17                                    # approx 15% size of validation sample
fraction3 <- 0.14                                    # approx 10% size of training subsample out of full train sample

test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples and store
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.train10pct.txt",
               trainsample.index)

###################
## twitter ########
###################

# load data and count number of lines in twitter document
conn <- file("./data/en_US/en_US.twitter.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for TWITTER
set.seed(42)
fraction1 <- 0.15                                    # 15% size of test sample
fraction2 <- 0.17                                    # approx 15% size of validation sample
fraction3 <- 0.14                                    # approx 10% size of training subsample out of full train sample

test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.train10pct.txt",
               trainsample.index)

Next i combine blogs-news-tweets sub samples into combined-corpus sub samples:

##############################################
## combine a 10% sample of all threee sources
##############################################

# Combine train 70% sub-samples into one
conn <- file("./data/en_US/sample/blogs.train70pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.train70pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.train70pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)
        
conn <- file("./data/en_US/sample/train70pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

# Combine train 10% sub-samples into one
conn <- file("./data/en_US/sample/blogs.train10pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.train10pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.train10pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)
        
conn <- file("./data/en_US/sample/train10pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)


# Combine test 15% sub-samples into one
conn <- file("./data/en_US/sample/blogs.test15pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.test15pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.test15pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/test15pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

# Combine validation 15% sub-samples into one
conn <- file("./data/en_US/sample/blogs.val15pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.val15pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.val15pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/val15pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

The resulting data sets from this sub sampling are:

blogs.train70pct.text, news.train70pct.text, twitter.train70pct.text
blogs.train10pct.text, news.train10pct.text, twitter.train10pct.text
blogs.test15pct.text, news.test15pct.text, twitter.test15pct.text
blogs.val15pct.text, news.val15pct.text, twitter.val15pct.text
train70pct.text, train10pct.text, test15pct.text, val15pct.text

Appendix B: Data pre-processing; Corpus prep for analysis

Set up pre-processing first:

library(tm) #load text mining library
library(textclean) #load auxiliary library 1

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10") #sets R's working directory 

# using a DirSource it loads a txt file into a VCorpus for preprocessing
# the files that need pre-processing in general are the sub-samples 
# from the previous section. But we start with the training samples
# blogs.train70pct.txt, news.train70pct.txt, twitter.train70pct.txt, train10pct.txt

# for this example code I use the train10pct.txt sample, whihc i place in the "./data/en_US/train.sample" folder 

sample.data <- VCorpus(DirSource("./data/en_US/train.sample", encoding = "UTF-8"), 
                 readerControl=list(language="en"))
summary(sample.data)
docs <- sample.data

# content-transformer custom functions
# Some custom functions that are needed to transform the corpus using content_transformer 
# functionality of tm package
toEmpty <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))}) # pattern to empty
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))}) # pattern to space

# turns ? and ! and . into an end of sentence identifier EEOSS. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of sentences and combine words from more than one sentence
toEOS1 <- content_transformer(function(x, pattern) {return (gsub("\\? |\\?$|\\! |\\!$", " EEOSS ", x))})
toEOS2 <- content_transformer(function(x, pattern) {return (gsub("\\. |\\.$", " EEOSS ", x))})

# turns abbreviations as H.S.B.C. into an identifier AABRR. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of such abbreviations as these ngrams are incorrect 
# from information perspective
toABR <- content_transformer(function(x, pattern) {return (gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", x))})

# turns numbers into an identifier NNUMM. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of such numbers as these ngrams are incorrect 
# from information perspective. A common approach is the 'remove numbers' with tm package.
# However, this will lead to formation of incorrect ngrams later on
toNUM <- content_transformer(function(x, pattern) {return (gsub("[0-9]+"," NNUMM ",x))})

# expand & and @ 
toAnd <- content_transformer(function(x) {return (gsub("&", "and", x))})
toAt <- content_transformer(function(x) {return (gsub("@", "at", x))})

# replace contractions using replace_contractions function from library library(textclean)
replaceContraction <- content_transformer(function(x) {return (replace_contraction(x))})
toHaveNot <- content_transformer(function(x) {return (gsub("haven't", "have not", x))})
toHadNot <- content_transformer(function(x) {return (gsub("hadn't", "had not", x))})

# remove duplicates
removeDup <- content_transformer(function(x) {return (unique(x))})

Apply a sequence of pre-processing steps on each text sample:

# 1. Remove duplicates
docs <- tm_map(docs, removeDup)

# 2. Separate words connected with - or /
# this helps later when we remove punctuation, so that two words do not stick to each otehr
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "/")

# 3. Establish end of sentence
# the sequence of application of the functions is important here
docs <- tm_map(docs, toEOS1)
docs <- tm_map(docs, toABR)
docs <- tm_map(docs, toEOS2)
docs <- tm_map(docs, toNUM)

# 4. Remove email and http/s
docs <- tm_map(docs, toEmpty, "\\S+@\\S+") # email
docs <- tm_map(docs, toEmpty, "[Hh}ttp([^ ]+)") # html links

# 5. Remove retweet entries, Remove @ people, twitter usernames
docs <- tm_map(docs, toEmpty, "RT | via") # retweets
docs <- tm_map(docs, toEmpty, "@([^ ]+)") # people
docs <- tm_map(docs, toEmpty, "[@][a - zA - Z0 - 9_]{1,15}") # usernames

# 6. Transform to lower case (need to wrap in content_transformer) (time consuming procedure)
docs <- tm_map(docs,content_transformer(tolower))

# 7. Remove/replace &, @, 'm, 's, 'are, 'll, etc...
docs <- tm_map(docs, toAnd) 
docs <- tm_map(docs, toAt) 
docs <- tm_map(docs, replaceContraction)

        # remove this contraction in say "mother's day"
        # before it is removed by remove_punctuation
        docs <- tm_map(docs, toEmpty, "'s") 

        # i found that havn't and hadn't did not get replaced with replace_contractions
        docs <- tm_map(docs, toHaveNot)
        docs <- tm_map(docs, toHadNot)

# 8. Remove emoji's, emoticons like ❤#RelayForLife ❤"😖😰💔💔"
docs <- tm_map(docs, toEmpty, "[^\x01-\x7F]")

# 9. Remove g, mg, lbs etc; removes all single letters except "a" and "i"

docs <- tm_map(docs, toSpace, " [1-9]+g ") # grams
docs <- tm_map(docs, toSpace, " [1-9]+mg ") # miligrams, etc
docs <- tm_map(docs, toSpace, " [1-9]+kg ")
docs <- tm_map(docs, toSpace, " [1-9]+lbs ")
docs <- tm_map(docs, toSpace, " [1-9]+s ") # seconds, etc
docs <- tm_map(docs, toSpace, " [1-9]+m ")
docs <- tm_map(docs, toSpace, " [1-9]+h ")
docs <- tm_map(docs, toSpace, " +g ") # grams
docs <- tm_map(docs, toSpace, " +mg ") # miligrams, etc
docs <- tm_map(docs, toSpace, " +kg ")
docs <- tm_map(docs, toSpace, " +lbs ")
docs <- tm_map(docs, toSpace, " +s ") # seconds, etc
docs <- tm_map(docs, toSpace, " +m ")
docs <- tm_map(docs, toSpace, " +h ")
docs <- tm_map(docs, toSpace, " +lbs ")
docs <- tm_map(docs, toSpace, " +kg ")

docs <- tm_map(docs, toSpace, " [b-hj-z] ") # all single-letter words except a and i

# 10. Remove numbers
docs <- tm_map(docs, removeNumbers)

# 11. Remove punctuation
# Note that this removes hashtags from tweets. I examined the data and most often the
# hashtag precedes a word of meaning and not a name of a person. If it were the latter
# it would be logical to remove the whole word that is preceded by a hashtag. However, 
# this strategy would lead to loss in meaning in the sentences and potentially affect
# the predictive power of the model. In addition, hopefully words with hashtags 
# that dont have a general meaning will not repeat often, and will have a low weight
# in calibrating the model via the ngrams. 
# Result: REMOVE HASHTAGS only, not words w/ hashtag
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = TRUE)

        # 11.1. Remove ": removing punctuation did not remove these
        docs <- tm_map(docs, toEmpty, "“")
        docs <- tm_map(docs, toEmpty, "”")
        docs <- tm_map(docs, toEmpty, "‘")
        docs <- tm_map(docs, toEmpty, "’")

# 12. Remove profanity (time consuming procedure)
# I don't want my predictions to suggest profanity words. For this purpose i may
# train a model with these words in and remove them in the end by recalibrating the 
# model on non-profanity data. Or else, which makes more sense, is to remove them
# from the beginning and not train the model on those. The latter choice is applied. 

# Donwload a list with bad words - we need this

# install.packages("RCurl")
library(RCurl)

profanity.source<-"https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("./data/en_US/profanity/en_bws.txt")){download.file(url=profanity.source , 
                destfile = "./data/en_US/profanity/en_bws.txt", method='libcurl')}

# remove bad words from corpora
bw.data<-read.table("./data/en_US/profanity/en_bws.txt", header=FALSE, sep="\n", strip.white=TRUE) # load bad words
names(bw.data)<-"bad.words"
docs<- tm_map(docs, removeWords, bw.data[,1]) # removal of profanity

# 13. Remove stop words
# From this point forward i keep two types of corpora - one with stop words included
# and one with stop words removed. The latter sample always has "nonstop" in the name of the file
docs.nonstop<-tm_map(docs, removeWords, stopwords("english"))

# 14. Remove whitespace

# The previous transformations generate a lot of extra whitespace in the lines of text; 
# applying the stripWhitespace transformation deals with this problem.
# I have tried different methods for this, but finally decided to use the tm package
docs <- tm_map(docs, stripWhitespace)
docs.nonstop <- tm_map(docs.nonstop, stripWhitespace)

# 15. Stem documents
# For the purposes of this porject I have made a choice to not use a stemmed corpus
# Logically we need to work with word types, not stems from them to form correct ngrams
# However, I offer the code i have tested for this step

# # install.packages("SnowballC")
# library("SnowballC")
# docs.stem <- tm_map(docs,stemDocument)
# docs.nonstop.stem <- tm_map(docs.nonstop,stemDocument)

# 16. Save data
# At this point i export the corpus with two versions - stop words included or not
# The reason is that pre-processing is very time consuming and shold be done once

save(docs, docs.nonstop, 
     file = "./data/en_US/train.sample/Rdata_output/train10pct.preprocsd.Rdata")

write(docs[[1]][[1]],"./data/en_US/train.sample/txt_output/train10pct.docs.preprocsd.txt")
write(docs.nonstop[[1]][[1]],"./data/en_US/train.sample/txt_output/train10pct.docs.nonstop.preprocsd.txt")

The resulting training files from this pre-processing step are (pre-processed in name:

blogs.train70pct.docs.pre-processed.text (also news., twitter. versions)
blogs.train70pct.docs.nonstop.pre-processed.text (also news., twitter. versions)
train10pct.docs.pre-processed.text, train10pct.docs.nonstop.pre-processed.text (to be used for testing code)
twitter.train10pct.docs.pre-processed.text, twitter.train10pct.docs.nonstop.pre-processed.text (to be used for testing code)

Appendix C: Tokenization and nGram Generation

I use package stylo to do tokenization. Since the ngrams can become very large, especially for 3grams and 4 grams, I tokenize the train70pct samples separately for blogs/news/tweets and aggregate the ngrams in a subsequent step.

# install.packages("stylo")
library(stylo)

###*********************** convert text to words with stylo **********************

### this step is done consecutively for docs[1] and doc.nonstop[1]
### and is applied to the train70pct and train10pct corpora from the previous step

### here we provide the example code for docs[1]
docs.words<- txt.to.words(docs[1])
###*******************************************************************************

###
# create data frames of one, two and three ngrams
###

###
# unigram
###
df1grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 1))) # create 1gram

# remove rows with eeoss, nnumm, aabrr as these are not useful
df1grm <- df1grm[which(df1grm$Var1!="eeoss"),]
df1grm <- df1grm[which(df1grm$Var1!="nnumm"),]
df1grm <- df1grm[which(df1grm$Var1!="aabrr"),]

# calculate relative frequencies
df1grm$RelFreq <- round(100*(df1grm$Freq/sum(df1grm$Freq)),2)

# Create a sorted table "sdf*" by decending frequency count
sdf1grm<-df1grm[order(df1grm$Freq, decreasing = TRUE),];

# Calculate relative cumulative frequencies, and column with row numbers.
# this allows one to find how many words are e.g. 50% of dictionary to answer
# a question like this later on
sdf1grm$CumSum <- cumsum(sdf1grm$Freq); sdf1grm$RelCumSum <- 100*(sdf1grm$CumSum/sum(sdf1grm$Freq))
sdf1grm$RowNum <- 1:dim(sdf1grm)[1]
colnames(sdf1grm)<-c("OneGram","Frequency","RelFrequency", "CumFreq", "RelCumFreq", "RowNum")

# Build frequency of frequency table for subsequent Good-Turing smoothing
uni.freqfreq<-data.frame(Uni=table(sdf1grm$Frequency))

###
# bigram
###

df2grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 2))) # create 2gram

# calculate relative frequencies
df2grm$RelFreq <- round(100*(df2grm$Freq/sum(df2grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
# note that any bigrams, tri/fourgrams that contain any of these three words are incorrect
# because they are not sequences of words that were actually used in the text
# grepl looks for a pattern in the text of each row
eos.index<-grepl("eeoss",df2grm$Var1)
df2grm<-df2grm[!eos.index,]
num.index<-grepl("nnumm",df2grm$Var1)
df2grm<-df2grm[!num.index,]
abr.index<-grepl("aabrr",df2grm$Var1)
df2grm<-df2grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf2grm<-df2grm[order(df2grm$Freq, decreasing = TRUE),]; colnames(sdf2grm)<-c("TwoGram","Frequency","RelFrequency")

# Build frequency of frequency table for Good-Turing smoothing
di.freqfreq<-data.frame(Di=table(sdf2grm$Frequency))

###
# trigram
###

df3grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 3))) # create 3gram

# calculate relative frequencies
df3grm$RelFreq <- round(100*(df3grm$Freq/sum(df3grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
eos.index<-grepl("eeoss",df3grm$Var1)
df3grm<-df3grm[!eos.index,]
num.index<-grepl("nnumm",df3grm$Var1)
df3grm<-df3grm[!num.index,]
abr.index<-grepl("aabrr",df3grm$Var1)
df3grm<-df3grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf3grm<-df3grm[order(df3grm$Freq, decreasing = TRUE),]; colnames(sdf3grm)<-c("TriGram","Frequency","RelFrequency")

# separate singletons
sdf3grm.2pl<-sdf3grm[which(sdf3grm$Frequency>=2),]
sdf3grm.singl<-sdf3grm[which(sdf3grm$Frequency==1),]

# Build frequency of frequency table for Good-Turing smoothing
tri.freqfreq<-data.frame(Tri=table(sdf3grm$Frequency))

###
# fourgram
###

df4grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 4)))  # create 4gram

# calculate relative frequencies
df4grm$RelFreq <- round(100*(df4grm$Freq/sum(df4grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
eos.index<-grepl("eeoss",df4grm$Var1)
df4grm<-df4grm[!eos.index,]
num.index<-grepl("nnumm",df4grm$Var1)
df4grm<-df4grm[!num.index,]
abr.index<-grepl("aabrr",df4grm$Var1)
df4grm<-df4grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf4grm<-df4grm[order(df4grm$Freq, decreasing = TRUE),]; colnames(sdf4grm)<-c("FourGram","Frequency","RelFrequency")

# separate singletons
sdf4grm.2pl<-sdf4grm[which(sdf4grm$Frequency>=2),]
sdf4grm.singl<-sdf4grm[which(sdf4grm$Frequency==1),]

# Build frequency of frequency table for Good-Turing smoothing
four.freqfreq<-data.frame(Four=table(sdf4grm$Frequency))

###
# Save ngrams data
###

save(docs, docs.words, sdf1grm, sdf2grm, sdf3grm, sdf3grm.2pl, sdf3grm.singl, 
     sdf4grm, sdf4grm.2pl, sdf4grm.singl, uni.freqfreq, di.freqfreq, tri.freqfreq, four.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train10pct_ngram_docs.Rdata")

# (for illustrative purposes only)
# save ngram data for the alternative nonstop corpus 
save(docs.nonstop, docs.words, sdf1grm, sdf2grm, sdf3grm, sdf3grm.2pl, sdf3grm.singl, 
     sdf4grm, sdf4grm.2pl, sdf4grm.singl, uni.freqfreq, di.freqfreq, tri.freqfreq, four.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train10pct_ngram_docs.nonstop.Rdata")

Input files for the last step were:

blogs.train70pct.pre-processed.Rdata, news.train70pct.pre-processed.Rdata, twitter.train70pct.pre-processed.Rdata
train10pct.pre-processed.Rdata

Output files for the last step were:

blogs.train70pct_ngram_docs.Rdata, news.train70pct_ngram_docs.Rdata, twitter.train70pct_ngram_docs.Rdata
blogs.train70pct_ngram_docs.nonstop.Rdata, news.train70pct_ngram_docs.nonstop.Rdata, twitter.train70pct_ngram_docs.nonstop.Rdata
train10pct_ngram_docs.Rdata, train10pct_ngram_docs.nonstop.Rdata

In the next step I aggregate the blogs/news/twitter train70pct samples ngrams into combined ngrams. I provide an example of the code used for files with included stop words. The same code is applied to files with stop words excluded.

###*************************************
### aggregate large training sample ####
###*************************************

####
## with stop words
####

# To load RData, one object at a time, we use library R.utils
# install.packages("R.utils")

library(R.utils); 

###
## unigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf1grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf1grm"]] 
sdf1grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf1grm"]]
sdf1grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf1grm"]]

# vertically bind the ngrams across three sources
df1grm <- rbind(sdf1grm.blogs[,1:2], sdf1grm.news[,1:2], sdf1grm.twitter[,1:2])

rm(sdf1grm.blogs,sdf1grm.news,sdf1grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf1grm <- df1grm %>% group_by(OneGram) %>% summarize(Frequency = sum(Frequency) )
rm(df1grm)
sdf1grm <- as.data.frame(sdf1grm); sdf1grm <- sdf1grm[order(-sdf1grm$Frequency),]

# calculate relative frequencies
sdf1grm$RelFreq <- round(100*(sdf1grm$Frequency/sum(sdf1grm$Frequency)),2)

## Calculate relative cumulative frequencies, and column with row numbers.
## this allows one to find how many words are e.g. 50% of dictionary
sdf1grm$CumSum <- cumsum(sdf1grm$Freq); sdf1grm$RelCumSum <- 100*(sdf1grm$CumSum/sum(sdf1grm$Freq))
sdf1grm$RowNum <- 1:dim(sdf1grm)[1]
colnames(sdf1grm)<-c("OneGram","Frequency","RelFrequency", "CumFreq", "RelCumFreq", "RowNum")

# Build frequency of frequency table for Good-Turing smoothing
uni.freqfreq<-data.frame(Uni=table(sdf1grm$Frequency))

# save combined data for 1gram
save(sdf1grm, uni.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata")
rm(sdf1grm, uni.freqfreq)

###
## bigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf2grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf2grm"]] 
sdf2grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf2grm"]]
sdf2grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf2grm"]]

# vertically bind the ngrams across three sources
df2grm <- rbind(sdf2grm.blogs[,1:2], sdf2grm.news[,1:2], sdf2grm.twitter[,1:2])

rm(sdf2grm.blogs,sdf2grm.news,sdf2grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf2grm <- df2grm %>% group_by(TwoGram) %>% summarize(Frequency = sum(Frequency) )
rm(df2grm)
sdf2grm <- as.data.frame(sdf2grm); sdf2grm <- sdf2grm[order(-sdf2grm$Frequency),]

# calculate relative frequencies
sdf2grm$RelFreq <- round(100*(sdf2grm$Frequency/sum(sdf2grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
di.freqfreq<-data.frame(Di=table(sdf2grm$Frequency))

# save combined data for 2gram
save(sdf2grm, di.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata")
rm(sdf2grm, di.freqfreq)

###
## trigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf3grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf3grm"]] 
sdf3grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf3grm"]]
sdf3grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf3grm"]]

# vertically bind the ngrams across three sources
df3grm <- rbind(sdf3grm.blogs[,1:2], sdf3grm.news[,1:2], sdf3grm.twitter[,1:2])

rm(sdf3grm.blogs,sdf3grm.news,sdf3grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf3grm <- df3grm %>% group_by(TriGram) %>% summarize(Frequency = sum(Frequency) )
rm(df3grm)
sdf3grm <- as.data.frame(sdf3grm); sdf3grm <- sdf3grm[order(-sdf3grm$Frequency),]

# calculate relative frequencies
sdf3grm$RelFreq <- round(100*(sdf3grm$Frequency/sum(sdf3grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
tri.freqfreq<-data.frame(Tri=table(sdf3grm$Frequency))

# separate singletons
sdf3grm.2pl<-sdf3grm[which(sdf3grm$Frequency>=2),]
sdf3grm.singl<-sdf3grm[which(sdf3grm$Frequency==1),]

# save combined data for 3gram
save(sdf3grm, tri.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram_docs.Rdata")
# save pre-processed data
save(sdf3grm.2pl, tri.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.Rdata")
# save pre-processed data
save(sdf3grm.singl, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram.singl_docs.Rdata")

rm(sdf3grm, tri.freqfreq, sdf3grm.2pl, sdf3grm.singl)

###
## fourgram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf4grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf4grm"]] 
sdf4grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf4grm"]]
sdf4grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf4grm"]]

# vertically bind the ngrams across three sources
df4grm <- rbind(sdf4grm.blogs[,1:2], sdf4grm.news[,1:2], sdf4grm.twitter[,1:2])

rm(sdf4grm.blogs,sdf4grm.news,sdf4grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf4grm <- df4grm %>% group_by(FourGram) %>% summarize(Frequency = sum(Frequency) )
rm(df4grm)
sdf4grm <- as.data.frame(sdf4grm); sdf4grm <- sdf4grm[order(-sdf4grm$Frequency),]

# calculate relative frequencies
sdf4grm$RelFreq <- round(100*(sdf4grm$Frequency/sum(sdf4grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
four.freqfreq<-data.frame(Four=table(sdf4grm$Frequency))

# separate singletons
sdf4grm.2pl<-sdf4grm[which(sdf4grm$Frequency>=2),]
sdf4grm.singl<-sdf4grm[which(sdf4grm$Frequency==1),]

# save combined data for 4gram
save(sdf4grm, four.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram_docs.Rdata")
# save pre-processed data
save(sdf4grm.2pl, four.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram.2pl_docs.Rdata")
# save pre-processed data
save(sdf4grm.singl, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram.singl_docs.Rdata")

rm(sdf4grm, four.freqfreq, sdf4grm.2pl, sdf4grm.singl)

Output files for the last step were:

blogs.train70pct_ngram_docs.Rdata, news.train70pct_ngram_docs.Rdata, twitter.train70pct_ngram_docs.Rdata
blogs.train70pct_ngram_docs.nonstop.Rdata, news.train70pct_ngram_docs.nonstop.Rdata, twitter.train70pct_ngram_docs.nonstop.Rdata

Output files from the last step were:

train70pct_1gram_docs.Rdata, train70pct_2gram_docs.Rdata
train70pct_3gram_docs.Rdata, train70pct_3gram.2pl_docs.Rdata, train70pct_3gram.singl_docs.Rdata
train70pct_4gram_docs.Rdata, train70pct_4gram.2pl_docs.Rdata, train70pct_4gram.singl_docs.Rdata
except for the train70pct_3gram.singl and train70pct_4gram.singl, the Rdata files also include the frequence of frequence tables

Extracting from RData some dataframes needed for exploratory analysis

I need to be able to load some 1gram and 3gram dataframes quicker for the purposes of exploratory analysis. I extract these dataframes with the following code.

# TriGrams
 sdf3grm.2pl.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]] 
 sdf3grm.2pl.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]]
 sdf3grm.2pl.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]]
 save(sdf3grm.2pl.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.Rdata")
 save(sdf3grm.2pl.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.Rdata")
 save(sdf3grm.2pl.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.Rdata")
 
 sdf3grm.2pl.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]] 
 sdf3grm.2pl.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]]
 sdf3grm.2pl.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]]
 
 save(sdf3grm.2pl.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.nonstop.Rdata")
 save(sdf3grm.2pl.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.nonstop.Rdata")
 save(sdf3grm.2pl.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.nonstop.Rdata")

# UniGrams  
sdf1grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]] 
sdf1grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]]
sdf1grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]]

save(sdf1grm.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata")
save(sdf1grm.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata")
save(sdf1grm.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata")

Appendix D: Data Characteristics and Exporatory Analysis

Characteristics of original sourced data

The code below generates the table in section “Data Description”. It works with the original three corpora provided by Swift-Coursera.

datafolder <- "/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10/data/en_US"
flist <- list.files(path=datafolder, recursive=F, pattern=".*en_.*.txt")

# lapply returns a list
l <- lapply(paste(datafolder, flist, sep="/"), function(f) {
        fsize <- file.info(f)[1]/1024/1024
        con <- file(f, open="r")
        lines <- readLines(con)
#        nchars <- lapply(lines, nchar)
#        maxchars <- which.max(nchars)
        nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
        close(con)
        # round to two decimal points, but if the first two decimals are 00, then still display them
        return(c(f, format(round(fsize, 2), nsmall=2), length(lines), nwords))
})

# unlist the list and display the results
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "num.of.words")

filenames <- data.frame(file.name = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"))
df <- cbind(filenames,df[,2:4])
df

NGrams based exploratory analysis

R code for Ngrams related plots in the “Exploratory Analysis” section.

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams including stop words for the combined train70pct corpus:

###
# ngram plots including stop words
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load 1/2/3grams 
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)

# extract top 20 ngrams by frequency
top20oneg<-sdf1grm[1:20,]; top20oneg$OneGram <- as.character(top20oneg$OneGram)
top20twog<-sdf2grm[1:20,]; top20twog$TwoGram <- as.character(top20twog$TwoGram)
top20trig<-sdf3grm.2pl[1:20,]; top20trig$TriGram <- as.character(top20trig$TriGram)

rm(sdf1grm, sdf2grm, sdf3grm)

library(ggplot2)
#install.packages("ggpubr")
library(ggpubr) # use to align three plots in one row

p1 <- ggplot (top20oneg, aes(x = reorder(OneGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightgreen" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "OneGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p2 <- ggplot (top20twog, aes(x = reorder(TwoGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightblue" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TwoGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p3 <- ggplot (top20trig, aes(x = reorder(TriGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TriGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

# Arrange plots (3,1)
 ggarrange(p1, p2, p3 + rremove("x.text"), 
#           labels = c("A", "B", "C"),
          labels = NULL,
          ncol = 3, nrow = 1)

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams excluding stop words for the combined train70pct corpus:

###
# ngram plots excluding stop words
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load 1/2/3grams  
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
 
# extract top 20 ngrams by frequency
top20oneg<-sdf1grm[1:20,]; top20oneg$OneGram <- as.character(top20oneg$OneGram)
top20twog<-sdf2grm[1:20,]; top20twog$TwoGram <- as.character(top20twog$TwoGram)
top20trig<-sdf3grm.2pl[1:20,]; top20trig$TriGram <- as.character(top20trig$TriGram)

rm(sdf1grm, sdf2grm, sdf3grm)

library(ggplot2)
#install.packages("ggpubr")
library(ggpubr) # use to align three plots in one row

p1 <- ggplot (top20oneg, aes(x = reorder(OneGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightgreen" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "OneGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p2 <- ggplot (top20twog, aes(x = reorder(TwoGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightblue" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TwoGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p3 <- ggplot (top20trig, aes(x = reorder(TriGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TriGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

# Arrange plots (3,1)
 ggarrange(p1, p2, p3 + rremove("x.text"), 
#           labels = c("A", "B", "C"),
          labels = NULL,
          ncol = 3, nrow = 1)

Wordcloud frequency of 3 grams for ngrams excluding stop words for the combined train70pct corpus:

###
# wordcloud plots of 3grams with excluded stop words
###
 
# install.packages("wordcloud")
library(wordcloud)
 
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)

wordcloud(words = sdf3grm.2pl$TriGram, freq = sdf3grm.2pl$Freq, min.freq = 3,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2")) #,scale=c(2, 1.2))

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams including stop words for the three separate - blogs/news/twitter - corpora:

 ###
 # 3grams from each source including stop words
 ###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
 
 # extract top 20 by freq
 top20trig.blogs<-sdf3grm.2pl.blogs[1:20,]; top20trig.blogs$TriGram <- as.character(top20trig.blogs$TriGram)
 top20trig.news<-sdf3grm.2pl.news[1:20,]; top20trig.news$TriGram <- as.character(top20trig.news$TriGram)
 top20trig.twitter<-sdf3grm.2pl.twitter[1:20,]; top20trig.twitter$TriGram <- as.character(top20trig.twitter$TriGram)

 library(ggplot2)
 #install.packages("ggpubr")
 library(ggpubr)
 
 pp1 <- ggplot (top20trig.blogs, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Blogs" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp2 <- ggplot (top20trig.news, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram News" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp3 <- ggplot (top20trig.twitter, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Twits" ) +
         ylab( "Freq & RelFrequency %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()

# Arrange plots (3,1)  
 ggarrange(pp1, pp2, pp3 + rremove("x.text"), 
           #labels = c("A", "B", "C"),
           labels = NULL,
           ncol = 3, nrow = 1)

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams excluding stop words for the three separate - blogs/news/twitter - corpora:

###
# 3grams from each source excluding stop words
###
 
rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")
  
load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
 
 # extract top 20 by freq
 top20trig.blogs<-sdf3grm.2pl.blogs[1:20,]; top20trig.blogs$TriGram <- as.character(top20trig.blogs$TriGram)
 top20trig.news<-sdf3grm.2pl.news[1:20,]; top20trig.news$TriGram <- as.character(top20trig.news$TriGram)
 top20trig.twitter<-sdf3grm.2pl.twitter[1:20,]; top20trig.twitter$TriGram <- as.character(top20trig.twitter$TriGram)

 library(ggplot2)
 #install.packages("ggpubr")
 library(ggpubr)
 
 pp1 <- ggplot (top20trig.blogs, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Blogs" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp2 <- ggplot (top20trig.news, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram News" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp3 <- ggplot (top20trig.twitter, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Twits" ) +
         ylab( "Freq & RelFrequency %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()

# Arrange plots (3,1)  
 ggarrange(pp1, pp2, pp3 + rremove("x.text"), 
           #labels = c("A", "B", "C"),
           labels = NULL,
           ncol = 3, nrow = 1)

Cumulative frequencies for 1grams excluding stop words:

###
# Cumulative frequences for 1grams excluding stop words
###
 
# How many unique words do you need in a frequency sorted dictionary to cover 50% 
# of all word instances in the language? 90%?
 
rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")


load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)

# plot number of words vs culative relative frequency.
# this shows how dictinary coverage increases with addition of more words
# it is an increasing with a decreasing rate process. first deriv+, second deriv-
plot(sdf1grm$RowNum[1:50000], sdf1grm$RelCumFreq[1:50000], type="l", lty=1, 
     ylab = "Cumulative Relative Frequency (0%-100% coverage)", xlab = "Top Frequency Words")
lines(sdf1grm.blogs$RelCumFreq, type="l", lty=1, col = "blue")
lines(sdf1grm.news$RelCumFreq, type="l", lty=1, col = "green")
lines(sdf1grm.twitter$RelCumFreq, type="l", lty=1, col = "red")
legend("bottomright", 
       legend = c("Combined", "Blogs", "News", "Tweets"), 
       col = c("black","blue", "green", "red" ), 
       pch = c(19,19, 19, 19), bty = "n", pt.cex = 2, cex = 1.2, 
       text.col = c("black","blue", "green", "red" ), horiz = F 
       #       inset = c(0.1, 0.1))
)

# find some quantiles, that show how quickly the number of words needed rises to 
# achieve more covereage of the dictionary
paste("number of words comprising 49.99 to 50.01% of the dictionary: ", range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 49.99 & sdf1grm$RelCumFreq < 50.01])[1], range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 49.99 & sdf1grm$RelCumFreq < 50.01])[2])
paste("number of words comprising 50.01 to 60.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 50.01 & sdf1grm$RelCumFreq < 60.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 50.01 & sdf1grm$RelCumFreq < 60.01])[2])
paste("number of words comprising 60.01 to 70.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 60.01 & sdf1grm$RelCumFreq < 70.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 60.01 & sdf1grm$RelCumFreq < 70.01])[2])
paste("number of words comprising 70.01 to 80.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 70.01 & sdf1grm$RelCumFreq < 80.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 70.01 & sdf1grm$RelCumFreq < 80.01])[2])
paste("number of words comprising 80.01 to 90.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 80.01 & sdf1grm$RelCumFreq < 90.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 80.01 & sdf1grm$RelCumFreq < 90.01])[2])
paste("number of words comprising 90.01 to 95.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 89.99 & sdf1grm$RelCumFreq < 90.01])[2],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 90.01 & sdf1grm$RelCumFreq < 95.01])[2])

TTR calculations for 1grams including stop words

###
# Generates TTR: word.type to token ratio for 1gram including stop words. 
# It demonstrates how rich the language is in a particular corpora. 
# The higher TTR the more unique words are used to achieve to number of tokens
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose = FALSE)

# calculate TTR by source and combined
blogs.tokens <- sum(sdf1grm.blogs$Frequency); blogs.types <- dim(sdf1grm.blogs)[1];
blogs.TTR <- blogs.types/blogs.tokens;
news.tokens <- sum(sdf1grm.news$Frequency); news.types <- dim(sdf1grm.news)[1];
news.TTR <- news.types/news.tokens;
twitter.tokens <- sum(sdf1grm.twitter$Frequency); twitter.types <- dim(sdf1grm.twitter)[1];
twitter.TTR <- twitter.types/twitter.tokens;
tokens <- sum( sdf1grm$Frequency); types <- dim(sdf1grm)[1];
TTR <- types/tokens;

TTR <- data.frame("Source" = c("Blogs", "News", "Twitter", "Corpus"),
                          "Tokens" = c(blogs.tokens,news.tokens,twitter.tokens,tokens), 
                          "Word.Types" = c(blogs.types,news.types,twitter.types,types),
                          "TTR" = c(blogs.TTR,news.TTR,twitter.TTR,TTR))
print("Word Types to tokens ratio (TTR) for corpora with NO stop words removed")
TTR

TTR calculations for 1grams excluding stop words

###
# Generates TTR: word.type to token ratio for 1gram excluding stop words. 
# It demonstrates how rich the language is in a particular corpora. 
# The higher TTR the more unique words are used to achieve to number of tokens
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)

# calculate TTR by source and combined
blogs.nonstop.tokens <- sum(sdf1grm.blogs$Frequency); blogs.nonstop.types <- dim(sdf1grm.blogs)[1];
blogs.nonstop.TTR <- blogs.nonstop.types/blogs.nonstop.tokens;
news.nonstop.tokens <- sum(sdf1grm.news$Frequency); news.nonstop.types <- dim(sdf1grm.news)[1];
news.nonstop.TTR <- news.nonstop.types/news.nonstop.tokens;
twitter.nonstop.tokens <- sum(sdf1grm.twitter$Frequency); twitter.nonstop.types <- dim(sdf1grm.twitter)[1];
twitter.nonstop.TTR <- twitter.nonstop.types/twitter.nonstop.tokens;
nonstop.tokens <- sum( sdf1grm$Frequency); nonstop.types <- dim(sdf1grm)[1];
nonstop.TTR <- nonstop.types/nonstop.tokens;


TTR.nonstop <- data.frame("Source" = c("Blogs", "News", "Twitter", "Corpus"),
                "Tokens" = c(blogs.nonstop.tokens,news.nonstop.tokens,twitter.nonstop.tokens,nonstop.tokens), 
                "Word.Types" = c(blogs.nonstop.types,news.nonstop.types,twitter.nonstop.types,nonstop.types),
                "TTR" = c(blogs.nonstop.TTR,news.nonstop.TTR,twitter.nonstop.TTR,nonstop.TTR))
print("Word Types to tokens ratio (TTR) for corpora with stop words removed")
TTR.nonstop

Appendix E: Good-Turing smoothing, GT counts, and GT conditional probabilities

The next block of script calculates the Good-Turing smoothing matrix for the counts/frequencies. The result is a small matrix for k=5.

###******************************************************
##******* Derive Good-Turing Smoothing matrix *******####
###******************************************************

# First calcuate the GTS matrix for k=5. After this we use the GT matrix to calculate 
# adjusted counts, and conditional probabilities on the adjusted counts. 
# Once conditional probs are ready we can split the 3grams and 4 grams to singletons and 2plus grams
# the 2plus grams will then be used for the model and to be loaded into the shiny app

# build empty GTS matrix
gtsm<-matrix(c(seq(0,6,1),rep(0,28)),nrow=7,ncol=5,
             dimnames = list(c(seq(0,6,1)),c("count","uni","di","tri", "four")))

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

###
# 1gram
###
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose=T)

length_1grm <- dim(sdf1grm)[1]

# fill up gtsm matrix
gtsm[1,2] <- uni.freqfreq[1,2]
gtsm[2:7,2] <- uni.freqfreq[1:6,2]

kFactor <- 6*gtsm[7,2]/gtsm[2,2] # for k = 5
for (c in 0:5){
        num<-((c+1)*gtsm[c+2,2]/gtsm[c+1,2])-(c)*kFactor
        den<- 1-kFactor
        gtsm[c+1,2]<-num/den
}
rm(sdf1grm, uni.freqfreq)

###
# 2gram
###
load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata", verbose=T)
# fill up gtsm matrix
gtsm[1,3] <- length_1grm^2 - dim(sdf2grm)[1]
gtsm[2:7,3] <- di.freqfreq[1:6,2]

kFactor <- 6*gtsm[7,3]/gtsm[2,3] # for k = 5
for (c in 0:5){
        num<-(c+1)*gtsm[c+2,3]/gtsm[c+1,3]-(c)*kFactor
        den<- 1-kFactor
        gtsm[c+1,3]<-num/den
}
rm(sdf2grm, di.freqfreq)

###
# 3gram
###
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram_docs.Rdata", verbose=T)
# fill up gtsm matrix
gtsm[1,4] <- length_1grm^3 - dim(sdf3grm)[1]
gtsm[2:7,4] <- tri.freqfreq[1:6,2]

kFactor <- 6*gtsm[7,4]/gtsm[2,4] # for k = 5
for (c in 0:5){
        num<-(c+1)*gtsm[c+2,4]/gtsm[c+1,4]-(c)*kFactor
        den<- 1-kFactor
        gtsm[c+1,4]<-num/den
}
rm(sdf3grm, tri.freqfreq)

###
# 4gram
###
load("./data/en_US/train.sample/Rdata_output/train70pct_4gram_docs.Rdata", verbose=T)
# fill up gtsm matrix                                          
gtsm[1,5] <- length_1grm^4 - dim(sdf4grm)[1]
gtsm[2:7,5] <- four.freqfreq[1:6,2]

kFactor <- 6*gtsm[7,5]/gtsm[2,5] # for k = 5
for (c in 0:5){
        num<-(c+1)*gtsm[c+2,5]/gtsm[c+1,5]-(c)*kFactor
        den<- 1-kFactor
        gtsm[c+1,5]<-num/den
}
rm(sdf4grm, four.freqfreq)

save(gtsm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_gtsm.docs.Rdata")

The next block of script uses the Good-Turing smoothing matrix for the counts/frequencies and calculates the adjusted GT counts, and resulting conditional probabilities. The result are the columns GTfreq, and GTprob.

###**************************************************************
#### calculate adjusted counts and conditional probabilities ####
###**************************************************************

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load GTS matrix
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_gtsm.docs.Rdata", verbose=T)

###
# 1gram
###

load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose=T)
sdf1grm <- sdf1grm[,1:2]
gtsm.1grm <- gtsm[,1:2]

# split the frame in 2 at k=5
sdf1grm.1to5 <-sdf1grm[sdf1grm$Frequency < 6,] 
sdf1grm.6pl <-sdf1grm[sdf1grm$Frequency >= 6,] 

# calculate discounted counts for c=1to5
sdf1grm.1to5 <- merge(sdf1grm.1to5,gtsm.1grm, by.x="Frequency", by.y="count", all.x = TRUE)
sdf1grm.1to5 <- sdf1grm.1to5[,c(2,1,3)]; names(sdf1grm.1to5) <- c("pred", "freq", "GTfreq")

# calculate discounted counts for c=1to5
sdf1grm.6pl$GTfreq <- sdf1grm.6pl$Frequency; names(sdf1grm.6pl) <- c("pred", "freq", "GTfreq")

# rbind and calculate conditional probability
sdf1grm <- rbind(sdf1grm.6pl,sdf1grm.1to5)
sdf1grm$GTprob <- sdf1grm$GTfreq / sum(sdf1grm$freq); sdf1grm <- sdf1grm[order(-sdf1grm$GTfreq),]   

# export
sdf1grm <- sdf1grm[,c("pred", "GTfreq", "GTprob")]
save(sdf1grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_1grm.GTfreqprob.docs.Rdata")

sdf1grm <- sdf1grm[,c("pred", "GTprob")]
save(sdf1grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_1grm.GTprob.docs.Rdata")
rm(sdf1grm)


###
# 2gram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load GTS matrix
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_gtsm.docs.Rdata", verbose=T)


load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata", verbose=T)
sdf2grm <- sdf2grm[,1:2]
gtsm.2grm <- gtsm[,c(1,3)]

# split the frame in 2 at k=5
sdf2grm.1to5 <-sdf2grm[sdf2grm$Frequency < 6,] 
sdf2grm.6pl <-sdf2grm[sdf2grm$Frequency >= 6,] 

# calculate discounted counts for c=1to5
sdf2grm.1to5 <- merge(sdf2grm.1to5,gtsm.2grm, by.x="Frequency", by.y="count", all.x = TRUE)
sdf2grm.1to5 <- sdf2grm.1to5[,c(2,1,3)]; names(sdf2grm.1to5) <- c("TwoGram", "freq", "GTfreq")

# calculate discounted counts for c=6pl
sdf2grm.6pl$GTfreq <- sdf2grm.6pl$Frequency; names(sdf2grm.6pl) <- c("TwoGram", "freq", "GTfreq")

# rbind and calculate conditional probability
sdf2grm <- rbind(sdf2grm.6pl,sdf2grm.1to5); sdf2grm <- sdf2grm[order(-sdf2grm$GTfreq),]

# extract 1grams from first word in two-gram, w1
sdf2grm$hist<-sub(" .*","",sdf2grm$TwoGram)
# extract 1grams from last word in twogram, w2
sdf2grm$pred<-sub(".* ","",sdf2grm$TwoGram)

# Calculate conditional probabilities
library(dplyr)
        # data frame with history data
        hist <- as.data.frame(sdf2grm[,c("hist","freq")])
        
        # calculate w1 group counts
        hist.gr <- hist %>% group_by(hist) %>% summarise(hist.count = sum(freq)) %>% arrange(-hist.count) %>% as.data.frame()

        # left join the w1.gr counts
        sdf2grm <- merge(sdf2grm,hist.gr, by="hist" , all.x = TRUE)
        
        # 2 gram conditional probability
        sdf2grm$GTprob <-sdf2grm$GTfreq / sdf2grm$hist.count
        sdf2grm <- sdf2grm[order(sdf2grm$hist, -sdf2grm$GTprob),]
        
        rm(hist, hist.gr)
        
        sdf2grm <- sdf2grm[,c(1,5,6,4,7)]
        sdf2gram.good <- sdf2grm
        
# export

save(sdf2grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_2grm.GTfreqprob.docs.Rdata")
     
sdf2grm <- sdf2grm[,c("hist", "pred","GTprob")]
save(sdf2grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_2grm.GTprob.docs.Rdata")
        
rm(sdf2grm)

###
# 3gram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load GTS matrix
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_gtsm.docs.Rdata", verbose=T)


load("./data/en_US/train.sample/Rdata_output/train70pct_3gram_docs.Rdata", verbose=T)

sdf3grm <- sdf3grm[,1:2]
gtsm.3grm <- gtsm[,c(1,4)]

# split the frame in 2 at k=5
sdf3grm.1to5 <-sdf3grm[sdf3grm$Frequency < 6,] 
sdf3grm.6pl <-sdf3grm[sdf3grm$Frequency >= 6,] 
rm(sdf3grm)

# calculate discounted counts for c=1to5
sdf3grm.1to5 <- merge(sdf3grm.1to5,gtsm.3grm, by.x="Frequency", by.y="count", all.x = TRUE)
sdf3grm.1to5 <- sdf3grm.1to5[,c(2,1,3)]; names(sdf3grm.1to5) <- c("TriGram", "freq", "GTfreq")

# calculate discounted counts for c=6pl
sdf3grm.6pl$GTfreq <- sdf3grm.6pl$Frequency; names(sdf3grm.6pl) <- c("TriGram", "freq", "GTfreq")

# rbind and calculate conditional probability
sdf3grm <- rbind(sdf3grm.6pl,sdf3grm.1to5); sdf3grm <- sdf3grm[order(-sdf3grm$GTfreq),]


# extract 2grams from first words in trigram
DiGram<-sub(" ","#",sdf3grm$TriGram)
DiGram<-sub(" .*","",DiGram)
sdf3grm$hist <- sub("#"," ",DiGram) # w12
rm(DiGram)

# extract 1grams from last word in trigram
sdf3grm$pred <- sub(".* ","",sdf3grm$TriGram) # w3


# Calculate conditional probabilities
library(dplyr)
        # data frame with history data
        hist <- as.data.frame(sdf3grm[,c("hist","freq")])

        # calculate w1 group counts
        hist.gr <- hist %>% group_by(hist) %>% summarise(hist.count = sum(freq)) %>% arrange(-hist.count) %>% as.data.frame()

        # left join the w1.gr counts
        sdf3grm <- merge(sdf3grm,hist.gr, by="hist" , all.x = TRUE)

        # 3 gram conditional probability
        sdf3grm$GTprob <-sdf3grm$GTfreq / sdf3grm$hist.count
        sdf3grm <- sdf3grm[order(sdf3grm$hist, -sdf3grm$GTprob),]

        rm(hist, hist.gr)

        sdf3grm <- sdf3grm[,c(1,5,6,4,7)]
        

# export
        
save(sdf3grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_3grm.GTfreqprob.docs.Rdata")
sdf3grm.2pl<-sdf3grm[which(sdf3grm$GTfreq>1),]
save(sdf3grm.2pl, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_3grm.2pl.GTfreqprob.docs.Rdata")

sdf3grm <- sdf3grm[,c("hist", "pred","GTprob")]
save(sdf3grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_3grm.GTprob.docs.Rdata")
sdf3grm.2pl <- sdf3grm.2pl[,c("hist", "pred","GTprob")]
save(sdf3grm.2pl, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_3grm.2pl.GTprob.docs.Rdata")

rm(sdf3grm,sdf3grm.2pl)
        
###
# 4gram
###

# The 4 gram is a large data set and has to be processed in pieces. 

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load GTS matrix
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_gtsm.docs.Rdata", verbose=T)

load("./data/en_US/train.sample/Rdata_output/train70pct_4gram_docs.Rdata", verbose=T)

sdf4grm <- sdf4grm[,1:2]
gtsm.4grm <- gtsm[,c(1,5)]

# split the data.frame in 2 at k=5, and pre-process separately, then aggregate
sdf4grm.1to5 <-sdf4grm[sdf4grm$Frequency < 6,] 
sdf4grm.6pl <-sdf4grm[sdf4grm$Frequency >= 6,] 
rm(sdf4grm)

# calculate discounted counts for c=1to5
sdf4grm.1to5 <- merge(sdf4grm.1to5,gtsm.4grm, by.x="Frequency", by.y="count", all.x = TRUE)
sdf4grm.1to5 <- sdf4grm.1to5[,c(2,1,3)]; names(sdf4grm.1to5) <- c("FourGram", "freq", "GTfreq")

# calculate discounted counts for c=6pl
sdf4grm.6pl$GTfreq <- sdf4grm.6pl$Frequency; names(sdf4grm.6pl) <- c("FourGram", "freq", "GTfreq")

# rbind and calculate conditional probability
sdf4grm <- rbind(sdf4grm.6pl,sdf4grm.1to5); sdf4grm <- sdf4grm[order(-sdf4grm$GTfreq),]

# save and reload to save memory
save(sdf4grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_interm.Rdata")

load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_interm.Rdata")

# split data.frame into three parts for the next procedure, and preprocess each part separately

# part 1
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_interm.Rdata")
sdf4grm <- sdf4grm[1:(ceiling(0.33*(dim(sdf4grm)[1]))),]

# part 2
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_interm.Rdata")
sdf4grm <- sdf4grm[(ceiling(0.33*(dim(sdf4grm)[1]))+1):(ceiling(0.66*(dim(sdf4grm)[1]))),]

# part 3
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_interm.Rdata")
sdf4grm <- sdf4grm[(ceiling(0.66*(dim(sdf4grm)[1]))+1):dim(sdf4grm)[1],]

# extract 3grams from first words in fourgram
TriGram<-sub(" ","#",sdf4grm$FourGram)
TriGram<-sub(" ","#",TriGram)
TriGram<-sub(" .*","",TriGram)
sdf4grm$hist <- gsub("#"," ",TriGram)
rm(TriGram)

# extract 1grams from last word in fourgram
sdf4grm$pred <- sub(".* ","",sdf4grm$FourGram)


save(sdf4grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p3.Rdata")
# save(sdf4grm, 
#      file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p2.Rdata")
# save(sdf4grm, 
#      file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p1.Rdata")
rm(sdf4grm)

# aggregate 4gram
library(R.utils); 
sdf4grm.p1 <- loadToEnv("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p1.Rdata")[["sdf4grm"]]
sdf4grm.p1 <- sdf4grm.p1[,2:5]
sdf4grm.p2 <- loadToEnv("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p2.Rdata")[["sdf4grm"]]
sdf4grm.p2 <- sdf4grm.p2[,2:5]
sdf4grm.p3 <- loadToEnv("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p3.Rdata")[["sdf4grm"]]
sdf4grm.p3 <- sdf4grm.p3[,2:5]

sdf4grm <- rbind(sdf4grm.p1, sdf4grm.p2)
rm(sdf4grm.p1,sdf4grm.p2)
sdf4grm <- rbind(sdf4grm, sdf4grm.p3)
rm(sdf4grm.p3)

save(sdf4grm, "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.docs_p123aggr.Rdata")

# Calculate conditional probabilities
library(dplyr)
        # data frame with history data
        hist <- as.data.frame(sdf4grm[,c("hist","freq")])

        # calculate w1 group counts
        hist.gr <- hist %>% group_by(hist) %>% summarise(hist.count = sum(freq)) %>% arrange(-hist.count) %>% as.data.frame()

        # left join the w1.gr counts
        sdf4grm <- merge(sdf4grm,hist.gr, by="hist" , all.x = TRUE)

        # 3 gram conditional probability
        sdf4grm$GTprob <-sdf4grm$GTfreq / sdf4grm$hist.count
        sdf4grm <- sdf4grm[order(sdf4grm$hist, -sdf4grm$GTprob),]

        rm(hist, hist.gr)

        sdf4grm <- sdf4grm[,c(1,5,6,4,7)]

        
# export

save(sdf4grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.GTfreqprob.docs.Rdata")
sdf4grm.2pl<-sdf4grm[which(sdf4grm$GTfreq>0.8),]
save(sdf4grm.2pl, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.2pl.GTfreqprob.docs.Rdata")

sdf4grm <- sdf4grm[,c("hist", "pred","GTprob")]
save(sdf4grm, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.GTprob.docs.Rdata")
sdf4grm.2pl <- sdf4grm.2pl[,c("hist", "pred","GTprob")]
save(sdf4grm.2pl, 
     file = "./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train70pct_4grm.2pl.GTprob.docs.Rdata")

rm(sdf4grm,sdf4grm.2pl)

The last block of script reformats the prepared ngrams to load into the model and the shiny app. The goal is to have only the most necessary information and keep the file sizes to a minimum.

###*****************************
# save files for shiny app ####
###*****************************

load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train100pct_1grm.GTfreqprob.docs.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train100pct_2grm.GTfreqprob.docs.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train100pct_3grm.2pl.GTfreqprob.docs.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/CondP_and_GoodTuring/train100pct_4grm.2pl.GTfreqprob.docs.Rdata", verbose=T)

# for the 1gram we do not have history, so we create a history column populated by "NA"
OneGram <- cbind(rep("NA",times=dim(sdf1grm)[1]),sdf1grm[,c("pred", "GTfreq", "GTprob" )])
names(OneGram) <- c("hist", "pred", "GTfreq", "GTprob" )
OneGram.2pl<-OneGram[which(OneGram$GTfreq>1),]

TwoGram <- sdf2grm[,c("hist", "pred", "GTfreq", "GTprob" )]
TwoGram.2pl<-TwoGram[which(TwoGram$GTfreq>1),]

TriGram <- sdf3grm.2pl[,c("hist", "pred", "GTfreq", "GTprob" )]
FourGram <- sdf4grm.2pl[,c("hist", "pred", "GTfreq", "GTprob" )] 

save(OneGram, file = "./data/en_US/shiny.app.data/train100pct_OneGram.Rdata")
OneGram <- OneGram.2pl
save(OneGram, file = "./data/en_US/shiny.app.data/train100pct_OneGram.2pl.Rdata")
save(TwoGram, file = "./data/en_US/shiny.app.data/train100pct_TwoGram.Rdata")
TwoGram <- TwoGram.2pl
save(TwoGram, file = "./data/en_US/shiny.app.data/train100pct_TwoGram.2pl.Rdata")
save(TriGram, file = "./data/en_US/shiny.app.data/train100pct_TriGram.2pl.Rdata")
save(FourGram, file = "./data/en_US/shiny.app.data/train100pct_FourGram.2pl.Rdata")

Appendix F: Running the model

To run the model we define a few functions: 1. A function to clean the data, and 2. Four functions to run Katz back-off from different ngram levels, starting Katz back-off from a 4gram, from a 3gram, form a 2 gram and from a 1gram depending on how long the user-inputed text is, and 3 than MAIN function that takes an user-inputed phrase, and utilizes the functions in 1 and 2 to provide a table with all predictions given the user-inputed phrase.

###************************************************************************************
# this script takes the Good-Touring conditional probabilities, and an input phrase to
# predict next word. It uses an algorithm based on the Katz backoff model. The alpha 
# and beta probabilities are calcualted based on the paper by Katz:
# "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer"
###************************************************************************************


###****************************************************
# take an input phrase and clean it ####
###****************************************************

library(tm)
library(textclean)
library(RCurl)

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load a data set with bad words
bw.data<-read.table("./data/en_US/profanity/en_bws.txt", header=FALSE, sep="\n", strip.white=TRUE)
names(bw.data)<-"bad.words"

# sample test input phrases
input <- "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
input <- "RT : According to the National Retail Federation $16.3 BILLION was spent on #MothersDay last year!!"
input <- " rtvia according to the national retail federation nnumm nnumm billion was spent on mothersday last"
input <- "Linda! Just looked at my sched & I have to hustle back to Chula for P.M. meetings, so no time 4 lunch. :( Do u meet every Fri?"
input <- "Killed my first NYC #cockroach last night w/ - sign me up for the worlds strongest man competition!"

###****************************************************
# function to clean a phrase ####
###****************************************************

cleanInput <-function(input) {
        # 1. Separate words connected with - or /
        input <- gsub("-", " ", input)
        input <- gsub("/", " ", input)
        
        # 2. Establish end of sentence, abbr, number, email, html
        input <- gsub("\\? |\\?$|\\! |\\!$", " EEOSS ", input)
        input <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", input)
        input <- gsub("\\. |\\.$", " EEOSS ", input)
        input <- gsub("[0-9]+"," NNUMM ",input)
        input <- gsub("\\S+@\\S+","EEMAILL",input) 
        input <- gsub("[Hh}ttp([^ ]+)","HHTMLL",input) 
        input <- gsub("RT | via"," RTVIA ",input) # retweets
        input <- gsub("@([^ ]+)","ATPPLE",input) # @people
        input <- gsub("[@][a - zA - Z0 - 9_]{1,15}","UUSRNMSS",input) # usernames
        
        # 3. to lower
        input <- tolower(input)
        
        # 4. Remove/replace &, @, 'm, 's, 'are, 'll, etc...
        input <- gsub(" & ", " and ", input)
        input <- gsub(" @ ", " at ", input)
        input <- replace_contraction(input)
        input <- gsub("'s", "", input) 
        input <- gsub("haven't", "have not", input)
        input <- gsub("hadn't", "had not", input)
        
        # 5. Remove emoji's, emoticons
        input <- gsub("[^\x01-\x7F]", "", input)
        
        # 6. Remove g, mg, lbs etc; removes all single letters except "a" and "i"
        
        input <- gsub(" [1-9]+g ", " ", input) # grams
        input <- gsub(" [1-9]+mg ", " ", input) # miligrams, etc
        input <- gsub(" [1-9]+kg ", " ", input)
        input <- gsub(" [1-9]+lbs ", " ", input)
        input <- gsub(" [1-9]+s ", " ", input) # seconds, etc
        input <- gsub(" [1-9]+m ", " ", input)
        input <- gsub(" [1-9]+h ", " ", input)
        input <- gsub(" +g ", " ", input) # grams
        input <- gsub(" +mg ", " ", input) # miligrams, etc
        input <- gsub(" +kg ", " ", input)
        input <- gsub(" +lbs ", " ", input)
        input <- gsub(" +s ", " ", input) # seconds, etc
        input <- gsub(" +m ", " ", input)
        input <- gsub(" +h ", " ", input)
        input <- gsub(" +lbs ", " ", input)
        input <- gsub(" +kg ", " ", input)
        
        # 7. remove punctuation
        input <- gsub("[^[:alnum:][:space:]\']", "",input)
        input <- gsub("“", "", input)
        input <- gsub("”", "", input)
        input <- gsub("‘", "", input)
        input <- gsub("’", "", input)
        
        # 8. remove all single letters eccept i and a
        input <- gsub(" u ", " you ", input)
        input <- gsub(" [b-hj-z] ", " ", input)
        
        # 9. remove profanity
        input <- removeWords(input, bw.data[,1])
        
        # 10. remove extra spaces
        # input <- gsub("^[ ]{1,10}","",input)
        # input <- gsub("[ ]{2,10}"," ",input)
        input <- stripWhitespace(input)
        # remove space at end of phrase
        input <- gsub(" $", "", input)
        return(input)
}

The next 4 script blocks are functions that execute Katz backoff from a differnet starting ngram level. Katz alpha is calculated within these functions wherever this is necessary.

###*****************************************************
### function to execute katz back-off from a 1 gram ####
###*****************************************************

predict.1grm <- function(OneGram){

prediction <- OneGram[, c("pred","GTprob","hist", "GTfreq","GTprob")]
names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")

return(prediction)
}

###*****************************************************
### function to execute katz back-off from a 2 gram ####
###*****************************************************

# 1. extract the hist from 2 gram that corresponds to gram2.w1
# 2. extract 1grams
# 3. calculate alpha
# 4. Sort 2 and 1 grams by Katz prob, and create table to output

predict.2grm <- function(TwoGram, OneGram, gram2.w1){
        
        # subset from gram 2 where hist matches gram2.w1
        hist.gram2.match <- TwoGram[which(TwoGram$hist==gram2.w1),]
        
        # subset gram 1 into words predicted by gram 2 or not predicted
        gr1.in.gr2 <- OneGram[OneGram$pred %in% hist.gram2.match$pred,]
        gr1.notin.gr2 <- OneGram[!(OneGram$pred %in% hist.gram2.match$pred),]
        
        # calculate alpha for gram 1(call gama the denominator of alpha)
        beta.gr1 <- 1- sum(hist.gram2.match$GTprob)
        gama.gr1 <- 1 - sum(gr1.in.gr2$GTprob)
        alpha.gr1 <- beta.gr1 / gama.gr1
        
        # Calculate KatzProb for gram 1
        gr1.notin.gr2$Kprob <- gr1.notin.gr2$GTprob * alpha.gr1
        # Calculate KatzProb for gram 2
        hist.gram2.match$Kprob <- hist.gram2.match$GTprob
        
        # rbind gr1 and gr matches, and sort
        prediction <- rbind(hist.gram2.match[1:1000,],gr1.notin.gr2[1:1000,])
        prediction <- prediction[order(-prediction$Kprob),]
        prediction <- prediction[,c("pred", "Kprob", "hist", "GTfreq","GTprob")]
        names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")
        
        rm(hist.gram2.match, gr1.in.gr2, gr1.notin.gr2, beta.gr1, gama.gr1, alpha.gr1)
        
        return(prediction)        
}

###*****************************************************
### function to execute katz back-off from a 3 gram ####
###*****************************************************

predict.3grm <- function(TriGram,TwoGram, OneGram, gram3.w12, gram2.w1){
        
        # gram 3 match and backoff to gram 2
        
        # subset from gram 3 where hist matches gram3.w12
        hist.gram3.match <- TriGram[which(TriGram$hist==gram3.w12),]
        
        # subset from gram 2 where hist matches gram2.w1
        hist.gram2.match <- TwoGram[which(TwoGram$hist==gram2.w1),]
        
        # subset gram 2 into words predicted by gram 3 or not predicted
        gr2.in.gr3 <- hist.gram2.match[hist.gram2.match$pred %in% hist.gram3.match$pred,]
        gr2.notin.gr3 <- hist.gram2.match[!(hist.gram2.match$pred %in% hist.gram3.match$pred),]
        
        # calculate alpha for gram 2(call gama the denominator of alpha)
        beta.gr2 <- 1- sum(hist.gram3.match$GTprob)
        gama.gr2 <- 1 - sum(gr2.in.gr3$GTprob)
        alpha.gr2 <- beta.gr2 / gama.gr2
        
        # Calculate KatzProb (Kprob) for gram 2
        gr2.notin.gr3$Kprob <- gr2.notin.gr3$GTprob * alpha.gr2
        # Calculate KatzProb for gram 3
        hist.gram3.match$Kprob <- hist.gram3.match$GTprob
        
        #### backoff to gram 1
       
        # subset gram 1 into words predicted by gram 2 or not predicted
        gr1.in.gr2 <- OneGram[OneGram$pred %in% hist.gram2.match$pred,]
        gr1.notin.gr2 <- OneGram[!(OneGram$pred %in% hist.gram2.match$pred),]
        
        # calculate alpha for gram 1(call gama the denominator of alpha)
        beta.gr1 <- 1- sum(hist.gram2.match$GTprob)
        gama.gr1 <- 1 - sum(gr1.in.gr2$GTprob)
        alpha.gr1 <- beta.gr1 / gama.gr1
        
        # Calculate KatzProb for gram 1
        gr1.notin.gr2$Kprob <- gr1.notin.gr2$GTprob * alpha.gr1 * alpha.gr2
        # Calculate KatzProb for gram 2
        
        # rbind gr1 and gr matches, and sort
        prediction <- rbind(hist.gram3.match[1:1000,],gr2.notin.gr3[1:1000,] ,gr1.notin.gr2[1:1000,])
        prediction <- prediction[order(-prediction$Kprob),]
        prediction <- prediction[,c("pred", "Kprob", "hist", "GTfreq","GTprob")]
        names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")
        
        rm(hist.gram3.match, hist.gram2.match, gr2.in.gr3, gr2.notin.gr3, 
           gr1.in.gr2, gr1.notin.gr2, beta.gr2, gama.gr2, alpha.gr2,
           beta.gr1, gama.gr1, alpha.gr1)
        
        return(prediction)
}

###*****************************************************
### function to execute katz back-off from a 4 gram ####
###*****************************************************

predict.4grm <- function(FourGram, TriGram,TwoGram, OneGram, gram4.w123, gram3.w12, gram2.w1){
        
        # gram 4 match and backoff to gram 1
        
        # subset from gram 4 where hist matches gram4.w123
        hist.gram4.match <- FourGram[which(FourGram$hist==gram4.w123),]
        
        # subset from gram 3 where hist matches gram3.w12
        hist.gram3.match <- TriGram[which(TriGram$hist==gram3.w12),]
        
        # subset gram 3 into words predicted by gram 4 or not predicted
        gr3.in.gr4 <- hist.gram3.match[hist.gram3.match$pred %in% hist.gram4.match$pred,]
        gr3.notin.gr4 <- hist.gram3.match[!(hist.gram3.match$pred %in% hist.gram4.match$pred),]
        
        # calculate alpha for gram 2(call gama the denominator of alpha)
        beta.gr3 <- 1- sum(hist.gram4.match$GTprob)
        gama.gr3 <- 1 - sum(gr3.in.gr4$GTprob)
        alpha.gr3 <- beta.gr3 / gama.gr3
        
        # Calculate KatzProb (Kprob) for gram 3
        gr3.notin.gr4$Kprob <- gr3.notin.gr4$GTprob * alpha.gr3
        # Calculate KatzProb for gram 4
        hist.gram4.match$Kprob <- hist.gram4.match$GTprob
        
        # gram 3 match and backoff to gram 2
        
        # subset from gram 2 where hist matches gram2.w1
        hist.gram2.match <- TwoGram[which(TwoGram$hist==gram2.w1),]
        
        # subset gram 2 into words predicted by gram 3 or not predicted
        gr2.in.gr3 <- hist.gram2.match[hist.gram2.match$pred %in% hist.gram3.match$pred,]
        gr2.notin.gr3 <- hist.gram2.match[!(hist.gram2.match$pred %in% hist.gram3.match$pred),]
        
        # calculate alpha for gram 2(call gama the denominator of alpha)
        beta.gr2 <- 1- sum(hist.gram3.match$GTprob)
        gama.gr2 <- 1 - sum(gr2.in.gr3$GTprob)
        alpha.gr2 <- beta.gr2 / gama.gr2
        
        # Calculate KatzProb (Kprob) for gram 2
        gr2.notin.gr3$Kprob <- gr2.notin.gr3$GTprob * alpha.gr2 * alpha.gr3
        
        #### backoff to gram 1
        
        # subset gram 1 into words predicted by gram 2 or not predicted
        gr1.in.gr2 <- OneGram[OneGram$pred %in% hist.gram2.match$pred,]
        gr1.notin.gr2 <- OneGram[!(OneGram$pred %in% hist.gram2.match$pred),]
        
        # calculate alpha for gram 1(call gama the denominator of alpha)
        beta.gr1 <- 1- sum(hist.gram2.match$GTprob)
        gama.gr1 <- 1 - sum(gr1.in.gr2$GTprob)
        alpha.gr1 <- beta.gr1 / gama.gr1
        
        # Calculate KatzProb for gram 1
        gr1.notin.gr2$Kprob <- gr1.notin.gr2$GTprob * alpha.gr1 * alpha.gr2 * alpha.gr3
        # Calculate KatzProb for gram 2
        
        # rbind gr1 and gr matches, and sort
        prediction <- rbind(hist.gram4.match[1:1000,], gr3.notin.gr4[1:1000,], gr2.notin.gr3[1:1000,] ,gr1.notin.gr2[1:1000,])
        prediction <- prediction[order(-prediction$Kprob),]
        prediction <- prediction[,c("pred", "Kprob", "hist", "GTfreq","GTprob")]
        names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")
        
        rm(hist.gram4.match, hist.gram3.match, hist.gram2.match, 
           gr3.in.gr4, gr3.notin.gr4, gr2.in.gr3, gr2.notin.gr3, 
           gr1.in.gr2, gr1.notin.gr2, beta.gr3, gama.gr3, alpha.gr3,
           beta.gr2, gama.gr2, alpha.gr2, beta.gr1, gama.gr1, alpha.gr1)
        
        return(prediction)
}

The next block of script provides the main function for running the model. It takes the input phrase and the prepared 4 ngrams and outputs a data.frame called prediction that holds a detailed view of what the algorithm predicted inlcluding the following fields:

Predicted word, Katz Probaility, History, GTfreq, GTprob

###***********************************************
### main function ####
###***********************************************

predWord.4grm <- function(input, OneGram, TwoGram, TriGram, FourGram){
        
        # 1. Check of the phrase is longer than one word
        # 2. Clean the phrase
        # 3. Check if the last word in the phrase is a break.word (defined below) and if yes = error
        # 4. Check if any of the words in the phrase is a break.word (defined below) and if yes 
        # move to an n-1 gram and repeat step 4 until you find an n-gram with no break words
        
        # number of words in user input, pre-cleaning
        n.words.input <-length(strsplit(input, "\\s+")[[1]])
        
        # error message if no input 
        if (n.words.input <1) stop("Please input at least one word")    # error handling
        
        # clean the input phrase and Count number of words in the phrase
        clean.input <- cleanInput(input)
        clean.input.words <- strsplit(clean.input, "\\s+")[[1]] # vector of the words in the clean input
        n.words <-length(clean.input.words)
        
        # a vector with words that break an ngram if located within the ngram
        ngram.break <- as.list(c("eeoss", "aabrr", "nnumm", "eemaill", "hhtmll", "rtvia", "atpple", "uusrnmss"))
        
        # if the last word is a ngram.break word, something that's not an english word, stop
        if (any(unlist(lapply(ngram.break, function(x) grepl(x,clean.input.words[n.words]))))) 
                stop("The last sequence of characters is something other than an English word.\n",
                     "Please input at least one word.")    # error handling 
        
        ###*******************************
        # phrase is at least 3 words long 
        ###*******************************
        
        # 1. extract 3,2,1 last words from phrase
        # 2. check longest 3 word extract for break.words. If yes go down to 2 and then 1 word
        # 3. once a group of words contains no break.words then we execute Katz backoff from this level of ngram
        # suppose the last 3 words contain a break.word, but the last 2 don't, then
        # we execute Katz backoff from ngram 3.
        
        if (n.words >= 3) {
                # extract the (n-1) words from ngrams from the last words in the phrase
                gram4.w123 <- paste(clean.input.words[n.words - 2],
                                    clean.input.words[n.words - 1],
                                    clean.input.words[n.words], sep=" ")
                gram3.w12 <- sub("^[a-z]+ ","",gram4.w123)
                gram2.w1 <- sub("^[a-z]+ ","",gram3.w12)
                
                # if any of the words in the ngram4.w123 is a .ngram.break word, then fail, move to n-1 ngram
                if ( any(unlist(lapply(ngram.break, function(x) grepl(x,gram4.w123)))) ){
                        # if any of the words in the ngram3.w12 is a .ngram.break word, then fail, move to n-1 ngram
                        if ( any(unlist(lapply(ngram.break, function(x) grepl(x,gram3.w12)))) ){
                                ### execute from a 2 gram as the two gram has already been checked for EOS breakwords
                                match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                                if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                        # export matches from OneGram table
                                        prediction <- predict.1grm(OneGram)
                                }
                                
                                else { # export matches from TwoGram$hist==gram2.w1
                                        prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                                }
                                
                        }
                        else { ### execute from a 3 gram, Katz back-off 
                                match.w12.count <- sum(TriGram[which(TriGram$hist==gram3.w12),"GTfreq"])
                                if (match.w12.count == 0) { # match.w12.count=0 therefore use Katz backoff to gram2.w1
                                        match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                                        if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                                # export matches from OneGram table
                                                prediction <- predict.1grm(OneGram)
                                        }
                                        
                                        else { # export matches from TwoGram$hist==gram2.w1
                                                prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                                        }
                                }
                                else { # export matches from TriGram$w12==gram3.w12
                                        prediction <- predict.3grm(TriGram,TwoGram, OneGram, gram3.w12, gram2.w1)            
                                }              
                        }
                        
                } else { # There are no break words in 4gram gram4.w123.
                       # execute from a 4 gram, Katz back-of
                        
                        ###
                        # start checking for matches and working backwards w Katz back-off when necessary
                        ###
                        
                        # Count how many times we find the gram4.w123 in the 4 gram table, in FourGram$w123
                        match.w123.count <- sum(FourGram[which(FourGram$hist==gram4.w123),"GTfreq"])
                        
                        if (match.w123.count == 0) { # match.w123.count=0 therefore use Katz backoff to gram3.w12
                                match.w12.count <- sum(TriGram[which(TriGram$hist==gram3.w12),"GTfreq"])
                                if (match.w12.count == 0) { # match.w12.count=0 therefore use Katz backoff to gram2.w1
                                        match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                                        if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                                # export matches from OneGram table
                                                prediction <- predict.1grm(OneGram)
                                                 }
                                        
                                        else { # export matches from TwoGram$hist==gram2.w1
                                                prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                                        }
                                }
                                else { # export matches from TriGram$w12==gram3.w12
                                        prediction <- predict.3grm(TriGram,TwoGram, OneGram, gram3.w12, gram2.w1)            
                                }        
                        }
                        else { # export matches from FourGram$w123==gram4.w123
                                prediction <- predict.4grm(FourGram, TriGram,TwoGram, OneGram, gram4.w123, gram3.w12, gram2.w1)
                        }
                }
        }
                ###*********************
                # phrase is 2 words long
                ###*********************
                else if (n.words==2 ){
                        # extract the (n-1) words from ngrams from the last words in the phrase
                        gram3.w12 <- clean.input
                        gram2.w1 <- sub("^[a-z]+ ","",gram3.w12)
                        
                        # if any of the words in the ngram3.w12 is a .ngram.break word, then fail, move to n-1 ngram
                        if ( any(unlist(lapply(ngram.break, function(x) grepl(x,gram3.w12)))) ){
                                ### execute from a 2 gram as the two gram has already been checked for EOS breakwords
                                match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                                if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                        # export matches from OneGram table
                                        prediction <- predict.1grm(OneGram)
                                }
                                
                                else { # export matches from TwoGram$hist==gram2.w1
                                        prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                                }
                                
                        }
                        else { ### execute from a 3 gram, Katz back-off 
                                match.w12.count <- sum(TriGram[which(TriGram$hist==gram3.w12),"GTfreq"])
                                if (match.w12.count == 0) { # match.w12.count=0 therefore use Katz backoff to gram2.w1
                                        match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                                        if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                                # export matches from OneGram table
                                                prediction <- predict.1grm(OneGram)
                                        }
                                        
                                        else { # export matches from TwoGram$hist==gram2.w1
                                                prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                                        }
                                }
                                else { # export matches from TriGram$w12==gram3.w12
                                        prediction <- predict.3grm(TriGram,TwoGram, OneGram, gram3.w12, gram2.w1)            
                                }              
                        }
                        
                }
                ###*************************
                # phrase is just 1 word long 
                ###*************************
                else {
                        gram2.w1 <- clean.input
        
                        ### execute from a 2 gram as the two gram has already been checked for EOS breakwords
                        match.w1.count <- sum(TwoGram[which(TwoGram$hist==gram2.w1),"GTfreq"])
                        if (match.w1.count == 0) { # match.w1.count=0 in gram2 therefore use Katz backoff to gram1
                                # export matches from OneGram table
                                prediction <- predict.1grm(OneGram)
                        }
                        
                        else { # export matches from TwoGram$hist==gram2.w1
                                prediction <- predict.2grm(TwoGram, OneGram, gram2.w1)
                        }
                }
return(prediction)       
}

Appendix G: Shiny app specific output formats based on a prediction from Appendix F

The first/main tab of the shiny app provides an output table where the first 10 Katz-probability groups are presented. A probabiity group is a set of predictions that have the same Katz probbaility. Thus the table can be more than 10 rows long, depending on the input phrase, but it will present only 10 probability groups (or whatever groups number is requested). The idea for this table and the design of the script is borrowed from Gerald Gendron and all credit goes to him.

groupsTable <-function(prediction,groups) {
                        ##
                        ## This function takes in the long prediction table and number of probability
                        ## groups requested by the user. A group is a set of rows from the prediction
                        ## table where the probability is the same. 
                        ##
                        ## It returns the list of information for the requested number of groups.
                        ## The table is to be used for the front page as a short output from the model
                        ## The second page of the shiny app is to hold a larger table presenting 
                        ## say 100-500 rows of the prediction table
                        ## 
                        ##
                        diff = 0 # initialize counter - compares diff in probabilities of rows
                        rows = 1 # initialize counter - counts total rows to subset from prediction table
                        nrows <- dim(prediction)[1]
                        if (nrows <= groups){        # if the total trigrams found equals one
                                groups.table <- prediction   # simply use the one phrase and return as output
                                return(groups.table)
                        } else {               # else build table of outputs based on cluster input
                                remain = nrows - 1  # counter to work through list
                                while (diff < groups && remain >0) {
                                        # take differences in KatzProb between rows 
                                        if (prediction[rows,2] - prediction[rows+1,2] > .00001) diff=diff + 1
                                        rows <-rows + 1
                                        remain <- remain - 1    # calculates when end-of-list is reached
                                }
                                if (remain == 0) {
                                        groups.table <- prediction[1:rows,]  # when list is fully used
                                        return(groups.table)
                                } else {
                                        groups.table <- prediction[1:rows-1,]  # if max is reached first
                                        return(groups.table)
                                }
                        }
                }

Appendix H: Out of sample validation tests

This appendix presents the scripts used to run the OOS validation tests. I provide examples with the test on the 2 gram and 4 gram history.

###****************************
# out of sample for the 2 gram
###****************************
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")
# model training data
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_OneGram.2pl.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_TwoGram.2pl.Rdata", verbose=T)
# validation data
load("./data/en_US/train.sample/Rdata_output/validation/val15pct_2gram.2pl_docs.Rdata", verbose=T)

###*****************************************************
### function to execute katz back-off from a 2 gram ###
###*****************************************************
sdf2grm <- sdf2grm.2pl

# function to excute Katz back-off from a 2 gram
predict.2grm <- function(TwoGram, OneGram, gram2.w1){
        
        # subset from gram 2 where hist matches gram2.w1
        hist.gram2.match <- TwoGram[which(TwoGram$hist==gram2.w1),]
        
        # subset gram 1 into words predicted by gram 2 or not predicted
        gr1.in.gr2 <- OneGram[OneGram$pred %in% hist.gram2.match$pred,]
        gr1.notin.gr2 <- OneGram[!(OneGram$pred %in% hist.gram2.match$pred),]
        
        # calculate alpha for gram 1(call gama the denominator of alpha)
        beta.gr1 <- 1- sum(hist.gram2.match$GTprob)
        gama.gr1 <- 1 - sum(gr1.in.gr2$GTprob)
        alpha.gr1 <- beta.gr1 / gama.gr1
        
        # Calculate KatzProb for gram 1
        gr1.notin.gr2$Kprob <- gr1.notin.gr2$GTprob * alpha.gr1
        # Calculate KatzProb for gram 2
        hist.gram2.match$Kprob <- hist.gram2.match$GTprob
        
        # rbind gr1 and gr matches, and sort
        prediction <- rbind(hist.gram2.match[1:1000,],gr1.notin.gr2[1:1000,])
        prediction <- prediction[order(-prediction$Kprob),]
        prediction <- prediction[,c("pred", "Kprob", "hist", "GTfreq","GTprob")]
        names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")
        
        rm(hist.gram2.match, gr1.in.gr2, gr1.notin.gr2, beta.gr1, gama.gr1, alpha.gr1)
        
        return(prediction)        
}

# predicted words
for (i in 1:nrow(sdf2grm)) {
        predicted <- predict.2grm(TwoGram, OneGram, sdf2grm[i,4]) # 4 is history column
        sdf2grm$model.pred1[i] <- predicted[1,1] 
        sdf2grm$model.pred2[i] <- predicted[2,1]
        sdf2grm$model.pred3[i] <- predicted[3,1]
}

# check for matches and calculate success rate
sdf2grm$match1 <- as.numeric(sdf2grm$pred == sdf2grm$model.pred1)*sdf2grm$freq
sdf2grm$match2 <- as.numeric(sdf2grm$pred == sdf2grm$model.pred2)*sdf2grm$freq
sdf2grm$match3 <- as.numeric(sdf2grm$pred == sdf2grm$model.pred3)*sdf2grm$freq

print( oos1 <- sum(sdf2grm$match1) / sum(sdf2grm$freq) )
print( oos3 <- sum(sdf2grm$match1, sdf2grm$match2, sdf2grm$match3) / sum(sdf2grm$freq) )

###****************************
# out of sample for the 4 gram
###****************************
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")
# model training data
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_OneGram.2pl.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_TwoGram.2pl.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_TriGram.2pl.Rdata", verbose=T)
load("./data/en_US/train.sample/Rdata_output/validation/train70pct_FourGram.2pl.Rdata", verbose=T)
# validation data
load("./data/en_US/train.sample/Rdata_output/validation/val15pct_4gram.2pl_docs.Rdata", verbose=T)

###*****************************************************
### function to execute katz back-off from a 4 gram ###
###*****************************************************
sdf4grm <- sdf4grm.2pl[,c("hist", "pred", "freq")]
rm(sdf4grm.2pl)

# function to excute Katz back-off from a 4 gram
predict.4grm <- function(FourGram, TriGram,TwoGram, OneGram, gram4.w123, gram3.w12, gram2.w1){
        
        # gram 4 match and backoff to gram 1
        
        # subset from gram 4 where hist matches gram4.w123
        hist.gram4.match <- FourGram[which(FourGram$hist==gram4.w123),]
        
        # subset from gram 3 where hist matches gram3.w12
        hist.gram3.match <- TriGram[which(TriGram$hist==gram3.w12),]
        
        # subset gram 3 into words predicted by gram 4 or not predicted
        gr3.in.gr4 <- hist.gram3.match[hist.gram3.match$pred %in% hist.gram4.match$pred,]
        gr3.notin.gr4 <- hist.gram3.match[!(hist.gram3.match$pred %in% hist.gram4.match$pred),]
        
        # calculate alpha for gram 2(call gama the denominator of alpha)
        beta.gr3 <- 1- sum(hist.gram4.match$GTprob)
        gama.gr3 <- 1 - sum(gr3.in.gr4$GTprob)
        alpha.gr3 <- beta.gr3 / gama.gr3
        
        # Calculate KatzProb (Kprob) for gram 3
        gr3.notin.gr4$Kprob <- gr3.notin.gr4$GTprob * alpha.gr3
        # Calculate KatzProb for gram 4
        hist.gram4.match$Kprob <- hist.gram4.match$GTprob
        
        # gram 3 match and backoff to gram 2
        
        # subset from gram 2 where hist matches gram2.w1
        hist.gram2.match <- TwoGram[which(TwoGram$hist==gram2.w1),]
        
        # subset gram 2 into words predicted by gram 3 or not predicted
        gr2.in.gr3 <- hist.gram2.match[hist.gram2.match$pred %in% hist.gram3.match$pred,]
        gr2.notin.gr3 <- hist.gram2.match[!(hist.gram2.match$pred %in% hist.gram3.match$pred),]
        
        # calculate alpha for gram 2(call gama the denominator of alpha)
        beta.gr2 <- 1- sum(hist.gram3.match$GTprob)
        gama.gr2 <- 1 - sum(gr2.in.gr3$GTprob)
        alpha.gr2 <- beta.gr2 / gama.gr2
        
        # Calculate KatzProb (Kprob) for gram 2
        gr2.notin.gr3$Kprob <- gr2.notin.gr3$GTprob * alpha.gr2 * alpha.gr3
        
        #### backoff to gram 1
        
        # subset gram 1 into words predicted by gram 2 or not predicted
        gr1.in.gr2 <- OneGram[OneGram$pred %in% hist.gram2.match$pred,]
        gr1.notin.gr2 <- OneGram[!(OneGram$pred %in% hist.gram2.match$pred),]
        
        # calculate alpha for gram 1(call gama the denominator of alpha)
        beta.gr1 <- 1- sum(hist.gram2.match$GTprob)
        gama.gr1 <- 1 - sum(gr1.in.gr2$GTprob)
        alpha.gr1 <- beta.gr1 / gama.gr1
        
        # Calculate KatzProb for gram 1
        gr1.notin.gr2$Kprob <- gr1.notin.gr2$GTprob * alpha.gr1 * alpha.gr2 * alpha.gr3
        # Calculate KatzProb for gram 2
        
        # rbind gr1 and gr matches, and sort
        prediction <- rbind(hist.gram4.match[1:1000,], gr3.notin.gr4[1:1000,], gr2.notin.gr3[1:1000,] ,gr1.notin.gr2[1:1000,])
        prediction <- prediction[order(-prediction$Kprob),]
        prediction <- prediction[,c("pred", "Kprob", "hist", "GTfreq","GTprob")]
        names(prediction) <- c("Predicted", "Katz Prob", "History", "GoodTuring Freq", "GoodTuring Prob")
        
        rm(hist.gram4.match, hist.gram3.match, hist.gram2.match, 
           gr3.in.gr4, gr3.notin.gr4, gr2.in.gr3, gr2.notin.gr3, 
           gr1.in.gr2, gr1.notin.gr2, beta.gr3, gama.gr3, alpha.gr3,
           beta.gr2, gama.gr2, alpha.gr2, beta.gr1, gama.gr1, alpha.gr1)
        
        return(prediction)
}


# predicted words
for (i in 1:nrow(sdf4grm)) {
        gram4.w123 <- sdf4grm[i,1]
        gram3.w12 <- sub("^[a-z]+ ","",gram4.w123)
        gram2.w1 <- sub("^[a-z]+ ","",gram3.w12)
        
        predicted <- predict.4grm(FourGram, TriGram,TwoGram, OneGram, gram4.w123, gram3.w12, gram2.w1)
        sdf4grm$model.pred1[i] <- predicted[1,1] 
        sdf4grm$model.pred2[i] <- predicted[2,1]
        sdf4grm$model.pred3[i] <- predicted[3,1]
        print(paste0("row # ", i))
}

# check for matches and calculate success rate
sdf4grm$match1 <- as.numeric(sdf4grm$pred == sdf4grm$model.pred1)*sdf4grm$freq
sdf4grm$match2 <- as.numeric(sdf4grm$pred == sdf4grm$model.pred2)*sdf4grm$freq
sdf4grm$match3 <- as.numeric(sdf4grm$pred == sdf4grm$model.pred3)*sdf4grm$freq

print( oos1 <- sum(sdf4grm$match1) / sum(sdf4grm$freq) )
print( oos3 <- sum(sdf4grm$match1, sdf4grm$match2, sdf4grm$match3, na.rm=T) / sum(sdf4grm$freq) )