Predict a Sequence of Words

This report describes my work on the John Hopkins’ Data Science Capstone project (provided via Coursera). For the current report the focus is on data pre-processing and exploratory analysis. In addition a tentative plan for modeling of the text corpus is provided.

Project Description from Coursera

This section primarily uses text directly cited from the Coursera website of the John Hopkins’ data Science Capstone project.

    Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

    "I went to the"

    the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

    This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product you can show off to your family, friends, and potential employers.

    The capstone includes the following deliverables:

    - An intermediate R markdown report that describes in plain language, plots, and code your exploratory analysis of the course data set. (the main focus of the current report)
    - A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.
    - A 5 slide deck created with R presentations pitching your algorithm and app to your boss or investor.

Executive summary

This report provides a description of the following steps:

data description
data sampling into train, validation, and test sets
data pre-processing including generation of ngrams
exploratory analysis
plan for modeling the text corpus
Appendix including all R code needed to process the first 4 steps above

Data Description

This section primarily uses text directly cited from the Coursera website of the John Hopkins’ data Science Capstone project. Some comments and data summaries are provided by the author.

    The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language*.

    Each entry is tagged with it's date of publication. Where user comments are included they will be tagged with the date of the main entry.

    Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with "sports" subject).In many cases it's not feasible to tag the entries (for example, it's not really practical to tag each individual Twitter entry, though I've got some ideas which might be implemented in the future) or no subject is found by the automated process, in which case the entry is tagged with a '0'.

    To save space, the subject and type is given as a numerical code.

    The files have been language filtered but may still contain some foreign text.

    Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

    This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.

The data is sourced from the following location:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The blogs corpus is comprised of paragraphs, where each paragraph is a line (or document) in the corpus. The news corpus is also comprised of paragraphs, where each paragraph is a line (or document) in the corpus. The twitter corpus is comprised of tweets, where each tweet is a line (or document) in the corpus. Initial summarization of the data provides the size of each corpus, number of lines and number of words in the table below.

##           file.name size(MB) num.of.lines num.of.words
## 1   en_US.blogs.txt   200.42       899288     37334149
## 2    en_US.news.txt   196.28      1010242     34372814
## 3 en_US.twitter.txt   159.36      2360148     30373565

We observe that the the number of words are overall similar across the three source documents. However, since each line in the blogs and news documents is a paragraph, the twitter document contains more than twice as many lines as the news and blogs documents.

Another thing to note is that each file is already plenty large in terms of MB size. R keeps the data into memory and appears to handle it very inefficiently, meaning the size of the data explodes when loaded and processed in R. As a result it most often won’t be tractable to work with the combined corpus directly to pre-process the data.

The R code used to produce this table is presented in Appendix D.

Data sampling

The data was split into 4 sub samples for each corpus source: 15% testing sample, 15% validation/development sample, 70% training sample. Out of the 70% training sample a 10% sub sample is selected to test code. Each source, blogs/news/twitter is sub sampled separately, and subsequently the separate source corpora are aggregated into a 15/15/70/10 corpus including all three sources. The random selection of lines from each corpus is achieved with binomial random sampling.

The 10% training sample is used to test code, and will likely further be used to test initial models. The 70% training sample is used for exploratory analysis and will be used later to train the models if my machine has enough memory to support this much data.

The R code used to produce the samples is presented in Appendix A.

Data pre-processing and preparation for analysis

The original data, as obtained by us from Coursera-SwiftKey, contains many irregularities that need to be addressed before the data is ready for exploratory analysis or modeling. For example the data contains emails, http/s addresses, emojis, contractions as “don’t”, upper/lower case, sympols as & @, etc… that have to either be removed or replaced/expanded (“don’t” expands into do not, e.g.) before ngrams are created.

The pre-processing code has taken many, many iterations and testing of each procedure. I hope I am not missing something important as a transformation, but that likely is not the case. Please advise if you see something that may not be producing the intended effect, or some majorly important transformation I am missing.

We apply the following transformations using the tm package in R, in the exact sequence described below:

Step 1. Remove duplicate rows
Step 2. Separate words connected with - or /. This helps later when we remove punctuation, so that two words do not stick to each
Step 3. Establish end of sentence, and replace it with “EEOSS”. Identify abbreviations as H.S.B.C. and replace them with “AABRR”. Identify numbers and replace them with “AABRR”. We need to do this so that when we are forming ngrams, the 2/3/4ngrams are not formed over end of sentences, abbreviations or numbers. For example consider the following tweet line
```
  "RT : According to the N.R.T $16.3 BILLION was spent on #MothersDay last year! Of the $16.3 BILLION spent $1 BILLION was spent on flowers"
```

If we were to directly exclude numbers, and then punctuation, we may form a 3gram reading “to the BILLION” which is incorrect as it does not really exist in this sentence. In addition, if we just remove end of sentence in “!”, we would form a 4gram that reads “year of the BILLION”, which also was not intended by the person who wrote the tweet.

Step 4. Remove email and http/s links
Step 5. Remove retweet entries (RT:), Remove @people, twitter usernames
Step 6. Transform to lower case. For the purposes of our models “Hello” and “hello” should be treated as the same piece of information.
Step 7. Remove/replace &, @, ’m, ’s, ’are, ’ll, and all other contractions
Step 8. Remove emoji’s, emoticons like ❤#RelayForLife ❤“😖😰💔💔”
Step 9. Remove g, mg, lbs, s, m, h etc abbreviations; Removes all single letters except “a” and “i”
Step 10. Remove numbers (if such still exist)
Step 11. Remove punctuation. I remove punctuation via the tm_map available function, while “preserving preserve_intra_word_dashes = TRUE” if such dashes still exist. In addition I separately remove the following symbols that were not removed by the tm_map functionality: ““”,””“,”‘“,”’“.
Step 12. Remove profanity. I don’t want my predictions to suggest profanity words. For this purpose i may train a model with these words in and remove them in the end by re-calibrating the model on non-profanity data. Or else, which makes more sense, is to remove them from the beginning and not train the model on those. The latter choice is applied. To remove profanity a list with profanity words is downloaded from:
```
  https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
```

and the list is used to filter out the “bad words”.

Step 13. Remove stop words. From this point forward i keep two types of corpora - one with stop words included and one with stop words removed. In many NLP applications removing stop words is appropriate. However, for the current application I believe stop words need to be used to train the model, so that we provide a natural language prediction. The corpora with stop words removed are, however, used for exploratory analysis as indicated in next sections.
Step 14. Remove whitespace. The substitutions and deletions applied in the previous steps lead to extra white spaces that are typically removed in NLP applications. I use the tm_map functionality for this white-space removal.

Stemming, reducing word types to their stems, is a further step that can be applied to NLP problems. However, for this particular application we rather work with the complete word type, then with its stem. This will provide for more correct model training and prediction.

The preprocessing is done by corpus source. The ready pre-processed corpora are exported and ready for ngram generation. The R code used to pre-process the samples is presented in Appendix B.

Tokenization and nGram Generation

NGrams are a popular method introduced by Jurafsky and Martin (2000) for NLP when applied to text prediction (and other tasks) problems. The ngram is a sequence of words found in the text that can be further sorted by frequency of appearance. For example UniGrams are 1 word ngrams, while TriGrams are sequences of three words as “I will go” or “Happy New Year”. NGrams are very useful for simple text prediction models, because, ngrams that appear with higher frequency are the ones that have a higher chance (probabilistically speaking) of appearing in prediction problems out of sample. For example a sequence of two words “I will” (first and second word) can be the base for 100s if not 1000s TriGrams (possible third word). However, among the possible third words in the training sample that form TriGrams with “I will”, some appear more often than others and will be selected with higher probability in prediction out of sample.

In case a TriGram that contains the first two words “I will” is not found in the model, then the model can refer to BiGrams containing a first word “will” and use this information to predict the third word in “I will …”. Naturally if no BiGrams exist, the model can refer to the stored UniGrams and suggest the highest probability single word as found in the training sample.

One obvious disadvantage to NGrams modeling is that not all possible third words to “I will …” are available in the training sets. Smoothing techniques can be applied to the available and unavailable three grams containing “I will …”, where probability mass is transferred from available in the training sample TriGrams to the ones that have not appeared even once. There are a few less or more sophisticated techniques for probability smoothing, and some of these techniques will be used in modeling to improve the performance of the model out of sample.

One quickly materializing problem with higher order ngrams, like 3ngrams and 4ngrams, is that very few 3ngrams and 4ngrams appear with higher frequency, and most 3/4ngrams appear just once (called singletons). This leads to enormous dataframes and potential problems with memory availability to process these dataframes. A known method is to remove the singletons when storing the 3/4ngrams.

NGrams are generated and are used for the moment to explore the data in a subsequent section. The R code used to generate and store 1/2/3/4grams is presented in Appendix C. For the moment the ngrams are stored as they are generated. In a later step I will look into storing the ngrams based on indexing of single words to see if this approach can reduce the memory demands of the subsequent modeling tasks. This will also likely be necessary to be able to operate the shiny app that has to be developed once the model is ready.

Exploratory analysis

Top Ngrams frequency

In this subsection we directly measure the top 20 most frequent 1/2/3grams for different corpora. It is important to look at the data with and without stop words excluded. For the purposes of modeling we need to include stop words. However, for exploratory analysis it is more interesting to remove the stop words and get a sense of the main topics prevailing in different type of corpora.

First we present the 1/2/3gram top 20 grams by frequency when stop words are not excluded from the corpora. The most common word is “The”, which represents about 5% coverage of the whole sample dictionary. Overall we see few surprises here - the top 20 ngrams here mainly represent some combination using a stop word and represent a very large portion of the dictionary.

Figure: Top 20 ngrams for the combined Corpus when stop words are not excluded from the corpus:

Next we present the 1/2/3gram top 20 grams by frequency when stop words are excluded from the corpora. The most common word is “will”, which represents about 0.75% coverage of the whole sample dictionary. Overall we see more variety here - the top 20 ngrams here have very low probability each, and the probability of each lower ranked TriGram drops very quickly compared to UniGrams and DiGrams. In addition, in terms of context we notice that likely the combined corpus represent relatively recent text since President Barack Obama is a fairly frequent TriGram. We see that we also have over-representation in the corpus of topics around holidays. Mostly, however, both DiGrams and TriGrams represent highly frequently used general expressions of the English language (pardon my lack of linguistic terminology).

Figure: Top 20 ngrams for the combined Corpus when stop words are excluded from the corpus:

For the beauty of it I also present a wordcloud plot of the combined corpus Trigrams when stop words are excluded from the corpus. More of the text topics and sentiments can be deduced from this more detailed plot.

Figure: Wordcloud plot of TriGrams for the combined Corpus when stop words are excluded from the corpus:

Next I present the TriGram top 20 grams by frequency for the three separate corpora - Blogs/News?Twitter - when stop words are NOT excluded from the corpora. The overall goal is to see if there are any topical/sentiment differences. We observe some specific for Blogs/Twitter TriGrams that will be unlikely for News articles as “I have been” and “I want to”. In the News corpus we observe also very specific for News corpus TriGrams as “The United States” and “according to the”. The Twitter corpus also reveals specific TriGrams that would rarely be observed in Blogs and News corpora as “thanks for the” and “looking forward to”.

Figure: Top 20 ngrams for the Blogs, News, Twitter corpora when stop words are not excluded from the corpus:

Last I present the TriGram top 20 grams by frequency for the three separate corpora - Blogs/News?Twitter - when stop words are excluded from the corpora. We observe some very specific for Blogs topics like “Amazon LLC”, “I pretty sure”, specific News topics as “senior court judge”, “President Barack Obama”, and specific tweets expressions as “let us go” or “I will take”. Topical differentiation between the three corpora based on TriGrams makes sense and is somewhat a loose confirmation that the data processing has gone relatively well.

Figure: Top 20 ngrams for the Blogs, News, Twitter corpora when stop words are not excluded from the corpus:

TTR - types of words to tokens ratio

The next characteristic of the data that we explore is the so called TTR: word.type to word.tokes ratio (Richards, 1987). The TTR demonstrates how rich the language is in a particular corpus. The higher TTR the more unique words are used to achieve to number of tokens. We calculate number of tokens (all words used) in the corpus, the number of word types (number of unique words) used in the corpus, and the respective TTR in the tables below. We do this for corpora that has stop words excluded or included.

## [1] "Word Types to tokens ratio (TTR) for corpora with NO stop words removed"

##    Source   Tokens Word.Types         TTR
## 1   Blogs 26296381     260377 0.009901629
## 2    News 23990968     195090 0.008131810
## 3 Twitter 20942380     331649 0.015836261
## 4  Corpus 71229729     574816 0.008069889

Note that the Twitter corpus is much richer than the Blogs and News corpora. Probable reason is that with a limited number of words allowed to use for a tweet people look for more precise words, which leads to richness of the Twitter corpus.

## [1] "Word Types to tokens ratio (TTR) for corpora with stop words removed"

##    Source   Tokens Word.Types        TTR
## 1   Blogs 13641106     260254 0.01907866
## 2    News 13851582     194967 0.01407543
## 3 Twitter 11814786     331526 0.02806026
## 4  Corpus 39307474     574693 0.01462045

We observe that the richness of the corpora increases when the stop words are excluded, which is expected. Stop words are extremely high frequency, low information words.

Cumulative frequences for 1grams

An interesting question to answer is what portion of your unique words comprise 50%, 75%, 90% of your sample dictionary. We present such information graphically first, as we can more clearly see the speed by which the dictionary coverage increases with additional unique words (word types). We constrain these plots to the first 50000 unique words in the 1grams. We present the speed of coverage increase for the separate source corpora - blogs/news/tweets - as well as for the combined corpus.

Figure: Cumulative frequences for 1grams excluding stop words:

We observe that we need about 1000 words to reach 50% coverage in each corpus, and about 50000 unique words to reach 95% coverage of our sample dictionary. It is interesting to note that coverage initially increases much faster for Twitter corpus than the other corpora. Even though this is the richest corpus in terms of TTR, we still have a smaller set of unique words that are represented with higher frequency as compared to the Blogs and News corpora.

Next we also calculate some dictionary coverage quantiles. The idea is to see how many unique words we need for, say 50% coverage. In addition to this we want to see how quickly the number of words increases to achieve an increase of 10% extra in coverage.

Figure: Quantiles, and ranges of quantiles in the combined corpus 1gram with stop words excluded:

## [1] "number of words comprising 49.99 to 50.01% of the dictionary:  966 966"

## [1] "number of words comprising 50.01 to 60.01% of the dictionary:  967 1720"

## [1] "number of words comprising 60.01 to 70.01% of the dictionary:  1721 3108"

## [1] "number of words comprising 70.01 to 80.01% of the dictionary:  3109 6067"

## [1] "number of words comprising 80.01 to 90.01% of the dictionary:  6068 15048"

## [1] "number of words comprising 90.01 to 95.01% of the dictionary:  15048 32788"

We note that we need about a 1000 unique words to achieve 50% coverage in the dictionary when stop words are excluded. From this quantile forward we need to about double the number of unique words in the corpus for each marginal 10% increase in coverage. The coverage speed significantly slows down beyond 90% coverage. For example to add another 5% coverage beyond 90% we need to move from 15,000 unique words to 32,000 unique words. The combined corpus clearly contains a small number of unique words that are represented very frequently and a large set of singletons.

The R code for all plots and tables presented in the section of exploratory analysis are included in Appendix D.

Plan

Next is model development, from simpler to more sophisticated model:

I need to figure out how to save ngrams as word indexes to save memory. Also I need to read up on a smart way to filter the higher order ngrams. Do I really need singletons in TriGrams? Not sure yet.
Develop simple Ngram models based on conditional probabilities of a larger ngram given its smaller ngram base. Implement Katz Backoff approach for prediction where a missing TriGram is approximated with a lower order ngram.
Make the model more sophisticated by smoothing probabilities across low occurrence marginal words or via interpolation. Take away some probability mass from higher frequency ngrams to redistribute to lower frequency ngrams. As much as I read, this smoothing improves the performance of the model significantly. I have prepared the data for the Good-Turing smoothing. According to Jurafsky interpolation is found to be even more effective, so I need to experiment with these methods as much as time allows it.
Make the model more sophisticated by smoothing probabilities across unknown (UNK) words, not used in the train sample, that could appear in the test sample.
Obviously understand how to test my model on the validation sample. Conceptually it is clear, but how this is done in R?
Build shiny app.

References

As I started this assignment two weeks ago this was a brand new area for me. I basically knew nothing. I spent much time looking for tutorials that can be followed to describe methods to be used for the specific task of text prediction. So far I have mainly learned from Jurafsky, Martin, and Manning contributions to the field. I have not been able to find another source that so carefully walks you step by step though NLP for the particular application of text prediction at a conceptual level.

I have also used numerous resources, too many to keep track, on Regular Expressions, to adapt those to the data pre-processing needs of this project.

Jurafsky, D. & Martin, J.H. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Englewood Cliffs, NJ: Prentice Hall.
Dan Jurafsky & Chris Manning, Stamford Course Lectures on NLP: https://www.youtube.com/watch?v=nfoudtpBV68
Stackoverflow
Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of Child Language, 14, pp. 201209. Doi: 10:1017/S0305000900012885. Retrieved from http://hfroehlich.files.wordpress.com/2013/05/s0305000900012885a.pdf.

Appendix

The Appendix mainly includes all the code that i have used in the steps I applied the code. One can use the code and fuly replicate my results.

For all calculations I use a personal 2012 MAC-book Pro computer with a 2.5 GHz Intel Core i5 and 16 GB 1600 MHz DDR3 installed. I have many other applications and a browser with about 30 Tabs open at the same time. With this set up I am often close to maxing out the available memory, and sometimes I max it out and R crashes. For this reason, I often split the pre-processing of the large training samples by blogs/news/twitter and later aggregate them into a combined sample in a separate step.

Appendix A: Data subsampling

I sub sample the data into 3 main samples - training (70%), validation/development (15%), testing (15%). I further take a random 10% sample coming from the 70% training sample to use for testing my code. Working with a large corpus is very time consuming.

# set working folder
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# A function to extract samples
textFileSample <- function(infile, outfile, selection) { 
  conn <- file(infile, "r")
  fulltext <- readLines(conn)
  nlines <- length(fulltext)
  close(conn)
  
  conn <- file(outfile, "w")

  for (i in 1:nlines) {
    if (selection[i]==1) { cat(fulltext[i], file=conn, sep="\n") }
  }
  close(conn)
  
  paste("Saved", sum(selection), "lines to file", outfile)
}

###################
## blogs ##########
###################

# read the txt file and count the number of lines in blogs document
conn <- file("./data/en_US/en_US.blogs.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for BLOGS
set.seed(42)
fraction1 <- 0.15      # 15% size of test sample
fraction2 <- 0.17      # approx 15% size of validation sample
fraction3 <- 0.14      # approx 10% size of training subsample out of full 70% train sample

# draw random samples based on rbinom
test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples and store
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.blogs.txt", "./data/en_US/sample/blogs.train10pct.txt",
               trainsample.index)

###################
## news ##########
###################

# read the txt file and count the number of lines in blogs document
conn <- file("./data/en_US/en_US.news.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for NEWS
set.seed(42)
fraction1 <- 0.15                                    # 15% size of test sample
fraction2 <- 0.17                                    # approx 15% size of validation sample
fraction3 <- 0.14                                    # approx 10% size of training subsample out of full train sample

test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples and store
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.news.txt", "./data/en_US/sample/news.train10pct.txt",
               trainsample.index)

###################
## twitter ########
###################

# load data and count number of lines in twitter document
conn <- file("./data/en_US/en_US.twitter.txt", "r")
fulltext <- readLines(conn)
numrows <- length(fulltext)
close(conn)
rm(fulltext)

# create indexes for samples for TWITTER
set.seed(42)
fraction1 <- 0.15                                    # 15% size of test sample
fraction2 <- 0.17                                    # approx 15% size of validation sample
fraction3 <- 0.14                                    # approx 10% size of training subsample out of full train sample

test.index <- rbinom(numrows, 1, fraction1)            # test sample index
val_train.index <- 1-test.index                        # full training + validation sample

val.subindex <- rbinom(numrows, 1, fraction2)          
val.index <- val_train.index*val.subindex              # validation sample index
train.index <- 1- test.index - val.index               # full training sample index

trainsample.subindex <- rbinom(numrows, 1, fraction3)
trainsample.index <- train.index*trainsample.subindex  # training sub-sample index

# extract subsamples
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.test15pct.txt",
               test.index)
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.val15pct.txt",
               val.index)
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.train70pct.txt",
               train.index)                  
textFileSample("./data/en_US/en_US.twitter.txt", "./data/en_US/sample/twitter.train10pct.txt",
               trainsample.index)

Next i combine blogs-news-tweets sub samples into combined-corpus sub samples:

##############################################
## combine a 10% sample of all threee sources
##############################################

# Combine train 70% sub-samples into one
conn <- file("./data/en_US/sample/blogs.train70pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.train70pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.train70pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)
        
conn <- file("./data/en_US/sample/train70pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

# Combine train 10% sub-samples into one
conn <- file("./data/en_US/sample/blogs.train10pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.train10pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.train10pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)
        
conn <- file("./data/en_US/sample/train10pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)


# Combine test 15% sub-samples into one
conn <- file("./data/en_US/sample/blogs.test15pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.test15pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.test15pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/test15pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

# Combine validation 15% sub-samples into one
conn <- file("./data/en_US/sample/blogs.val15pct.txt", "r")
        file1 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/news.val15pct.txt", "r")
        file2 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/twitter.val15pct.txt", "r")
        file3 <- readLines(conn)
        close(conn)

conn <- file("./data/en_US/sample/val15pct.txt", "w")
        cat(c(file1,file2,file3), file=conn, sep="\n")
        close(conn)
rm(file1, file2, file3)

The resulting data sets from this sub sampling are:

blogs.train70pct.text, news.train70pct.text, twitter.train70pct.text
blogs.train10pct.text, news.train10pct.text, twitter.train10pct.text
blogs.test15pct.text, news.test15pct.text, twitter.test15pct.text
blogs.val15pct.text, news.val15pct.text, twitter.val15pct.text
train70pct.text, train10pct.text, test15pct.text, val15pct.text

Appendix B: Data pre-processing; Corpus prep for analysis

Set up pre-processing first:

library(tm) #load text mining library
library(textclean) #load auxiliary library 1

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10") #sets R's working directory 

# using a DirSource it loads a txt file into a VCorpus for preprocessing
# the files that need pre-processing in general are the sub-samples 
# from the previous section. But we start with the training samples
# blogs.train70pct.txt, news.train70pct.txt, twitter.train70pct.txt, train10pct.txt

# for this example code I use the train10pct.txt sample, whihc i place in the "./data/en_US/train.sample" folder 

sample.data <- VCorpus(DirSource("./data/en_US/train.sample", encoding = "UTF-8"), 
                 readerControl=list(language="en"))
summary(sample.data)
docs <- sample.data

# content-transformer custom functions
# Some custom functions that are needed to transform the corpus using content_transformer 
# functionality of tm package
toEmpty <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))}) # pattern to empty
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))}) # pattern to space

# turns ? and ! and . into an end of sentence identifier EEOSS. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of sentences and combine words from more than one sentence
toEOS1 <- content_transformer(function(x, pattern) {return (gsub("\\? |\\?$|\\! |\\!$", " EEOSS ", x))})
toEOS2 <- content_transformer(function(x, pattern) {return (gsub("\\. |\\.$", " EEOSS ", x))})

# turns abbreviations as H.S.B.C. into an identifier AABRR. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of such abbreviations as these ngrams are incorrect 
# from information perspective
toABR <- content_transformer(function(x, pattern) {return (gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", x))})

# turns numbers into an identifier NNUMM. The purpose is that when forming ngrams we 
# discard ngrams that are formed on the edges of such numbers as these ngrams are incorrect 
# from information perspective. A common approach is the 'remove numbers' with tm package.
# However, this will lead to formation of incorrect ngrams later on
toNUM <- content_transformer(function(x, pattern) {return (gsub("[0-9]+"," NNUMM ",x))})

# expand & and @ 
toAnd <- content_transformer(function(x) {return (gsub("&", "and", x))})
toAt <- content_transformer(function(x) {return (gsub("@", "at", x))})

# replace contractions using replace_contractions function from library library(textclean)
replaceContraction <- content_transformer(function(x) {return (replace_contraction(x))})
toHaveNot <- content_transformer(function(x) {return (gsub("haven't", "have not", x))})
toHadNot <- content_transformer(function(x) {return (gsub("hadn't", "had not", x))})

# remove duplicates
removeDup <- content_transformer(function(x) {return (unique(x))})

Apply a sequence of pre-processing steps on each text sample:

# 1. Remove duplicates
docs <- tm_map(docs, removeDup)

# 2. Separate words connected with - or /
# this helps later when we remove punctuation, so that two words do not stick to each otehr
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "/")

# 3. Establish end of sentence
# the sequence of application of the functions is important here
docs <- tm_map(docs, toEOS1)
docs <- tm_map(docs, toABR)
docs <- tm_map(docs, toEOS2)
docs <- tm_map(docs, toNUM)

# 4. Remove email and http/s
docs <- tm_map(docs, toEmpty, "\\S+@\\S+") # email
docs <- tm_map(docs, toEmpty, "[Hh}ttp([^ ]+)") # html links

# 5. Remove retweet entries, Remove @ people, twitter usernames
docs <- tm_map(docs, toEmpty, "RT | via") # retweets
docs <- tm_map(docs, toEmpty, "@([^ ]+)") # people
docs <- tm_map(docs, toEmpty, "[@][a - zA - Z0 - 9_]{1,15}") # usernames

# 6. Transform to lower case (need to wrap in content_transformer) (time consuming procedure)
docs <- tm_map(docs,content_transformer(tolower))

# 7. Remove/replace &, @, 'm, 's, 'are, 'll, etc...
docs <- tm_map(docs, toAnd) 
docs <- tm_map(docs, toAt) 
docs <- tm_map(docs, replaceContraction)

        # remove this contraction in say "mother's day"
        # before it is removed by remove_punctuation
        docs <- tm_map(docs, toEmpty, "'s") 

        # i found that havn't and hadn't did not get replaced with replace_contractions
        docs <- tm_map(docs, toHaveNot)
        docs <- tm_map(docs, toHadNot)

# 8. Remove emoji's, emoticons like ❤#RelayForLife ❤"😖😰💔💔"
docs <- tm_map(docs, toEmpty, "[^\x01-\x7F]")

# 9. Remove g, mg, lbs etc; removes all single letters except "a" and "i"

docs <- tm_map(docs, toSpace, " [1-9]+g ") # grams
docs <- tm_map(docs, toSpace, " [1-9]+mg ") # miligrams, etc
docs <- tm_map(docs, toSpace, " [1-9]+kg ")
docs <- tm_map(docs, toSpace, " [1-9]+lbs ")
docs <- tm_map(docs, toSpace, " [1-9]+s ") # seconds, etc
docs <- tm_map(docs, toSpace, " [1-9]+m ")
docs <- tm_map(docs, toSpace, " [1-9]+h ")
docs <- tm_map(docs, toSpace, " +g ") # grams
docs <- tm_map(docs, toSpace, " +mg ") # miligrams, etc
docs <- tm_map(docs, toSpace, " +kg ")
docs <- tm_map(docs, toSpace, " +lbs ")
docs <- tm_map(docs, toSpace, " +s ") # seconds, etc
docs <- tm_map(docs, toSpace, " +m ")
docs <- tm_map(docs, toSpace, " +h ")
docs <- tm_map(docs, toSpace, " +lbs ")
docs <- tm_map(docs, toSpace, " +kg ")

docs <- tm_map(docs, toSpace, " [b-hj-z] ") # all single-letter words except a and i

# 10. Remove numbers
docs <- tm_map(docs, removeNumbers)

# 11. Remove punctuation
# Note that this removes hashtags from tweets. I examined the data and most often the
# hashtag precedes a word of meaning and not a name of a person. If it were the latter
# it would be logical to remove the whole word that is preceded by a hashtag. However, 
# this strategy would lead to loss in meaning in the sentences and potentially affect
# the predictive power of the model. In addition, hopefully words with hashtags 
# that dont have a general meaning will not repeat often, and will have a low weight
# in calibrating the model via the ngrams. 
# Result: REMOVE HASHTAGS only, not words w/ hashtag
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = TRUE)

        # 11.1. Remove ": removing punctuation did not remove these
        docs <- tm_map(docs, toEmpty, "“")
        docs <- tm_map(docs, toEmpty, "”")
        docs <- tm_map(docs, toEmpty, "‘")
        docs <- tm_map(docs, toEmpty, "’")

# 12. Remove profanity (time consuming procedure)
# I don't want my predictions to suggest profanity words. For this purpose i may
# train a model with these words in and remove them in the end by recalibrating the 
# model on non-profanity data. Or else, which makes more sense, is to remove them
# from the beginning and not train the model on those. The latter choice is applied. 

# Donwload a list with bad words - we need this

# install.packages("RCurl")
library(RCurl)

profanity.source<-"https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("./data/en_US/profanity/en_bws.txt")){download.file(url=profanity.source , 
                destfile = "./data/en_US/profanity/en_bws.txt", method='libcurl')}

# remove bad words from corpora
bw.data<-read.table("./data/en_US/profanity/en_bws.txt", header=FALSE, sep="\n", strip.white=TRUE) # load bad words
names(bw.data)<-"bad.words"
docs<- tm_map(docs, removeWords, bw.data[,1]) # removal of profanity

# 13. Remove stop words
# From this point forward i keep two types of corpora - one with stop words included
# and one with stop words removed. The latter sample always has "nonstop" in the name of the file
docs.nonstop<-tm_map(docs, removeWords, stopwords("english"))

# 14. Remove whitespace

# The previous transformations generate a lot of extra whitespace in the lines of text; 
# applying the stripWhitespace transformation deals with this problem.
# I have tried different methods for this, but finally decided to use the tm package
docs <- tm_map(docs, stripWhitespace)
docs.nonstop <- tm_map(docs.nonstop, stripWhitespace)

# 15. Stem documents
# For the purposes of this porject I have made a choice to not use a stemmed corpus
# Logically we need to work with word types, not stems from them to form correct ngrams
# However, I offer the code i have tested for this step

# # install.packages("SnowballC")
# library("SnowballC")
# docs.stem <- tm_map(docs,stemDocument)
# docs.nonstop.stem <- tm_map(docs.nonstop,stemDocument)

# 16. Save data
# At this point i export the corpus with two versions - stop words included or not
# The reason is that pre-processing is very time consuming and shold be done once

save(docs, docs.nonstop, 
     file = "./data/en_US/train.sample/Rdata_output/train10pct.preprocsd.Rdata")

write(docs[[1]][[1]],"./data/en_US/train.sample/txt_output/train10pct.docs.preprocsd.txt")
write(docs.nonstop[[1]][[1]],"./data/en_US/train.sample/txt_output/train10pct.docs.nonstop.preprocsd.txt")

The resulting training files from this pre-processing step are (pre-processed in name:

blogs.train70pct.docs.pre-processed.text (also news., twitter. versions)
blogs.train70pct.docs.nonstop.pre-processed.text (also news., twitter. versions)
train10pct.docs.pre-processed.text, train10pct.docs.nonstop.pre-processed.text (to be used for testing code)
twitter.train10pct.docs.pre-processed.text, twitter.train10pct.docs.nonstop.pre-processed.text (to be used for testing code)

Appendix C: Tokenization and nGram Generation

I use package stylo to do tokenization. Since the ngrams can become very large, especially for 3grams and 4 grams, I tokenize the train70pct samples separately for blogs/news/tweets and aggregate the ngrams in a subsequent step.

# install.packages("stylo")
library(stylo)

###*********************** convert text to words with stylo **********************

### this step is done consecutively for docs[1] and doc.nonstop[1]
### and is applied to the train70pct and train10pct corpora from the previous step

### here we provide the example code for docs[1]
docs.words<- txt.to.words(docs[1])
###*******************************************************************************

###
# create data frames of one, two and three ngrams
###

###
# unigram
###
df1grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 1))) # create 1gram

# remove rows with eeoss, nnumm, aabrr as these are not useful
df1grm <- df1grm[which(df1grm$Var1!="eeoss"),]
df1grm <- df1grm[which(df1grm$Var1!="nnumm"),]
df1grm <- df1grm[which(df1grm$Var1!="aabrr"),]

# calculate relative frequencies
df1grm$RelFreq <- round(100*(df1grm$Freq/sum(df1grm$Freq)),2)

# Create a sorted table "sdf*" by decending frequency count
sdf1grm<-df1grm[order(df1grm$Freq, decreasing = TRUE),];

# Calculate relative cumulative frequencies, and column with row numbers.
# this allows one to find how many words are e.g. 50% of dictionary to answer
# a question like this later on
sdf1grm$CumSum <- cumsum(sdf1grm$Freq); sdf1grm$RelCumSum <- 100*(sdf1grm$CumSum/sum(sdf1grm$Freq))
sdf1grm$RowNum <- 1:dim(sdf1grm)[1]
colnames(sdf1grm)<-c("OneGram","Frequency","RelFrequency", "CumFreq", "RelCumFreq", "RowNum")

# Build frequency of frequency table for subsequent Good-Turing smoothing
uni.freqfreq<-data.frame(Uni=table(sdf1grm$Frequency))

###
# bigram
###

df2grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 2))) # create 2gram

# calculate relative frequencies
df2grm$RelFreq <- round(100*(df2grm$Freq/sum(df2grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
# note that any bigrams, tri/fourgrams that contain any of these three words are incorrect
# because they are not sequences of words that were actually used in the text
# grepl looks for a pattern in the text of each row
eos.index<-grepl("eeoss",df2grm$Var1)
df2grm<-df2grm[!eos.index,]
num.index<-grepl("nnumm",df2grm$Var1)
df2grm<-df2grm[!num.index,]
abr.index<-grepl("aabrr",df2grm$Var1)
df2grm<-df2grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf2grm<-df2grm[order(df2grm$Freq, decreasing = TRUE),]; colnames(sdf2grm)<-c("TwoGram","Frequency","RelFrequency")

# Build frequency of frequency table for Good-Turing smoothing
di.freqfreq<-data.frame(Di=table(sdf2grm$Frequency))

###
# trigram
###

df3grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 3))) # create 3gram

# calculate relative frequencies
df3grm$RelFreq <- round(100*(df3grm$Freq/sum(df3grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
eos.index<-grepl("eeoss",df3grm$Var1)
df3grm<-df3grm[!eos.index,]
num.index<-grepl("nnumm",df3grm$Var1)
df3grm<-df3grm[!num.index,]
abr.index<-grepl("aabrr",df3grm$Var1)
df3grm<-df3grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf3grm<-df3grm[order(df3grm$Freq, decreasing = TRUE),]; colnames(sdf3grm)<-c("TriGram","Frequency","RelFrequency")

# separate singletons
sdf3grm.2pl<-sdf3grm[which(sdf3grm$Frequency>=2),]
sdf3grm.singl<-sdf3grm[which(sdf3grm$Frequency==1),]

# Build frequency of frequency table for Good-Turing smoothing
tri.freqfreq<-data.frame(Tri=table(sdf3grm$Frequency))

###
# fourgram
###

df4grm<-data.frame(table(make.ngrams(docs.words[[1]], ngram.size = 4)))  # create 4gram

# calculate relative frequencies
df4grm$RelFreq <- round(100*(df4grm$Freq/sum(df4grm$Freq)),2)

# remove rows with eeoss, nnumm, aabrr as these are not useful
eos.index<-grepl("eeoss",df4grm$Var1)
df4grm<-df4grm[!eos.index,]
num.index<-grepl("nnumm",df4grm$Var1)
df4grm<-df4grm[!num.index,]
abr.index<-grepl("aabrr",df4grm$Var1)
df4grm<-df4grm[!abr.index,]

# Create a sorted table "sdf*" by decending frequency count
sdf4grm<-df4grm[order(df4grm$Freq, decreasing = TRUE),]; colnames(sdf4grm)<-c("FourGram","Frequency","RelFrequency")

# separate singletons
sdf4grm.2pl<-sdf4grm[which(sdf4grm$Frequency>=2),]
sdf4grm.singl<-sdf4grm[which(sdf4grm$Frequency==1),]

# Build frequency of frequency table for Good-Turing smoothing
four.freqfreq<-data.frame(Four=table(sdf4grm$Frequency))

###
# Save ngrams data
###

save(docs, docs.words, sdf1grm, sdf2grm, sdf3grm, sdf3grm.2pl, sdf3grm.singl, 
     sdf4grm, sdf4grm.2pl, sdf4grm.singl, uni.freqfreq, di.freqfreq, tri.freqfreq, four.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train10pct_ngram_docs.Rdata")

# (for illustrative purposes only)
# save ngram data for the alternative nonstop corpus 
save(docs.nonstop, docs.words, sdf1grm, sdf2grm, sdf3grm, sdf3grm.2pl, sdf3grm.singl, 
     sdf4grm, sdf4grm.2pl, sdf4grm.singl, uni.freqfreq, di.freqfreq, tri.freqfreq, four.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train10pct_ngram_docs.nonstop.Rdata")

Input files for the last step were:

blogs.train70pct.pre-processed.Rdata, news.train70pct.pre-processed.Rdata, twitter.train70pct.pre-processed.Rdata
train10pct.pre-processed.Rdata

Output files for the last step were:

blogs.train70pct_ngram_docs.Rdata, news.train70pct_ngram_docs.Rdata, twitter.train70pct_ngram_docs.Rdata
blogs.train70pct_ngram_docs.nonstop.Rdata, news.train70pct_ngram_docs.nonstop.Rdata, twitter.train70pct_ngram_docs.nonstop.Rdata
train10pct_ngram_docs.Rdata, train10pct_ngram_docs.nonstop.Rdata

In the next step I aggregate the blogs/news/twitter train70pct samples ngrams into combined ngrams. I provide an example of the code used for files with included stop words. The same code is applied to files with stop words excluded.

###*************************************
### aggregate large training sample ####
###*************************************

####
## with stop words
####

# To load RData, one object at a time, we use library R.utils
# install.packages("R.utils")

library(R.utils); 

###
## unigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf1grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf1grm"]] 
sdf1grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf1grm"]]
sdf1grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf1grm"]]

# vertically bind the ngrams across three sources
df1grm <- rbind(sdf1grm.blogs[,1:2], sdf1grm.news[,1:2], sdf1grm.twitter[,1:2])

rm(sdf1grm.blogs,sdf1grm.news,sdf1grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf1grm <- df1grm %>% group_by(OneGram) %>% summarize(Frequency = sum(Frequency) )
rm(df1grm)
sdf1grm <- as.data.frame(sdf1grm); sdf1grm <- sdf1grm[order(-sdf1grm$Frequency),]

# calculate relative frequencies
sdf1grm$RelFreq <- round(100*(sdf1grm$Frequency/sum(sdf1grm$Frequency)),2)

## Calculate relative cumulative frequencies, and column with row numbers.
## this allows one to find how many words are e.g. 50% of dictionary
sdf1grm$CumSum <- cumsum(sdf1grm$Freq); sdf1grm$RelCumSum <- 100*(sdf1grm$CumSum/sum(sdf1grm$Freq))
sdf1grm$RowNum <- 1:dim(sdf1grm)[1]
colnames(sdf1grm)<-c("OneGram","Frequency","RelFrequency", "CumFreq", "RelCumFreq", "RowNum")

# Build frequency of frequency table for Good-Turing smoothing
uni.freqfreq<-data.frame(Uni=table(sdf1grm$Frequency))

# save combined data for 1gram
save(sdf1grm, uni.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata")
rm(sdf1grm, uni.freqfreq)

###
## bigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf2grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf2grm"]] 
sdf2grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf2grm"]]
sdf2grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf2grm"]]

# vertically bind the ngrams across three sources
df2grm <- rbind(sdf2grm.blogs[,1:2], sdf2grm.news[,1:2], sdf2grm.twitter[,1:2])

rm(sdf2grm.blogs,sdf2grm.news,sdf2grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf2grm <- df2grm %>% group_by(TwoGram) %>% summarize(Frequency = sum(Frequency) )
rm(df2grm)
sdf2grm <- as.data.frame(sdf2grm); sdf2grm <- sdf2grm[order(-sdf2grm$Frequency),]

# calculate relative frequencies
sdf2grm$RelFreq <- round(100*(sdf2grm$Frequency/sum(sdf2grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
di.freqfreq<-data.frame(Di=table(sdf2grm$Frequency))

# save combined data for 2gram
save(sdf2grm, di.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata")
rm(sdf2grm, di.freqfreq)

###
## trigram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf3grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf3grm"]] 
sdf3grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf3grm"]]
sdf3grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf3grm"]]

# vertically bind the ngrams across three sources
df3grm <- rbind(sdf3grm.blogs[,1:2], sdf3grm.news[,1:2], sdf3grm.twitter[,1:2])

rm(sdf3grm.blogs,sdf3grm.news,sdf3grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf3grm <- df3grm %>% group_by(TriGram) %>% summarize(Frequency = sum(Frequency) )
rm(df3grm)
sdf3grm <- as.data.frame(sdf3grm); sdf3grm <- sdf3grm[order(-sdf3grm$Frequency),]

# calculate relative frequencies
sdf3grm$RelFreq <- round(100*(sdf3grm$Frequency/sum(sdf3grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
tri.freqfreq<-data.frame(Tri=table(sdf3grm$Frequency))

# separate singletons
sdf3grm.2pl<-sdf3grm[which(sdf3grm$Frequency>=2),]
sdf3grm.singl<-sdf3grm[which(sdf3grm$Frequency==1),]

# save combined data for 3gram
save(sdf3grm, tri.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram_docs.Rdata")
# save pre-processed data
save(sdf3grm.2pl, tri.freqfreq,
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.Rdata")
# save pre-processed data
save(sdf3grm.singl, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_3gram.singl_docs.Rdata")

rm(sdf3grm, tri.freqfreq, sdf3grm.2pl, sdf3grm.singl)

###
## fourgram
###

setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

sdf4grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf4grm"]] 
sdf4grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf4grm"]]
sdf4grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf4grm"]]

# vertically bind the ngrams across three sources
df4grm <- rbind(sdf4grm.blogs[,1:2], sdf4grm.news[,1:2], sdf4grm.twitter[,1:2])

rm(sdf4grm.blogs,sdf4grm.news,sdf4grm.twitter)

## group and summarize using dplyr library
library(dplyr)

sdf4grm <- df4grm %>% group_by(FourGram) %>% summarize(Frequency = sum(Frequency) )
rm(df4grm)
sdf4grm <- as.data.frame(sdf4grm); sdf4grm <- sdf4grm[order(-sdf4grm$Frequency),]

# calculate relative frequencies
sdf4grm$RelFreq <- round(100*(sdf4grm$Frequency/sum(sdf4grm$Frequency)),2)

# Build frequency of frequency table for Good-Turing smoothing
four.freqfreq<-data.frame(Four=table(sdf4grm$Frequency))

# separate singletons
sdf4grm.2pl<-sdf4grm[which(sdf4grm$Frequency>=2),]
sdf4grm.singl<-sdf4grm[which(sdf4grm$Frequency==1),]

# save combined data for 4gram
save(sdf4grm, four.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram_docs.Rdata")
# save pre-processed data
save(sdf4grm.2pl, four.freqfreq, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram.2pl_docs.Rdata")
# save pre-processed data
save(sdf4grm.singl, 
     file = "./data/en_US/train.sample/Rdata_output/train70pct_4gram.singl_docs.Rdata")

rm(sdf4grm, four.freqfreq, sdf4grm.2pl, sdf4grm.singl)

Output files for the last step were:

blogs.train70pct_ngram_docs.Rdata, news.train70pct_ngram_docs.Rdata, twitter.train70pct_ngram_docs.Rdata
blogs.train70pct_ngram_docs.nonstop.Rdata, news.train70pct_ngram_docs.nonstop.Rdata, twitter.train70pct_ngram_docs.nonstop.Rdata

Output files from the last step were:

train70pct_1gram_docs.Rdata, train70pct_2gram_docs.Rdata
train70pct_3gram_docs.Rdata, train70pct_3gram.2pl_docs.Rdata, train70pct_3gram.singl_docs.Rdata
train70pct_4gram_docs.Rdata, train70pct_4gram.2pl_docs.Rdata, train70pct_4gram.singl_docs.Rdata
except for the train70pct_3gram.singl and train70pct_4gram.singl, the Rdata files also include the frequence of frequence tables

Extracting from RData some dataframes needed for exploratory analysis

I need to be able to load some 1gram and 3gram dataframes quicker for the purposes of exploratory analysis. I extract these dataframes with the following code.

# TriGrams
 sdf3grm.2pl.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]] 
 sdf3grm.2pl.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]]
 sdf3grm.2pl.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.Rdata")[["sdf3grm.2pl"]]
 save(sdf3grm.2pl.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.Rdata")
 save(sdf3grm.2pl.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.Rdata")
 save(sdf3grm.2pl.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.Rdata")
 
 sdf3grm.2pl.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]] 
 sdf3grm.2pl.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]]
 sdf3grm.2pl.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.nonstop.Rdata")[["sdf3grm.2pl"]]
 
 save(sdf3grm.2pl.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.nonstop.Rdata")
 save(sdf3grm.2pl.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.nonstop.Rdata")
 save(sdf3grm.2pl.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.nonstop.Rdata")

# UniGrams  
sdf1grm.blogs <- loadToEnv("./data/en_US/train.sample/Rdata_output/blogs.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]] 
sdf1grm.news <- loadToEnv("./data/en_US/train.sample/Rdata_output/news.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]]
sdf1grm.twitter <- loadToEnv("./data/en_US/train.sample/Rdata_output/twitter.train70pct_ngram_docs.nonstop.Rdata")[["sdf1grm"]]

save(sdf1grm.blogs, file = "./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata")
save(sdf1grm.news, file = "./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata")
save(sdf1grm.twitter, file = "./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata")

Appendix D: Data Characteristics and Exporatory Analysis

Characteristics of original sourced data

The code below generates the table in section “Data Description”. It works with the original three corpora provided by Swift-Coursera.

datafolder <- "/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10/data/en_US"
flist <- list.files(path=datafolder, recursive=F, pattern=".*en_.*.txt")

# lapply returns a list
l <- lapply(paste(datafolder, flist, sep="/"), function(f) {
        fsize <- file.info(f)[1]/1024/1024
        con <- file(f, open="r")
        lines <- readLines(con)
#        nchars <- lapply(lines, nchar)
#        maxchars <- which.max(nchars)
        nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
        close(con)
        # round to two decimal points, but if the first two decimals are 00, then still display them
        return(c(f, format(round(fsize, 2), nsmall=2), length(lines), nwords))
})

# unlist the list and display the results
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "num.of.words")

filenames <- data.frame(file.name = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"))
df <- cbind(filenames,df[,2:4])
df

NGrams based exploratory analysis

R code for Ngrams related plots in the “Exploratory Analysis” section.

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams including stop words for the combined train70pct corpus:

###
# ngram plots including stop words
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load 1/2/3grams 
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)

# extract top 20 ngrams by frequency
top20oneg<-sdf1grm[1:20,]; top20oneg$OneGram <- as.character(top20oneg$OneGram)
top20twog<-sdf2grm[1:20,]; top20twog$TwoGram <- as.character(top20twog$TwoGram)
top20trig<-sdf3grm.2pl[1:20,]; top20trig$TriGram <- as.character(top20trig$TriGram)

rm(sdf1grm, sdf2grm, sdf3grm)

library(ggplot2)
#install.packages("ggpubr")
library(ggpubr) # use to align three plots in one row

p1 <- ggplot (top20oneg, aes(x = reorder(OneGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightgreen" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "OneGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p2 <- ggplot (top20twog, aes(x = reorder(TwoGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightblue" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TwoGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p3 <- ggplot (top20trig, aes(x = reorder(TriGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TriGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

# Arrange plots (3,1)
 ggarrange(p1, p2, p3 + rremove("x.text"), 
#           labels = c("A", "B", "C"),
          labels = NULL,
          ncol = 3, nrow = 1)

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams excluding stop words for the combined train70pct corpus:

###
# ngram plots excluding stop words
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

# load 1/2/3grams  
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_2gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
 
# extract top 20 ngrams by frequency
top20oneg<-sdf1grm[1:20,]; top20oneg$OneGram <- as.character(top20oneg$OneGram)
top20twog<-sdf2grm[1:20,]; top20twog$TwoGram <- as.character(top20twog$TwoGram)
top20trig<-sdf3grm.2pl[1:20,]; top20trig$TriGram <- as.character(top20trig$TriGram)

rm(sdf1grm, sdf2grm, sdf3grm)

library(ggplot2)
#install.packages("ggpubr")
library(ggpubr) # use to align three plots in one row

p1 <- ggplot (top20oneg, aes(x = reorder(OneGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightgreen" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "OneGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p2 <- ggplot (top20twog, aes(x = reorder(TwoGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "lightblue" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TwoGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

p3 <- ggplot (top20trig, aes(x = reorder(TriGram, Frequency), y= Frequency )) + 
        geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
        coord_flip() +
        geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
        geom_text( aes (label = paste(RelFreq, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
        xlab( "TriGram List" ) +
        ylab( "Freq & RelFreq %" ) +
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
        theme_bw()

# Arrange plots (3,1)
 ggarrange(p1, p2, p3 + rremove("x.text"), 
#           labels = c("A", "B", "C"),
          labels = NULL,
          ncol = 3, nrow = 1)

Wordcloud frequency of 3 grams for ngrams excluding stop words for the combined train70pct corpus:

###
# wordcloud plots of 3grams with excluded stop words
###
 
# install.packages("wordcloud")
library(wordcloud)
 
load("./data/en_US/train.sample/Rdata_output/train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)

wordcloud(words = sdf3grm.2pl$TriGram, freq = sdf3grm.2pl$Freq, min.freq = 3,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2")) #,scale=c(2, 1.2))

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams including stop words for the three separate - blogs/news/twitter - corpora:

 ###
 # 3grams from each source including stop words
 ###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.Rdata", verbose = FALSE)
 
 # extract top 20 by freq
 top20trig.blogs<-sdf3grm.2pl.blogs[1:20,]; top20trig.blogs$TriGram <- as.character(top20trig.blogs$TriGram)
 top20trig.news<-sdf3grm.2pl.news[1:20,]; top20trig.news$TriGram <- as.character(top20trig.news$TriGram)
 top20trig.twitter<-sdf3grm.2pl.twitter[1:20,]; top20trig.twitter$TriGram <- as.character(top20trig.twitter$TriGram)

 library(ggplot2)
 #install.packages("ggpubr")
 library(ggpubr)
 
 pp1 <- ggplot (top20trig.blogs, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Blogs" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp2 <- ggplot (top20trig.news, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram News" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp3 <- ggplot (top20trig.twitter, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Twits" ) +
         ylab( "Freq & RelFrequency %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()

# Arrange plots (3,1)  
 ggarrange(pp1, pp2, pp3 + rremove("x.text"), 
           #labels = c("A", "B", "C"),
           labels = NULL,
           ncol = 3, nrow = 1)

Top 20 highest frequency 1gram, 2gram and 3 grams for ngrams excluding stop words for the three separate - blogs/news/twitter - corpora:

###
# 3grams from each source excluding stop words
###
 
rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")
  
load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_3gram.2pl_docs.nonstop.Rdata", verbose = FALSE)
 
 # extract top 20 by freq
 top20trig.blogs<-sdf3grm.2pl.blogs[1:20,]; top20trig.blogs$TriGram <- as.character(top20trig.blogs$TriGram)
 top20trig.news<-sdf3grm.2pl.news[1:20,]; top20trig.news$TriGram <- as.character(top20trig.news$TriGram)
 top20trig.twitter<-sdf3grm.2pl.twitter[1:20,]; top20trig.twitter$TriGram <- as.character(top20trig.twitter$TriGram)

 library(ggplot2)
 #install.packages("ggpubr")
 library(ggpubr)
 
 pp1 <- ggplot (top20trig.blogs, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Blogs" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp2 <- ggplot (top20trig.news, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram News" ) +
         ylab( "Freq & RelFreq %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()
 
 pp3 <- ggplot (top20trig.twitter, aes(x = reorder(TriGram, - Frequency), y= Frequency )) + 
         geom_bar( stat = "Identity" , fill = "mistyrose" ) +  
         coord_flip() +
         geom_text( aes (label = Frequency ) , vjust = - 0.3, hjust = 1.1, size = 2.5) +
         geom_text( aes (label = paste(RelFrequency, "%") ) , vjust = + 1.30, hjust = 1.1, size = 2.5) +
         xlab( "TriGram Twits" ) +
         ylab( "Freq & RelFrequency %" ) +
         theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) ) +
         theme_bw()

# Arrange plots (3,1)  
 ggarrange(pp1, pp2, pp3 + rremove("x.text"), 
           #labels = c("A", "B", "C"),
           labels = NULL,
           ncol = 3, nrow = 1)

Cumulative frequencies for 1grams excluding stop words:

###
# Cumulative frequences for 1grams excluding stop words
###
 
# How many unique words do you need in a frequency sorted dictionary to cover 50% 
# of all word instances in the language? 90%?
 
rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")


load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)

# plot number of words vs culative relative frequency.
# this shows how dictinary coverage increases with addition of more words
# it is an increasing with a decreasing rate process. first deriv+, second deriv-
plot(sdf1grm$RowNum[1:50000], sdf1grm$RelCumFreq[1:50000], type="l", lty=1, 
     ylab = "Cumulative Relative Frequency (0%-100% coverage)", xlab = "Top Frequency Words")
lines(sdf1grm.blogs$RelCumFreq, type="l", lty=1, col = "blue")
lines(sdf1grm.news$RelCumFreq, type="l", lty=1, col = "green")
lines(sdf1grm.twitter$RelCumFreq, type="l", lty=1, col = "red")
legend("bottomright", 
       legend = c("Combined", "Blogs", "News", "Tweets"), 
       col = c("black","blue", "green", "red" ), 
       pch = c(19,19, 19, 19), bty = "n", pt.cex = 2, cex = 1.2, 
       text.col = c("black","blue", "green", "red" ), horiz = F 
       #       inset = c(0.1, 0.1))
)

# find some quantiles, that show how quickly the number of words needed rises to 
# achieve more covereage of the dictionary
paste("number of words comprising 49.99 to 50.01% of the dictionary: ", range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 49.99 & sdf1grm$RelCumFreq < 50.01])[1], range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 49.99 & sdf1grm$RelCumFreq < 50.01])[2])
paste("number of words comprising 50.01 to 60.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 50.01 & sdf1grm$RelCumFreq < 60.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 50.01 & sdf1grm$RelCumFreq < 60.01])[2])
paste("number of words comprising 60.01 to 70.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 60.01 & sdf1grm$RelCumFreq < 70.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 60.01 & sdf1grm$RelCumFreq < 70.01])[2])
paste("number of words comprising 70.01 to 80.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 70.01 & sdf1grm$RelCumFreq < 80.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 70.01 & sdf1grm$RelCumFreq < 80.01])[2])
paste("number of words comprising 80.01 to 90.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 80.01 & sdf1grm$RelCumFreq < 90.01])[1],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 80.01 & sdf1grm$RelCumFreq < 90.01])[2])
paste("number of words comprising 90.01 to 95.01% of the dictionary: ",range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 89.99 & sdf1grm$RelCumFreq < 90.01])[2],range(sdf1grm$RowNum[sdf1grm$RelCumFreq > 90.01 & sdf1grm$RelCumFreq < 95.01])[2])

TTR calculations for 1grams including stop words

###
# Generates TTR: word.type to token ratio for 1gram including stop words. 
# It demonstrates how rich the language is in a particular corpora. 
# The higher TTR the more unique words are used to achieve to number of tokens
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.Rdata", verbose = FALSE)

# calculate TTR by source and combined
blogs.tokens <- sum(sdf1grm.blogs$Frequency); blogs.types <- dim(sdf1grm.blogs)[1];
blogs.TTR <- blogs.types/blogs.tokens;
news.tokens <- sum(sdf1grm.news$Frequency); news.types <- dim(sdf1grm.news)[1];
news.TTR <- news.types/news.tokens;
twitter.tokens <- sum(sdf1grm.twitter$Frequency); twitter.types <- dim(sdf1grm.twitter)[1];
twitter.TTR <- twitter.types/twitter.tokens;
tokens <- sum( sdf1grm$Frequency); types <- dim(sdf1grm)[1];
TTR <- types/tokens;

TTR <- data.frame("Source" = c("Blogs", "News", "Twitter", "Corpus"),
                          "Tokens" = c(blogs.tokens,news.tokens,twitter.tokens,tokens), 
                          "Word.Types" = c(blogs.types,news.types,twitter.types,types),
                          "TTR" = c(blogs.TTR,news.TTR,twitter.TTR,TTR))
print("Word Types to tokens ratio (TTR) for corpora with NO stop words removed")
TTR

TTR calculations for 1grams excluding stop words

###
# Generates TTR: word.type to token ratio for 1gram excluding stop words. 
# It demonstrates how rich the language is in a particular corpora. 
# The higher TTR the more unique words are used to achieve to number of tokens
###

rm(list=ls())
setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course10")

load("./data/en_US/train.sample/Rdata_output/blogs.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/news.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/twitter.train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)
load("./data/en_US/train.sample/Rdata_output/train70pct_1gram_docs.nonstop.Rdata", verbose = FALSE)

# calculate TTR by source and combined
blogs.nonstop.tokens <- sum(sdf1grm.blogs$Frequency); blogs.nonstop.types <- dim(sdf1grm.blogs)[1];
blogs.nonstop.TTR <- blogs.nonstop.types/blogs.nonstop.tokens;
news.nonstop.tokens <- sum(sdf1grm.news$Frequency); news.nonstop.types <- dim(sdf1grm.news)[1];
news.nonstop.TTR <- news.nonstop.types/news.nonstop.tokens;
twitter.nonstop.tokens <- sum(sdf1grm.twitter$Frequency); twitter.nonstop.types <- dim(sdf1grm.twitter)[1];
twitter.nonstop.TTR <- twitter.nonstop.types/twitter.nonstop.tokens;
nonstop.tokens <- sum( sdf1grm$Frequency); nonstop.types <- dim(sdf1grm)[1];
nonstop.TTR <- nonstop.types/nonstop.tokens;


TTR.nonstop <- data.frame("Source" = c("Blogs", "News", "Twitter", "Corpus"),
                "Tokens" = c(blogs.nonstop.tokens,news.nonstop.tokens,twitter.nonstop.tokens,nonstop.tokens), 
                "Word.Types" = c(blogs.nonstop.types,news.nonstop.types,twitter.nonstop.types,nonstop.types),
                "TTR" = c(blogs.nonstop.TTR,news.nonstop.TTR,twitter.nonstop.TTR,nonstop.TTR))
print("Word Types to tokens ratio (TTR) for corpora with stop words removed")
TTR.nonstop