Summary

This report contains exploratory analysis of the training dataset used for the data science capstone assignment regarding natural language processing of english language.

We first load in three documents each from different sources: twitter, blogs, and news. Blogs file contains around 900,000 sentences, news file contains a little over a million, and twitter file contains nearly 2.4 million sentences. We sampled 10% of each text file in order to reduce computing time and consolidated the documents together to create a combined corpus. The summary of the combined corpus can be seen below.

## Corpus consisting of 3 documents:
## 
##     Text  Types  Tokens Sentences
##  twitter 139824 3673367    259483
##    blogs 131686 4292950    207797
##     news 131251 3988562    186643
## 
## Source: Concatenation by c.corpus()
## Created: Tue Mar 05 09:54:33 2019
## Notes:

Next, we created the document frequency matrix for mono, bi, and trigrams. For the monogram model, we removed the punctuations, symbols, twitter hashtags, and common stopwords. The 50 most frequent words are displayed below.

##  said  just   one  like   can   get  time   new   now  good 
## 30552 30277 28943 27063 25155 22821 21583 19340 18079 17915

For bigrams, we also removed the punctuations, symbols, and twitter hashtags, but not the stopwords.

##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    42774    40726    21319    20173    19629    16430    14396    12490 
##     in_a with_the 
##    12052    10815

Same goes for the trigrams.

##     one_of_the       a_lot_of thanks_for_the        to_be_a    going_to_be 
##           3531           3040           2447           1896           1820 
##      i_want_to     out_of_the     the_end_of    some_of_the       it_was_a 
##           1546           1470           1454           1419           1382

Building an n-gram model

I built my first predictive model with a basic bigram prediction where the algorithm looks at the word before and chooses the bigram with the highest frequency containing that beginning word. For example, for the sentence “I have a car”, the model looks at all the bigrams that starts with “I” and chooses the bigram with highest frequency, which happens to be “I have”. Next, the model chooses the bigram starting with “have” that has the highest frequency in our corpus.

Hence, this model only looks at the word that comes right before the one we are trying to predict. The result is not so great. We predicted on two sentences “I have a beautiful car” and “who let the d0gs out”. As we can see below, predicting on only the previous word creates sentences that may be somewhat sound gramatically but incorrect. Therefore, we need better way of predicting. In the next part of the assignment, I plan to use the backoff method and the interpolation method, as well as kneser-ney smoothing in order to improve the model. For the shiny app, I plan to optimize my model in order to give a swift-like experience to the users where a user can enter in a word and it would give suggestions for the next word, adapting dynamically as the sentence progresses.

##      [,1]                    
## [1,] "i have a lot day and"  
## [2,] "who is me first and of"