This report contains exploratory analysis of the training dataset used for the data science capstone assignment regarding natural language processing of english language.
We first load in three documents each from different sources: twitter, blogs, and news. Blogs file contains around 900,000 sentences, news file contains a little over a million, and twitter file contains nearly 2.4 million sentences. We sampled 10% of each text file in order to reduce computing time and consolidated the documents together to create a combined corpus. The summary of the combined corpus can be seen below.
## Corpus consisting of 3 documents:
##
## Text Types Tokens Sentences
## twitter 139824 3673367 259483
## blogs 131686 4292950 207797
## news 131251 3988562 186643
##
## Source: Concatenation by c.corpus()
## Created: Tue Mar 05 09:54:33 2019
## Notes:
Next, we created the document frequency matrix for mono, bi, and trigrams. For the monogram model, we removed the punctuations, symbols, twitter hashtags, and common stopwords. The 50 most frequent words are displayed below.
## said just one like can get time new now good
## 30552 30277 28943 27063 25155 22821 21583 19340 18079 17915
For bigrams, we also removed the punctuations, symbols, and twitter hashtags, but not the stopwords.
## of_the in_the to_the for_the on_the to_be at_the and_the
## 42774 40726 21319 20173 19629 16430 14396 12490
## in_a with_the
## 12052 10815
Same goes for the trigrams.
## one_of_the a_lot_of thanks_for_the to_be_a going_to_be
## 3531 3040 2447 1896 1820
## i_want_to out_of_the the_end_of some_of_the it_was_a
## 1546 1470 1454 1419 1382
I built my first predictive model with a basic bigram prediction where the algorithm looks at the word before and chooses the bigram with the highest frequency containing that beginning word. For example, for the sentence “I have a car”, the model looks at all the bigrams that starts with “I” and chooses the bigram with highest frequency, which happens to be “I have”. Next, the model chooses the bigram starting with “have” that has the highest frequency in our corpus.
Hence, this model only looks at the word that comes right before the one we are trying to predict. The result is not so great. We predicted on two sentences “I have a beautiful car” and “who let the d0gs out”. As we can see below, predicting on only the previous word creates sentences that may be somewhat sound gramatically but incorrect. Therefore, we need better way of predicting. In the next part of the assignment, I plan to use the backoff method and the interpolation method, as well as kneser-ney smoothing in order to improve the model. For the shiny app, I plan to optimize my model in order to give a swift-like experience to the users where a user can enter in a word and it would give suggestions for the next word, adapting dynamically as the sentence progresses.
## [,1]
## [1,] "i have a lot day and"
## [2,] "who is me first and of"