Summary

The goal of this milestone report is to show the advance in the first part of the capstone project. The exploratory data analysis was done based in the 3 documents. Those are text documents composed by blogs, news, and twits. N-gram models, series of n consecutive words, were created for n = 1, 2, 3, and 4. The necessary amount of n-grams to represent the 50% and 90% of the data is considerably less than the whole corpus. This would be exploit for efficiency in memory and processing model time. The next word prediction algorithm would be based in the Katz’s back-off model.

Data Loading

The corpus, raw text database, was created with the 3 provided documents. Below, the amount of sentences from each document is presented.

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##           2015464            140254           2574102

Then, the corpus, was “re-shaped” to obtain the sentences from of the texts. A sample sentence is shown below.

## Corpus consisting of 1 document.
## en_US.blogs.txt.3000 :
## "I think the peachness of it makes me think it's summer time."

Data Cleaning

Then, tokens were extracted from the corpus, tokens are string of continuous characters. That means that can be words, punctuation marks, symbols, etc. Punctuation, separators, and symbols were removed. Below are present the total amount of tokens in the corpus and the amount of types, that is unique tokens in the corpus.

##    n_token   n_type
## 1 70536673 64625571

Stop words are left in the corpus as they are important in a prediction model. Words were not stemmed as in a prediction algorithm each word variant is important for accuracy. All words were lower cased and profanation words were removed.

N-Gram Models

N-gram models were created for N = 1, 2, 3, and 4.

1-Gram

The 1-gram model presented stop words as the most frequent. This is logical as this words are constructors of the language. This behavior is expected. Below, a word cloud plot is presented with the 100 most frequent words. Also, the top ten word table is displayed and a plot of the 20 most frequent monograms.

##    feature frequency rank docfreq group
## 1      the   2939603    1 1807523   all
## 2       to   1922021    2 1430441   all
## 3      and   1598158    3 1225640   all
## 4        a   1573398    4 1222019   all
## 5        i   1501610    5 1129688   all
## 6       of   1293159    6  991405   all
## 7       in   1022482    7  858131   all
## 8      you    849173    8  685465   all
## 9       is    812081    9  715518   all
## 10     for    774644   10  690682   all

For the 1-gram model the amount of unique words to cover the 50% and 90% of the whole corpus frequency was calculated. As seen below, with less than 150 words the cumulative frequency percentage is greater than 50%. Also, with less than 8000 words the 90% percent of the corpus is covered.

fifty <- sum(g1freq$frequency[1:150])/sum(g1freq$frequency)
ninety <- sum(g1freq$frequency[1:8000])/sum(g1freq$frequency)
data.frame(Percentage = c(fifty, ninety)*100, Words = c(150,8000))
##   Percentage Words
## 1   51.77031   150
## 2   90.27763  8000

2-Gram

The 2-gram model is still mostly populated with stop words combinations. Nonetheless, this model is more accurate to express basic ideas. The word cloud plot is presented with the top 100 2-grams. Also, the top ten 2-grams table is printed.

g2freq <- textstat_frequency(g2dfm)
head(g2freq, 10)
##    feature frequency rank docfreq group
## 1   of_the    257539    1  233535   all
## 2   in_the    244138    2  230417   all
## 3  for_the    137135    3  133523   all
## 4   to_the    135548    4  130711   all
## 5   on_the    128782    5  124285   all
## 6    to_be    118434    6  114208   all
## 7   at_the     88901    7   86727   all
## 8   i_have     79759    8   77037   all
## 9  and_the     77650    9   75261   all
## 10   i_was     75705   10   71991   all
ggplotly(ggplot(g2freq[1:20,], aes(x = reorder(feature, -frequency), 
                                weight = frequency)) + 
        geom_bar(color = "steelblue", fill = "skyblue") + 
        theme(axis.text.x = element_text(angle = 45)) + 
        labs(title ="Top-20 Bi-grams", x = "Bi-grams", y = "Frequency"))

3-Gram

The 3-gram model presents some very basic sentences. As seen below, the use of stop word is mostly for connecting nouns and verbs. The top 100 word cloud plot is presented next to the top ten 3-gram table.

##               feature frequency rank docfreq group
## 1      thanks_for_the     23768    1   23675   all
## 2          one_of_the     21042    2   20830   all
## 3            a_lot_of     19323    3   18935   all
## 4           i_want_to     13171    4   12978   all
## 5             to_be_a     13165    5   13054   all
## 6         going_to_be     12722    6   12618   all
## 7            i_have_a     10876    7   10820   all
## 8            i_donâ_t     10587    8   10312   all
## 9  looking_forward_to     10561    9   10538   all
## 10          i_have_to     10324   10   10232   all

4-Gram

The 4-gram model presents the most complete sentence structure. Also, the less amount of stop words. Nonetheless, the frequency of each 4-gram is considerably less than frequencies seen is smaller level n-gram models. The word cloud plot is presented next to the top ten table.

##                  feature frequency rank docfreq group
## 1  thanks_for_the_follow      6246    1    6241   all
## 2         the_end_of_the      4981    2    4956   all
## 3        the_rest_of_the      4648    3    4631   all
## 4          at_the_end_of      4254    4    4235   all
## 5     for_the_first_time      3779    5    3768   all
## 6       at_the_same_time      3532    6    3521   all
## 7         is_going_to_be      3528    7    3517   all
## 8      thanks_for_the_rt      3339    8    3338   all
## 9      thank_you_for_the      3169    9    3166   all
## 10     can't_wait_to_see      2997   10    2996   all

Personal Insights

The corpus is really extensive. Even with modern computer power working with the whole data is difficult. Nonetheless, as seen in the monogram analysis, with just 150 word the 50% of the corpus is represented and with less than 8000 words more than the 90%. With this in mind the corpus can be trimmed to a fitter model.
Also, the frequency of the top-10 n-grams gets smaller as n goes bigger. This is because identical series of 3 or 4 words are difficult to found. This result in the most frequent 4-gram to be “thanks for the follow”.

Next-Word Prediction Model

The next word prediction model would be based in the Katz’s back-off model. This would be done by obtaining the probability of the next word given the previous one, two or three words. With the 4-gram model, the last word would be the prediction. So, by taking the last word and dividing the frequency of each 4-gram that has the same last word, the probability can be obtained.