The goal of this milestone report is to show the advance in the first part of the capstone project. The exploratory data analysis was done based in the 3 documents. Those are text documents composed by blogs, news, and twits. N-gram models, series of n consecutive words, were created for n = 1, 2, 3, and 4. The necessary amount of n-grams to represent the 50% and 90% of the data is considerably less than the whole corpus. This would be exploit for efficiency in memory and processing model time. The next word prediction algorithm would be based in the Katz’s back-off model.
The corpus, raw text database, was created with the 3 provided documents. Below, the amount of sentences from each document is presented.
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 2015464 140254 2574102
Then, the corpus, was “re-shaped” to obtain the sentences from of the texts. A sample sentence is shown below.
## Corpus consisting of 1 document.
## en_US.blogs.txt.3000 :
## "I think the peachness of it makes me think it's summer time."
Then, tokens were extracted from the corpus, tokens are string of continuous characters. That means that can be words, punctuation marks, symbols, etc. Punctuation, separators, and symbols were removed. Below are present the total amount of tokens in the corpus and the amount of types, that is unique tokens in the corpus.
## n_token n_type
## 1 70536673 64625571
Stop words are left in the corpus as they are important in a prediction model. Words were not stemmed as in a prediction algorithm each word variant is important for accuracy. All words were lower cased and profanation words were removed.
N-gram models were created for N = 1, 2, 3, and 4.
The 1-gram model presented stop words as the most frequent. This is logical as this words are constructors of the language. This behavior is expected. Below, a word cloud plot is presented with the 100 most frequent words. Also, the top ten word table is displayed and a plot of the 20 most frequent monograms.
## feature frequency rank docfreq group
## 1 the 2939603 1 1807523 all
## 2 to 1922021 2 1430441 all
## 3 and 1598158 3 1225640 all
## 4 a 1573398 4 1222019 all
## 5 i 1501610 5 1129688 all
## 6 of 1293159 6 991405 all
## 7 in 1022482 7 858131 all
## 8 you 849173 8 685465 all
## 9 is 812081 9 715518 all
## 10 for 774644 10 690682 all
For the 1-gram model the amount of unique words to cover the 50% and 90% of the whole corpus frequency was calculated. As seen below, with less than 150 words the cumulative frequency percentage is greater than 50%. Also, with less than 8000 words the 90% percent of the corpus is covered.
fifty <- sum(g1freq$frequency[1:150])/sum(g1freq$frequency)
ninety <- sum(g1freq$frequency[1:8000])/sum(g1freq$frequency)
data.frame(Percentage = c(fifty, ninety)*100, Words = c(150,8000))
## Percentage Words
## 1 51.77031 150
## 2 90.27763 8000
The 2-gram model is still mostly populated with stop words combinations. Nonetheless, this model is more accurate to express basic ideas. The word cloud plot is presented with the top 100 2-grams. Also, the top ten 2-grams table is printed.
g2freq <- textstat_frequency(g2dfm)
head(g2freq, 10)
## feature frequency rank docfreq group
## 1 of_the 257539 1 233535 all
## 2 in_the 244138 2 230417 all
## 3 for_the 137135 3 133523 all
## 4 to_the 135548 4 130711 all
## 5 on_the 128782 5 124285 all
## 6 to_be 118434 6 114208 all
## 7 at_the 88901 7 86727 all
## 8 i_have 79759 8 77037 all
## 9 and_the 77650 9 75261 all
## 10 i_was 75705 10 71991 all
ggplotly(ggplot(g2freq[1:20,], aes(x = reorder(feature, -frequency),
weight = frequency)) +
geom_bar(color = "steelblue", fill = "skyblue") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title ="Top-20 Bi-grams", x = "Bi-grams", y = "Frequency"))
The 3-gram model presents some very basic sentences. As seen below, the use of stop word is mostly for connecting nouns and verbs. The top 100 word cloud plot is presented next to the top ten 3-gram table.
## feature frequency rank docfreq group
## 1 thanks_for_the 23768 1 23675 all
## 2 one_of_the 21042 2 20830 all
## 3 a_lot_of 19323 3 18935 all
## 4 i_want_to 13171 4 12978 all
## 5 to_be_a 13165 5 13054 all
## 6 going_to_be 12722 6 12618 all
## 7 i_have_a 10876 7 10820 all
## 8 i_donâ_t 10587 8 10312 all
## 9 looking_forward_to 10561 9 10538 all
## 10 i_have_to 10324 10 10232 all
The 4-gram model presents the most complete sentence structure. Also, the less amount of stop words. Nonetheless, the frequency of each 4-gram is considerably less than frequencies seen is smaller level n-gram models. The word cloud plot is presented next to the top ten table.
## feature frequency rank docfreq group
## 1 thanks_for_the_follow 6246 1 6241 all
## 2 the_end_of_the 4981 2 4956 all
## 3 the_rest_of_the 4648 3 4631 all
## 4 at_the_end_of 4254 4 4235 all
## 5 for_the_first_time 3779 5 3768 all
## 6 at_the_same_time 3532 6 3521 all
## 7 is_going_to_be 3528 7 3517 all
## 8 thanks_for_the_rt 3339 8 3338 all
## 9 thank_you_for_the 3169 9 3166 all
## 10 can't_wait_to_see 2997 10 2996 all
The corpus is really extensive. Even with modern computer power working with the whole data is difficult. Nonetheless, as seen in the monogram analysis, with just 150 word the 50% of the corpus is represented and with less than 8000 words more than the 90%. With this in mind the corpus can be trimmed to a fitter model.
Also, the frequency of the top-10 n-grams gets smaller as n goes bigger. This is because identical series of 3 or 4 words are difficult to found. This result in the most frequent 4-gram to be “thanks for the follow”.
The next word prediction model would be based in the Katz’s back-off model. This would be done by obtaining the probability of the next word given the previous one, two or three words. With the 4-gram model, the last word would be the prediction. So, by taking the last word and dividing the frequency of each 4-gram that has the same last word, the probability can be obtained.