## [1] 899288
## [1] 77259
## [1] 2360148
## Document-feature matrix of: 60,000 documents, 60,706 features (99.98% sparse) and 1 docvar.
## features
## docs year thereaft oil field platform name pagan god love mr
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 0 0 0 0 0 0 0 1 1
## text3 0 0 0 0 0 0 0 0 1 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 59,994 more documents, reached max_nfeat ... 60,696 more features ]
## feature frequency rank docfreq group
## 1 947133 1 59863 all
## 2 #ff 48 2 47 all
## 3 #teamfollowback 9 3 9 all
## 4 #brewers 8 4 8 all
## 5 #nowplaying 7 5 7 all
## 6 #nba 7 5 7 all
An N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. So a bigram model (N = 2) predicts the occurrence of a word given only its previous word (N – 1 = 1). In this sense, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (N – 1 = 2). Let’s assume a bigram model. So we are going to find the probability of a word based only on its previous word. In general, we can say that this probability is (the times’s number the previous word ‘wp’ occurs before the word ‘wn’) / (the times’s number the previous word ‘wp’ occurs in the corpus) =
[Count (wp wn)]/(Count (wp))
Analysing the frequencies of 2-grams and 3-grams in the corpus, I identified the following.
I will identify the unique words you need in a frequency sorted dictionary to cover 50% of all word instances in the language. After, I will choose the better model to predict situations in this previous analysis. Finally, I will work in the shiny APP to publish on the shiny site.