The purpose of this report is to provide first and exploratory analysis of the en_US corpus data. This will consist of tri-gram frequency for all three documents in the corpus (blogs, news and twitter.) A tri-gram is essentially a group of three words as they appear together in the document. For example if you have a sentence “The Quick Brown fox jumps over the lazy dog.” you would have tri-grams “The Quick Brown”, “Quick Brown Fox”, “Brown Fox jumped”,“Fox jumped over”, “jumped over the”, “over the lazy” and “the lazy dog.” We will also show a summary of each of the three documents in the corpus and the most common words from a sample of each of the three documents. This will allow us to get a feeling for the data and the differences between the three groups of data.
The second part will discuss our plans for a predictive model and shiny application that will use our predictive model.
These graphs are the top 20 phases (ngrams) from each dataset (blogs, news and twitter)
As you can see there are differences between the three data sets.These phases seem to be in step with what one would expect for blogs verses news articles verses twitter feeds. One thing of note is phrases like “Happy Valentine’s Day” show up very frequently in twitter. Not surprising to anyone who sees people posting holiday greetings on social media on the various holidays.
The number of lines for each (blogs, news, twitter) are as follows
## [1] 899288
## [1] 77259
## [1] 2360148
These are the summaries for all three documents in the corpus: Blogs, News and Twitter.
## $chars
## [1] 158322265
##
## $letters
## [1] 116631145
##
## $whitespace
## [1] 37334910
##
## $punctuation
## [1] 3157642
##
## $digits
## [1] 985537
##
## $words
## [1] 19868446
##
## $sentences
## [1] 0
##
## $lines
## [1] 1
##
## $wordlens
## [1] 134210 377348 1760362 4151793 3324605 2882773 2413217 1679406
## [9] 1128687 2016044
##
## $senlens
## [1] 0 0 0 0 0 0 0 0 0 0
##
## $syllens
## [1] 6870120 7717832 3246704 1146968 252752 37772 6255 2354
## [9] 979 1712
##
## attr(,"class")
## [1] "string_summary"
## $chars
## [1] 12351435
##
## $letters
## [1] 9429741
##
## $whitespace
## [1] 2643985
##
## $punctuation
## [1] 84962
##
## $digits
## [1] 177753
##
## $words
## [1] 1576830
##
## $sentences
## [1] 0
##
## $lines
## [1] 1
##
## $wordlens
## [1] 11047 39724 118753 293331 249738 238658 211290 156040 107099 151149
##
## $senlens
## [1] 0 0 0 0 0 0 0 0 0 0
##
## $syllens
## [1] 469814 599646 298390 102730 23753 3219 629 175 57 56
##
## attr(,"class")
## [1] "string_summary"
## $chars
## [1] 125099425
##
## $letters
## [1] 93221570
##
## $whitespace
## [1] 30374728
##
## $punctuation
## [1] 405909
##
## $digits
## [1] 1015463
##
## $words
## [1] 17814991
##
## $sentences
## [1] 0
##
## $lines
## [1] 1
##
## $wordlens
## [1] 355913 829347 2212249 4431910 3058682 2280898 1927065 1137522
## [9] 672557 908847
##
## $senlens
## [1] 0 0 0 0 0 0 0 0 0 0
##
## $syllens
## [1] 7577414 6403525 2022696 675919 144180 30233 10651 5608
## [9] 2352 2877
##
## attr(,"class")
## [1] "string_summary"
In this case it’s obvious each of these documents is quite large with blogs being the largest. Notice though that twitter has the most number of lines. This is probably caused by the fact we’re looking at “tweets” rather than paragraphs. The size of blogs in particular is going to present a challenge as it would use more memory then is practical in some if not many cases.
We will now look at the Document term matrix for a sample of all three documents. This shows the count for each word in each line. We can them sum these columns to get a feel for the most common worksWe took a sample of 10,000 lines from each of the three documents to obtain these calculations. This is a prime example of where it’s not practical to use all the data as it’s far to large for the calculations necessary for this calculations
## <<DocumentTermMatrix (documents: 10000, terms: 33426)>>
## Non-/sparse entries: 193687/334066313
## Sparsity : 100%
## Maximal term length: 98
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just know like now one people time will
## 1194 2 1 1 0 2 1 0 0 1 0
## 1897 0 0 0 0 0 1 0 0 0 3
## 2360 0 1 0 0 0 1 1 1 0 0
## 2750 0 0 0 1 0 2 0 0 0 0
## 558 0 0 1 0 2 0 0 0 1 0
## 6418 0 0 4 0 3 1 1 1 0 0
## 7486 0 0 1 1 1 0 4 0 0 3
## 7849 0 0 0 0 0 0 1 0 0 1
## 9025 1 0 0 0 0 0 1 0 0 0
## 9154 3 0 0 0 0 0 2 1 0 0
## <<DocumentTermMatrix (documents: 10000, terms: 33120)>>
## Non-/sparse entries: 186528/331013472
## Sparsity : 100%
## Maximal term length: 46
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs also can just last new one said two will year
## 1721 0 0 0 0 0 0 0 0 0 0
## 1738 0 0 0 0 0 0 0 0 0 0
## 2783 0 1 1 0 0 1 0 0 0 0
## 3539 0 0 0 0 0 0 0 0 0 0
## 8247 0 0 2 0 2 1 0 1 1 0
## 8461 0 0 0 0 0 0 0 0 2 0
## 8961 0 0 0 0 0 1 0 0 0 0
## 9017 0 0 0 0 0 2 0 2 4 0
## 9158 0 0 0 0 0 0 0 0 5 0
## 9340 0 1 0 0 0 1 0 1 3 0
## <<DocumentTermMatrix (documents: 10000, terms: 15800)>>
## Non-/sparse entries: 68209/157931791
## Sparsity : 100%
## Maximal term length: 36
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can day dont get good just like love thanks will
## 128 2 0 1 1 0 0 0 0 0 0
## 2588 1 0 0 1 1 0 0 0 0 0
## 2987 0 0 0 0 0 0 0 0 0 0
## 4728 0 0 0 0 0 1 0 0 0 0
## 5748 0 0 1 0 0 0 0 0 0 1
## 6972 0 0 0 0 0 0 0 0 0 0
## 7811 1 1 0 0 0 0 1 0 0 1
## 9056 0 0 0 0 0 1 1 0 0 0
## 916 0 0 0 0 0 0 0 0 0 0
## 9567 0 0 0 0 0 0 0 0 0 0
Using the matrices above we were able to calculate the most common words for each document. The most common terms for Blogs is
## Term Count
## 803 wonderful 97
## 808 church 97
## 1135 lost 97
## 1231 front 97
## 90 cut 98
## 138 hear 98
## 222 stuff 98
## 274 within 98
## 682 company 98
## 1668 weà 98
## 2410 history 98
## 3206 public 98
## 619 goes 99
## 1418 movie 99
## 1420 perfect 99
## 1734 especially 99
## 1907 film 99
## 2115 local 99
## 4067 question 99
## 4845 child 99
One interesting thing here is we had the work weA (with a tilde) as one of the most commmon words. I’m not sure why that is the case but it bears further investigation.
The most common terms for News is
## Term Count
## 1760 live 93
## 1160 cut 94
## 1520 behind 94
## 2672 billion 94
## 3103 running 94
## 9 close 95
## 1782 able 96
## 2260 history 96
## 23 building 97
## 1464 saturday 97
## 2581 along 97
## 224 deal 98
## 1308 nearly 98
## 1365 thing 98
## 1410 started 98
## 2487 change 98
## 2786 cleveland 98
## 14 old 99
## 746 won 99
## 803 fire 99
The most common terms for Twitter is
## Term Count
## 107 anyone 90
## 259 live 90
## 1114 bad 90
## 382 things 91
## 470 free 91
## 589 everyone 92
## 1597 watching 92
## 39 even 94
## 462 hate 94
## 946 weekend 94
## 3 gonna 95
## 748 something 95
## 248 tomorrow 96
## 410 big 96
## 221 help 97
## 220 feel 98
## 793 yeah 98
## 78 guys 99
## 724 awesome 99
## 1116 man 99
The next step is the predictive model. The concept we will be using is a backoff model. This means if we don’t have a match for an n-gram (lets say a tri-gram, three word phase as described in the beginning if the paper,) we will look for a match for an n-1 gram. This is what is meant by backoff. Another factor is our training set is not inclusive off all possible combinations. Therefore we need to use a discount factor for our probabilities. For example, suppose the probablily of seeing “fox” after seeing “the quick brown” is 10%, might discount the probability by 15% to take into account words that do not appear in the training set after “the quick brown” but might appear in the test set. In that case thw probability would be 8.5% (.10*(1-0.85)). In addition we are researching markov chains as a way to implement this as suggested by the assignment description. One thought, given that some phrases are a lot more common is to sort the trigrams by how common they are in that particular document. This will allow the algorithm to find them more quickly.
In therms of the Shiny app, we’re going to start simply by having the user input a three word phase and provide a list of possible next words. This concept may change as we go to implementation.
There are challenges that will have to be overcome. For example, as mentioned earlier (and seen in the courpus summary), the dataset is huge. In some cases too big for the programming tools and system memory. We already worked around it for calculating common words by taking a sample of the data for our calculations. Another work around that had to be done is creating the tri-grams. We had to create the input to the routine for creating the ngrams in pieces as it wouldn’t work otherwise.