Introduction

The purpose of this report is to provide first and exploratory analysis of the en_US corpus data. This will consist of tri-gram frequency for all three documents in the corpus (blogs, news and twitter.) A tri-gram is essentially a group of three words as they appear together in the document. For example if you have a sentence “The Quick Brown fox jumps over the lazy dog.” you would have tri-grams “The Quick Brown”, “Quick Brown Fox”, “Brown Fox jumped”,“Fox jumped over”, “jumped over the”, “over the lazy” and “the lazy dog.” We will also show a summary of each of the three documents in the corpus and the most common words from a sample of each of the three documents. This will allow us to get a feeling for the data and the differences between the three groups of data.

The second part will discuss our plans for a predictive model and shiny application that will use our predictive model.

Analysis

These graphs are the top 20 phases (ngrams) from each dataset (blogs, news and twitter)

As you can see there are differences between the three data sets.These phases seem to be in step with what one would expect for blogs verses news articles verses twitter feeds. One thing of note is phrases like “Happy Valentine’s Day” show up very frequently in twitter. Not surprising to anyone who sees people posting holiday greetings on social media on the various holidays.

The number of lines for each (blogs, news, twitter) are as follows

## [1] 899288
## [1] 77259
## [1] 2360148

These are the summaries for all three documents in the corpus: Blogs, News and Twitter.

## $chars
## [1] 158322265
## 
## $letters
## [1] 116631145
## 
## $whitespace
## [1] 37334910
## 
## $punctuation
## [1] 3157642
## 
## $digits
## [1] 985537
## 
## $words
## [1] 19868446
## 
## $sentences
## [1] 0
## 
## $lines
## [1] 1
## 
## $wordlens
##  [1]  134210  377348 1760362 4151793 3324605 2882773 2413217 1679406
##  [9] 1128687 2016044
## 
## $senlens
##  [1] 0 0 0 0 0 0 0 0 0 0
## 
## $syllens
##  [1] 6870120 7717832 3246704 1146968  252752   37772    6255    2354
##  [9]     979    1712
## 
## attr(,"class")
## [1] "string_summary"
## $chars
## [1] 12351435
## 
## $letters
## [1] 9429741
## 
## $whitespace
## [1] 2643985
## 
## $punctuation
## [1] 84962
## 
## $digits
## [1] 177753
## 
## $words
## [1] 1576830
## 
## $sentences
## [1] 0
## 
## $lines
## [1] 1
## 
## $wordlens
##  [1]  11047  39724 118753 293331 249738 238658 211290 156040 107099 151149
## 
## $senlens
##  [1] 0 0 0 0 0 0 0 0 0 0
## 
## $syllens
##  [1] 469814 599646 298390 102730  23753   3219    629    175     57     56
## 
## attr(,"class")
## [1] "string_summary"
## $chars
## [1] 125099425
## 
## $letters
## [1] 93221570
## 
## $whitespace
## [1] 30374728
## 
## $punctuation
## [1] 405909
## 
## $digits
## [1] 1015463
## 
## $words
## [1] 17814991
## 
## $sentences
## [1] 0
## 
## $lines
## [1] 1
## 
## $wordlens
##  [1]  355913  829347 2212249 4431910 3058682 2280898 1927065 1137522
##  [9]  672557  908847
## 
## $senlens
##  [1] 0 0 0 0 0 0 0 0 0 0
## 
## $syllens
##  [1] 7577414 6403525 2022696  675919  144180   30233   10651    5608
##  [9]    2352    2877
## 
## attr(,"class")
## [1] "string_summary"

In this case it’s obvious each of these documents is quite large with blogs being the largest. Notice though that twitter has the most number of lines. This is probably caused by the fact we’re looking at “tweets” rather than paragraphs. The size of blogs in particular is going to present a challenge as it would use more memory then is practical in some if not many cases.

We will now look at the Document term matrix for a sample of all three documents. This shows the count for each word in each line. We can them sum these columns to get a feel for the most common worksWe took a sample of 10,000 lines from each of the three documents to obtain these calculations. This is a prime example of where it’s not practical to use all the data as it’s far to large for the calculations necessary for this calculations

## <<DocumentTermMatrix (documents: 10000, terms: 33426)>>
## Non-/sparse entries: 193687/334066313
## Sparsity           : 100%
## Maximal term length: 98
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   can get just know like now one people time will
##   1194   2   1    1    0    2   1   0      0    1    0
##   1897   0   0    0    0    0   1   0      0    0    3
##   2360   0   1    0    0    0   1   1      1    0    0
##   2750   0   0    0    1    0   2   0      0    0    0
##   558    0   0    1    0    2   0   0      0    1    0
##   6418   0   0    4    0    3   1   1      1    0    0
##   7486   0   0    1    1    1   0   4      0    0    3
##   7849   0   0    0    0    0   0   1      0    0    1
##   9025   1   0    0    0    0   0   1      0    0    0
##   9154   3   0    0    0    0   0   2      1    0    0
## <<DocumentTermMatrix (documents: 10000, terms: 33120)>>
## Non-/sparse entries: 186528/331013472
## Sparsity           : 100%
## Maximal term length: 46
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   also can just last new one said two will year
##   1721    0   0    0    0   0   0    0   0    0    0
##   1738    0   0    0    0   0   0    0   0    0    0
##   2783    0   1    1    0   0   1    0   0    0    0
##   3539    0   0    0    0   0   0    0   0    0    0
##   8247    0   0    2    0   2   1    0   1    1    0
##   8461    0   0    0    0   0   0    0   0    2    0
##   8961    0   0    0    0   0   1    0   0    0    0
##   9017    0   0    0    0   0   2    0   2    4    0
##   9158    0   0    0    0   0   0    0   0    5    0
##   9340    0   1    0    0   0   1    0   1    3    0
## <<DocumentTermMatrix (documents: 10000, terms: 15800)>>
## Non-/sparse entries: 68209/157931791
## Sparsity           : 100%
## Maximal term length: 36
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   can day dont get good just like love thanks will
##   128    2   0    1   1    0    0    0    0      0    0
##   2588   1   0    0   1    1    0    0    0      0    0
##   2987   0   0    0   0    0    0    0    0      0    0
##   4728   0   0    0   0    0    1    0    0      0    0
##   5748   0   0    1   0    0    0    0    0      0    1
##   6972   0   0    0   0    0    0    0    0      0    0
##   7811   1   1    0   0    0    0    1    0      0    1
##   9056   0   0    0   0    0    1    1    0      0    0
##   916    0   0    0   0    0    0    0    0      0    0
##   9567   0   0    0   0    0    0    0    0      0    0

Using the matrices above we were able to calculate the most common words for each document. The most common terms for Blogs is

##            Term Count
## 803   wonderful    97
## 808      church    97
## 1135       lost    97
## 1231      front    97
## 90          cut    98
## 138        hear    98
## 222       stuff    98
## 274      within    98
## 682     company    98
## 1668        weà   98
## 2410    history    98
## 3206     public    98
## 619        goes    99
## 1418      movie    99
## 1420    perfect    99
## 1734 especially    99
## 1907       film    99
## 2115      local    99
## 4067   question    99
## 4845      child    99

One interesting thing here is we had the work weA (with a tilde) as one of the most commmon words. I’m not sure why that is the case but it bears further investigation.

The most common terms for News is

##           Term Count
## 1760      live    93
## 1160       cut    94
## 1520    behind    94
## 2672   billion    94
## 3103   running    94
## 9        close    95
## 1782      able    96
## 2260   history    96
## 23    building    97
## 1464  saturday    97
## 2581     along    97
## 224       deal    98
## 1308    nearly    98
## 1365     thing    98
## 1410   started    98
## 2487    change    98
## 2786 cleveland    98
## 14         old    99
## 746        won    99
## 803       fire    99

The most common terms for Twitter is

##           Term Count
## 107     anyone    90
## 259       live    90
## 1114       bad    90
## 382     things    91
## 470       free    91
## 589   everyone    92
## 1597  watching    92
## 39        even    94
## 462       hate    94
## 946    weekend    94
## 3        gonna    95
## 748  something    95
## 248   tomorrow    96
## 410        big    96
## 221       help    97
## 220       feel    98
## 793       yeah    98
## 78        guys    99
## 724    awesome    99
## 1116       man    99

Plans for the Predictive Model

The next step is the predictive model. The concept we will be using is a backoff model. This means if we don’t have a match for an n-gram (lets say a tri-gram, three word phase as described in the beginning if the paper,) we will look for a match for an n-1 gram. This is what is meant by backoff. Another factor is our training set is not inclusive off all possible combinations. Therefore we need to use a discount factor for our probabilities. For example, suppose the probablily of seeing “fox” after seeing “the quick brown” is 10%, might discount the probability by 15% to take into account words that do not appear in the training set after “the quick brown” but might appear in the test set. In that case thw probability would be 8.5% (.10*(1-0.85)). In addition we are researching markov chains as a way to implement this as suggested by the assignment description. One thought, given that some phrases are a lot more common is to sort the trigrams by how common they are in that particular document. This will allow the algorithm to find them more quickly.

In therms of the Shiny app, we’re going to start simply by having the user input a three word phase and provide a list of possible next words. This concept may change as we go to implementation.

There are challenges that will have to be overcome. For example, as mentioned earlier (and seen in the courpus summary), the dataset is huge. In some cases too big for the programming tools and system memory. We already worked around it for calculating common words by taking a sample of the data for our calculations. Another work around that had to be done is creating the tri-grams. We had to create the input to the routine for creating the ngrams in pieces as it wouldn’t work otherwise.