The purpose of this report is to explore the contents of three files containing tweets, blog posts and newspaper articles. The makeup of lines, singular words, bigrams (wo word combinations) and trigrams (three word combinations) will be analysed.
This data will later be used to build a predictive text model to predict the next word following a single word, bigram or trigram.
| Measure | Tweets | Blogs | News |
|---|---|---|---|
| Lines | 2,360,148 | 899,288 | 1,010,242 |
| Characters | 162,096,031 | 206,824,505 | 203,223,159 |
| Characters / Line (Document) | 68.68045 | 229.987 | 201.1628 |
‘A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format (which can be very memory consuming for large matrices).’ https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Word counts for each Corpus created using the tm package are shown below, both in the original data and post tidy. To tidy the dataset, I have removed punctuation, numbers, standard English stopwords and applies stemming using Porters’s Stemming Algorithm.
Note that the analysis below is based on a sample of 100,000 records from each file as processing the complete files was till running after 16 hours!
| wordCountOriginal | wordCountTidied | |
|---|---|---|
| en_US.blogs.txt | 3,260,769 | 2,009,404 |
| en_US.news.txt | 2,808,015 | 1,844,205 |
| en_US.twitter.txt | 993,231 | 658,719 |
| word | freq | |
|---|---|---|
| said | said | 29,451 |
| will | will | 28,002 |
| one | one | 26,781 |
| like | like | 23,548 |
| just | just | 22,635 |
| get | get | 22,566 |
| time | time | 22,046 |
| can | can | 20,807 |
| year | year | 19,862 |
| make | make | 17,106 |
| word | freq | |
|---|---|---|
| zoloft | zoloft | 2 |
| zoni | zoni | 2 |
| zopa | zopa | 2 |
| zori | zori | 2 |
| zorro” | zorro” | 2 |
| zotto | zotto | 2 |
| zuck | zuck | 2 |
| zuzu | zuzu | 2 |
| zwick | zwick | 2 |
| zyrtec | zyrtec | 2 |
40,096
We now switch to useing TidyText which is similar to tm but I find more intuitive.
There are 8,583,202 unfiltered bigrams and 8,284,704 unfiltered trigrams.
| bigram | n |
|---|---|
| of the | 41293 |
| in the | 37778 |
| to the | 19802 |
| on the | 17591 |
| for the | 16317 |
| to be | 14114 |
| at the | 12572 |
| and the | 12288 |
| in a | 11074 |
| with the | 9860 |
| trigram | n |
|---|---|
| one of the | 3220 |
| a lot of | 2772 |
| to be a | 1490 |
| the end of | 1437 |
| going to be | 1392 |
| as well as | 1366 |
| out of the | 1355 |
| it was a | 1299 |
| some of the | 1292 |
| be able to | 1260 |
There are 1,285,365 unfiltered bigrams and 473,420 unfiltered trigrams.
| word1 | word2 | n |
|---|---|---|
| st | louis | 969 |
| los | angeles | 682 |
| san | francisco | 612 |
| happy | birthday | 432 |
| san | diego | 406 |
| social | media | 385 |
| ice | cream | 372 |
| real | estate | 321 |
| vice | president | 313 |
| white | house | 308 |
| word1 | word2 | word3 | n |
|---|---|---|---|
| president | barack | obama | 130 |
| st | louis | county | 101 |
| world | war | ii | 95 |
| gov | chris | christie | 88 |
| happy | mothers | day | 80 |
| happy | mother’s | day | 74 |
We need to include the stop words (as its all instances in the language) but still exclude numbers
Just 143 words are needed to cover 50% of all word instances in the language and 7,543 to cover 90%.
I plan to build a models for bigrams, trigrams and possibly quadgrams which will predict the highest probably n+1 word given a single word, bigram or trigram.
The shiny app will use the highest n model for the available words input and if a probability for the n+1 word reaches a certain threshold (TBC), that prediction will be used. Otherwise it will use the n-1 gram model and repeat the excercise.
Different models will be tested and compared to see which provide the highest accuracy predictions.
Given the performance issues encountered, models will be trained on a subset of the data.