The purpose of this report is to explore the contents of three files containing tweets, blog posts and newspaper articles. The makeup of lines, singular words, bigrams (wo word combinations) and trigrams (three word combinations) will be analysed.

This data will later be used to build a predictive text model to predict the next word following a single word, bigram or trigram.

Load the Data

Measure Tweets Blogs News
Lines 2,360,148 899,288 1,010,242
Characters 162,096,031 206,824,505 203,223,159
Characters / Line (Document) 68.68045 229.987 201.1628

Build and Transform Corpora using TM package

‘A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format (which can be very memory consuming for large matrices).’ https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Word counts for each Corpus created using the tm package are shown below, both in the original data and post tidy. To tidy the dataset, I have removed punctuation, numbers, standard English stopwords and applies stemming using Porters’s Stemming Algorithm.

Note that the analysis below is based on a sample of 100,000 records from each file as processing the complete files was till running after 16 hours!

wordCountOriginal wordCountTidied
en_US.blogs.txt 3,260,769 2,009,404
en_US.news.txt 2,808,015 1,844,205
en_US.twitter.txt 993,231 658,719

Top 10 Words

word freq
said said 29,451
will will 28,002
one one 26,781
like like 23,548
just just 22,635
get get 22,566
time time 22,046
can can 20,807
year year 19,862
make make 17,106

Bottom 10 Words

word freq
zoloft zoloft 2
zoni zoni 2
zopa zopa 2
zori zori 2
zorro” zorro” 2
zotto zotto 2
zuck zuck 2
zuzu zuzu 2
zwick zwick 2
zyrtec zyrtec 2

Number of unique words

40,096

Q1 Some words are more frequent than others - what are the distributions of word frequencies?

Process using TidyText package

We now switch to useing TidyText which is similar to tm but I find more intuitive.

Q2 What are the frequencies of bigrams and trigrams in the dataset?

Unfiltered bigrams and trigrams

There are 8,583,202 unfiltered bigrams and 8,284,704 unfiltered trigrams.

Top Unfiltered Bigrams

bigram n
of the 41293
in the 37778
to the 19802
on the 17591
for the 16317
to be 14114
at the 12572
and the 12288
in a 11074
with the 9860

Top Unfiltered Trigrams

trigram n
one of the 3220
a lot of 2772
to be a 1490
the end of 1437
going to be 1392
as well as 1366
out of the 1355
it was a 1299
some of the 1292
be able to 1260

Filtered bigrams and trigrams

There are 1,285,365 unfiltered bigrams and 473,420 unfiltered trigrams.

Top filtered Bigrams

word1 word2 n
st louis 969
los angeles 682
san francisco 612
happy birthday 432
san diego 406
social media 385
ice cream 372
real estate 321
vice president 313
white house 308

Top filtered Trigrams

word1 word2 word3 n
president barack obama 130
st louis county 101
world war ii 95
gov chris christie 88
happy mothers day 80
happy mother’s day 74

Q3 How many unique words do you need in a frequency sorted dictionary to cover 50% and 90% of all word instances in the language?

We need to include the stop words (as its all instances in the language) but still exclude numbers

Just 143 words are needed to cover 50% of all word instances in the language and 7,543 to cover 90%.

Model Design

I plan to build a models for bigrams, trigrams and possibly quadgrams which will predict the highest probably n+1 word given a single word, bigram or trigram.

The shiny app will use the highest n model for the available words input and if a probability for the n+1 word reaches a certain threshold (TBC), that prediction will be used. Otherwise it will use the n-1 gram model and repeat the excercise.

Different models will be tested and compared to see which provide the highest accuracy predictions.

Given the performance issues encountered, models will be trained on a subset of the data.