The purpose of this report is to explore the contents of three files containing tweets, blog posts and newspaper articles. The makeup of lines, singular words, bigrams (wo word combinations) and trigrams (three word combinations) will be analysed.

This data will later be used to build a predictive text model to predict the next word following a single word, bigram or trigram.

Load the Data

Measure	Tweets	Blogs	News
Lines	2,360,148	899,288	1,010,242
Characters	162,096,031	206,824,505	203,223,159
Characters / Line (Document)	68.68045	229.987	201.1628

Build and Transform Corpora using TM package

‘A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas as.matrix() yields the full matrix in dense format (which can be very memory consuming for large matrices).’ https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Word counts for each Corpus created using the tm package are shown below, both in the original data and post tidy. To tidy the dataset, I have removed punctuation, numbers, standard English stopwords and applies stemming using Porters’s Stemming Algorithm.

Note that the analysis below is based on a sample of 100,000 records from each file as processing the complete files was till running after 16 hours!

	wordCountOriginal	wordCountTidied
en_US.blogs.txt	3,260,769	2,009,404
en_US.news.txt	2,808,015	1,844,205
en_US.twitter.txt	993,231	658,719

Top 10 Words

	word	freq
said	said	29,451
will	will	28,002
one	one	26,781
like	like	23,548
just	just	22,635
get	get	22,566
time	time	22,046
can	can	20,807
year	year	19,862
make	make	17,106

Bottom 10 Words

	word	freq
zoloft	zoloft	2
zoni	zoni	2
zopa	zopa	2
zori	zori	2
zorro”	zorro”	2
zotto	zotto	2
zuck	zuck	2
zuzu	zuzu	2
zwick	zwick	2
zyrtec	zyrtec	2

Number of unique words

40,096

Q1 Some words are more frequent than others - what are the distributions of word frequencies?

Process using TidyText package

We now switch to useing TidyText which is similar to tm but I find more intuitive.

Q2 What are the frequencies of bigrams and trigrams in the dataset?

Unfiltered bigrams and trigrams

There are 8,583,202 unfiltered bigrams and 8,284,704 unfiltered trigrams.

Top Unfiltered Bigrams

bigram	n
of the	41293
in the	37778
to the	19802
on the	17591
for the	16317
to be	14114
at the	12572
and the	12288
in a	11074
with the	9860

Top Unfiltered Trigrams

trigram	n
one of the	3220
a lot of	2772
to be a	1490
the end of	1437
going to be	1392
as well as	1366
out of the	1355
it was a	1299
some of the	1292
be able to	1260

Filtered bigrams and trigrams

There are 1,285,365 unfiltered bigrams and 473,420 unfiltered trigrams.

Top filtered Bigrams

word1	word2	n
st	louis	969
los	angeles	682
san	francisco	612
happy	birthday	432
san	diego	406
social	media	385
ice	cream	372
real	estate	321
vice	president	313
white	house	308

Top filtered Trigrams

word1	word2	word3	n
president	barack	obama	130
st	louis	county	101
world	war	ii	95
gov	chris	christie	88
happy	mothers	day	80
happy	mother’s	day	74

Q3 How many unique words do you need in a frequency sorted dictionary to cover 50% and 90% of all word instances in the language?

We need to include the stop words (as its all instances in the language) but still exclude numbers

Just 143 words are needed to cover 50% of all word instances in the language and 7,543 to cover 90%.

Model Design

I plan to build a models for bigrams, trigrams and possibly quadgrams which will predict the highest probably n+1 word given a single word, bigram or trigram.

The shiny app will use the highest n model for the available words input and if a probability for the n+1 word reaches a certain threshold (TBC), that prediction will be used. Otherwise it will use the n-1 gram model and repeat the excercise.

Different models will be tested and compared to see which provide the highest accuracy predictions.

Given the performance issues encountered, models will be trained on a subset of the data.

Data Science Specialisation Capstone Week 2

Chris Woods

18/7/2019