Exploratory Analysis of Text

Source Data

A zip archive file was downloaded from the Coursera project assignment web page. The files below were extracted from this and used in this report. These files contained English language text from three different sources:

en_US.blogs.txt - text from Blogs
en_US.news.txt - text from news web sites
en_US.twitter.txt - text from Twitter tweets

Reading the data

The text files were read into an “R” environment and the following summary information was produced for each. As can been seen below, the maximum character size for Twitter is the (old) limit of 140 characters.

Number of lines per source file:

source_text	line_count
blogs	899288
news	1010242
twitter	2360148

Longest line (number of characters) per source file:

source_text	longest_line
blogs	40833
news	11384
twitter	140

Most Frequent Words 1

The text lines were split into individual words (also known as tokens) and from that, the number of words per source was determined:

source_text	word_count
blogs	37546246
news	34762395
twitter	30093369

Most Frequent Words 2

We can then see what the more frequently used words were per source text (top 10 words are shown).

source_text	word	word_count
news	the	1974366
blogs	the	1860156
blogs	and	1094401
blogs	to	1069440
twitter	the	937405
news	to	906145
blogs	a	900362
news	and	889511
news	a	878035
blogs	of	876799

Most Frequent Words 3

Over all three texts, the most frequently used words are:

word	word_count
the	4771927
to	2764230
and	2422450
a	2389755
of	2010936
in	1657973
i	1657335
for	1103087
is	1075727
that	1042522

Most frequent non-stop words

It can be seen from above that stop words (words such as “the”, “a” - see [https://en.wikipedia.org/wiki/Stop_words]) are very popular. These were removed as well as digits (0,1,2,3…10), again to show the most frequent non-stop, non-numeric words. Top 10 words are shown.

word	word_count
time	224774
day	175983
love	161651
people	159280
life	91716
rt	89702
home	83247
week	78095
night	77360
game	74838

Interesting that “time” and “day” are so frequently used.

Plot of frequently used words per source text

The plot below shows the 10 most frequently used words per source text - excluding stop words and digits (0,1,2…etc).

Design of Preciction Model 1 - Bigrams

The propsed next step is to split the source texts into bigram tokens contained in a single data set. Bigrams [https://en.wikipedia.org/wiki/Bigram] are two adjacent words.

Bigrams were created from the entire texts - the 10 most frequent are:

bigram	bigram_count
of the	431130
in the	408595
to the	213669
for the	201206
on the	197419
to be	162723
at the	142545
and the	125852
in a	119416
with the	106231

Design of Preciction Model 2

Given a word that the user has typed in, the prediction algorithm would then use the frequency counts in the bigram data set to suggest the next possible words. It could limit the prediction to four suggested words.

For example if the user types in “the”, the algortithm would look for the most frequent bigram whose first word is “the” and return the second word of bigram.

Perhaps this coule be made more interesting by using the following rules for creating the list of predicted words:

The first predicted word can be a stop word
The second and third predcited words must not be non-stop words
The fourth word predicted is a random word picked from the matching bigrams