Source Data

A zip archive file was downloaded from the Coursera project assignment web page. The files below were extracted from this and used in this report. These files contained English language text from three different sources:

Reading the data

The text files were read into an “R” environment and the following summary information was produced for each. As can been seen below, the maximum character size for Twitter is the (old) limit of 140 characters.

Number of lines per source file:
source_text line_count
blogs 899288
news 1010242
twitter 2360148
Longest line (number of characters) per source file:
source_text longest_line
blogs 40833
news 11384
twitter 140

Most Frequent Words 1

The text lines were split into individual words (also known as tokens) and from that, the number of words per source was determined:

source_text word_count
blogs 37546246
news 34762395
twitter 30093369

Most Frequent Words 2

We can then see what the more frequently used words were per source text (top 10 words are shown).

source_text word word_count
news the 1974366
blogs the 1860156
blogs and 1094401
blogs to 1069440
twitter the 937405
news to 906145
blogs a 900362
news and 889511
news a 878035
blogs of 876799

Most Frequent Words 3

Over all three texts, the most frequently used words are:

word word_count
the 4771927
to 2764230
and 2422450
a 2389755
of 2010936
in 1657973
i 1657335
for 1103087
is 1075727
that 1042522

Most frequent non-stop words

It can be seen from above that stop words (words such as “the”, “a” - see [https://en.wikipedia.org/wiki/Stop_words]) are very popular. These were removed as well as digits (0,1,2,3…10), again to show the most frequent non-stop, non-numeric words. Top 10 words are shown.

word word_count
time 224774
day 175983
love 161651
people 159280
life 91716
rt 89702
home 83247
week 78095
night 77360
game 74838

Interesting that “time” and “day” are so frequently used.

Plot of frequently used words per source text

The plot below shows the 10 most frequently used words per source text - excluding stop words and digits (0,1,2…etc).

Design of Preciction Model 1 - Bigrams

The propsed next step is to split the source texts into bigram tokens contained in a single data set. Bigrams [https://en.wikipedia.org/wiki/Bigram] are two adjacent words.

Bigrams were created from the entire texts - the 10 most frequent are:

bigram bigram_count
of the 431130
in the 408595
to the 213669
for the 201206
on the 197419
to be 162723
at the 142545
and the 125852
in a 119416
with the 106231

Design of Preciction Model 2

Given a word that the user has typed in, the prediction algorithm would then use the frequency counts in the bigram data set to suggest the next possible words. It could limit the prediction to four suggested words.

For example if the user types in “the”, the algortithm would look for the most frequent bigram whose first word is “the” and return the second word of bigram.

Perhaps this coule be made more interesting by using the following rules for creating the list of predicted words: