Introduction

This document contains a brief summary of findings as a result of exploratory analysis of the en_US data sets provided as part of the Capstone project for the Data Science specialisation on Coursera. In this document a review of the basic features of the three english data files provided is presented. In addition we begin to analyse the distribution of the words, and combinations of words within the data. Finally, a brief discussion of potential modelling techniques is included.

Data files.

The data files analysed were downloaded from the Coursera website. In the file, files for a number of languages were provided, however analysis of the english data files has taken place. The three data files analysed were 1) en_US.blogs.txt (Sourced from blogs) 2) en_US.news.txt (Sourced from news articles) 3) en_US.twitter.txt (Sourced from tweets)

Features of the data sets.

Metadata

The below table lists some of the basic information relating to the 3 files read in.

File Number of items Number of unique words Total number of words
news 20,581 349,194 33,550,925
twitter 2,360,148 495,760 29,409,829
blogs 899,288 460,798 36,885,227

Words

This table lists the most common words across the three files.

word total_count
the 4,748,972
to 2,752,048
and 2,401,905
a 2,378,345
of 2,005,004
in 1,642,609
i 1,628,263
for 1,099,201
is 1,072,038
that 1,036,473

The below graph illustrate the distribution of words within the data set. A way to consider this graph, is that it is showing the “un-evenness” of the data-set; a perfect diagonal line from the bottom-left to the top-right would indicate that each word appeared an equal number of times in the data set.

The distribution of words is quite concentrated, with a small number of words appearing commonly and a large number of words appearing a small number of times. A distribution of this nature may prove somewhat difficult in the modelling, as many words will not have a comparatively large number of observations.

It is also worth noting that there does not appear to be a significant difference in the distributions across the three data sets. This means that we may be able to use a single model across all three data sets, rather than having to develop different models.

Two-grams

In predictive text-analytics, a “two-gram” is a combination of two items (in this case, words), and a “three-gram” is a combination of three words. These will form the basis of the modelling.

The below table lists the most common two-grams across the data sets.

word total_count
of the 430,284
in the 411,620
to the 213,786
for the 200,948
on the 196,284
to be 161,524
at the 142,884
and the 125,459
in a 119,926
with the 105,794

The below graph shows the distribution of two-grams across the data set.

The distribution of two-grams is slightly more even than the distribution of words.

Modelling techniques

The model will likely be built to predit the most common word, given the previous three-gram or two-gram. This could be considered in some way a deterministic solution; the most likely word is just based on a lookup of what has followed given this combination of words in the past. The complication is likely to come in a couple of different ways. Firstly constructing the lookup in such a way as to allow for a speedy prediction; secondly allowing for instances where a word combination is entered that does not appear in the data set. This can possibly be achieved by predicting based on the individual word, where the two-gram or three-gram does not appear in the data set.