Exploratory analysis of data sets; Coursera capstone

Introduction

This document contains a brief summary of findings as a result of exploratory analysis of the en_US data sets provided as part of the Capstone project for the Data Science specialisation on Coursera. In this document a review of the basic features of the three english data files provided is presented. In addition we begin to analyse the distribution of the words, and combinations of words within the data. Finally, a brief discussion of potential modelling techniques is included.

Data files.

The data files analysed were downloaded from the Coursera website. In the file, files for a number of languages were provided, however analysis of the english data files has taken place. The three data files analysed were 1) en_US.blogs.txt (Sourced from blogs) 2) en_US.news.txt (Sourced from news articles) 3) en_US.twitter.txt (Sourced from tweets)

Features of the data sets.

Metadata

The below table lists some of the basic information relating to the 3 files read in.

File	Number of items	Number of unique words	Total number of words
news	20,581	349,194	33,550,925
twitter	2,360,148	495,760	29,409,829
blogs	899,288	460,798	36,885,227

Words

This table lists the most common words across the three files.

word	total_count
the	4,748,972
to	2,752,048
and	2,401,905
a	2,378,345
of	2,005,004
in	1,642,609
i	1,628,263
for	1,099,201
is	1,072,038
that	1,036,473

The below graph illustrate the distribution of words within the data set. A way to consider this graph, is that it is showing the “un-evenness” of the data-set; a perfect diagonal line from the bottom-left to the top-right would indicate that each word appeared an equal number of times in the data set.

The distribution of words is quite concentrated, with a small number of words appearing commonly and a large number of words appearing a small number of times. A distribution of this nature may prove somewhat difficult in the modelling, as many words will not have a comparatively large number of observations.

It is also worth noting that there does not appear to be a significant difference in the distributions across the three data sets. This means that we may be able to use a single model across all three data sets, rather than having to develop different models.

Two-grams

In predictive text-analytics, a “two-gram” is a combination of two items (in this case, words), and a “three-gram” is a combination of three words. These will form the basis of the modelling.

The below table lists the most common two-grams across the data sets.

word	total_count
of the	430,284
in the	411,620
to the	213,786
for the	200,948
on the	196,284
to be	161,524
at the	142,884
and the	125,459
in a	119,926
with the	105,794

The below graph shows the distribution of two-grams across the data set.

The distribution of two-grams is slightly more even than the distribution of words.

Modelling techniques

The model will likely be built to predit the most common word, given the previous three-gram or two-gram. This could be considered in some way a deterministic solution; the most likely word is just based on a lookup of what has followed given this combination of words in the past. The complication is likely to come in a couple of different ways. Firstly constructing the lookup in such a way as to allow for a speedy prediction; secondly allowing for instances where a word combination is entered that does not appear in the data set. This can possibly be achieved by predicting based on the individual word, where the two-gram or three-gram does not appear in the data set.