This document contains a brief summary of findings as a result of exploratory analysis of the en_US data sets provided as part of the Capstone project for the Data Science specialisation on Coursera. In this document a review of the basic features of the three english data files provided is presented. In addition we begin to analyse the distribution of the words, and combinations of words within the data. Finally, a brief discussion of potential modelling techniques is included.
The data files analysed were downloaded from the Coursera website. In the file, files for a number of languages were provided, however analysis of the english data files has taken place. The three data files analysed were 1) en_US.blogs.txt (Sourced from blogs) 2) en_US.news.txt (Sourced from news articles) 3) en_US.twitter.txt (Sourced from tweets)
The below table lists some of the basic information relating to the 3 files read in.
| File | Number of items | Number of unique words | Total number of words |
|---|---|---|---|
| news | 20,581 | 349,194 | 33,550,925 |
| 2,360,148 | 495,760 | 29,409,829 | |
| blogs | 899,288 | 460,798 | 36,885,227 |
This table lists the most common words across the three files.
| word | total_count |
|---|---|
| the | 4,748,972 |
| to | 2,752,048 |
| and | 2,401,905 |
| a | 2,378,345 |
| of | 2,005,004 |
| in | 1,642,609 |
| i | 1,628,263 |
| for | 1,099,201 |
| is | 1,072,038 |
| that | 1,036,473 |
The below graph illustrate the distribution of words within the data set. A way to consider this graph, is that it is showing the “un-evenness” of the data-set; a perfect diagonal line from the bottom-left to the top-right would indicate that each word appeared an equal number of times in the data set.
The distribution of words is quite concentrated, with a small number of words appearing commonly and a large number of words appearing a small number of times. A distribution of this nature may prove somewhat difficult in the modelling, as many words will not have a comparatively large number of observations.
It is also worth noting that there does not appear to be a significant difference in the distributions across the three data sets. This means that we may be able to use a single model across all three data sets, rather than having to develop different models.
In predictive text-analytics, a “two-gram” is a combination of two items (in this case, words), and a “three-gram” is a combination of three words. These will form the basis of the modelling.
The below table lists the most common two-grams across the data sets.
| word | total_count |
|---|---|
| of the | 430,284 |
| in the | 411,620 |
| to the | 213,786 |
| for the | 200,948 |
| on the | 196,284 |
| to be | 161,524 |
| at the | 142,884 |
| and the | 125,459 |
| in a | 119,926 |
| with the | 105,794 |
The below graph shows the distribution of two-grams across the data set.
The distribution of two-grams is slightly more even than the distribution of words.
The model will likely be built to predit the most common word, given the previous three-gram or two-gram. This could be considered in some way a deterministic solution; the most likely word is just based on a lookup of what has followed given this combination of words in the past. The complication is likely to come in a couple of different ways. Firstly constructing the lookup in such a way as to allow for a speedy prediction; secondly allowing for instances where a word combination is entered that does not appear in the data set. This can possibly be achieved by predicting based on the individual word, where the two-gram or three-gram does not appear in the data set.