1.- Overview

In this report we performed the exploratory analysis of a data set as part of the capstone project for the data science coursera specialization (Johns Hopkins University). The general goal of the project is to build a model that, given an input phrase, can predict which is the next most likely word that the user will write. The initial part of this project and the focus of this report is to explore the text data that will be use during the learning phase of the predictive model. To facilitate the readability we will divide this report in the following sections:

2.- Initial summary of the data set

For this project we use a data set that should be downloaded from a link provided at the coursera website. Once unziped we can see that the text data is formed by three text file documents:

In the following table we provide a summary of the raw (not cleaned) text data.

Number of Lines Longest Line Shortest Line Median Line Number of Words Number of Unique Words Percent of Unique Words
en_US.blogs.txt 899288 40835 1 157 38636958 405295 1.048983
en_US.news.txt 77259 5760 2 186 2765729 94146 3.404021
en_US.twitter.txt 2360148 213 2 64 31256916 452831 1.448739

3.- Data Cleaning

In this section we made some basic cleaning of the data set. To accelerate the process of model building later on we will work on a reduced data set that was built by taking random samples of lines from the original text files. In the cleaning we included: transforming characters to lowercase and removing numbers, urls, emails addresses, punctuation, words that had special characters included, extra white spaces, and some optional stemming (we save for testing pourposes a stemmed and a not stemmed version of the clean data).

4.- Basic statistics and visualization

To get a general “feeling” of the general structure of the text data set we here the frequencies of the most representative words present on a randomly selected subset of the cleaned data. Only the most frequent words in subset are ploted.

#plot word frequencies
library(ggplot2)
f<-ggplot(subset(wf, freq>250),aes(word, freq)) 
f + geom_bar(stat="identity") 

Now from this bar plot we can see, maybe without much surprise, that many of the frequent words belong to a group of what is generally defined as stop words which are in fact common words for every language but have low information on the content of phrases and sentences. In the next visualization we present instead a word cloud of the most frequent words once we have removed the stop words. In this type of visualization, the size and colors of the represented words are scaled according to their frequencies.

#word cloud plot.
library(wordcloud)
FALSE Loading required package: RColorBrewer
wordcloud(names(freq.noStop),freq.noStop, min.freq=3, scale=c(8,.2),max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)

5.- Plans to build the prediction model

The prediction model we need to build should use as input a phrase and give as output the next most likely word for that phrase. To solved this problem we will use n-gram models. However, besides the selection of the general model to be use we will proceed by starting with a simple model (uni-grams) on several versions of our data sets (e.g., stemmed vs. not stemmed) or even their sizes (using several sizes of the data sets and try to see what is the minimal size that better reproduces predictions). Next we can optimize further using higher order n-grams.

Apendix

Expected values of some statistics in the original data

The expected value of some statistical properties in the original data set is easy to reproduce even by resampling only a small fraction of the data. This is the case of the median number which can be reproduced (see table 1) with ~600 sampled lines (~0.77% of the data)

On the other side, if we want to take a sample that captutes most of the unique words in our corpus even taking 10000 lines (12% of the data) will manage to obtain only 6% of the total unique words for this text file and predictably many more lines will need to be sampled to get most of the data. We can have an idea of this numbers by estimating the growth rate of the number of unique words vs.the number of sampled lines in the text file.

For the pourposes of this exploratory analysis we used a simplify (and small) sample of values from the original data, but to build the final model we will perform an estimation of the sample size required to represent more interesting data statistics like the n-grams.