^{^{1.- Overview}}

In this report we performed the exploratory analysis of a data set as part of the capstone project for the data science coursera specialization (Johns Hopkins University). The general goal of the project is to build a model that, given an input phrase, can predict which is the next most likely word that the user will write. The initial part of this project and the focus of this report is to explore the text data that will be use during the learning phase of the predictive model. To facilitate the readability we will divide this report in the following sections:

Initial summary of the data set.
Data cleaning.
Basic statistics and visualization.
Plans to build the prediction model.
Apendix

^{^{2.- Initial summary of the data set}}

For this project we use a data set that should be downloaded from a link provided at the coursera website. Once unziped we can see that the text data is formed by three text file documents:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

In the following table we provide a summary of the raw (not cleaned) text data.

	Number of Lines	Longest Line	Shortest Line	Median Line	Number of Words	Number of Unique Words	Percent of Unique Words
en_US.blogs.txt	899288	40835	1	157	38636958	405295	1.048983
en_US.news.txt	77259	5760	2	186	2765729	94146	3.404021
en_US.twitter.txt	2360148	213	2	64	31256916	452831	1.448739

^{^{3.- Data Cleaning}}

In this section we made some basic cleaning of the data set. To accelerate the process of model building later on we will work on a reduced data set that was built by taking random samples of lines from the original text files. In the cleaning we included: transforming characters to lowercase and removing numbers, urls, emails addresses, punctuation, words that had special characters included, extra white spaces, and some optional stemming (we save for testing pourposes a stemmed and a not stemmed version of the clean data).

^{^{4.- Basic statistics and visualization}}

To get a general “feeling” of the general structure of the text data set we here the frequencies of the most representative words present on a randomly selected subset of the cleaned data. Only the most frequent words in subset are ploted.

#plot word frequencies
library(ggplot2)
f<-ggplot(subset(wf, freq>250),aes(word, freq)) 
f + geom_bar(stat="identity")

Now from this bar plot we can see, maybe without much surprise, that many of the frequent words belong to a group of what is generally defined as stop words which are in fact common words for every language but have low information on the content of phrases and sentences. In the next visualization we present instead a word cloud of the most frequent words once we have removed the stop words. In this type of visualization, the size and colors of the represented words are scaled according to their frequencies.

#word cloud plot.
library(wordcloud)

FALSE Loading required package: RColorBrewer

wordcloud(names(freq.noStop),freq.noStop, min.freq=3, scale=c(8,.2),max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)

^{^{5.- Plans to build the prediction model}}

The prediction model we need to build should use as input a phrase and give as output the next most likely word for that phrase. To solved this problem we will use n-gram models. However, besides the selection of the general model to be use we will proceed by starting with a simple model (uni-grams) on several versions of our data sets (e.g., stemmed vs. not stemmed) or even their sizes (using several sizes of the data sets and try to see what is the minimal size that better reproduces predictions). Next we can optimize further using higher order n-grams.

^{^Apendix}

^{^{^{Expected values of some statistics in the original data}}}

The expected value of some statistical properties in the original data set is easy to reproduce even by resampling only a small fraction of the data. This is the case of the median number which can be reproduced (see table 1) with ~600 sampled lines (~0.77% of the data)

On the other side, if we want to take a sample that captutes most of the unique words in our corpus even taking 10000 lines (12% of the data) will manage to obtain only 6% of the total unique words for this text file and predictably many more lines will need to be sampled to get most of the data. We can have an idea of this numbers by estimating the growth rate of the number of unique words vs.the number of sampled lines in the text file.

For the pourposes of this exploratory analysis we used a simplify (and small) sample of values from the original data, but to build the final model we will perform an estimation of the sample size required to represent more interesting data statistics like the n-grams.

Milestone report: exploratory analysis of the Swiftkey data set.

Rolando P. Hong Enriquez