Executive Summary

This document provides an exploratory analysis of the English corpus provided as part of the capstone course for the JohnÂ’s Hopkins/Coursera Data Science Specialization.

Reading the data

There are three Corpus from different sources: Blogs, News and Twitter. The twitter is larger (in number of lines) than the others but with shorter lines. The News and Blogs files have roughly the same amount of lines.

## Warning: package 'knitr' was built under R version 3.0.3
## Warning: package 'tm' was built under R version 3.0.3
## Warning: package 'SnowballC' was built under R version 3.0.3
## Warning: package 'wordcloud' was built under R version 3.0.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.0.3
## Loading required package: RColorBrewer

The table bellow show the size of each dataset as teir number of lines

## 
## 
## |                |X1     |X2     |X3      |
## |:---------------|:------|:------|:-------|
## |Origin          |blogs  |news   |twitter |
## |Number.of.Lines |899288 |766277 |2360148 |
## |Size            |248.5  |189.4  |301.4   |

The plot bellow show the mean number of words from each source.

plot of chunk unnamed-chunk-1

Data Pre-processing

The data pre-processing involved three steps as described bellow.

  1. Removing punctuation

In this step we remove punctuation to normalize the text and remove strange characters. In some contexts special characters may be useful, as hashtags in tweets. But for our purpose these characters only bring harvoc on tokenization.

  1. Profanity Filtering

Profanities are removed from data since they can cause disconfort in a text predictor.

  1. Word normalization

This is also with the objective of tokenization. All characters are converted to lower and some spelling correction is done. We try to transform the data in a way that it is easier to identify words that have ecxactly equal meaning. We haven’t stemmed words though, because there is a lot of difference in words with the same stem. A word prediction must acknowledge the exact word a user want and not only the stem. Stopwords aren’t removed neither. They are a central part of the vocabulary and must be kept.

Visualization

A look at the data can give a better insight of the structure of each corpus. The data from each corpus is first processed as described above.

## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
Twitter Wordcloud

plot of chunk twitter

News Wordcloud

plot of chunk news

Blogs Wordcloud

plot of chunk blogs

Conclusion

After the exploration I have built a n-gram model from a large text sample mixing all the corpora. unigrams, bigrams, trigrams and 4-grams have been built and Term-Document-Matrices have been built to feed a markov model. This model is still in developtment and shall be presented further ahead.