This document provides an exploratory analysis of the English corpus provided as part of the capstone course for the JohnÂ’s Hopkins/Coursera Data Science Specialization.
There are three Corpus from different sources: Blogs, News and Twitter. The twitter is larger (in number of lines) than the others but with shorter lines. The News and Blogs files have roughly the same amount of lines.
## Warning: package 'knitr' was built under R version 3.0.3
## Warning: package 'tm' was built under R version 3.0.3
## Warning: package 'SnowballC' was built under R version 3.0.3
## Warning: package 'wordcloud' was built under R version 3.0.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.0.3
## Loading required package: RColorBrewer
The table bellow show the size of each dataset as teir number of lines
##
##
## | |X1 |X2 |X3 |
## |:---------------|:------|:------|:-------|
## |Origin |blogs |news |twitter |
## |Number.of.Lines |899288 |766277 |2360148 |
## |Size |248.5 |189.4 |301.4 |
The plot bellow show the mean number of words from each source.
The data pre-processing involved three steps as described bellow.
In this step we remove punctuation to normalize the text and remove strange characters. In some contexts special characters may be useful, as hashtags in tweets. But for our purpose these characters only bring harvoc on tokenization.
Profanities are removed from data since they can cause disconfort in a text predictor.
This is also with the objective of tokenization. All characters are converted to lower and some spelling correction is done. We try to transform the data in a way that it is easier to identify words that have ecxactly equal meaning. We haven’t stemmed words though, because there is a lot of difference in words with the same stem. A word prediction must acknowledge the exact word a user want and not only the stem. Stopwords aren’t removed neither. They are a central part of the vocabulary and must be kept.
A look at the data can give a better insight of the structure of each corpus. The data from each corpus is first processed as described above.
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
## Warning: invalid document identifiers
After the exploration I have built a n-gram model from a large text sample mixing all the corpora. unigrams, bigrams, trigrams and 4-grams have been built and Term-Document-Matrices have been built to feed a markov model. This model is still in developtment and shall be presented further ahead.