This report provides an understanding of how the data is structured and documents the efforts taken during exploratory analysis. The end goal of this analysis is to build a word prediction application (via Shiny).
Datasets provided consists of texts coming from 3 different sources; news, blogs and Twitter.
Below are some statistics taken from the data. For clarity, the number of words are derived simply by non-word characters plus 1.
| data_source | num_line | num_words |
|---|---|---|
| blog | 899288 | 39386844 |
| news | 77258 | 2837474 |
| 2360148 | 32874008 |
Based on the above, we have determined that the size of data would be too much for the current hardware to handle. Thus, we will need to sample the data for our analysis.
After sampling, below are the stats compared to the original:
| data_source | num_line | num_words |
|---|---|---|
| blog | 20000 | 878150 |
| news | 20000 | 732262 |
| 20000 | 278501 |
Now that we’ve sampled the data, we can begin our data exploration. Pre-processing of the text is done using the quanteda package. Transformation that was done includes turning all capital letters into lowercase, removing numbers, removing punctuation, removing separators, removing twitter handles and mentions, and ignoring english stopwords. The words are not stemmed however.
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 10,000 documents
## ... indexing features: 32,978 feature types
## ... removed 170 features, from 174 supplied (glob) feature types
## ... created a 10000 x 32808 sparse dfm
## ... complete.
## Elapsed time: 1.22 seconds.
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 10,000 documents
## ... indexing features: 32,308 feature types
## ... removed 168 features, from 174 supplied (glob) feature types
## ... created a 10000 x 32140 sparse dfm
## ... complete.
## Elapsed time: 0.56 seconds.
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 10,000 documents
## ... indexing features: 15,522 feature types
## ... removed 167 features, from 174 supplied (glob) feature types
## ... created a 10000 x 15355 sparse dfm
## ... complete.
## Elapsed time: 0.39 seconds.
We now look at the term frequencies for the top 20 words from each source.
Blog
News
Twitter
When building n-grams, certain aspect of the changes done during the earlier pre-processing was changed. Only removal of punctuations and lowercasing of text are done while other aspects of the text are retained in order to get a better context of what was meant in each text.
Below are a few graphs highlighting the frequencies of the 2-grams and 3-grams from each data source.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 73,686 feature types
## ... created a 10000 x 73686 sparse dfm
## ... complete.
## Elapsed time: 0.08 seconds.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 99,418 feature types
## ... created a 10000 x 99418 sparse dfm
## ... complete.
## Elapsed time: 0.08 seconds.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 197,939 feature types
## ... created a 10000 x 197939 sparse dfm
## ... complete.
## Elapsed time: 0.32 seconds.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 297,883 feature types
## ... created a 10000 x 297883 sparse dfm
## ... complete.
## Elapsed time: 0.47 seconds.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 209,128 feature types
## ... created a 10000 x 209128 sparse dfm
## ... complete.
## Elapsed time: 0.61 seconds.
##
## ... indexing documents: 10,000 documents
## ... indexing features: 346,072 feature types
## ... created a 10000 x 346072 sparse dfm
## ... complete.
## Elapsed time: 0.67 seconds.
For the language model, I’m planning to use a simple stupid backoff model for starters. The reason for this is to simplify the amount of processing that is required for the prediction.
A few challenges that I am expecting in the future would be on the accuracy of the model, given that it will be built based on a smaller set of data.