The report explains the exploratory analysis and the goals for the eventual app and algorithm.
This document explain only the major features of the data and briefly summarize
my plans for creating the prediction algorithm and Shiny app.
library(NLP)
library(tm)
library(RWeka)
library(dplyr)
library(ggplot2)
library(stringi)
library(wordcloud)
Basic summaries
| source | word counts | line counts |
|---|---|---|
| 30,373,832 | 2,360,148 | |
| blog | 37,334,441 | 899,288 |
| news | 2,643,972 | 77,259 |
Basic summaries
| source | word counts | line counts | sample size |
|---|---|---|---|
| 3,040,137 | 236,014 | 10% | |
| blog | 3,712,352 | 89,928 | 10% |
| news | 528,193 | 15,451 | 20% |
Based on data review of each of the three sources, i decided to carry out the
following cleansing transformations:
- remove non-ascii characters
- change to lowercase
- remove punctuation
- remove numbers
- remove extra whitespace
i decided not to remove stopwords, in order not to lower the prediction power of
the algorithm. about bad words, i decided to mask them in runtime, in order to
comply with the requirements and not lose predication power.
Additional transformations which could be helpful, i defer to a latter
phase of the project:
- splitting multiple sentences within a single line
- remove garbage
- if possible fix misspelled words
Out of the three data sources, twitter is of the lowest quality, it contains a lot of noise,
apparently it is a low frequency noise, and it's influence should be negligible.
on the other hand news is the most accurate data source, containing sentences in
proper english.
At the end i decided to use all the 3 sources in order to cover as much ground as possible,
and increase the potential predication power of the model.
Some words are more frequent than others, at this phase i used the 'tm' and 'RWeka' packages
to calculate the distributions of keyphrases frequencies including 2-grams and 3-grams. This
information will be the foundation for the predicting the next word.
Building a Term-Document Matrix
## <<TermDocumentMatrix (terms: 177361, documents: 3)>>
## Non-/sparse entries: 247608/284475
## Sparsity : 53%
## Maximal term length: 120
## Weighting : term frequency (tf)
Top 50 most common words
ggplot(head(term.freq,50), aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip() + ggtitle("Top 50 most common words")
2-gram word cloud
Top ten 3-grams
## term freq
## 1 thanks for the 2393
## 2 one of the 2215
## 3 a lot of 1971
## 4 to be a 1317
## 5 going to be 1282
## 6 i want to 1272
## 7 i have a 1090
## 8 looking forward to 1046
## 9 it was a 1042
## 10 thank you for 1038
Try to incorporate additional cleansing transformations as mentioned earlier,
build for a 2 seconds response time of the shiny app, even on the expense of accuracy.
be able to predict the majority of next words, and try to minimize false-positives as
much as possible, since based on my own experience this is quite an annoying phenomena.