Data Science Capstone Milestone Report

PREAMBLE

The report explains the exploratory analysis and the goals for the eventual app and algorithm. 
This document explain only the major features of the data and briefly summarize
my plans for creating the prediction algorithm and Shiny app.

LOAD REQUIRED PACKAGES

library(NLP)
library(tm)
library(RWeka)
library(dplyr)
library(ggplot2)
library(stringi)
library(wordcloud)

DATA ACQUISITION - LOAD RAW DATA

Basic summaries

source	word counts	line counts
twitter	30,373,832	2,360,148
blog	37,334,441	899,288
news	2,643,972	77,259

DATA SAMPLING

Basic summaries

source	word counts	line counts	sample size
twitter	3,040,137	236,014	10%
blog	3,712,352	89,928	10%
news	528,193	15,451	20%

DATA CLEANSING

Based on data review of each of the three sources, i decided to carry out the
following cleansing transformations: 
- remove non-ascii characters 
- change to lowercase 
- remove punctuation 
- remove numbers 
- remove extra whitespace 
i decided not to remove stopwords, in order not to lower the prediction power of 
the algorithm. about bad words, i decided to mask them in runtime, in order to
comply with the requirements and not lose predication power. 

Additional transformations which could be helpful, i defer to a latter 
phase of the project: 
- splitting multiple sentences within a single line 
- remove garbage 
- if possible fix misspelled words 

Out of the three data sources, twitter is of the lowest quality, it contains a lot of noise,
apparently it is a low frequency noise, and it's influence should be negligible.
on the other hand news is the most accurate data source, containing sentences in 
proper english.
At the end i decided to use all the 3 sources in order to cover as much ground as possible,
and increase the potential predication power of the model.

EXPLORATORY ANALYSIS

Some words are more frequent than others, at this phase i used the 'tm' and 'RWeka' packages 
to calculate the distributions of keyphrases frequencies including 2-grams and 3-grams. This 
information will be the foundation for the predicting the next word.

Building a Term-Document Matrix

## <<TermDocumentMatrix (terms: 177361, documents: 3)>>
## Non-/sparse entries: 247608/284475
## Sparsity           : 53%
## Maximal term length: 120
## Weighting          : term frequency (tf)

Top 50 most common words

ggplot(head(term.freq,50), aes(x=term, y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip() + ggtitle("Top 50 most common words")

2-gram word cloud

Top ten 3-grams

##                  term freq
## 1      thanks for the 2393
## 2          one of the 2215
## 3            a lot of 1971
## 4             to be a 1317
## 5         going to be 1282
## 6           i want to 1272
## 7            i have a 1090
## 8  looking forward to 1046
## 9            it was a 1042
## 10      thank you for 1038

NEXT STEPS

Try to incorporate additional cleansing transformations as mentioned earlier, 
build for a 2 seconds response time of the shiny app, even on the expense of accuracy.
be able to predict the majority of next words, and try to minimize false-positives as 
much as possible, since based on my own experience this is quite an annoying phenomena.

Data Science Capstone Milestone Report

ronencozen@gmail.com

Monday, March 23, 2015

PREAMBLE

LOAD REQUIRED PACKAGES

DATA ACQUISITION - LOAD RAW DATA

DATA SAMPLING

DATA CLEANSING

EXPLORATORY ANALYSIS

NEXT STEPS