Milestone Report

Introduction

The goal of the report is to explain the exploratory analysis and the goals for the eventual app and algorithm. I have briefly summarized my algorithm and the algorithm in it.
It gives a basic summary statistics about the data set.

The Data Set

To get started with the Data Science Capstone Project.I’ve download the Coursera Swiftkey Dataset. After extraction, I have chosen to work with folder en_US which contains following three files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt.

Summary of the data set

## Warning in readLines("./final/en_US/en_US.news.txt", skipNul = T): incomplete
## final line found on './final/en_US/en_US.news.txt'

##              name size_file num.words num.length
## 1   en_US.twitter     200Mb  38154238     899288
## 2 en_US.blogs.txt     196Mb   2693898      77259
## 3      en_US.news     160Mb  30218166    2360148

Since the data set is quite large and it would be very memory intensive process to preprocess such a large data set.
So, I divided the data set into 25 small data set that constitues the entire corpus together. All the preprocess and tokenization steps are performed on these files and the results are then compiled into one major data table.

Preprocessing the Data set

The following preprocessing steps were done (in order).
1. Removing URLs.
2. Removing symbols and non-ascii charecters.
3. Removing Hashtags and punctuations.
4. Removing Numbers.
5. Removing Profanity words.

Note: I haven’t removed the stop words because in our project stopwords can be useful in predicting the next words.

Building N-Grams

After all the preprocessing and cleaning steps, various N-Grams were tokens were made(N = 2,3,4,5) to compare the position of words relative to others.
Here’s the look of a sample of 3-gram data table.

##          feature  freq pred       base
## 1     one_of_the 19716  the     one_of
## 2       a_lot_of 19015   of      a_lot
## 3 thanks_for_the 13985  the thanks_for
## 4        to_be_a 13145    a      to_be
## 5    going_to_be 12560   be   going_to
## 6      i_want_to 11406   to     i_want

Most Frequent N-Grams

Note that the above N-Grams are built by pruning the N-Grams with a minimum frequency of four.

Strategy for prediction model and Shiny App

For the predictive model, I am thinking about using the Stupid Backoff Model. It is a simple model and can be very fast relative to other algorithms. However it may be a bit inaccurate compared to other algorithms.
This algorithm just focuses on the Probabilty of seeing a particular word given the previous set of words.
I would be deploying the 5-gram model to predict the text.However if the words entered are less than 4 than i would shift to the lesser_grams models.

For the Shiny App, I would keep it simple which will predict the next word and give out the best possible 3 options for the predictions.
As soon as the user enter space, three predictions will be shown immediately.