Data Science Capstone Project Milestone Report

Introduction

The goal of this project is to understand the provided data, perform analysis to provide summary statistics and establich a overall approach to build prediction algorithm and shiny app to predict the next probable word by using the given input.

Getting and Cleaning the Data

The data for this project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip provided by Coursera in partnership with Swiftkey.For this project we are using only files from “en_US” folder.

Below is the summary of all the three files. It will show the Number of Lines and Number of Words for each file.

##   filename num_lines num_words
## 1     news   1010242  34762395
## 2  twitter   2360148  30093410
## 3    blogs    899288  37546246

From the summary we can see that twitter data has more number of lines where as blogs data has more number of words.

Sampling of Data

These files are huge so it requires more time to process the data. in order to reduce this only subset of the data is taken using sampling so that our subset is a representative sample. For simplicity, we will take 1% sample of the data.Below is the summary of all the three sample files. It will show the Number of Lines and Number of Words for each sample file.

##        filename num_linesSample num_wordsSample
## 1    newsSample           10102          348176
## 2 twitterSample           23601          300582
## 3   blogsSample            8992          390738

## Loading required package: NLP

n-gram Plots

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Findings

From the graphs we can see that it requires more data cleaning.

Next Step

Requires more data cleaning. With the clean data build a predictive model using this n-gram model and provide the interface to the user with the Shiny app.