MILESTONE REPORT

I. Introduction

This milestone report contains exploratory data analysis of the SwiftKey dataset provided in the Coursera Data Science Capstone course. The data consists of 3 text files containing text from 3 different sources (blogs, new, and twitter). It can be downloaded in the link below:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The goal of this project is to perform an exploratory data analysis and to provide an overview of the dataset. The step-by-step process in developing the prediction algorithm is summarized in this report.

II. Objective

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

III. Exploratory Data Analysis

Data Loading

Downloading the dataset from the link below: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The datasets consist of text files from 3 different sources:

News
Blogs
Twitter

The text data files are provided in 4 different languages:

German
English - United States
Finnish
Russian

In this project, we will only focus on the English - United States data sets.

Data Summary

The table below shows the summary of the English - United States datasets being evaluated and to be used in the predictive algorithm.

Statistical Summary of English - United States Text Files
Dataset	File Size (in MB)	Line Count	Word Count	Minimum Word Count	Mean Word Count	Maximum Word Count
blogs	200.42	899,288	37,546,239	0	41.75	6,726
news	196.28	77,259	2,674,536	1	34.62	1,123
twitter	159.36	2,360,148	30,093,413	1	12.75	47

Data Cleaning

Before performing exploratory data analysis, data cleaning must be done first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lowercase.

Due to hardware limitations, each file will be sampled to only 1% in order to demonstrate the data cleaning and exploratory data analysis.

The table below shows the summary of the sampled English - United States datasets being evaluated and to be used in the predictive algorithm.

Statistical Summary of the English - United States Text Files
Dataset	File Size (in MB)	Line Count	Word Count	Minimum Word Count	Mean Word Count	Maximum Word Count
blogs	0.88	8,992	376,707	1	41.89	682
news	0.07	772	27,020	1	35.00	152
twitter	0.78	23,601	301,545	1	12.78	35

Data Visualization

After transforming and cleaning the data, it is ready for some exploratory analysis to determine the most frequent unigrams, bigrams, and trigrams (sets of 1, 2, and 3 words that occur together).

Next Steps For Final Project

This concludes the exploratory analysis on the dataset. The next steps of this capstone project would be the following is to finalize the predictive algorithm with the use of N-Gram (similar to what we did in the exploratory analysis above), and deploy the algorithm as a Shiny App. For the user interface of the Shiny App, it will consist of a text input box that will allow a user to enter a word or a phrase. Then the app will use our algorithm to suggest the most likely next word.

See source code: https://github.com/jrnaputo/CapstoneProject/blob/master/Capstone_MilestoneReport.Rmd