MILESTONE REPORT

Data Science Capstone

JOHN JOSEPH R. NAPUTO

AUGUST 15, 2019

I. Introduction

This milestone report contains exploratory data analysis of the SwiftKey dataset provided in the Coursera Data Science Capstone course. The data consists of 3 text files containing text from 3 different sources (blogs, new, and twitter). It can be downloaded in the link below:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The goal of this project is to perform an exploratory data analysis and to provide an overview of the dataset. The step-by-step process in developing the prediction algorithm is summarized in this report.

II. Objective

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

  2. Create a basic report of summary statistics about the data sets.

  3. Report any interesting findings that you amassed so far.

  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

III. Exploratory Data Analysis

Data Loading

Downloading the dataset from the link below: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The datasets consist of text files from 3 different sources:

  1. News
  2. Blogs
  3. Twitter

The text data files are provided in 4 different languages:

  1. German
  2. English - United States
  3. Finnish
  4. Russian

In this project, we will only focus on the English - United States data sets.

Data Summary

The table below shows the summary of the English - United States datasets being evaluated and to be used in the predictive algorithm.
Statistical Summary of English - United States Text Files
Dataset File Size (in MB) Line Count Word Count Minimum Word Count Mean Word Count Maximum Word Count
blogs 200.42 899,288 37,546,239 0 41.75 6,726
news 196.28 77,259 2,674,536 1 34.62 1,123
twitter 159.36 2,360,148 30,093,413 1 12.75 47

Data Cleaning

Before performing exploratory data analysis, data cleaning must be done first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lowercase.

Due to hardware limitations, each file will be sampled to only 1% in order to demonstrate the data cleaning and exploratory data analysis.

The table below shows the summary of the sampled English - United States datasets being evaluated and to be used in the predictive algorithm.
Statistical Summary of the English - United States Text Files
Dataset File Size (in MB) Line Count Word Count Minimum Word Count Mean Word Count Maximum Word Count
blogs 0.88 8,992 376,707 1 41.89 682
news 0.07 772 27,020 1 35.00 152
twitter 0.78 23,601 301,545 1 12.78 35

Data Visualization

After transforming and cleaning the data, it is ready for some exploratory analysis to determine the most frequent unigrams, bigrams, and trigrams (sets of 1, 2, and 3 words that occur together).

Next Steps For Final Project

This concludes the exploratory analysis on the dataset. The next steps of this capstone project would be the following is to finalize the predictive algorithm with the use of N-Gram (similar to what we did in the exploratory analysis above), and deploy the algorithm as a Shiny App. For the user interface of the Shiny App, it will consist of a text input box that will allow a user to enter a word or a phrase. Then the app will use our algorithm to suggest the most likely next word.

See source code: https://github.com/jrnaputo/CapstoneProject/blob/master/Capstone_MilestoneReport.Rmd