I. Introduction
This milestone report contains exploratory data analysis of the SwiftKey dataset provided in the Coursera Data Science Capstone course. The data consists of 3 text files containing text from 3 different sources (blogs, new, and twitter). It can be downloaded in the link below:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The goal of this project is to perform an exploratory data analysis and to provide an overview of the dataset. The step-by-step process in developing the prediction algorithm is summarized in this report.
II. Objective
Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
III. Exploratory Data Analysis
Data Loading
Downloading the dataset from the link below: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The datasets consist of text files from 3 different sources:
- News
- Blogs
The text data files are provided in 4 different languages:
- German
- English - United States
- Finnish
- Russian
In this project, we will only focus on the English - United States data sets.
Data Summary
The table below shows the summary of the English - United States datasets being evaluated and to be used in the predictive algorithm.| Dataset | File Size (in MB) | Line Count | Word Count | Minimum Word Count | Mean Word Count | Maximum Word Count |
|---|---|---|---|---|---|---|
| blogs | 200.42 | 899,288 | 37,546,239 | 0 | 41.75 | 6,726 |
| news | 196.28 | 77,259 | 2,674,536 | 1 | 34.62 | 1,123 |
| 159.36 | 2,360,148 | 30,093,413 | 1 | 12.75 | 47 |
Data Cleaning
Before performing exploratory data analysis, data cleaning must be done first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lowercase.
Due to hardware limitations, each file will be sampled to only 1% in order to demonstrate the data cleaning and exploratory data analysis.
The table below shows the summary of the sampled English - United States datasets being evaluated and to be used in the predictive algorithm.| Dataset | File Size (in MB) | Line Count | Word Count | Minimum Word Count | Mean Word Count | Maximum Word Count |
|---|---|---|---|---|---|---|
| blogs | 0.88 | 8,992 | 376,707 | 1 | 41.89 | 682 |
| news | 0.07 | 772 | 27,020 | 1 | 35.00 | 152 |
| 0.78 | 23,601 | 301,545 | 1 | 12.78 | 35 |
Data Visualization
After transforming and cleaning the data, it is ready for some exploratory analysis to determine the most frequent unigrams, bigrams, and trigrams (sets of 1, 2, and 3 words that occur together).
Next Steps For Final Project
This concludes the exploratory analysis on the dataset. The next steps of this capstone project would be the following is to finalize the predictive algorithm with the use of N-Gram (similar to what we did in the exploratory analysis above), and deploy the algorithm as a Shiny App. For the user interface of the Shiny App, it will consist of a text input box that will allow a user to enter a word or a phrase. Then the app will use our algorithm to suggest the most likely next word.
See source code: https://github.com/jrnaputo/CapstoneProject/blob/master/Capstone_MilestoneReport.Rmd