Introduction

This document is the Milestone Report for the Data Science Specialization course of Johns Hopkins University at Coursera.

SwifKey Corporation has built an application based on a smart keyboard to improve text entry for users on mobile devices. The objective of this project is to develop a predictive model that provides suggestion for the next word.

This Milestone Report describes the exploratory data analysis of the Capstone Dataset.

Getting the Data

The dataset consists of 4 locales, the file sizes for EN_US are as follows:

##                name      size
## 1   en_US.blogs.txt 200.42 MB
## 2    en_US.news.txt 196.28 MB
## 3 en_US.twitter.txt 159.36 MB

Data samples

Samples were created for 1% of the dataset, file sizes for the samples are as follows:

##                name    size
## 1   en_US.blogs.txt 2.01 MB
## 2    en_US.news.txt 1.97 MB
## 3 en_US.twitter.txt  1.6 MB

Corpus construction

A Corpus was created in R with the ‘tm’ package and a Transformation was applied to reduce noise by converting text to lower case, removing punctuations, numbers and English stop words,apply stemming and Strip unnecessary white spaces.

## <<TermDocumentMatrix (terms: 7115, documents: 3)>>
## Non-/sparse entries: 21345/0
## Sparsity           : 0%
## Maximal term length: 15
## Weighting          : term frequency (tf)

Exploratory Data Analysis

Word sequences can be explored via N-Grams from the Corpus.

Word Cloud

Linguistic context can be derived by generating a Word Cloud from the Corpus.

Summary

The file size for Twitter is smaller than Blogs and News, probably because tweets are short. The 1-Gram and 2-Gram models have reasonable linguistic context. The 3-Gram model does not have reasonable context and contains foreign language words. The word cloud has loose linguistic context.

Next steps

The following can be done to improve the prediction algorithm for the Shiny app. Increase the sample size of the data Improve quality of the training data by removing foreign language text Take into account the maximum likelyhood in the 3-Gram model and the smoothing probability in the 2-Gram model.