Milestone Report

Introduction

This report describes my exploratory analysis of the data sets provided for the Capstone Project and and my goals for the eventual algorithm and app.

Data Sets Used

The data for this project was downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The downloaded file was unzipped and the following files selected for subsequent analysis:
en_US.blogs.txt (referred to as “Blogs”)
en_US.news.txt (referred to as “News”)
en_US.twitter.txt (referred to as “Twitter”)

Data Set Basic Characteristics

This section contains some basic characteristics of the selected data files.

Data Set Size Metrics

Data Set Character Counts

Each data set was analyzed to see how many characters it contained. The results are summarized in the following three bar plots:

Data Set Line Counts

Each data set was analyzed to see how many lines it contained. The results are summarized in the following three bar plots:

Data Set Word Counts

Each data set was analyzed to see how many words it contained. For the purposes of this analysis, a “word” is defined as a series of one or more non-space characters followed by one or more spaces. The results are summarized in the following three bar plots:

Data Set Word Frequencies

Each data set was analyzed to find the number of times each word was used. The ten most-observed words for each data set, in order of decreasing number of observations, are summarized in the following three bar plots:

Data Set Word-Pair Frequencies

For each of the data sets, 0.1% of the lines were randomly selected. That subset of lines was analyzed using 2-grams to find the most-frequent two word pairs. The ten most-observed word-pairs for each data set are summarized in the following three bar plots:

Plan for a Shiny app

The goal of the shiny app is to demonstrate an algorithm for predicting the next word to be typed based upon the words previously typed.

The exact algorithm to be used is yet to be determined, but it will likely include concepts taken from n-grams and/or Markov chains.

The word frequencies used by the algorithm will be taken from the data sets described above in this document.