This report describes my exploratory analysis of the data sets provided for the Capstone Project and and my goals for the eventual algorithm and app.
The data for this project was downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The downloaded file was unzipped and the following files selected for
subsequent analysis:
en_US.blogs.txt (referred to as “Blogs”)
en_US.news.txt (referred to as “News”)
en_US.twitter.txt (referred to as “Twitter”)
This section contains some basic characteristics of the selected data files.
Each data set was analyzed to see how many characters it contained.
The results are summarized in the following three bar plots:
Each data set was analyzed to see how many lines it contained. The results are summarized in the following three bar plots:
Each data set was analyzed to see how many words it contained. For
the purposes of this analysis, a “word” is defined as a series of one or
more non-space characters followed by one or more spaces. The results
are summarized in the following three bar plots:
Each data set was analyzed to find the number of times each word was
used. The ten most-observed words for each data set, in order of
decreasing number of observations, are summarized in the following three
bar plots:
For each of the data sets, 0.1% of the lines were randomly selected.
That subset of lines was analyzed using 2-grams to find the
most-frequent two word pairs. The ten most-observed word-pairs for each
data set are summarized in the following three bar plots:
The goal of the shiny app is to demonstrate an algorithm for predicting the next word to be typed based upon the words previously typed.
The exact algorithm to be used is yet to be determined, but it will likely include concepts taken from n-grams and/or Markov chains.
The word frequencies used by the algorithm will be taken from the data sets described above in this document.