Introduction and Report Objectives

This is the milestone report for the Data Science Capstone Project. For the project details and background, please refer to the course page in Coursera# [link]https://class.coursera.org/dsscapstone-006

These are the project objectives:

Data Source and Preparation

The data is provided by Swiftkey and can be downloaded from the Coursera website as a zip file [link]https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Steps:

  • Download the dataset to a directory
  • Unzip the contents of the dataset
  • Use this as the same working directory for the various R processing and and mark-down procedures throughout the rest of the report
  • The unzipped files show that there are 4 languages available in the dataset: de_DE, en_US, fi_FI and ru_RU. Each language contains three types of corpus: * tweets, blogs and news.

For the purpose of this report, only the english data will be analyzed.

# Unzipped data in my local directory

datafile.blog <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.blogs.txt"
datafile.twitter <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.news.txt"
datafile.news <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.twitter.txt"

# Loading data into memory

blog <- readLines(datafile.blog, encoding = "UTF-8")
twitter <- readLines(datafile.twitter, encoding = "UTF-8")
## Warning in readLines(datafile.twitter, encoding = "UTF-8"): incomplete
## final line found on 'C:\Users\Mate\Desktop\CapeStonePrj\final\en_US
## \en_US.news.txt'
news <- readLines(datafile.news, encoding = "UTF-8")

Data Analysis

Character and Line Analysis

Blogs Data Set

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

Twitter Data Set

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

News Data Set

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   161961561   133948120

Blog summary Data Set

##    Length     Class      Mode 
##    899288 character character

Blog words summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###Twitter summary

##    Length     Class      Mode 
##     77259 character character
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###News summary

##    Length     Class      Mode 
##   2360148 character character
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###Further Data Analysis

Corpus-Random Sampling

Blog sample

## [1] "\"Full time Mum.....payment is pure love\"."                                                                                                                                                                                                                                                                                               
## [2] "Port Santa's Little Helper"                                                                                                                                                                                                                                                                                                                
## [3] "What, gentle reader, are we to make of these appalling revelations? Can it be that those we have loved and respected are mere mortals? Worse, flawed mortals. My worry is that, once the initial shock has worn off, it might leave me cynical; questioning of devolution, suspicious of politicians and political systems. I do hope not."
## [4] "* Secondly please e-mail a brief statement about yourself and any DT experience you have to craftingwhenwecan@gmail.com. If you do not have any DT experience please still enter! We welcome everyone to enter in this call."                                                                                                              
## [5] "GET TO KNOW MITT ROMNEY if you are a Christian<U+0085>"

Twitter sample

## [1] "Citrus"                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "In a roundtable discussion with about three dozen students at the University of South Florida, Nelson said he remains a strong proponent of the DREAM Act, the Development, Relief and Education for Alien Minors measure. The Democratic-backed bill would grant a path to citizenship to young illegal immigrants who attend college or serve in the military. It remains stalled in Congress."
## [3] "Though a parliamentary report in November on the final phase of Canada's combat mission found \"serious security, development and governance issues\" remained in Kandahar, it also concluded Canada made \"important gains\" in Afghanistan. \"We are a far better army from what we were<U+0085>when we entered the conflict,\" Gen. Milner says."                                                    
## [4] "He enlisted in the Army after graduating from De Soto High School in 1997."                                                                                                                                                                                                                                                                                                                      
## [5] "6. LaGuardia (11-3-0) (5)"

News sample

## [1] "I've never seen people actually read the instructions on what to do of the plane is crashing. The people next to me are taking it seriously."
## [2] "This is not the car we ordered, dad. Quiet Russ. Ed, this is not the car we ordered"                                                         
## [3] "Come on and pop dis Pill I call it Super Love u wit a Super Thug so Let ya hair down and be da Girl I know u can be. #Np Take You Home."     
## [4] "So I guess Hollywood is gonna start Spiderman all over again? Now he skateboards and is a cool kid? Amazing how much Hollywood sucks."       
## [5] "lol....probably so"

Word nGrams

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Word nGrams

Word n-gram plots

Plotting the mono-gram

###Plotting the bi-gram ###Plotting the tri-gram ###Plotting the quadri-gram ###Word cloud

## Loading required package: RColorBrewer
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : things could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : another could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : something could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : long could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : since could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : found could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : getting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : thanks could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : always could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : might could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : sure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : went could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : today could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : came could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : high could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : thought could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : donÂ’t could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : free could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : game could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : someone could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : better could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : hope could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : percent could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : place could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : whole could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : without could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : already could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : days could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : give could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : including could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : man could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : money could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : play could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : read could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : called could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : different could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : ever could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : making could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : top could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : world could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : american could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : full could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : wanted could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : enough could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : god could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : night could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : open could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : says could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : according could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : everyone could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : family could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : music could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : police could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : though could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : kind could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : looking could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : pretty could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : second could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : started could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : women could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : believe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : early could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : happy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : person could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : quite could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : several could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : time. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : try could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : using could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : young could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : friends could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : government could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : known could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : number could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : president could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : room could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : small could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : black could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : end could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : gave could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : maybe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : must could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : played could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : coming could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : former could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : often could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : past could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : public could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : times could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : u.s. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : whether could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : along could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : everything could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : nothing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : p.m. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : please could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : point could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : turn could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : working could not be fit on page. It will not be
## plotted.

## NULL

Exploration Discovery

Below are the findings discovered based on the exercise :-

In general, the corpora appears to have enough variety and context for a simple word prediction algorithm to be implemented

Next Steps

Improvements and plans

Improve data cleaning to create a more reliable corpora * Use bad word filtering * Come up with a good sampling model so that we won’t use the entire corpus all the time for n-gram creation * Determine the final prediction model to use * Test final prediction model

Once my prediction model is complete I will use it in a Shiny app where someone can type a phrase and my model predicts the next word. Next word prediction: bi-gram, tri-gram, quadri-gram prediction data