This is the milestone report for the Data Science Capstone Project. For the project details and background, please refer to the course page in Coursera# [link]https://class.coursera.org/dsscapstone-006
These are the project objectives:
The data is provided by Swiftkey and can be downloaded from the Coursera website as a zip file [link]https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Steps:
For the purpose of this report, only the english data will be analyzed.
# Unzipped data in my local directory
datafile.blog <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.blogs.txt"
datafile.twitter <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.news.txt"
datafile.news <- "C:\\Users\\Mate\\Desktop\\CapeStonePrj\\final\\en_US\\en_US.twitter.txt"
# Loading data into memory
blog <- readLines(datafile.blog, encoding = "UTF-8")
twitter <- readLines(datafile.twitter, encoding = "UTF-8")
## Warning in readLines(datafile.twitter, encoding = "UTF-8"): incomplete
## final line found on 'C:\Users\Mate\Desktop\CapeStonePrj\final\en_US
## \en_US.news.txt'
news <- readLines(datafile.news, encoding = "UTF-8")
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 161961561 133948120
## Length Class Mode
## 899288 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
###Twitter summary
## Length Class Mode
## 77259 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
###News summary
## Length Class Mode
## 2360148 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
###Further Data Analysis
## [1] "\"Full time Mum.....payment is pure love\"."
## [2] "Port Santa's Little Helper"
## [3] "What, gentle reader, are we to make of these appalling revelations? Can it be that those we have loved and respected are mere mortals? Worse, flawed mortals. My worry is that, once the initial shock has worn off, it might leave me cynical; questioning of devolution, suspicious of politicians and political systems. I do hope not."
## [4] "* Secondly please e-mail a brief statement about yourself and any DT experience you have to craftingwhenwecan@gmail.com. If you do not have any DT experience please still enter! We welcome everyone to enter in this call."
## [5] "GET TO KNOW MITT ROMNEY if you are a Christian<U+0085>"
## [1] "Citrus"
## [2] "In a roundtable discussion with about three dozen students at the University of South Florida, Nelson said he remains a strong proponent of the DREAM Act, the Development, Relief and Education for Alien Minors measure. The Democratic-backed bill would grant a path to citizenship to young illegal immigrants who attend college or serve in the military. It remains stalled in Congress."
## [3] "Though a parliamentary report in November on the final phase of Canada's combat mission found \"serious security, development and governance issues\" remained in Kandahar, it also concluded Canada made \"important gains\" in Afghanistan. \"We are a far better army from what we were<U+0085>when we entered the conflict,\" Gen. Milner says."
## [4] "He enlisted in the Army after graduating from De Soto High School in 1997."
## [5] "6. LaGuardia (11-3-0) (5)"
## [1] "I've never seen people actually read the instructions on what to do of the plane is crashing. The people next to me are taking it seriously."
## [2] "This is not the car we ordered, dad. Quiet Russ. Ed, this is not the car we ordered"
## [3] "Come on and pop dis Pill I call it Super Love u wit a Super Thug so Let ya hair down and be da Girl I know u can be. #Np Take You Home."
## [4] "So I guess Hollywood is gonna start Spiderman all over again? Now he skateboards and is a cool kid? Amazing how much Hollywood sucks."
## [5] "lol....probably so"
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
Word n-gram plots
###Plotting the bi-gram
###Plotting the tri-gram
###Plotting the quadri-gram
###Word cloud
## Loading required package: RColorBrewer
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : things could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : another could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : something could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : long could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : since could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : found could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : getting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : thanks could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : always could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : might could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : sure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : went could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : today could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : came could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : high could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : thought could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : donÂ’t could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : free could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : game could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : someone could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : better could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : hope could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : percent could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : place could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : whole could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : without could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : already could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : days could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : give could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : including could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : man could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : money could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : play could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : read could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : called could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : different could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : ever could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : making could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : top could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : world could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : american could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : full could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : wanted could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : enough could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : god could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : night could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : open could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : says could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : according could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : everyone could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : family could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : music could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : police could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : though could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : kind could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : looking could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : pretty could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : second could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : started could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : women could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : believe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : early could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : happy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : person could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : quite could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : several could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : time. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : try could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : using could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : young could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : friends could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : government could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : known could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : number could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : president could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : room could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : small could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : black could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : end could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : gave could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : maybe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : must could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : played could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : coming could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : former could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : often could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : past could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : public could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : times could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : u.s. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : whether could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : along could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : everything could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : nothing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : p.m. could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : please could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : point could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : turn could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(myCorpus, scale = c(5, 0.5), max.words = 200,
## random.order = FALSE, : working could not be fit on page. It will not be
## plotted.
## NULL
Below are the findings discovered based on the exercise :-
Blog contains the largest text, but there is no common style of writing to be found. Depending on the author, it can be formal or informal, focused or multi-topic, concise or very long. This corpus will be good for creating bi-grams and tri-grams
The twitter data has max or only limited to only 140 characters per tweet, and uses a lot of informal writing (acronymis, colloquial short forms, abbreviations, leetspeak, etc.) It contains the most misspellings. One important thing it can provide is the hashtags. This corpus can be good for creating uni-grams
The news data looks like it has the most formal style of writing. The topics are also very focused and contain very little misspellings. This will be good for generating bi-grams and tri-grams
In general, the corpora appears to have enough variety and context for a simple word prediction algorithm to be implemented
Improve data cleaning to create a more reliable corpora * Use bad word filtering * Come up with a good sampling model so that we won’t use the entire corpus all the time for n-gram creation * Determine the final prediction model to use * Test final prediction model
Once my prediction model is complete I will use it in a Shiny app where someone can type a phrase and my model predicts the next word. Next word prediction: bi-gram, tri-gram, quadri-gram prediction data