The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Review criteria:
Required libraries
require('tm')
require('ggplot2')
require('stringi')
dataURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDIR = "final"
if (!dir.exists(dataDIR)) {
dataZipName <- "Coursera-SwiftKey.zip"
if (!file.exists(dataZipName))
download.file(dataURL, dataZipName, method = "auto")
unzip(dataZipName)
if (dir.exists(dataDIR))
file.remove(dataZipName)
}
file.blog <- "final/en_US/en_US.blogs.txt"
file.twitter <- "final/en_US/en_US.twitter.txt"
file.web <- "final/en_US/en_US.news.txt"
lines.blog <- readLines(file(file.blog))
lines.twitter <- readLines(file(file.twitter))
lines.web <- readLines(file(file.web))
Get the number of lines of each source of data
length(lines.blog)
## [1] 899288
length(lines.twitter)
## [1] 2360148
length(lines.web)
## [1] 77259
Get the number of words per line of data and present the summary
# Get number of words per line
nwords.blog <- stri_count_words(lines.blog)
nwords.twitter <- stri_count_words(lines.twitter)
nwords.web <- stri_count_words(lines.web)
# Summary
summary(nwords.blog)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.43 61.00 6726.00
summary(nwords.twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 60.0
summary(nwords.web)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.87 46.00 1123.00
Here I prepare a random sample of 10% of total lines provided by the source data. This is done to decrease the processing time but maintaining enough cases to be representative of the original sample.
slines.blog <- sample(lines.blog, 0.1*length(lines.blog))
slines.twitter <- sample(lines.blog, 0.1*length(lines.twitter))
slines.web <- sample(lines.blog, 0.1*length(lines.web))
snwords.blog <- stri_count_words(slines.blog)
snwords.twitter <- stri_count_words(slines.twitter)
snwords.web <- stri_count_words(slines.web)
df.nwords.all <- data.frame(nword = c(snwords.blog, snwords.twitter, snwords.web),
type = c(rep("blog", length(snwords.blog)), rep("twitter",length(snwords.twitter)), rep("web", length(slines.web))))
Plotting the density probability of the frequency of the number of words per line
ggplot(data = df.nwords.all) + geom_density(aes(nword)) + facet_wrap(~type, nrow = 3) + xlim(0,500)
Create a corpus and clean it to see which words occur more often. We apply only for web source to see an example as the computation is heavy.
webCorpus = Corpus(VectorSource(slines.web))
webCorpus = tm_map(webCorpus, content_transformer(tolower))
webCorpus = tm_map(webCorpus, removePunctuation)
webCorpus = tm_map(webCorpus, removeNumbers)
webDTM = TermDocumentMatrix(webCorpus,
control = list(minWordLength = 1))
mWeb = as.matrix(webDTM)
webOrder <- sort(rowSums(mWeb), decreasing = TRUE)
Finally displaying which are the 10-top frequent words and 10-bottom words (only used once)
head(webOrder, 10)
## the and that for you was with this have but
## 15924 9306 3909 3038 2529 2371 2360 2152 1983 1841
tail(webOrder, 10)
## zoes zucchini zugangscode zur zurich zwei
## 1 1 1 1 1 1
## zymarika zytophiles zzzzs zzzzzzzz
## 1 1 1 1
We have performed an exploratory analysis. Some important findings: