Milestone Report of the Capstone project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria:

Libraries and download data

Required libraries

require('tm')
require('ggplot2')
require('stringi')
dataURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDIR = "final"

if (!dir.exists(dataDIR)) {
    dataZipName <- "Coursera-SwiftKey.zip"
    if (!file.exists(dataZipName))
        download.file(dataURL, dataZipName, method = "auto")
    unzip(dataZipName)
    if (dir.exists(dataDIR))
        file.remove(dataZipName)
}

Load data as lines

file.blog <- "final/en_US/en_US.blogs.txt"
file.twitter <- "final/en_US/en_US.twitter.txt"
file.web <-  "final/en_US/en_US.news.txt"

lines.blog <- readLines(file(file.blog))     
lines.twitter <- readLines(file(file.twitter)) 
lines.web <- readLines(file(file.web)) 

Exploratory analysis

Get the number of lines of each source of data

length(lines.blog)
## [1] 899288
length(lines.twitter)
## [1] 2360148
length(lines.web)
## [1] 77259

Get the number of words per line of data and present the summary

# Get number of words per line
nwords.blog <- stri_count_words(lines.blog)
nwords.twitter <- stri_count_words(lines.twitter)
nwords.web <- stri_count_words(lines.web)

# Summary
summary(nwords.blog)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00
summary(nwords.twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0
summary(nwords.web)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00

Further exploration of a sample

Here I prepare a random sample of 10% of total lines provided by the source data. This is done to decrease the processing time but maintaining enough cases to be representative of the original sample.

slines.blog <- sample(lines.blog, 0.1*length(lines.blog))    
slines.twitter <- sample(lines.blog, 0.1*length(lines.twitter))  
slines.web <- sample(lines.blog, 0.1*length(lines.web))  

snwords.blog <- stri_count_words(slines.blog)
snwords.twitter <- stri_count_words(slines.twitter)
snwords.web <- stri_count_words(slines.web)

df.nwords.all <- data.frame(nword = c(snwords.blog, snwords.twitter, snwords.web), 
                            type = c(rep("blog", length(snwords.blog)), rep("twitter",length(snwords.twitter)), rep("web", length(slines.web))))

Plotting the density probability of the frequency of the number of words per line

ggplot(data = df.nwords.all) + geom_density(aes(nword)) + facet_wrap(~type, nrow = 3) + xlim(0,500) 

Create a corpus and clean it to see which words occur more often. We apply only for web source to see an example as the computation is heavy.

webCorpus = Corpus(VectorSource(slines.web))
webCorpus = tm_map(webCorpus, content_transformer(tolower))
webCorpus = tm_map(webCorpus, removePunctuation)
webCorpus = tm_map(webCorpus, removeNumbers)

webDTM = TermDocumentMatrix(webCorpus,
                           control = list(minWordLength = 1))
mWeb = as.matrix(webDTM)
webOrder <- sort(rowSums(mWeb), decreasing = TRUE)

Finally displaying which are the 10-top frequent words and 10-bottom words (only used once)

head(webOrder, 10)
##   the   and  that   for   you   was  with  this  have   but 
## 15924  9306  3909  3038  2529  2371  2360  2152  1983  1841
tail(webOrder, 10)
##        zoes    zucchini zugangscode         zur      zurich        zwei 
##           1           1           1           1           1           1 
##    zymarika  zytophiles       zzzzs    zzzzzzzz 
##           1           1           1           1

Conclusions

We have performed an exploratory analysis. Some important findings: