Milestone Report of the Capstone project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria:

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Libraries and download data

Required libraries

require('tm')
require('ggplot2')
require('stringi')

dataURL<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dataDIR = "final"

if (!dir.exists(dataDIR)) {
    dataZipName <- "Coursera-SwiftKey.zip"
    if (!file.exists(dataZipName))
        download.file(dataURL, dataZipName, method = "auto")
    unzip(dataZipName)
    if (dir.exists(dataDIR))
        file.remove(dataZipName)
}

Load data as lines

file.blog <- "final/en_US/en_US.blogs.txt"
file.twitter <- "final/en_US/en_US.twitter.txt"
file.web <-  "final/en_US/en_US.news.txt"

lines.blog <- readLines(file(file.blog))     
lines.twitter <- readLines(file(file.twitter)) 
lines.web <- readLines(file(file.web))

Exploratory analysis

Get the number of lines of each source of data

length(lines.blog)

## [1] 899288

length(lines.twitter)

## [1] 2360148

length(lines.web)

## [1] 77259

Get the number of words per line of data and present the summary

# Get number of words per line
nwords.blog <- stri_count_words(lines.blog)
nwords.twitter <- stri_count_words(lines.twitter)
nwords.web <- stri_count_words(lines.web)

# Summary
summary(nwords.blog)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00

summary(nwords.twitter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0

summary(nwords.web)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00

Further exploration of a sample

Here I prepare a random sample of 10% of total lines provided by the source data. This is done to decrease the processing time but maintaining enough cases to be representative of the original sample.

slines.blog <- sample(lines.blog, 0.1*length(lines.blog))    
slines.twitter <- sample(lines.blog, 0.1*length(lines.twitter))  
slines.web <- sample(lines.blog, 0.1*length(lines.web))  

snwords.blog <- stri_count_words(slines.blog)
snwords.twitter <- stri_count_words(slines.twitter)
snwords.web <- stri_count_words(slines.web)

df.nwords.all <- data.frame(nword = c(snwords.blog, snwords.twitter, snwords.web), 
                            type = c(rep("blog", length(snwords.blog)), rep("twitter",length(snwords.twitter)), rep("web", length(slines.web))))

Plotting the density probability of the frequency of the number of words per line

ggplot(data = df.nwords.all) + geom_density(aes(nword)) + facet_wrap(~type, nrow = 3) + xlim(0,500)

Create a corpus and clean it to see which words occur more often. We apply only for web source to see an example as the computation is heavy.

webCorpus = Corpus(VectorSource(slines.web))
webCorpus = tm_map(webCorpus, content_transformer(tolower))
webCorpus = tm_map(webCorpus, removePunctuation)
webCorpus = tm_map(webCorpus, removeNumbers)

webDTM = TermDocumentMatrix(webCorpus,
                           control = list(minWordLength = 1))
mWeb = as.matrix(webDTM)
webOrder <- sort(rowSums(mWeb), decreasing = TRUE)

Finally displaying which are the 10-top frequent words and 10-bottom words (only used once)

head(webOrder, 10)

##   the   and  that   for   you   was  with  this  have   but 
## 15924  9306  3909  3038  2529  2371  2360  2152  1983  1841

tail(webOrder, 10)

##        zoes    zucchini zugangscode         zur      zurich        zwei 
##           1           1           1           1           1           1 
##    zymarika  zytophiles       zzzzs    zzzzzzzz 
##           1           1           1           1

Conclusions

We have performed an exploratory analysis. Some important findings:

The amount of lines per source type, and distribution of amount of words per line
We have seen in the plot how the distribution of amount of words per line for twitter is more abrupt than for blog, and for the web. This can be caused by the limitation of characters of twitter they need to be more specific.
We approached the number of times that each word is repated. As we could imagine, the most common words are prepositions, and we can find very rare words that appear only once. This can be used to calculate marginal probabilities of each word, that can be used for the prediction model as addition to the a posteriori probabilities depending on which one was the previous word.

Milestone Report Capstone