Structure of Data Analysis Part 2

This follows the video of the same name in week 1 of the coursera course, Reproducible Research.

Exploratory Analysis

library(kernlab)
data(spam)
set.seed(3435)
trainIndicator <- rbinom(n=nrow(spam), size=1, prob=0.5)
nrow(spam)

## [1] 4601

table(trainIndicator)

## trainIndicator
##    0    1 
## 2314 2287

trainSpam <- spam[trainIndicator == 1, ]
testSpam <- spam[trainIndicator == 0, ]

names(trainSpam)

##  [1] "make"              "address"           "all"              
##  [4] "num3d"             "our"               "over"             
##  [7] "remove"            "internet"          "order"            
## [10] "mail"              "receive"           "will"             
## [13] "people"            "report"            "addresses"        
## [16] "free"              "business"          "email"            
## [19] "you"               "credit"            "your"             
## [22] "font"              "num000"            "money"            
## [25] "hp"                "hpl"               "george"           
## [28] "num650"            "lab"               "labs"             
## [31] "telnet"            "num857"            "data"             
## [34] "num415"            "num85"             "technology"       
## [37] "num1999"           "parts"             "pm"               
## [40] "direct"            "cs"                "meeting"          
## [43] "original"          "project"           "re"               
## [46] "edu"               "table"             "conference"       
## [49] "charSemicolon"     "charRoundbracket"  "charSquarebracket"
## [52] "charExclamation"   "charDollar"        "charHash"         
## [55] "capitalAve"        "capitalLong"       "capitalTotal"     
## [58] "type"

head(trainSpam)

##    make address  all num3d  our over remove internet order mail receive
## 1  0.00    0.64 0.64     0 0.32 0.00   0.00        0  0.00 0.00    0.00
## 7  0.00    0.00 0.00     0 1.92 0.00   0.00        0  0.00 0.64    0.96
## 9  0.15    0.00 0.46     0 0.61 0.00   0.30        0  0.92 0.76    0.76
## 12 0.00    0.00 0.25     0 0.38 0.25   0.25        0  0.00 0.00    0.12
## 14 0.00    0.00 0.00     0 0.90 0.00   0.90        0  0.00 0.90    0.90
## 16 0.00    0.42 0.42     0 1.27 0.00   0.42        0  0.00 1.27    0.00
##    will people report addresses free business email  you credit your font
## 1  0.64   0.00      0         0 0.32        0  1.29 1.93   0.00 0.96    0
## 7  1.28   0.00      0         0 0.96        0  0.32 3.85   0.00 0.64    0
## 9  0.92   0.00      0         0 0.00        0  0.15 1.23   3.53 2.00    0
## 12 0.12   0.12      0         0 0.00        0  0.00 1.16   0.00 0.77    0
## 14 0.00   0.90      0         0 0.00        0  0.00 2.72   0.00 0.90    0
## 16 0.00   0.00      0         0 1.27        0  0.00 1.70   0.42 1.27    0
##    num000 money hp hpl george num650 lab labs telnet num857 data num415
## 1       0  0.00  0   0      0      0   0    0      0      0 0.00      0
## 7       0  0.00  0   0      0      0   0    0      0      0 0.00      0
## 9       0  0.15  0   0      0      0   0    0      0      0 0.15      0
## 12      0  0.00  0   0      0      0   0    0      0      0 0.00      0
## 14      0  0.00  0   0      0      0   0    0      0      0 0.00      0
## 16      0  0.42  0   0      0      0   0    0      0      0 0.00      0
##    num85 technology num1999 parts pm direct cs meeting original project re
## 1      0          0    0.00     0  0   0.00  0       0      0.0       0  0
## 7      0          0    0.00     0  0   0.00  0       0      0.0       0  0
## 9      0          0    0.00     0  0   0.00  0       0      0.3       0  0
## 12     0          0    0.00     0  0   0.00  0       0      0.0       0  0
## 14     0          0    0.00     0  0   0.00  0       0      0.0       0  0
## 16     0          0    1.27     0  0   0.42  0       0      0.0       0  0
##    edu table conference charSemicolon charRoundbracket charSquarebracket
## 1    0     0          0         0.000            0.000                 0
## 7    0     0          0         0.000            0.054                 0
## 9    0     0          0         0.000            0.271                 0
## 12   0     0          0         0.022            0.044                 0
## 14   0     0          0         0.000            0.000                 0
## 16   0     0          0         0.000            0.063                 0
##    charExclamation charDollar charHash capitalAve capitalLong capitalTotal
## 1            0.778      0.000    0.000      3.756          61          278
## 7            0.164      0.054    0.000      1.671           4          112
## 9            0.181      0.203    0.022      9.744         445         1257
## 12           0.663      0.000    0.000      1.243          11          184
## 14           0.000      0.000    0.000      2.083           7           25
## 16           0.572      0.063    0.000      5.659          55          249
##    type
## 1  spam
## 7  spam
## 9  spam
## 12 spam
## 14 spam
## 16 spam

table(trainSpam$type)

## 
## nonspam    spam 
##    1381     906

First attempt at a plot of one of the terms against the type.

plot(trainSpam$capitalAve ~ trainSpam$type)

The problem with the above is the outliers, so apply log10. We must ensure that no values are 0 (can’t take log of 0) so add very small number to each value, which is acceptable in exploratory analysis. The video shows adding 1, but it seems better to add a very small number instead.

plot(log10(trainSpam$capitalAve + 0.00001) ~ trainSpam$type)

Now plot some pair-wise relationships; there could be many more, in the full 57 columns of terms.

plot(log10(trainSpam[, 1:4] + 1))

Now make a hierarchical cluster. The t function transposes a matrix-like structure, so now our terms are the rows, and the record values are the columns. This is what the dist function expects. It calculates the distance between each of the rows in a matrix-like structure, based on the values in all of the columns.

hCluster <- hclust(dist(t(trainSpam[, 1:57])))
plot(hCluster)

The first cluster was not useful because the values in one field, capitalTotal, appear to outweight all the other fields. (capitalLong also outweights the other fields.) So we can transform the data first. Note that the following leaves the last two fields out! So do we need the transformation? (The video makes no mention of why two columns are ommitted.)

The result shows that one cluster contains just capitalAve (average number of capitals), and another contains “you,” “your” and “will”.

hClusterUpdated <- hclust(dist(t(log10(trainSpam[, 1:55] + 1))))
plot(hClusterUpdated)

Diverging from the video now. I have left out yet another field, capitalAve, which also has disproportionate outliers, then plot a cluster without the log10 transformation. The result is different; whether better or worse, I’m not sure.

hClusterUpdated2 <- hclust(dist(t(trainSpam[, 1:54])))
plot(hClusterUpdated2)

Statistical Prediction Modeling

This bit of code throws so many new things at us that it is now clear that the hand-holding in the course series is at an end. Though this was becoming clear already in week 3 of EDA.

Why add the numType field?
The type field is a factor field with two levels, and the numeric values of those levels are 1 and 2. We need values of 0 and 1.

trainSpam$numType <- as.numeric(trainSpam$type) - 1

What is a cost function?
It is used to determine the “cost” of the parameters. In this case, the x parameter is the observed value and the y value is the expected or predicted value. We expect them to be the same. Since we are using logistic regression, the expected value can range from 0 to 1, with 0.5 as a boundary. If the expected value is less than 0.5, we expect this to be a “failure”, otherwise, a “success.”

So this function compares the observed (which will always be 0 or 1) to the expected value. If the observed value is a 0 and expected value is <= 0.5, the cost for that case is 0; likewise if observed value is a 1 and expected value is > 0.5. For those cases, the cost is zero. In other cases (observed is 0 and expected is > 0.5, and observed is 1 and expected is <= 0.5), the cost is 1. The cost function sums the cost of all cases in the data and returns the total.

costFunction <- function(x, y) sum(x != (y > 0.5))
cvError <- rep(NA, 55)
# Why load the boot library?
# It contains the cv.glm function, used below.
library(boot)

Why use reformulate function?
Because it allows us to build a model based on a character vector. What is the response parameter? It is the name of the left-hand side of the formula. So if the first parameter is a vector containing “a” and “b”, and the response parameter is “x”, then the formula is x ~ a + b

What is the glm function?
It creates a “generalized” linear model. It can create more than one type. The paramater family=“binomial” means that it creates a logistic linear model.

What is logistic regression?
Briefly, it is used to when the response variable is categorical with two possible outcomes, fail (0) or success (1). Simple linear regression does not work in such case. Logistic regression uses a “logit” transformation that is described here:
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
Briefly, it determines the probability that a given case in the data results in a 0 or 1 (consider this in the context of the cost function).

What is the cv.glm function?
It calculates the “estimated K-fold cross-validation prediction error for generalized linear models.” (I have seen this term before but am not yet familiar with the details.) The first parameter is the data set, the second is a generalized linear model, the third is a cost function, and the fourth is the K value. The latter is 2 here because we have two possible outcomes, success and fail. The data is divided randomly into K groups. This is determined by the numType field created above, which, if you recall, contains either a 0 or 1. So for each group K, a call is made to costFunction, where the first parameter is trainSpam\(numType, and the second is glmFit\)fitted.

for (i in 1:55) {
  lmFormula <- reformulate(names(trainSpam)[i], response = "numType")
  glmFit <- glm(lmFormula, family="binomial", data = trainSpam)
  cvError[i] <- cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

What is the final answer?
We have so far built a generalized linear model for each term in trainSpam, and then calculated its cost. The final answer is the term with the lowest cost, i.e., the term that is most effective at predicting whether an email is spam.

names(trainSpam)[which.min(cvError)]

## [1] "charDollar"