Statistical Learning

Homework assignment 2, 0:00, January 1st, 2014

Part 1

set.seed(1)
setwd("~/Documents/Universiteit/StatLearn")
data.raw <- read.table("home2sim.txt", header = T)

# Split the data:
data <- data.raw[1:200, ]
data.test <- data.raw[201:400, ]

Method 1: LDA

Note that I changed the pch parameters used for the plot calls from circles to other shapes to show the difference between original points and sample points.

library(MASS)
mylda <- lda(y ~ ., data = data)
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
newdata <- expand.grid(x1 = seq(-2, 4, by = 0.1), x2 = seq(-3, 4, by = 0.1))

points(newdata, col = predict(mylda, newdata)$class, pch = "~")

mean(data.test$y == predict(mylda, data.test)$class)

## [1] 0.76

Method 2: LDA on first principal component

We add “dimen=1” to use the first component, in both places:

mylda <- lda(y ~ ., data = data)
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)

points(newdata, col = predict(mylda, newdata, dimen = 1)$class, pch = "~")

mean(data.test$y == predict(mylda, data.test, dimen = 1)$class)

## [1] 0.76

We see no difference from the original LDA. This is expected, because our input data is 2-dimensional and we have 2 centroids/possible values of Y. Since a linear split in a 2-dimensional space results in a 1-dimensional subspace, we were already working with a single dimension.

Method 3: QDA

We now use the qda function from the MASS package:

plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)

model <- qda(y ~ ., data = data)
points(newdata, col = predict(model, newdata)$class, pch = "~")

plot of chunk unnamed-chunk-4

mean(data.test$y == predict(model, data.test)$class)

## [1] 0.8

We see that there is now a quadratic separation, as expected. The accuracy is higher than for the previous method.

Method 4: Linear SVM

library("kernlab")
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)

model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "vanilladot", C = 1)

##  Setting default kernel parameters

points(newdata, col = predict(model, newdata) + 1, pch = "~")

plot of chunk unnamed-chunk-5

mean(data.test$y == predict(model, data.test))

## [1] 0.75

Method 5: Radial basis support vector machine


for (C in c(0.1, 1, 10, 100, 1000)) {
    model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5, 
        C = C)
    print(cross(model))
}

## Using automatic sigma estimation (sigest) for RBF or laplace kernel 
## [1] 0.24
## Using automatic sigma estimation (sigest) for RBF or laplace kernel 
## [1] 0.2
## Using automatic sigma estimation (sigest) for RBF or laplace kernel 
## [1] 0.24
## Using automatic sigma estimation (sigest) for RBF or laplace kernel 
## [1] 0.215
## Using automatic sigma estimation (sigest) for RBF or laplace kernel 
## [1] 0.24

We see that the prediction error is the lowest for C=1, so we use that:

plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)

model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5, 
    C = 1)

## Using automatic sigma estimation (sigest) for RBF or laplace kernel

points(newdata, col = predict(model, newdata) + 1, pch = "~")

plot of chunk unnamed-chunk-7

mean(data.test$y == predict(model, data.test))

## [1] 0.795

Method 6: CART

We use code from the assignment:

library("rpart")

mycart <- rpart(y ~ ., method = "class", data = data)
plot(mycart, uniform = TRUE)
text(mycart, use.n = TRUE, all = TRUE, cex = 0.8)

plot of chunk unnamed-chunk-8

mycart$cptable

##        CP nsplit rel error xerror    xstd
## 1 0.41237      0    1.0000 1.1856 0.07207
## 2 0.15464      1    0.5876 0.6495 0.06772
## 3 0.01546      2    0.4330 0.5567 0.06473
## 4 0.01031      4    0.4021 0.5670 0.06510
## 5 0.01000      5    0.3918 0.5773 0.06546

From the table we select the option with the lowest value for xerror, since it is an estimate of the classification error of the tree, obtained by cross validation. This option is the third, so CP=0.01546392. This gives us our pruned tree:

myprunedcart <- prune(mycart, cp = mycart$cptable[3])
plot(myprunedcart, uniform = TRUE)
text(myprunedcart, use.n = TRUE, all = TRUE, cex = 0.8)

plot of chunk unnamed-chunk-9

Applying this to our data results in:

plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
points(newdata, col = apply(predict(myprunedcart, newdata), 1, which.max), pch = "~")

plot of chunk unnamed-chunk-10

mean(data.test$y + 1 == apply(predict(myprunedcart, data.test), 1, which.max))

## [1] 0.72

Method 7: Random forest

library("randomForest")

## randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes.


model <- randomForest(factor(y) ~ ., data = data)

plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
points(newdata, col = predict(model, newdata), pch = "~")

plot of chunk unnamed-chunk-11

mean(data.test$y == predict(model, data.test))

## [1] 0.75

Method 8: “ADA-BOOST.M1”

library("adabag")  # Note: not adaboost

## Loading required package: mlbench Loading required package: caret Loading
## required package: cluster Loading required package: foreach Loading
## required package: lattice Loading required package: plyr Loading required
## package: reshape2


data$y <- as.factor(data$y)
model <- boosting(y ~ x1 + x2, data = data)

plot(data$x1, data$x2, col = as.numeric(data$y), xlab = "", ylab = "", pch = 19)
newdata$y <- as.factor(c(rep(0, 4300), rep(1, 31)))
points(newdata, col = as.numeric(predict(model, newdata)$class) + 1, pch = "~")

plot of chunk unnamed-chunk-12


data.test$y <- as.factor(data.test$y)
mean(data.test$y == predict(model, data.test)$class)

## [1] 0.695

Part 2: Spam dataset

We now run all methods again, but for the spam dataset. We do not provide the visualizations, since this is many-more dimensional.

I chose not to split the data into two equal parts, but to make the training data larger. This was done because there are columns with very sparse data (i.e. many zeroes). Without a large sample this can result in rank deficiency for some procedures, including qda.

set.seed(3)
setwd("~/Documents/Universiteit/StatLearn")
data(spam)
# Important: randomize order!
data.raw <- spam[sample(nrow(spam)), ]

# Split the data:
data <- data.raw[1:4000, ]
data.test <- data.raw[4001:4601, ]

Method 1: LDA

library(MASS)
mylda <- lda(type ~ ., data = data)

mean(data.test$type == predict(mylda, data.test)$class)

## [1] 0.8968

Method 2: LDA on (first principal) component(s)

We add “dimen=1” to use the first component:

mylda <- lda(type ~ ., data = data)

mean(data.test$type == predict(mylda, data.test, dimen = 1)$class)

## [1] 0.8968

mean(data.test$type == predict(mylda, data.test, dimen = 2)$class)

## [1] 0.8968

mean(data.test$type == predict(mylda, data.test, dimen = 3)$class)

## [1] 0.8968

mean(data.test$type == predict(mylda, data.test, dimen = 4)$class)

## [1] 0.8968

mean(data.test$type == predict(mylda, data.test, dimen = 5)$class)

## [1] 0.8968

mean(data.test$type == predict(mylda, data.test, dimen = 6)$class)

## [1] 0.8968

Again we see that the principal component is enough to fully determine the accuracy of the method.