Homework assignment 2, 0:00, January 1st, 2014
set.seed(1)
setwd("~/Documents/Universiteit/StatLearn")
data.raw <- read.table("home2sim.txt", header = T)
# Split the data:
data <- data.raw[1:200, ]
data.test <- data.raw[201:400, ]
Note that I changed the pch parameters used for the plot calls from circles to other shapes to show the difference between original points and sample points.
library(MASS)
mylda <- lda(y ~ ., data = data)
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
newdata <- expand.grid(x1 = seq(-2, 4, by = 0.1), x2 = seq(-3, 4, by = 0.1))
points(newdata, col = predict(mylda, newdata)$class, pch = "~")
mean(data.test$y == predict(mylda, data.test)$class)
## [1] 0.76
We add “dimen=1” to use the first component, in both places:
mylda <- lda(y ~ ., data = data)
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
points(newdata, col = predict(mylda, newdata, dimen = 1)$class, pch = "~")
mean(data.test$y == predict(mylda, data.test, dimen = 1)$class)
## [1] 0.76
We see no difference from the original LDA. This is expected, because our input data is 2-dimensional and we have 2 centroids/possible values of Y. Since a linear split in a 2-dimensional space results in a 1-dimensional subspace, we were already working with a single dimension.
We now use the qda function from the MASS package:
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
model <- qda(y ~ ., data = data)
points(newdata, col = predict(model, newdata)$class, pch = "~")
mean(data.test$y == predict(model, data.test)$class)
## [1] 0.8
We see that there is now a quadratic separation, as expected. The accuracy is higher than for the previous method.
library("kernlab")
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "vanilladot", C = 1)
## Setting default kernel parameters
points(newdata, col = predict(model, newdata) + 1, pch = "~")
mean(data.test$y == predict(model, data.test))
## [1] 0.75
for (C in c(0.1, 1, 10, 100, 1000)) {
model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5,
C = C)
print(cross(model))
}
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.24
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.2
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.24
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.215
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.24
We see that the prediction error is the lowest for C=1, so we use that:
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
model <- ksvm(y ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5,
C = 1)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
points(newdata, col = predict(model, newdata) + 1, pch = "~")
mean(data.test$y == predict(model, data.test))
## [1] 0.795
We use code from the assignment:
library("rpart")
mycart <- rpart(y ~ ., method = "class", data = data)
plot(mycart, uniform = TRUE)
text(mycart, use.n = TRUE, all = TRUE, cex = 0.8)
mycart$cptable
## CP nsplit rel error xerror xstd
## 1 0.41237 0 1.0000 1.1856 0.07207
## 2 0.15464 1 0.5876 0.6495 0.06772
## 3 0.01546 2 0.4330 0.5567 0.06473
## 4 0.01031 4 0.4021 0.5670 0.06510
## 5 0.01000 5 0.3918 0.5773 0.06546
From the table we select the option with the lowest value for xerror, since it is an estimate of the classification error of the tree, obtained by cross validation. This option is the third, so CP=0.01546392. This gives us our pruned tree:
myprunedcart <- prune(mycart, cp = mycart$cptable[3])
plot(myprunedcart, uniform = TRUE)
text(myprunedcart, use.n = TRUE, all = TRUE, cex = 0.8)
Applying this to our data results in:
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
points(newdata, col = apply(predict(myprunedcart, newdata), 1, which.max), pch = "~")
mean(data.test$y + 1 == apply(predict(myprunedcart, data.test), 1, which.max))
## [1] 0.72
library("randomForest")
## randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(factor(y) ~ ., data = data)
plot(data$x1, data$x2, col = data$y + 1, xlab = "", ylab = "", pch = 19)
points(newdata, col = predict(model, newdata), pch = "~")
mean(data.test$y == predict(model, data.test))
## [1] 0.75
library("adabag") # Note: not adaboost
## Loading required package: mlbench Loading required package: caret Loading
## required package: cluster Loading required package: foreach Loading
## required package: lattice Loading required package: plyr Loading required
## package: reshape2
data$y <- as.factor(data$y)
model <- boosting(y ~ x1 + x2, data = data)
plot(data$x1, data$x2, col = as.numeric(data$y), xlab = "", ylab = "", pch = 19)
newdata$y <- as.factor(c(rep(0, 4300), rep(1, 31)))
points(newdata, col = as.numeric(predict(model, newdata)$class) + 1, pch = "~")
data.test$y <- as.factor(data.test$y)
mean(data.test$y == predict(model, data.test)$class)
## [1] 0.695
We now run all methods again, but for the spam dataset. We do not provide the visualizations, since this is many-more dimensional.
I chose not to split the data into two equal parts, but to make the training data larger. This was done because there are columns with very sparse data (i.e. many zeroes). Without a large sample this can result in rank deficiency for some procedures, including qda.
set.seed(3)
setwd("~/Documents/Universiteit/StatLearn")
data(spam)
# Important: randomize order!
data.raw <- spam[sample(nrow(spam)), ]
# Split the data:
data <- data.raw[1:4000, ]
data.test <- data.raw[4001:4601, ]
library(MASS)
mylda <- lda(type ~ ., data = data)
mean(data.test$type == predict(mylda, data.test)$class)
## [1] 0.8968
We add “dimen=1” to use the first component:
mylda <- lda(type ~ ., data = data)
mean(data.test$type == predict(mylda, data.test, dimen = 1)$class)
## [1] 0.8968
mean(data.test$type == predict(mylda, data.test, dimen = 2)$class)
## [1] 0.8968
mean(data.test$type == predict(mylda, data.test, dimen = 3)$class)
## [1] 0.8968
mean(data.test$type == predict(mylda, data.test, dimen = 4)$class)
## [1] 0.8968
mean(data.test$type == predict(mylda, data.test, dimen = 5)$class)
## [1] 0.8968
mean(data.test$type == predict(mylda, data.test, dimen = 6)$class)
## [1] 0.8968
Again we see that the principal component is enough to fully determine the accuracy of the method.
We now use the qda function from the MASS package:
model <- qda(factor(type) ~ ., data = data)
mean(data.test$type == predict(model, data.test)$class)
## [1] 0.8303
library("kernlab")
model <- ksvm(type ~ ., data = data, type = "C-svc", kernel = "vanilladot",
C = 1)
## Setting default kernel parameters
mean(data.test$type == predict(model, data.test))
## [1] 0.9268
for (C in c(0.1, 1, 10, 100, 1000)) {
model <- ksvm(type ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5,
C = C)
print(cross(model))
}
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.1005
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.067
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.069
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.0745
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
## [1] 0.08575
We see that the prediction error is the lowest for C=1, so we use that:
model <- ksvm(type ~ ., data = data, type = "C-svc", kernel = "rbf", cross = 5,
C = 1)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
mean(data.test$type == predict(model, data.test))
## [1] 0.9218
We use code from the assignment. We don't care to expand the plot, since it would become very large, and we're not really interested in the specifics right now.
library("rpart")
mycart <- rpart(type ~ ., method = "class", data = data)
plot(mycart, uniform = TRUE)
text(mycart, use.n = TRUE, all = TRUE, cex = 0.8)
mycart$cptable
## CP nsplit rel error xerror xstd
## 1 0.47532 0 1.0000 1.0000 0.01957
## 2 0.15127 1 0.5247 0.5544 0.01655
## 3 0.04304 2 0.3734 0.4709 0.01558
## 4 0.03101 4 0.2873 0.3437 0.01371
## 5 0.01139 5 0.2563 0.2981 0.01290
## 6 0.01000 7 0.2335 0.2854 0.01266
From the table we select the option with the lowest value for xerror, since it is an estimate of the classification error of the tree, obtained by cross validation. This option is the third, so CP=0.01000000. This gives us our pruned tree:
myprunedcart <- prune(mycart, cp = mycart$cptable[6])
plot(myprunedcart, uniform = TRUE)
text(myprunedcart, use.n = TRUE, all = TRUE, cex = 0.8)
Applying this to our data results in:
mean(as.numeric(data.test$type) == as.numeric(apply(predict(myprunedcart, data.test),
1, which.max)))
## [1] 0.9085
library("randomForest")
model <- randomForest(type ~ ., data = data)
mean(data.test$type == predict(model, data.test))
## [1] 0.9501
library("adabag") # Note: not adaboost
model <- boosting(type ~ ., data = data, mfinal = 22)
## Warning: Walker's alias method used: results are different from R < 2.2.0
mean(data.test$type == predict(model, data.test)$class)
## [1] 0.9584