ArcLakeGroupSummary <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
dundeedata <- read_csv("~/Desktop/EPSRC Project /dundeedata.csv.xls")
colnames(dundeedata)[1]<-"GloboLakes_ID" # change the GloboLID column name to GloboLakes_ID to make the merge easier.
Data<-merge(ArcLakeGroupSummary, dundeedata, by = "GloboLakes_ID", all = TRUE )
Data<-subset(Data, Group!="NA") # The data set is back to the original 732 rows just with extra columns of information
Data$Group<-as.factor(Data$Group)
In order to compae different models and particular parameter values I decided to split the data into a training (80%) and test (20%) set in a stratified way. The training set will be used in stratified 5 fold cross-validation to compare the perfomance of different models. The model that performs best, i.e. has the lowest cross-validation error rate will be used toproduce a test error rate.
An illustration of the scheme to be used in comparing various models using stratification can be found below.
Some brief justifications for some of the decisions made:
The overall scheme - A good justification is explained here .
The 80-20 split - After having tried out a few different splits (i.e. 60-40, 70-30, etc…), I found that the SVMs with quadratic and radial kernels tended to overfit when it came to performing cross-validation on the training sets with a smaller percentage.
The 5-fold cross-validation - A lower amount of folds induces a lower variance for the estimate of the cross-validation error rate. Also, as the training set consists 80% of all observations each of the 5 folds account for 16% of the entire data set which is compariable to the test set 20% size.
The stratification through out - Ideally we would want our classification method to wrongly / correctly predict an observation’s class based on the method being trained on that particular class. An illustration of an extreme problem that could be created if we didn’t stratify is given below. In addition, as we are using a single seed stratifying will give us a more stable estimate of our CV and test error rates.
In order to use each method, I first prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.
Data1<-data.frame(Data[,c("Group","PC1","PC2")])
# Stratify the entire training set into training and test sets
set.seed(234)
library(caret)
train.index<-createDataPartition(Data1$Group, p=0.8, list = FALSE)
train.set<-Data1[train.index, ]
test.set<-Data1[-train.index, ]
# Stratify the training set into 5 folds
folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds
Originally, I only considered QDA. However, including the performance of LDA may make for an interesting comparison.
# Using LDA to produce the CV error rate
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
lda.fit<-lda(formula = Group~PC1+PC2, data=train.data)
lda.y <- valid.data$Group
lda.predy<-predict(lda.fit, valid.data)$class
ith.test.error<- mean(lda.y!=lda.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.04745763
# Using QDA to produce the CV error rate
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
qda.fit<-qda(formula = Group~PC1+PC2, data=train.data)
qda.y <- valid.data$Group
qda.predy<-predict(qda.fit, valid.data)$class
ith.test.error<- mean(qda.y!=qda.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.04067797
In comparison, QDA performed better than LDA with each method producing a cross-validation error rate of 4.07% and 4.75%, respectively.
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 0.03589176. This produced a CV error rate of 1.53%.
# Using SVM linear kernel cost = 0.03589176 - the best linear kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~PC1+PC2,data = train.data, kernel="linear", cost = 0.03589176 ,scale=FALSE)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.01525424
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 1797909, gamma = 0.09669255 and degree = 3. This also produced a CV error rate of 1.53%.
# Using SVM polynomial kernel cost = 1797909, gamma = 0.09669255 and degree = 3 - the best polynomial kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="polynomial",
cost = 1797909, gamma = 0.09669255, degree = 3)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.01525424
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 25718660 and gamma = 0.0003379114. This also produced a CV error rate of 1.53%.
# Using SVM radial kernel cost = 25718660 and gamma = 0.0003379114 - the best radial kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="radial", cost=25718660, gamma=0.0003379114)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.01525424
Overall, there was a tie between the SVM (linear kernel, cost = 0.03589176), SVM (polynomial kernel, cost = 1797909, gamma = 0.09669255, degree = 3) and SVM (radial kernel, cost=25718660, gamma=0.0003379114). However, in the case of this tie, the SVM with linear kernel is the simplest model and thus preferred. Now we will use the entire training set to train this model and test it on the test set to get the test error rate.
svmfit<-svm(Group~PC1+PC2,data = train.set,kernel="linear",cost = 0.03589176 ,scale=FALSE)
svmfit
##
## Call:
## svm(formula = Group ~ PC1 + PC2, data = train.set, kernel = "linear",
## cost = 0.03589176, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.03589176
## gamma: 0.5
##
## Number of Support Vectors: 78
A plot of how the model partitioned the space is given below - the observations represented by a cross are the support vectors.
The in-built plot function in the e1071 library plots the SVM classification in a weird way - PC1 gets put on the y-axis and PC2 gets put on the x-axis. I decided to create my own classification plot over a fine grid.
xgrid<-expand.grid(X1=seq(min(Data$PC1), max(Data$PC1), length.out = 150),
X2=seq(min(Data$PC2), max(Data$PC2), length.out =150))
colnames(xgrid)<-c("PC1", "PC2")
group.train.set.pred<-predict(svmfit, xgrid)
xgrid<-cbind(xgrid, group.train.set.pred)
ggplot(xgrid, aes(x=PC1,y=PC2))+
geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
geom_point(data = train.set[-svmfit$index,], aes(x=PC1, y=PC2, colour=Group))+
geom_point(data = train.set[svmfit$index,], aes(x=PC1, y=PC2, colour=Group), shape=4) + labs(colour = "Group", title="Decision Surface With Training Set Observations")
svm.y<-test.set$Group
svm.predy<-predict(svmfit, test.set)
mean(svm.y!=svm.predy)
## [1] 0.04225352
The test error rate for this model was 4.23%.
The cross-classification table is given below.
table(svm.y, svm.predy)
## svm.predy
## svm.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 15 0 0 0 0 0 0
## 4 0 0 0 23 1 0 0 0 0
## 5 0 0 0 1 44 0 0 0 3
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 3 0 0
## 8 0 0 0 0 0 0 0 5 0
## 9 0 0 0 0 1 0 0 0 19
4 out of the 6 errors were observations of class 5 being misclassified as either group 4 or 9.
ggplot(xgrid, aes(x=PC1,y=PC2))+
geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
geom_point(data = test.set, aes(x=PC1, y=PC2, colour=Group))+ labs(colour = "Group", title="Decision Surface With Test Set Observations")
In order to use each model, I prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.
Data2<-data.frame(Data[, c("Group", "Latitude", "Longitude", "OverallAvg")])
# Stratify the entire training set into training and test sets
set.seed(234)
library(caret)
train.index<-createDataPartition(Data2$Group, p=0.8, list = FALSE)
train.set<-Data2[train.index, ]
test.set<-Data2[-train.index, ]
# Stratify the training set into 5 folds
folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds
Here we implement QDA and LDA, the comparison may be insightful.
# Using LDA to produce the CV error rate
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
lda.fit<-lda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
lda.y <- valid.data$Group
lda.predy<-predict(lda.fit, valid.data)$class
ith.test.error<- mean(lda.y!=lda.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.08983051
# Using QDA to produce the CV error rate
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
qda.fit<-qda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
qda.y <- valid.data$Group
qda.predy<-predict(qda.fit, valid.data)$class
ith.test.error<- mean(qda.y!=qda.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.05932203
As seen in the output above, QDA performed much better than LDA with each model producing a cross-validationerror rate of 5.93% and 8.98%, respectively. There appears to be a significant difference in the two rates - it may be the case that we could expect more flexible models to perform better.
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 14.76168. This produced a CV error rate of 5.59%.
# Using SVM linear kernel cost = 14.76168 - the best linear kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data,
kernel="linear", cost=14.76168, scale=FALSE)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.0559322
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 485165195, gamma = 0.001458117 and degree = 1. This also produced a CV error rate of 4.41%.
# Using SVM polynomial kernel cost = 485165195, gamma = 0.001458117 and degree = 1 - the best polynomial kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data,
kernel="polynomial", cost = 485165195, gamma = 0.001458117, degree = 1)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.04576271
In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.
The optimal hyperparameters were chosen to be cost = 5958.305 and gamma = 0.006992461. This also produced a CV error rate of 3.73%.
# Using SVM radial kernel cost = 5958.305 and gamma = 0.006992461 - the best radial kernel SVM model considered.
CV.error<-NULL
for (i in 1:5) {
valid.data <- subset(train.set, fold == i)
train.data <- subset(train.set, fold != i)
svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data,
kernel="radial", cost = 5958.305, gamma = 0.006992461)
svm.y<-valid.data$Group
svm.predy<-predict(svmfit, valid.data)
ith.test.error<- mean(svm.y!=svm.predy)
CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)
}
sum(CV.error)
## [1] 0.03728814
Overall, the best performing model was an SVM (radial kernel, cost = 5958.305, gamma = 0.006992461). Now we will use the entire training set to train ths model and test it on the test set to get the test error rate.
svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set,
kernel="radial", cost = 5958.305, gamma = 0.006992461)
svmfit
##
## Call:
## svm(formula = Group ~ Longitude + Latitude + OverallAvg, data = train.set,
## kernel = "radial", cost = 5958.305, gamma = 0.006992461)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 5958.305
## gamma: 0.006992461
##
## Number of Support Vectors: 104
svm.y<-test.set$Group
svm.predy<-predict(svmfit, test.set)
mean(svm.y!=svm.predy)
## [1] 0.04929577
The test error rate for this model was 4.93%. The cross-classification table is given below.
table(svm.y, svm.predy)
## svm.predy
## svm.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 15 0 0 0 0 0 0
## 4 0 0 0 24 0 0 0 0 0
## 5 0 0 0 3 44 0 0 0 1
## 6 0 0 1 0 0 7 0 0 0
## 7 0 0 0 0 0 0 2 1 0
## 8 0 0 0 0 0 0 0 5 0
## 9 0 0 0 0 1 0 0 0 19
Interestingly, 4 out of the 7 errors were observations of class 5 being misclassified as either groups 4 or 9. High misclassification of class 5 was also evident when PC1 and PC2 were used as explanatory variables.
svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set,
kernel="radial", cost = 5958.305, gamma = 0.006992461)
xgrid<-expand.grid(X1=seq(min(Data$Longitude), max(Data$Longitude), length.out =70),
X2=seq(min(Data$Latitude), max(Data$Latitude), length.out = 70),
X3=seq(min(Data$OverallAvg), max(Data$OverallAvg), length.out = 70))
colnames(xgrid)<-c("Longitude", "Latitude", "OverallAvg")
group.train.set.pred<-predict(svmfit, xgrid)
xgrid<-cbind(xgrid, group.train.set.pred)
There is no in-built function to plot the SVM classification for dimensions greater than two. The plot of how the model partitions the space is given below. The plot was created using a fine grid.
interactive <- plot_ly() %>%
add_trace(
x = ~xgrid$Longitude,
y = ~xgrid$Latitude,
z= ~xgrid$OverallAvg,
mode = "markers",
color= ~ xgrid$group.train.set.pred,
opacity=0.05,
text = ~paste("Predicted Group: ", xgrid$group.train.set.pred)) %>%
add_trace(
x = ~test.set$Longitude,
y = ~test.set$Latitude,
z= ~test.set$OverallAvg,
mode = "markers",
color= ~ test.set$Group,
opacity=1,
text = ~paste("Group: ", test.set$Group))%>%
layout(
title = "Predicted 3d space with test observations",
scene = list(
xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
zaxis = list(title = "OverallAvg")))%>%
layout(annotations=list(yref="paper", xref="paper", y=1.05, x=1.1,text= "Predicted / Actual", showarrow=F))
interactive
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors