1 Reading in the Data

ArcLakeGroupSummary <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
dundeedata <- read_csv("~/Desktop/EPSRC Project /dundeedata.csv.xls")

colnames(dundeedata)[1]<-"GloboLakes_ID" # change the GloboLID column name to GloboLakes_ID to make the merge easier.

Data<-merge(ArcLakeGroupSummary, dundeedata, by = "GloboLakes_ID", all = TRUE )
Data<-subset(Data, Group!="NA") # The data set is back to the original 732 rows just with extra columns of information

Data$Group<-as.factor(Data$Group)

2 Scheme for Comparing Classification Methods

In order to compae different models and particular parameter values I decided to split the data into a training (80%) and test (20%) set in a stratified way. The training set will be used in stratified 5 fold cross-validation to compare the perfomance of different models. The model that performs best, i.e. has the lowest cross-validation error rate will be used toproduce a test error rate.

An illustration of the scheme to be used in comparing various models using stratification can be found below.

Some brief justifications for some of the decisions made:

The overall scheme - A good justification is explained here .
The 80-20 split - After having tried out a few different splits (i.e. 60-40, 70-30, etc…), I found that the SVMs with quadratic and radial kernels tended to overfit when it came to performing cross-validation on the training sets with a smaller percentage.
The 5-fold cross-validation - A lower amount of folds induces a lower variance for the estimate of the cross-validation error rate. Also, as the training set consists 80% of all observations each of the 5 folds account for 16% of the entire data set which is compariable to the test set 20% size.
The stratification through out - Ideally we would want our classification method to wrongly / correctly predict an observation’s class based on the method being trained on that particular class. An illustration of an extreme problem that could be created if we didn’t stratify is given below. In addition, as we are using a single seed stratifying will give us a more stable estimate of our CV and test error rates.

3 For PC1 + PC2

In order to use each method, I first prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.

Data1<-data.frame(Data[,c("Group","PC1","PC2")])

# Stratify the entire training set into training and test sets

set.seed(234)

library(caret)
train.index<-createDataPartition(Data1$Group, p=0.8, list = FALSE)
train.set<-Data1[train.index, ]
test.set<-Data1[-train.index, ]

# Stratify the training set into 5 folds

folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds

3.1 LDA & QDA

Originally, I only considered QDA. However, including the performance of LDA may make for an interesting comparison.

  # Using LDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    lda.fit<-lda(formula = Group~PC1+PC2, data=train.data)
    lda.y <- valid.data$Group
    lda.predy<-predict(lda.fit, valid.data)$class
    
    ith.test.error<- mean(lda.y!=lda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.04745763

  # Using QDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    qda.fit<-qda(formula = Group~PC1+PC2, data=train.data)
    qda.y <- valid.data$Group
    qda.predy<-predict(qda.fit, valid.data)$class
    
    ith.test.error<- mean(qda.y!=qda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.04067797

In comparison, QDA performed better than LDA with each method producing a cross-validation error rate of 4.07% and 4.75%, respectively.

3.2 SVM Linear Kernel

In exploring the optimal hyperparameter settings fro this model bayesian optimization was used - a more detailed description can be found here. It’s not in this file as the run time for it to compile was over an hour.

The optimal hyperparameters were chosen to be cost = 0.03589176. This produced a CV error rate of 1.53%.

    # Using SVM linear kernel cost = 0.03589176 - the best linear kernel SVM model considered.
    
     CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2,data = train.data, kernel="linear", cost = 0.03589176 ,scale=FALSE)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
    sum(CV.error)

## [1] 0.01525424

3.3 SVM Polynomial Kernel

The optimal hyperparameters were chosen to be cost = 1797909, gamma = 0.09669255 and degree = 3. This also produced a CV error rate of 1.53%.

  # Using SVM polynomial kernel cost = 1797909, gamma = 0.09669255 and degree = 3 - the best polynomial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="polynomial",
                cost = 1797909, gamma = 0.09669255, degree = 3)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.01525424

3.4 SVM Radial Kernel

The optimal hyperparameters were chosen to be cost = 25718660 and gamma = 0.0003379114. This also produced a CV error rate of 1.53%.

    # Using SVM radial kernel cost = 25718660 and gamma = 0.0003379114 - the best radial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="radial", cost=25718660, gamma=0.0003379114)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.01525424

3.5 The Best Model

Overall, there was a tie between the SVM (linear kernel, cost = 0.03589176), SVM (polynomial kernel, cost = 1797909, gamma = 0.09669255, degree = 3) and SVM (radial kernel, cost=25718660, gamma=0.0003379114). However, in the case of this tie, the SVM with linear kernel is the simplest model and thus preferred. Now we will use the entire training set to train this model and test it on the test set to get the test error rate.

    svmfit<-svm(Group~PC1+PC2,data = train.set,kernel="linear",cost = 0.03589176 ,scale=FALSE)

    svmfit

## 
## Call:
## svm(formula = Group ~ PC1 + PC2, data = train.set, kernel = "linear", 
##     cost = 0.03589176, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.03589176 
##       gamma:  0.5 
## 
## Number of Support Vectors:  78

A plot of how the model partitioned the space is given below - the observations represented by a cross are the support vectors.

The in-built plot function in the e1071 library plots the SVM classification in a weird way - PC1 gets put on the y-axis and PC2 gets put on the x-axis. I decided to create my own classification plot over a fine grid.

xgrid<-expand.grid(X1=seq(min(Data$PC1), max(Data$PC1), length.out = 150), 
                   X2=seq(min(Data$PC2), max(Data$PC2), length.out =150))

colnames(xgrid)<-c("PC1", "PC2")

group.train.set.pred<-predict(svmfit, xgrid)

xgrid<-cbind(xgrid, group.train.set.pred)

ggplot(xgrid, aes(x=PC1,y=PC2))+
  geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
  geom_point(data = train.set[-svmfit$index,], aes(x=PC1, y=PC2, colour=Group))+
  geom_point(data = train.set[svmfit$index,], aes(x=PC1, y=PC2, colour=Group), shape=4) + labs(colour = "Group", title="Decision Surface With Training Set Observations")

    svm.y<-test.set$Group
    svm.predy<-predict(svmfit, test.set)
    
    mean(svm.y!=svm.predy)

## [1] 0.04225352

The test error rate for this model was 4.23%.

The cross-classification table is given below.

    table(svm.y, svm.predy)

##      svm.predy
## svm.y  1  2  3  4  5  6  7  8  9
##     1 11  0  0  0  0  0  0  0  0
##     2  0  8  0  0  0  0  0  0  0
##     3  0  0 15  0  0  0  0  0  0
##     4  0  0  0 23  1  0  0  0  0
##     5  0  0  0  1 44  0  0  0  3
##     6  0  0  0  0  0  8  0  0  0
##     7  0  0  0  0  0  0  3  0  0
##     8  0  0  0  0  0  0  0  5  0
##     9  0  0  0  0  1  0  0  0 19

4 out of the 6 errors were observations of class 5 being misclassified as either group 4 or 9.

ggplot(xgrid, aes(x=PC1,y=PC2))+
  geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
  geom_point(data = test.set, aes(x=PC1, y=PC2, colour=Group))+ labs(colour = "Group", title="Decision Surface With Test Set Observations")

4 For Longitutde + Latitude + OverallAvg

In order to use each model, I prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.

Data2<-data.frame(Data[, c("Group", "Latitude", "Longitude", "OverallAvg")])

# Stratify the entire training set into training and test sets

set.seed(234)

library(caret)
train.index<-createDataPartition(Data2$Group, p=0.8, list = FALSE)
train.set<-Data2[train.index, ]
test.set<-Data2[-train.index, ]

# Stratify the training set into 5 folds

folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds

4.1 LDA & QDA

Here we implement QDA and LDA, the comparison may be insightful.

  # Using LDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    lda.fit<-lda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
    lda.y <- valid.data$Group
    lda.predy<-predict(lda.fit, valid.data)$class
    
    ith.test.error<- mean(lda.y!=lda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.08983051

  # Using QDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    qda.fit<-qda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
    
    qda.y <- valid.data$Group
    
    qda.predy<-predict(qda.fit, valid.data)$class
    
    ith.test.error<- mean(qda.y!=qda.predy) 
    
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.05932203

As seen in the output above, QDA performed much better than LDA with each model producing a cross-validationerror rate of 5.93% and 8.98%, respectively. There appears to be a significant difference in the two rates - it may be the case that we could expect more flexible models to perform better.

4.2 SVM Linear Kernel

The optimal hyperparameters were chosen to be cost = 14.76168. This produced a CV error rate of 5.59%.

    # Using SVM linear kernel cost = 14.76168 - the best linear kernel SVM model considered.
    
     CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, 
                kernel="linear", cost=14.76168, scale=FALSE)
    
    svm.y<-valid.data$Group
    
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
    sum(CV.error)

## [1] 0.0559322

4.3 SVM Polynomial Kernel

The optimal hyperparameters were chosen to be cost = 485165195, gamma = 0.001458117 and degree = 1. This also produced a CV error rate of 4.41%.

  # Using SVM polynomial kernel cost = 485165195, gamma = 0.001458117 and degree = 1 - the best polynomial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data,
                kernel="polynomial", cost = 485165195, gamma = 0.001458117, degree = 1)
    
    svm.y<-valid.data$Group
    
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.04576271

4.4 SVM Radial Kernel

The optimal hyperparameters were chosen to be cost = 5958.305 and gamma = 0.006992461. This also produced a CV error rate of 3.73%.

    # Using SVM radial kernel cost = 5958.305 and gamma = 0.006992461 - the best radial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data,
                kernel="radial", cost = 5958.305, gamma = 0.006992461)
    
    svm.y<-valid.data$Group
    
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.03728814

4.5 The Best Model

Overall, the best performing model was an SVM (radial kernel, cost = 5958.305, gamma = 0.006992461). Now we will use the entire training set to train ths model and test it on the test set to get the test error rate.

    svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set, 
                kernel="radial", cost = 5958.305, gamma = 0.006992461)

    svmfit

## 
## Call:
## svm(formula = Group ~ Longitude + Latitude + OverallAvg, data = train.set, 
##     kernel = "radial", cost = 5958.305, gamma = 0.006992461)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5958.305 
##       gamma:  0.006992461 
## 
## Number of Support Vectors:  104

    svm.y<-test.set$Group
    
    svm.predy<-predict(svmfit, test.set)
    
    mean(svm.y!=svm.predy)

## [1] 0.04929577

The test error rate for this model was 4.93%. The cross-classification table is given below.

    table(svm.y, svm.predy)

##      svm.predy
## svm.y  1  2  3  4  5  6  7  8  9
##     1 11  0  0  0  0  0  0  0  0
##     2  0  8  0  0  0  0  0  0  0
##     3  0  0 15  0  0  0  0  0  0
##     4  0  0  0 24  0  0  0  0  0
##     5  0  0  0  3 44  0  0  0  1
##     6  0  0  1  0  0  7  0  0  0
##     7  0  0  0  0  0  0  2  1  0
##     8  0  0  0  0  0  0  0  5  0
##     9  0  0  0  0  1  0  0  0 19

Interestingly, 4 out of the 7 errors were observations of class 5 being misclassified as either groups 4 or 9. High misclassification of class 5 was also evident when PC1 and PC2 were used as explanatory variables.

svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set,
            kernel="radial", cost = 5958.305, gamma = 0.006992461)

xgrid<-expand.grid(X1=seq(min(Data$Longitude), max(Data$Longitude), length.out =70),
                   X2=seq(min(Data$Latitude), max(Data$Latitude), length.out = 70), 
                   X3=seq(min(Data$OverallAvg), max(Data$OverallAvg), length.out = 70))

colnames(xgrid)<-c("Longitude", "Latitude", "OverallAvg")

group.train.set.pred<-predict(svmfit, xgrid)

xgrid<-cbind(xgrid, group.train.set.pred)

There is no in-built function to plot the SVM classification for dimensions greater than two. The plot of how the model partitions the space is given below. The plot was created using a fine grid.

interactive <- plot_ly() %>% 
  add_trace(
    x = ~xgrid$Longitude, 
    y = ~xgrid$Latitude, 
    z= ~xgrid$OverallAvg, 
    mode = "markers",
    color= ~ xgrid$group.train.set.pred, 
    opacity=0.05, 
    text = ~paste("Predicted Group: ", xgrid$group.train.set.pred)) %>% 
  add_trace(
    x = ~test.set$Longitude, 
    y = ~test.set$Latitude, 
    z= ~test.set$OverallAvg, 
    mode = "markers", 
    color= ~ test.set$Group, 
    opacity=1, 
    text = ~paste("Group: ", test.set$Group))%>%
  layout(
    title = "Predicted 3d space with test observations",
    scene = list(
      xaxis = list(title = "Longitude"),
      yaxis = list(title = "Latitude"),
      zaxis = list(title = "OverallAvg")))%>% 
  layout(annotations=list(yref="paper", xref="paper", y=1.05, x=1.1,text= "Predicted / Actual", showarrow=F))

interactive

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

If you deselect some of the particular items in the legend you can see the misclassifications given in the cross-classification table above.

5 Conclusion

Very brief conclusion / overview.

Using PC1 & PC2 as explanatory variables, SVM (linear kernel, cost = 0.03589176 ) had the joint lowest CV error rate (1.53%) and had a test error rate of 4.23%.

Using Longitude, Latitude & OverallAvg as explanatory variables, SVM (radial kernel, cost = 5958.305, gamma = 0.006992461) had the lowest CV error rate (3.73%) and had a test error rate of 4.93%.

EPSRC Model Comparison SVM 2

George Gerogiannis

25 July 2017