1 Reading in the Data

ArcLakeGroupSummary <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
dundeedata <- read_csv("~/Desktop/EPSRC Project /dundeedata.csv.xls")

colnames(dundeedata)[1]<-"GloboLakes_ID" # change the GloboLID column name to GloboLakes_ID to make the merge easier.

Data<-merge(ArcLakeGroupSummary, dundeedata, by = "GloboLakes_ID", all = TRUE )
Data<-subset(Data, Group!="NA") # The data set is back to the original 732 rows just with extra columns of information

Data$Group<-as.factor(Data$Group)

2 Scheme for Comparing Classification Methods

In order to compae different models and particular parameter values I decided to split the data into a training (80%) and test (20%) set in a stratified way. The training set will be used in stratified 5 fold cross-validation to compare the perfomance of different models. The model that performs best, i.e. has the lowest cross-validation error rate will be used toproduce a test error rate.

An illustration of the scheme to be used in comparing various models using stratification can be found below.

Some brief justifications for some of the decisions made:

The overall scheme - A good justification is explained here .
The 80-20 split - After having tried out a few different splits (i.e. 60-40, 70-30, etc…), I found that the SVMs with quadratic and radial kernels tended to overfit when it came to performing cross-validation on the training set.
The 5-fold cross-validation - A lower amount of folds induces a lower variance for the estimate of the cross-validation error rate. Also, as the training set consists 80% of all observations each of the 5 folds account for 16% of the entire data set which is compariable to the test set 20% size.
The stratification through out - Ideally we would want our classification method to wrongly / correctly predict an observation’s class based on the method being trained on that particular class. An illustration of an extreme problem that could be created if we didn’t stratify is given below. In addition, as we are using a single seed stratifying will give us a more stable estimate of our CV and test error rates.

3 For PC1 + PC2

In order to use each method, I first prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.

Data1<-data.frame(Data[,c("Group","PC1","PC2")])

# Stratify the entire training set into training and test sets

set.seed(234)

library(caret)
train.index<-createDataPartition(Data1$Group, p=0.8, list = FALSE)
train.set<-Data1[train.index, ]
test.set<-Data1[-train.index, ]

# Stratify the training set into 5 folds

folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds

3.1 LDA & QDA

Originally, I only considered QDA. However, including the performance of LDA may make for an interesting comparison.

  # Using LDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    lda.fit<-lda(formula = Group~PC1+PC2, data=train.data)
    lda.y <- valid.data$Group
    lda.predy<-predict(lda.fit, valid.data)$class
    
    ith.test.error<- mean(lda.y!=lda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.04745763

  # Using QDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    qda.fit<-qda(formula = Group~PC1+PC2, data=train.data)
    qda.y <- valid.data$Group
    qda.predy<-predict(qda.fit, valid.data)$class
    
    ith.test.error<- mean(qda.y!=qda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.04067797

In comparison, QDA performed better than LDA with each method producing a cross-validation error rate of 4.07% and 4.75%, respectively.

3.2 SVM Linear Kernel

In exploring different cost values for the SVM with a linear kernel, I initially explored a couple hundred values. However, as the computation of all of these models was intense and noting that better performance was induced by smaller cost values, many larger values are omitted here.

  # Searching for the best SVM with linear kernel changing cost

  costs<-c(seq(0.05, 0.5, by=0.01 ),1, 10, 50, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000)
    
  CV.errors<-numeric(length(costs))
    
  for(j in 1:length(costs)){
      
        errors<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2,data = train.data, kernel="linear", cost=costs[j] ,scale=FALSE)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    errors<-c(errors,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  CV.errors[j]<-sum(errors)
    
  }
  
  min(CV.errors)

## [1] 0.01525424

  costs[which.min(CV.errors)]

## [1] 0.27

Of all of the cost values considered, the SVM with linear kernel had the lowest CV error rate when the cost was set 0.27. The CV error rate for this model was 1.53% - nearly a third of the QDA CV error rate.

Below is a plot of different cost values and the CV error rates produced.

  interactive<-ggplot(data = data.frame(cbind(costs, CV.errors)), aes(x=costs, y=CV.errors)) + geom_point()
  ggplotly(interactive)

Here we retrieve and confirm the best performing SVM with linear kernel.

    # Using SVM linear kernel cost = 0.27 - the best polynomial kernel SVM model considered.
    
     CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2,data = train.data, kernel="linear", cost=0.27 ,scale=FALSE)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
    sum(CV.error)

## [1] 0.01525424

3.3 SVM Polynomial Kernel

Using the SVM with polynomial kernel, low values of degree with moderate costs tended to perform well. The coef0 argument was kept as 0, as changes did not help to produce better models. Also, the default gamma value (1/(data dimension)) was kept the same, as making this variable would result in the amount of models having to be fit getting out of hand - I’ll look more deeply into this in the near future.

    # Searching for the best SVM with polynomial kernel changing cost and degree

    costs<-c(seq(0.1, 0.5, by=0.2 ),1, 10, 50, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000)
    degrees<-c(1:8)
    
    matrix.errors<-matrix(NA, nrow = length(costs), ncol = length(degrees))
    for(j in 1:length(costs)){
      for(l in 1:length(degrees)){
        
         CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="polynomial", cost=costs[j], degree=degrees[l])
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  matrix.errors[j, l]<-sum(CV.error)
        
      }
    }
    
    min(matrix.errors)

## [1] 0.01525424

    # turn the matrix.error into a column vector
    
    xgrid<-expand.grid(X1=costs, X2=degrees)
    colnames(xgrid)<-c("costs", "degrees")
    
    CV.Errors<-as.vector(matrix.errors)
    
    xgrid<-cbind(xgrid, CV.Errors)
    
    xgrid[which.min(CV.Errors), ]

##    costs degrees  CV.Errors
## 10   300       1 0.01525424

Of all the different combinations of parameter values considered, the best performance occured when degree was 1 and cost was 300. The CV error rate was 1.53% - this is identical to the best performing SVM with linear kernel.

Retrieving and confirming the best performing SVM with polynomial kernel.

  # Using SVM polynomial kernel degree = 1, cost = 300 - the best polynomial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="polynomial",degree = 1, cost=300)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.01525424

3.4 SVM Radial Kernel

Using the SVM with radial kernel, low values of gamma with moderate values of cost tended to perform well.

  # Searching for the best SVM with radial kernel changing cost and gamma

    costs<-c(seq(0.1, 0.5, by=0.2 ),1, 10, 50, 100, 150, 200, 300, 400, 500, 40000)
    
    gammas<-c(0.01, seq(0, 0.5, by=0.2),seq(1, 100, by=20), seq(100, 1000, by=200))
  
    matrix.errors<-matrix(NA, nrow = length(costs), ncol = length(gammas))

    for(j in 1:length(costs)){
      for(l in 1:length(gammas)){
              CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="radial", cost=costs[j], gamma=gammas[l])
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  matrix.errors[j, l]<-sum(CV.error)
      }
    }
  min(matrix.errors)

## [1] 0.01525424

  # turn the matrix.error into a column vector
  
    xgrid<-expand.grid(X1=costs, X2=gammas)
    colnames(xgrid)<-c("costs", "gammas")
    
    CV.Errors<-as.vector(matrix.errors)
    
    xgrid<-cbind(xgrid, CV.Errors)
    
    xgrid[which.min(CV.Errors), ]

##    costs gammas  CV.Errors
## 13 40000   0.01 0.01525424

Of all the different combinations of parameter values considered, the best performance occured when gamma was 0.2 and cost was 300. The CV error rate was 2.2%.

    # Using SVM radial kernel cost = 300, gamma = 0.2 - the best radial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~PC1+PC2, data = train.data, kernel="radial", cost=300, gamma=0.2)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.0220339

3.5 The Best Model

Overall, there was a tie between the SVM (linear kernel, cost = 0.27) and SVM (polynomial kernel, degree = 1, cost = 300). However, in the case of this tie, the SVM with linear kernel is the simpler model and thus preferred. Now we will use the entire training set to train this model and test it on the test set to get the test error rate.

    svmfit<-svm(Group~PC1+PC2,data = train.set,kernel="linear",cost=0.27 ,scale=FALSE)

    svmfit

## 
## Call:
## svm(formula = Group ~ PC1 + PC2, data = train.set, kernel = "linear", 
##     cost = 0.27, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.27 
##       gamma:  0.5 
## 
## Number of Support Vectors:  57

A plot of how the model partitioned the space is given below - the observations represented by a cross are the support vectors.

The in-built plot function in the e1071 library plots the SVM classification in a weird way - PC1 gets put on the y-axis and PC2 gets put on the x-axis. I decided to create my own classification plot over a fine grid

xgrid<-expand.grid(X1=seq(min(Data$PC1), max(Data$PC1), length.out = 150), X2=seq(min(Data$PC2), max(Data$PC2), length.out = 150))

colnames(xgrid)<-c("PC1", "PC2")
group.train.set.pred<-predict(svmfit, xgrid)
xgrid<-cbind(xgrid, group.train.set.pred)

ggplot(xgrid, aes(x=PC1,y=PC2))+
  geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
  geom_point(data = train.set[-svmfit$index, ], aes(x=PC1, y=PC2, colour=Group))+
  geom_point(data = train.set[svmfit$index, ], aes(x=PC1, y=PC2, colour=Group), shape=4)+
  labs(colour = "Group", title = "Decision Surface With Training Set Observations")

    svm.y<-test.set$Group
    svm.predy<-predict(svmfit, test.set)
    
    mean(svm.y!=svm.predy)

## [1] 0.03521127

The test error rate for this model was 3.52%.

The cross-classification table is given below.

    table(svm.y, svm.predy)

##      svm.predy
## svm.y  1  2  3  4  5  6  7  8  9
##     1 11  0  0  0  0  0  0  0  0
##     2  0  8  0  0  0  0  0  0  0
##     3  0  0 15  0  0  0  0  0  0
##     4  0  0  0 23  1  0  0  0  0
##     5  0  0  0  1 44  0  0  0  3
##     6  0  0  0  0  0  8  0  0  0
##     7  0  0  0  0  0  0  3  0  0
##     8  0  0  0  0  0  0  0  5  0
##     9  0  0  0  0  0  0  0  0 20

4 out of the 5 errors were observations of class 5 being misclassified as either group 4 or 9.

ggplot(xgrid, aes(x=PC1,y=PC2))+
  geom_point(aes(colour=group.train.set.pred), alpha = 1/5)+
  geom_point(data = test.set, aes(x=PC1, y=PC2, colour=Group))+
  labs(colour = "Group", title = "Decision Surface With Test Observations")

4 For Longitutde + Latitude + OverallAvg

In order to use each model, I prepare a suitable data frame - splitting it into training and test sets and then splitting the training set into 5 folds.

Data2<-data.frame(Data[, c("Group", "Latitude", "Longitude", "OverallAvg")])

# Stratify the entire training set into training and test sets

set.seed(234)

library(caret)
train.index<-createDataPartition(Data2$Group, p=0.8, list = FALSE)
train.set<-Data2[train.index, ]
test.set<-Data2[-train.index, ]

# Stratify the training set into 5 folds

folds <- createFolds(y=factor(train.set$Group), k = 5, list = FALSE)
train.set$fold <- folds

4.1 LDA & QDA

Here we implement QDA and LDA, the comparison may be insightful.

  # Using LDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    lda.fit<-lda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
    lda.y <- valid.data$Group
    lda.predy<-predict(lda.fit, valid.data)$class
    
    ith.test.error<- mean(lda.y!=lda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.08983051

  # Using QDA to produce the CV error rate

  CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    qda.fit<-qda(formula = Group~ Longitude + Latitude + OverallAvg, data=train.data)
    qda.y <- valid.data$Group
    qda.predy<-predict(qda.fit, valid.data)$class
    
    ith.test.error<- mean(qda.y!=qda.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error) 
  }
  
  sum(CV.error)

## [1] 0.05932203

As seen in the output above, QDA performed much better than LDA with each model producing a cross-validationerror rate of 5.93% and 8.98%, respectively. There appears to be a significant difference in the two rates - it may be the case that we could expect more flexible models to perform better.

4.2 SVM Linear Kernel

Using an SVM with linear kernel, low values of cost tended to perform well.

  # Searching for the best SVM with linear kernel changing cost

  costs<-c(seq(0.05, 0.5, by=0.01 ),1, 10, 50, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000)
    
  CV.errors<-numeric(length(costs))
    
  for(j in 1:length(costs)){
      
        errors<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, kernel="linear", cost=costs[j] ,scale=FALSE)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    errors<-c(errors,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  CV.errors[j]<-sum(errors)
    
  }
  
  min(CV.errors)

## [1] 0.0559322

  costs[which.min(CV.errors)]

## [1] 10

Of all the cost values considered, the SVM with linear kernel had the lowest CV error rate when the cost was set to 10. The CV error rate for this model was 5.59% - slightly better than the performance of QDA.

Below is a plot of different cost values and the CV error rates produced.

  interactive<-ggplot(data = data.frame(cbind(costs, CV.errors)), aes(x=costs, y=CV.errors)) + geom_point()
  ggplotly(interactive)

Retrieving and confirming the best performing SVM with linear kernel.

    # Using SVM linear kernel cost = 10 - the best polynomial kernel SVM model considered.
    
     CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, kernel="linear", cost=10 ,scale=FALSE)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
    sum(CV.error)

## [1] 0.0559322

4.3 SVM Polynomial Kernel

Using the SVM with polynomial kernel, low values of degree with moderate costs tended to perform well. The treatment of coef0 and gamma are the same as previously mentioned.

    # Searching for the best SVM with polynomial kernel changing cost and degree

    costs<-c(seq(0.1, 0.5, by=0.2 ),1, 10, 50, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000, seq(1000, 3000, by=500))
    degrees<-c(1:8)
    
    matrix.errors<-matrix(NA, nrow = length(costs), ncol = length(degrees))
    for(j in 1:length(costs)){
      for(l in 1:length(degrees)){
        
         CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg,data = train.data, kernel="polynomial", cost=costs[j], degree=degrees[l])
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  matrix.errors[j, l]<-sum(CV.error)
        
      }
    }
    
    min(matrix.errors)

## [1] 0.04576271

    # turn the matrix.error into a column vector
    
    xgrid<-expand.grid(X1=costs, X2=degrees)
    colnames(xgrid)<-c("costs", "degrees")
    
    CV.Errors<-as.vector(matrix.errors)
    
    xgrid<-cbind(xgrid, CV.Errors)
    
    xgrid[which.min(CV.Errors), ]

##    costs degrees  CV.Errors
## 12   500       1 0.04576271

Of all the different combinations of parameter values considered, the best performance occured when degree was 1 and cost was 500. The CV error rate was 4.58%.

Retrieving and confirming the best performing SVM with polynomial kernel.

  # Using SVM polynomial kernel degree = 1, cost = 500 - the best polynomial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, kernel="polynomial",degree = 1, cost=500)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.04576271

4.4 SVM Radial Kernel

Using the SVM with radial kernel, low values of gamma with moderate values of cost tended to perform well.

  # Searching for the best SVM with radial kernel changing cost and gamma
  # changed cost gamma high cost low gammas appear to be good choices.

    costs<-c(seq(0.1, 0.5, by=0.2 ),1, 10, 50, 100, 150, 200, 300, 400, 500)
    
    gammas<-c(seq(0, 0.5, by=0.2),seq(1, 100, by=20), seq(100, 1000, by=200))
  
    matrix.errors<-matrix(NA, nrow = length(costs), ncol = length(gammas))

    for(j in 1:length(costs)){
      for(l in 1:length(gammas)){
              CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, kernel="radial", cost=costs[j], gamma=gammas[l])
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  matrix.errors[j, l]<-sum(CV.error)
      }
    }
  min(matrix.errors)

## [1] 0.04915254

  # turn the matrix.error into a column vector
  
    xgrid<-expand.grid(X1=costs, X2=gammas)
    colnames(xgrid)<-c("costs", "gammas")
    
    CV.Errors<-as.vector(matrix.errors)
    
    xgrid<-cbind(xgrid, CV.Errors)
    
    xgrid[which.min(CV.Errors), ]

##    costs gammas  CV.Errors
## 22   300    0.2 0.04915254

Of all the different combinations of parameter values considered, the best performance occured when gamma was 0.2 and cost was 300 - this is the same as for when PC1 & PC2 were the explanatory variables. The CV error rate here was 4.92%.

Retrieving and confirming the best performing SVM with radial kernel.

    # Using SVM radial kernel cost = 300, gamma = 0.2 - the best radial kernel SVM model considered.
  
      CV.error<-NULL 
  
    for (i in 1:5) { 
    valid.data <- subset(train.set, fold == i)
    train.data <- subset(train.set, fold != i) 
    
    svmfit<-svm(Group~ Longitude + Latitude + OverallAvg, data = train.data, kernel="radial", cost=300, gamma=0.2)
    svm.y<-valid.data$Group
    svm.predy<-predict(svmfit, valid.data)
    
    ith.test.error<- mean(svm.y!=svm.predy) 
    CV.error<-c(CV.error,(nrow(valid.data)/nrow(train.set))*ith.test.error)  
  }
  
  sum(CV.error)

## [1] 0.04915254

4.5 The Best Model

Overall, the best performing model was an SVM (polynomial kernel, degree = 1, cost = 500). Now we will use the entire training set to train ths model and test it on the test set to get the test error rate.

    svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set, kernel="polynomial", degree=1, cost=500)
    svm.y<-test.set$Group
    svm.predy<-predict(svmfit, test.set)
    
    mean(svm.y!=svm.predy)

## [1] 0.04929577

The test error rate for this model was 4.93%. The cross-classification table is given below.

    table(svm.y, svm.predy)

##      svm.predy
## svm.y  1  2  3  4  5  6  7  8  9
##     1 11  0  0  0  0  0  0  0  0
##     2  0  8  0  0  0  0  0  0  0
##     3  0  0 15  0  0  0  0  0  0
##     4  0  0  0 24  0  0  0  0  0
##     5  0  0  0  3 44  0  0  0  1
##     6  0  0  1  0  0  7  0  0  0
##     7  0  0  0  0  0  0  2  1  0
##     8  0  0  0  0  0  0  0  5  0
##     9  0  0  0  0  1  0  0  0 19

Interestingly,4 out of the 7 errors were observations of class 5 being misclassified as either groups 4 or 9. High misclassification of class 5 was also evident when PC1 and PC2 were used as explanatory variables.

    svmfit<-svm(Group ~ Longitude + Latitude + OverallAvg,data = train.set, kernel="polynomial", degree=1, cost=500)

xgrid<-expand.grid(X1=seq(min(Data$Longitude), max(Data$Longitude), length.out = 70), X2=seq(min(Data$Latitude), max(Data$Latitude), length.out = 70), X3=seq(min(Data$OverallAvg), max(Data$OverallAvg), length.out = 70))

colnames(xgrid)<-c("Longitude", "Latitude", "OverallAvg")
group.train.set.pred<-predict(svmfit, xgrid)
xgrid<-cbind(xgrid, group.train.set.pred)

There is no in-built function to plot the SVM classification for dimensions greater than two. The plot of how the model partitions the space is given below. The plot was created using a fine grid.

interactive <- plot_ly() %>% 
  add_trace(
    x = ~xgrid$Longitude, 
    y = ~xgrid$Latitude, 
    z= ~xgrid$OverallAvg, 
    mode = "markers",
    color= ~ xgrid$group.train.set.pred, 
    opacity=0.05, 
    text = ~paste("Predicted Group: ", xgrid$group.train.set.pred)) %>% 
  add_trace(
    x = ~test.set$Longitude, 
    y = ~test.set$Latitude, 
    z= ~test.set$OverallAvg, 
    mode = "markers", 
    color= ~ test.set$Group, 
    opacity=1, 
    text = ~paste("Group: ", test.set$Group))%>%
  layout(
    title = "Predicted 3d space with test observations",
    scene = list(
      xaxis = list(title = "Longitude"),
      yaxis = list(title = "Latitude"),
      zaxis = list(title = "OverallAvg")))%>% 
  layout(annotations=list(yref="paper", xref="paper", y=1.05, x=1.1,text= "Predicted / Actual", showarrow=F))

interactive

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

If you deselect some of the particular items in the legend you can see the misclassifications given in the cross-classification table above.

5 Conclusion

Very brief conclusion / overview.

Using PC1 & PC2 as explanatory variables, SVM (linear kernel, cost = 0.27) had the lowest CV error rate (1.53%) and had a test error rate of 3.52%.

Using Longitude, Latitude & OverallAvg as explanatory variables, SVM (polynomial kernel, degree = 1, cost = 500) had the lowest CV error rate (4.58%) and had a test error rate of 4.93%.

EPSRC Model Comparison SVM V2

George Gerogiannis

23 July 2017