1 Introduction / Exploration
- 1.1 Exploring the Data Set
- 1.2 Visualising Potential 2-D Partitions
2 Explanation of Methods to be Used
3 Prediction
- 3.1 Splitting & Model Building
- 3.2 Comments on the Results of Prediction
4 Conclusion

library(ggplot2)
library(stringr)
library(DT)
library(corrplot)
library(leaflet)
library(klaR)
library(plotly)
library(plyr)
library(caret)
library(MASS) # for LDA and QDA.

library(readxl)
Data <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")

1 Introduction / Exploration

1.1 Exploring the Data Set

The data set consists of 732 observations with 9 attributes - 8 continuous and 1 factor. Due to there not being a very large number of observations in total, it seems sensible for us to use some form of cross validation for this problem i.e. leave one out, K-fold, random subsampling, etc…

Data$Group<-factor(Data$Group)

a <- ggplot(Data, aes(x=as.numeric(Group), fill=Group)) + geom_bar(position = "dodge")+ 
     labs(x="Group" , y="Number of Observations",title="Histogram of Observations in each Group") +
     scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9))

ggplotly(a)

Data$Group<-as.numeric(Data$Group)

There are 9 classification groups with an unequal amount of observations in each - this may produce some technical difficulties for classification. As can be seen in the histogram above, groups 7 & 8 have a very small number of observations with 19 and 29 in each, respectively. In contrast, group 5 has a relatively large number of observations - 244. As a result, we would need to be careful whenever partitioning our data. For example, if we were to split our data into training and test sets in a purely random way we could end up with two uncomparable groups i.e. the test set could end up with no observations of group 7. Consequently, we may want to use stratification for any partition we do to prevent this.

1.2 Visualising Potential 2-D Partitions

In this section we will explore how a few potential classification methods may perform considering only the first two princial components for simplicity. This leads on from some of the exploratory data analysis previously done.

Data$Group<-factor(Data$Group)

ggplot(Data, aes(PC1, PC2, colour=Group)) + 
    geom_point() + stat_ellipse() + labs( title="PC1 VS. PC2 ")

Data$Group<-as.numeric(Data$Group)

As can be seen above, considering only the first two principal components as features results in groups with a wide variety of distributions. For example, groups 4 & 5 are very elliptical whilst groups 3 & 7 are fairly circular. There are also issues with overlapping and outliers. Consequently, in this instance one would expect more flexible classification methods to produce lower in-sample and out-of-sample error rates for the training and test sets. For example, we may expect QDA & KNN to perform better than LDA as their class boundaries are more flexible.

2 Explanation of Methods to be Used

2.1 Loss Function

For this classification problem, there are a number of loss functions we could use to provide us with an error metric. However, for simplicity we will only consider the frequently used 0-1 loss function.

2.2 Data Splitting

There are a number of ways in which we can split our data in preperation to use our classification methods. Two of the main ways to do is to split the data into training, validation and test sets or use cross validation. As the data set is of medum size, we would prefer to use cross validation to get the most out of our data.

2.2.1 Cross Validation Approaches

In terms of cross validation we are going to compare 5-fold, 10-fold, 20-fold, 50-fold and 100-fold cross validation. The reason why leave one out was not considered was because it was a bit too computationally intense. Furthermore, K > 100 is not considered here because using stratified cross validation broke down as the folds could not be properly stratified.

In addition, we will compare the effects of stratifying and not stratifying the folds.

It has been suggested that stratified 10-fold cross validation works well.

2.2.2 Potential Issues

We have filtered our variables, as we originally had many more variables. So, in a sense, we are working with cherry picked variables. Thus, our cross validation error rates may be a conservative estimate - page 245 of The Elements of Statistical Learning.

We must take care when using the principal components as they have been created in a special way ( functional principal component (FPC) analysis was applied to smooth seasonal curves ) - interpretation may be difficult.

2.3 Classification Methods

In this analysis we will focus on quadratic discriminant analysis (QDA).

3 Prediction

In my QDA predictions I will compare the following:

Different values of K for K-fold cross validation - 5, 10, 20, 50 & 100.
Non-stratified vs stratified cross validation.
The use of principal components vs the use of continuous original variables as features.

Only the continuous variables were considered as features. Consequently, Type was not used.

3.1 Splitting & Model Building

For total overall efficiency in terms of length and neatness of code, I created a function to perform non-stratified cross validation for QDA.

cv.qda <-function (data=Data, model=Group~., y="Group", K=10, seed=234) {
  # Preperation
  n <- nrow(data) # gets the number of rows
  set.seed(seed)  # sets the specific seed
  datay<-data[,y] # gets the classification column
  
  #partition the data into K subsets
  Ceil <- ceiling(n/K) # get the ceiling for the number of observations divided by the number of folds
  Fold <- sample(rep(1:K, Ceil), n) # get the n indicies for the for the k-fold cross validation. In words,         repeat 1 to k ceil times, sampling n of these to get n cross validation indices. (size of folds may         not be of the same size)
  data<-cbind(data, Fold) # combine the data with a fold column
  
  CV.error<-NULL # set up any empty vector to hold the cross validation error
  all.k.fold.errors <-NULL # set up any empty vector to hold all the error rates for all i test sets to be                     used to compute a standard deviation
  library(MASS)  #gets the qda functions
  
  # Building the model K times
  for (i in 1:K) { 
    
    test.data <- subset(data, Fold == i )#set the ith fold to be the test data
    train.data <- subset(data, Fold != i) #set the other folds to be the training data
    
    qda.fit<-qda(formula = model, data=train.data) #fit the model using the training data
    qda.y <- test.data[ ,y] #get the y (group) for the observed test set
    qda.predy<-predict(qda.fit, test.data)$class  #predict the y (group) for the test set
    
    ith.test.error<- mean(qda.y!=qda.predy) # get the error rate for the ith test set
    all.k.fold.errors<-c(all.k.fold.errors, ith.test.error) # save the error rates for all i test sets to                       be used to compute a standard deviation
    CV.error<-c(CV.error,(nrow(test.data)/n)*ith.test.error) # add (the size of fold/n) times the ith test              set error rate to CV error vector we will end up summing them, the formula is given in the                 elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY  time stamp:                 10:55-it is weighted because the folds are not of the same size
  }
  CV.error #close to the other one
  SD.CV.error<-sqrt( sum( ( all.k.fold.errors - mean(all.k.fold.errors) )^2  ) / (K - 1) ) # the formula is              given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY                 time stamp: 10:55
  #Output
  list(call = model, K = K, qda_error_rate = sum(CV.error), qda_sd_error_rate = SD.CV.error, seed = seed)  
}

For total overall efficiency in terms of length and neatness of code, I created a function to perform stratified cross validation for QDA.

strat.cv.qda <-function (data=Data, model=Group~., y="Group", K=10, seed=234) {
  # Preperation
  n <- nrow(data) # gets the number of rows
  set.seed(seed)  # sets the specific seed
  datay<-data[,y] # gets the classification column

  #partition the data into K subsets
  folds <- createFolds(y=factor(data$Group), k = K, list = FALSE) # createFolds - For data splitting,                the random sampling is done within the levels of y when y is a factor in an attempt to balance the          class distributions within the splits.

  data$fold <- folds
  ddply(data, 'fold', summarise, prop=mean(Group))
  
  data<-as.data.frame(data)
  
  CV.error<-NULL # set up any empty vector to hold the cross validation error
  all.k.fold.errors <-NULL # set up any empty vector to hold all the error rates for all i test sets to be                      used to compute a standard deviation
  library(MASS)  #gets the qda and qda functions
  
  # Building the model K times
  for (i in 1:K) { 
    test.data <- subset(data, fold == i )#set the ith fold to be the test data
    train.data <- subset(data, fold != i) #set the other folds to be the training data
    
    qda.fit<-qda(formula = model, data=train.data) #fit the model using the training data
    qda.y <- test.data[ ,y] #get the y (group) for the observed test set
    qda.predy<-predict(qda.fit, test.data)$class  #predict the y (group) for the test set
    
    ith.test.error<- mean(qda.y!=qda.predy) # get the error rate for the ith test set
    all.k.fold.errors<-c(all.k.fold.errors, ith.test.error) # save the error rates for all i test sets to                        be used to compute a standard deviation
    CV.error<-c(CV.error,(nrow(test.data)/n)*ith.test.error) # add (the size of fold/n) times the ith test              set error rate to CV error vector we will end up summing them, the formula is given in the                 elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY  time stamp:                 10:55-it is weighted because the folds are not of the same size
  }
  
  CV.error
  all.k.fold.errors
  SD.CV.error<-sqrt( sum( ( all.k.fold.errors - mean(all.k.fold.errors) )^2  ) / (K - 1) ) # the formula is               given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY                 time stamp: 10:55
  #Output
  list(call = model, K = K, qda_error_rate = sum(CV.error), qda_sd_error_rate = SD.CV.error, seed = seed)  
}

We will need a matrix to store the information about the error rates the various methods produced using the pricipal components as features. We can also plot this information.

Matrix<-matrix(data=NA, nrow = 10, ncol = 4)
Matrix<-as.data.frame(Matrix)

colnames(Matrix)<-c("Folds", "Error", "Est.Variance", "Partitioning")
Matrix[1:5,4]<-"Non-Stratified"
Matrix[6:10,4]<-"Stratified"

# Non-Stratified

info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=5, seed = 234)
Matrix[1,1]<-5
Matrix[1,2]<-info$qda_error_rate
Matrix[1,3]<-info$qda_sd_error_rate^2 # squaring the estimated standard deviation gives us the estimated                variance.

info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=10, seed = 234)
Matrix[2,1]<-10
Matrix[2,2]<-info$qda_error_rate
Matrix[2,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=20, seed = 234)
Matrix[3,1]<-20
Matrix[3,2]<-info$qda_error_rate
Matrix[3,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=50, seed = 234)
Matrix[4,1]<-50
Matrix[4,2]<-info$qda_error_rate
Matrix[4,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=100, seed = 234)
Matrix[5,1]<-100
Matrix[5,2]<-info$qda_error_rate
Matrix[5,3]<-info$qda_sd_error_rate^2

# Stratified

info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=5, seed = 234)
Matrix[6,1]<-5
Matrix[6,2]<-info$qda_error_rate
Matrix[6,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=10, seed = 234)
Matrix[7,1]<-10
Matrix[7,2]<-info$qda_error_rate
Matrix[7,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=20, seed = 234)
Matrix[8,1]<-20
Matrix[8,2]<-info$qda_error_rate
Matrix[8,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=50, seed = 234)
Matrix[9,1]<-50
Matrix[9,2]<-info$qda_error_rate
Matrix[9,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=100, seed = 234)
Matrix[10,1]<-100
Matrix[10,2]<-info$qda_error_rate
Matrix[10,3]<-info$qda_sd_error_rate^2
  
ggplot(Matrix, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) + 
  geom_line() + geom_point() +
  geom_errorbar(aes(ymin=Error-Est.Variance,ymax=Error+Est.Variance),width=3.5)    +
  labs(y="Error Rate", title="K-fold Cross Validation for QDA using the 2 Principal Components")+
  scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))

Matrix

##    Folds      Error Est.Variance   Partitioning
## 1      5 0.05054645 5.710179e-05 Non-Stratified
## 2     10 0.05054645 1.713665e-03 Non-Stratified
## 3     20 0.05191257 1.696827e-03 Non-Stratified
## 4     50 0.04918033 4.153454e-03 Non-Stratified
## 5    100 0.04644809 6.532396e-03 Non-Stratified
## 6      5 0.04781421 3.531916e-04     Stratified
## 7     10 0.04781421 8.897777e-04     Stratified
## 8     20 0.04644809 1.631997e-03     Stratified
## 9     50 0.04644809 2.733969e-03     Stratified
## 10   100 0.04918033 7.539460e-03     Stratified

We will need a matrix to store the information about the error rates the various methods produced using the continuous variables as features. We can also plot this information.

Matrix.1<-matrix(data=NA, nrow = 10, ncol = 4)
Matrix.1<-as.data.frame(Matrix.1)

colnames(Matrix.1)<-c("Folds", "Error", "Est.Variance", "Partitioning")
Matrix.1[1:5,4]<-"Non-Stratified"
Matrix.1[6:10,4]<-"Stratified"

# Non-Stratified

info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y =      "Group", K=5, seed = 234)
Matrix.1[1,1]<-5
Matrix.1[1,2]<-info$qda_error_rate
Matrix.1[1,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y =      "Group", K=10, seed = 234)
Matrix.1[2,1]<-10
Matrix.1[2,2]<-info$qda_error_rate
Matrix.1[2,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y =      "Group", K=20, seed = 234)
Matrix.1[3,1]<-20
Matrix.1[3,2]<-info$qda_error_rate
Matrix.1[3,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y =      "Group", K=50, seed = 234)
Matrix.1[4,1]<-50
Matrix.1[4,2]<-info$qda_error_rate
Matrix.1[4,3]<-info$qda_sd_error_rate^2

info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y =      "Group", K=100, seed = 234)
Matrix.1[5,1]<-100
Matrix.1[5,2]<-info$qda_error_rate
Matrix.1[5,3]<-info$qda_sd_error_rate^2

# Stratified

info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp,      y = "Group", K=5, seed = 234)
Matrix.1[6,1]<-5
Matrix.1[6,2]<-info$qda_error_rate
Matrix.1[6,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp,     y = "Group", K=10, seed = 234)
Matrix.1[7,1]<-10
Matrix.1[7,2]<-info$qda_error_rate
Matrix.1[7,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp,     y = "Group", K=20, seed = 234)
Matrix.1[8,1]<-20
Matrix.1[8,2]<-info$qda_error_rate
Matrix.1[8,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp,     y = "Group", K=50, seed = 234)
Matrix.1[9,1]<-50
Matrix.1[9,2]<-info$qda_error_rate
Matrix.1[9,3]<-info$qda_sd_error_rate^2

info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp,     y = "Group", K=100, seed = 234)
Matrix.1[10,1]<-100
Matrix.1[10,2]<-info$qda_error_rate
Matrix.1[10,3]<-info$qda_sd_error_rate^2

ggplot(Matrix.1, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) + 
  geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
  geom_line() + geom_point()+
  labs(y="Error Rate", title="K-fold Cross Validation for QDA using the Original Variables")+
  scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))

Matrix.1

##    Folds      Error Est.Variance   Partitioning
## 1      5 0.09562842 0.0003345115 Non-Stratified
## 2     10 0.09699454 0.0019399860 Non-Stratified
## 3     20 0.09699454 0.0025963501 Non-Stratified
## 4     50 0.09972678 0.0083510953 Non-Stratified
## 5    100 0.09562842 0.0114167266 Non-Stratified
## 6      5 0.09153005 0.0001084575     Stratified
## 7     10 0.09289617 0.0008829981     Stratified
## 8     20 0.09426230 0.0021670630     Stratified
## 9     50 0.10109290 0.0066579763     Stratified
## 10   100 0.09699454 0.0136770966     Stratified

3.2 Comments on the Results of Prediction

On the whole, for a low to moderate number of folds (5 - 50) stratified cross validation out performs non-stratified cross validation in terms of both error rate and estimated variances.

Using the 2 principal components, the stratified CV error rates for 5, 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV error rates. Also, using the original variables, the stratified CV error rates for 5, 10 & 20 fold cross validation are lower than that of the non-stratified error rates.

Using the 2 principal components, the stratified CV estimated variances for 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV variances. Similarly, using the original variables, the stratified CV estimated variances for 5, 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV variances.

As expected, using larger values of K results in larger estimated variances in almost all cases.

Using the 2 principal components as features gives lower error rates than using the original continuous variables. The largest error rate using the 2 principal components (0.05191257) is almost half the lowest error rate using the original variables (0.09153005).

In the 2 principal components case, stratified 5-fold cross validation appears to be one of the best methods as the error rate is amongst the lowest and has a very small variance. In the original continuous variable case, stratified 5-fold cross validation is the best performing method both in terms of error rate and estimated variance.

4 Conclusion

In conclusion, of the methods considered, I would recommend using stratified 5-fold cross validation using the 2 principal components as features.

EPSRC Classification Focusing on QDA

George Gerogiannis

3 July 2017