library(ggplot2)
library(stringr)
library(DT)
library(corrplot)
library(leaflet)
library(klaR)
library(plotly)
library(plyr)
library(caret)
library(MASS) # for LDA and QDA.library(readxl)
Data <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")The data set consists of 732 observations with 9 attributes - 8 continuous and 1 factor. Due to there not being a very large number of observations in total, it seems sensible for us to use some form of cross validation for this problem i.e. leave one out, K-fold, random subsampling, etc…
Data$Group<-factor(Data$Group)
a <- ggplot(Data, aes(x=as.numeric(Group), fill=Group)) + geom_bar(position = "dodge")+
labs(x="Group" , y="Number of Observations",title="Histogram of Observations in each Group") +
scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9))
ggplotly(a)Data$Group<-as.numeric(Data$Group)There are 9 classification groups with an unequal amount of observations in each - this may produce some technical difficulties for classification. As can be seen in the histogram above, groups 7 & 8 have a very small number of observations with 19 and 29 in each, respectively. In contrast, group 5 has a relatively large number of observations - 244. As a result, we would need to be careful whenever partitioning our data. For example, if we were to split our data into training and test sets in a purely random way we could end up with two uncomparable groups i.e. the test set could end up with no observations of group 7. Consequently, we may want to use stratification for any partition we do to prevent this.
In this section we will explore how a few potential classification methods may perform considering only the first two princial components for simplicity. This leads on from some of the exploratory data analysis previously done.
Data$Group<-factor(Data$Group)
ggplot(Data, aes(PC1, PC2, colour=Group)) +
geom_point() + stat_ellipse() + labs( title="PC1 VS. PC2 ")Data$Group<-as.numeric(Data$Group)As can be seen above, considering only the first two principal components as features results in groups with a wide variety of distributions. For example, groups 4 & 5 are very elliptical whilst groups 3 & 7 are fairly circular. There are also issues with overlapping and outliers. Consequently, in this instance one would expect more flexible classification methods to produce lower in-sample and out-of-sample error rates for the training and test sets. For example, we may expect QDA & KNN to perform better than LDA as their class boundaries are more flexible.
For this classification problem, there are a number of loss functions we could use to provide us with an error metric. However, for simplicity we will only consider the frequently used 0-1 loss function.
There are a number of ways in which we can split our data in preperation to use our classification methods. Two of the main ways to do is to split the data into training, validation and test sets or use cross validation. As the data set is of medum size, we would prefer to use cross validation to get the most out of our data.
In terms of cross validation we are going to compare 5-fold, 10-fold, 20-fold, 50-fold and 100-fold cross validation. The reason why leave one out was not considered was because it was a bit too computationally intense. Furthermore, K > 100 is not considered here because using stratified cross validation broke down as the folds could not be properly stratified.
In addition, we will compare the effects of stratifying and not stratifying the folds.
It has been suggested that stratified 10-fold cross validation works well.
We have filtered our variables, as we originally had many more variables. So, in a sense, we are working with cherry picked variables. Thus, our cross validation error rates may be a conservative estimate - page 245 of The Elements of Statistical Learning.
We must take care when using the principal components as they have been created in a special way ( functional principal component (FPC) analysis was applied to smooth seasonal curves ) - interpretation may be difficult.
In this analysis we will focus on quadratic discriminant analysis (QDA).
In my QDA predictions I will compare the following:
Only the continuous variables were considered as features. Consequently, Type was not used.
For total overall efficiency in terms of length and neatness of code, I created a function to perform non-stratified cross validation for QDA.
cv.qda <-function (data=Data, model=Group~., y="Group", K=10, seed=234) {
# Preperation
n <- nrow(data) # gets the number of rows
set.seed(seed) # sets the specific seed
datay<-data[,y] # gets the classification column
#partition the data into K subsets
Ceil <- ceiling(n/K) # get the ceiling for the number of observations divided by the number of folds
Fold <- sample(rep(1:K, Ceil), n) # get the n indicies for the for the k-fold cross validation. In words, repeat 1 to k ceil times, sampling n of these to get n cross validation indices. (size of folds may not be of the same size)
data<-cbind(data, Fold) # combine the data with a fold column
CV.error<-NULL # set up any empty vector to hold the cross validation error
all.k.fold.errors <-NULL # set up any empty vector to hold all the error rates for all i test sets to be used to compute a standard deviation
library(MASS) #gets the qda functions
# Building the model K times
for (i in 1:K) {
test.data <- subset(data, Fold == i )#set the ith fold to be the test data
train.data <- subset(data, Fold != i) #set the other folds to be the training data
qda.fit<-qda(formula = model, data=train.data) #fit the model using the training data
qda.y <- test.data[ ,y] #get the y (group) for the observed test set
qda.predy<-predict(qda.fit, test.data)$class #predict the y (group) for the test set
ith.test.error<- mean(qda.y!=qda.predy) # get the error rate for the ith test set
all.k.fold.errors<-c(all.k.fold.errors, ith.test.error) # save the error rates for all i test sets to be used to compute a standard deviation
CV.error<-c(CV.error,(nrow(test.data)/n)*ith.test.error) # add (the size of fold/n) times the ith test set error rate to CV error vector we will end up summing them, the formula is given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY time stamp: 10:55-it is weighted because the folds are not of the same size
}
CV.error #close to the other one
SD.CV.error<-sqrt( sum( ( all.k.fold.errors - mean(all.k.fold.errors) )^2 ) / (K - 1) ) # the formula is given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY time stamp: 10:55
#Output
list(call = model, K = K, qda_error_rate = sum(CV.error), qda_sd_error_rate = SD.CV.error, seed = seed)
}For total overall efficiency in terms of length and neatness of code, I created a function to perform stratified cross validation for QDA.
strat.cv.qda <-function (data=Data, model=Group~., y="Group", K=10, seed=234) {
# Preperation
n <- nrow(data) # gets the number of rows
set.seed(seed) # sets the specific seed
datay<-data[,y] # gets the classification column
#partition the data into K subsets
folds <- createFolds(y=factor(data$Group), k = K, list = FALSE) # createFolds - For data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.
data$fold <- folds
ddply(data, 'fold', summarise, prop=mean(Group))
data<-as.data.frame(data)
CV.error<-NULL # set up any empty vector to hold the cross validation error
all.k.fold.errors <-NULL # set up any empty vector to hold all the error rates for all i test sets to be used to compute a standard deviation
library(MASS) #gets the qda and qda functions
# Building the model K times
for (i in 1:K) {
test.data <- subset(data, fold == i )#set the ith fold to be the test data
train.data <- subset(data, fold != i) #set the other folds to be the training data
qda.fit<-qda(formula = model, data=train.data) #fit the model using the training data
qda.y <- test.data[ ,y] #get the y (group) for the observed test set
qda.predy<-predict(qda.fit, test.data)$class #predict the y (group) for the test set
ith.test.error<- mean(qda.y!=qda.predy) # get the error rate for the ith test set
all.k.fold.errors<-c(all.k.fold.errors, ith.test.error) # save the error rates for all i test sets to be used to compute a standard deviation
CV.error<-c(CV.error,(nrow(test.data)/n)*ith.test.error) # add (the size of fold/n) times the ith test set error rate to CV error vector we will end up summing them, the formula is given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY time stamp: 10:55-it is weighted because the folds are not of the same size
}
CV.error
all.k.fold.errors
SD.CV.error<-sqrt( sum( ( all.k.fold.errors - mean(all.k.fold.errors) )^2 ) / (K - 1) ) # the formula is given in the elements of statistical learning -https://www.youtube.com/watch?v=nZAM5OXrktY time stamp: 10:55
#Output
list(call = model, K = K, qda_error_rate = sum(CV.error), qda_sd_error_rate = SD.CV.error, seed = seed)
}We will need a matrix to store the information about the error rates the various methods produced using the pricipal components as features. We can also plot this information.
Matrix<-matrix(data=NA, nrow = 10, ncol = 4)
Matrix<-as.data.frame(Matrix)
colnames(Matrix)<-c("Folds", "Error", "Est.Variance", "Partitioning")
Matrix[1:5,4]<-"Non-Stratified"
Matrix[6:10,4]<-"Stratified"
# Non-Stratified
info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=5, seed = 234)
Matrix[1,1]<-5
Matrix[1,2]<-info$qda_error_rate
Matrix[1,3]<-info$qda_sd_error_rate^2 # squaring the estimated standard deviation gives us the estimated variance.
info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=10, seed = 234)
Matrix[2,1]<-10
Matrix[2,2]<-info$qda_error_rate
Matrix[2,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=20, seed = 234)
Matrix[3,1]<-20
Matrix[3,2]<-info$qda_error_rate
Matrix[3,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=50, seed = 234)
Matrix[4,1]<-50
Matrix[4,2]<-info$qda_error_rate
Matrix[4,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=100, seed = 234)
Matrix[5,1]<-100
Matrix[5,2]<-info$qda_error_rate
Matrix[5,3]<-info$qda_sd_error_rate^2
# Stratified
info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=5, seed = 234)
Matrix[6,1]<-5
Matrix[6,2]<-info$qda_error_rate
Matrix[6,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=10, seed = 234)
Matrix[7,1]<-10
Matrix[7,2]<-info$qda_error_rate
Matrix[7,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=20, seed = 234)
Matrix[8,1]<-20
Matrix[8,2]<-info$qda_error_rate
Matrix[8,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=50, seed = 234)
Matrix[9,1]<-50
Matrix[9,2]<-info$qda_error_rate
Matrix[9,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~PC1+PC2, y = "Group", K=100, seed = 234)
Matrix[10,1]<-100
Matrix[10,2]<-info$qda_error_rate
Matrix[10,3]<-info$qda_sd_error_rate^2
ggplot(Matrix, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_line() + geom_point() +
geom_errorbar(aes(ymin=Error-Est.Variance,ymax=Error+Est.Variance),width=3.5) +
labs(y="Error Rate", title="K-fold Cross Validation for QDA using the 2 Principal Components")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))Matrix## Folds Error Est.Variance Partitioning
## 1 5 0.05054645 5.710179e-05 Non-Stratified
## 2 10 0.05054645 1.713665e-03 Non-Stratified
## 3 20 0.05191257 1.696827e-03 Non-Stratified
## 4 50 0.04918033 4.153454e-03 Non-Stratified
## 5 100 0.04644809 6.532396e-03 Non-Stratified
## 6 5 0.04781421 3.531916e-04 Stratified
## 7 10 0.04781421 8.897777e-04 Stratified
## 8 20 0.04644809 1.631997e-03 Stratified
## 9 50 0.04644809 2.733969e-03 Stratified
## 10 100 0.04918033 7.539460e-03 Stratified
We will need a matrix to store the information about the error rates the various methods produced using the continuous variables as features. We can also plot this information.
Matrix.1<-matrix(data=NA, nrow = 10, ncol = 4)
Matrix.1<-as.data.frame(Matrix.1)
colnames(Matrix.1)<-c("Folds", "Error", "Est.Variance", "Partitioning")
Matrix.1[1:5,4]<-"Non-Stratified"
Matrix.1[6:10,4]<-"Stratified"
# Non-Stratified
info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=5, seed = 234)
Matrix.1[1,1]<-5
Matrix.1[1,2]<-info$qda_error_rate
Matrix.1[1,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=10, seed = 234)
Matrix.1[2,1]<-10
Matrix.1[2,2]<-info$qda_error_rate
Matrix.1[2,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=20, seed = 234)
Matrix.1[3,1]<-20
Matrix.1[3,2]<-info$qda_error_rate
Matrix.1[3,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=50, seed = 234)
Matrix.1[4,1]<-50
Matrix.1[4,2]<-info$qda_error_rate
Matrix.1[4,3]<-info$qda_sd_error_rate^2
info<-cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=100, seed = 234)
Matrix.1[5,1]<-100
Matrix.1[5,2]<-info$qda_error_rate
Matrix.1[5,3]<-info$qda_sd_error_rate^2
# Stratified
info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=5, seed = 234)
Matrix.1[6,1]<-5
Matrix.1[6,2]<-info$qda_error_rate
Matrix.1[6,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=10, seed = 234)
Matrix.1[7,1]<-10
Matrix.1[7,2]<-info$qda_error_rate
Matrix.1[7,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=20, seed = 234)
Matrix.1[8,1]<-20
Matrix.1[8,2]<-info$qda_error_rate
Matrix.1[8,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=50, seed = 234)
Matrix.1[9,1]<-50
Matrix.1[9,2]<-info$qda_error_rate
Matrix.1[9,3]<-info$qda_sd_error_rate^2
info<-strat.cv.qda(data = Data, model=Group~Latitude+Longitude+Elevation+LakeSize+OverallAvg+OverallMeanAmp, y = "Group", K=100, seed = 234)
Matrix.1[10,1]<-100
Matrix.1[10,2]<-info$qda_error_rate
Matrix.1[10,3]<-info$qda_sd_error_rate^2
ggplot(Matrix.1, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold Cross Validation for QDA using the Original Variables")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))Matrix.1## Folds Error Est.Variance Partitioning
## 1 5 0.09562842 0.0003345115 Non-Stratified
## 2 10 0.09699454 0.0019399860 Non-Stratified
## 3 20 0.09699454 0.0025963501 Non-Stratified
## 4 50 0.09972678 0.0083510953 Non-Stratified
## 5 100 0.09562842 0.0114167266 Non-Stratified
## 6 5 0.09153005 0.0001084575 Stratified
## 7 10 0.09289617 0.0008829981 Stratified
## 8 20 0.09426230 0.0021670630 Stratified
## 9 50 0.10109290 0.0066579763 Stratified
## 10 100 0.09699454 0.0136770966 Stratified
In conclusion, of the methods considered, I would recommend using stratified 5-fold cross validation using the 2 principal components as features.
3.2 Comments on the Results of Prediction
On the whole, for a low to moderate number of folds (5 - 50) stratified cross validation out performs non-stratified cross validation in terms of both error rate and estimated variances.
Using the 2 principal components, the stratified CV error rates for 5, 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV error rates. Also, using the original variables, the stratified CV error rates for 5, 10 & 20 fold cross validation are lower than that of the non-stratified error rates.
Using the 2 principal components, the stratified CV estimated variances for 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV variances. Similarly, using the original variables, the stratified CV estimated variances for 5, 10, 20 & 50 fold cross validation are lower than that of the non-stratified CV variances.
As expected, using larger values of K results in larger estimated variances in almost all cases.
Using the 2 principal components as features gives lower error rates than using the original continuous variables. The largest error rate using the 2 principal components (0.05191257) is almost half the lowest error rate using the original variables (0.09153005).
In the 2 principal components case, stratified 5-fold cross validation appears to be one of the best methods as the error rate is amongst the lowest and has a very small variance. In the original continuous variable case, stratified 5-fold cross validation is the best performing method both in terms of error rate and estimated variance.