## Loading required package: MASS
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(readxl)
Data <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
The data set consists of 732 observations with 9 attributes - 8 continuous and 1 factor. Due to there not being a very large number of observations in total, it seems sensible for us to use some form of cross validation for this problem i.e. leave one out, k-fold, random subsampling, etc…
a <- ggplot(Data, aes(Group)) + geom_bar(position = "dodge") +
labs(y="Number of Observations",title="Histogram of Observations in each Group.") +
scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9))
ggplotly(a)
There are 9 classification groups of with an unequal amount of observations in each - this may produce some technical difficulties for classification. As can be seen in the histogram above, groups 7 & 8 have a very small number of observations with 19 and 29 in each, respectively. In contrast, group 5 has a relatively large number of observations - 244. As a result, we would need to be careful whenever partitioning our data. For example, if we were to split our data into training and test sets in a purely random way we could end up with two uncomparable groups i.e. the test set could end up with no observations of group 7. Do a simulation. Consequently, we may want to use stratification for any partition we do to prevent this.
In this section we will explore a few potential classical classification methods for this data set - namely, LDA & QDA. Leading on from some of the exploratory data analysis previously done, we’ll have a look at 2-D partition of potential classical classification methods.
Data$Group<-factor(Data$Group)
partimat(Group~PC2+PC1, data = Data, method="lda")
partimat(Group~PC2+PC1, data = Data, method="qda")
This is the partition of the entire data set using LDA and QDA, giving us an insight into how the spaces would be divided up and is completely exploratory
WHAT ARE THE CONSEQUENCES OF DOING THIS, IS THIS SENSIBLE! SAVE FOR THE LATER RESULTS SECTION.
For this classification problem, there are a number of loss functions we could use to provide us with an error metric. For example, the square loss, logistic loss, hinge loss, etc… However, for simplicity we will only consider the frequently used 0-1 loss function.