ArcLakeGroupSummary <- read_excel("~/Desktop/EPSRC Project /ArcLakeGroupSummary.xlsx")
dundeedata <- read_csv("~/Desktop/EPSRC Project /dundeedata.csv.xls")
colnames(dundeedata)[1]<-"GloboLakes_ID" # change the GloboLID column name to GloboLakes_ID to make the merge easier.
Data<-merge(ArcLakeGroupSummary, dundeedata, by = "GloboLakes_ID", all = TRUE )
Data<-subset(Data, Group!="NA") # The data set is back to the original 732 rows just with extra columns of information
dim(Data)
## [1] 732 24
1_L_Area_Perimeter_UTM_(997L).AREA_sqkm & 1_L_Area_Perimeter_UTM_(997L).PERIMETER_km are the areas and perimeters of the lake. There are 5 missing values for each variable in the data set.
KG_Coding, KG_Class, KG_ID & KG_RGB - These four columns all correspond to Köppen-Geiger Climate classifications – they are factors with the same information coded four different ways; an abbreviated class code, a description of the class, an ID number and the RGB colours for creating standardized color maps. There are 5 missing values for each variable in the data set. These variables have 26 levels (excluding NA).
KG_Coding is either a string of 2 or 3 letters.
unique(Data$KG_Coding)
## [1] "Dfb" "Aw" "BWk" "Dwc" "Dfc" "BWh" "BSk" "Cwb" "BSh" "Cfa" "Am"
## [12] "Dwb" "Dsb" "Dfa" "Cwa" "ET" "Csc" NA "Af" "Cfb" "Csb" "Csa"
## [23] "Cfc" "As" "Dsc" "Dwa" "Dfd"
Main Climate | Precipitation | Temperature |
---|---|---|
A: Equitorial | W: Dessert | h: Hot Arid |
B: Arid | S: Steepe | k: Cold Arid |
C: Warm Temp | f: Fully Humid | a: Hot Summer |
D: Snow | s: Summer Dry | b: Warm Summer |
E: Polar | w: Winter Dry | c: Cool Summer |
- | m: Monsoonal | d: Extremely Continental |
- | - | F: Polar Frost |
- | - | T: Polar Tundra |
KG_Class is the KG_Coding in words. KG_ID is a two digit number code representing the KG_Coding. KG_RGB is a colour coding for the KG_Coding which is used for Köppen-Geiger climate maps.
TEOW is another climate classification factor - terrestrial ecoregions of the world. There are 5 missing values for this variable. This variable has 14 levels (excluding NA).
Rocktype & GLIM - The predominant type of rock in the lake and a coded version of this (GLIM). There are 5 missing values for these variables. These variables have 14 levels (excluding NA).
RiverDensity - The density of rivers in a pre-defined catchment area around the lake. There are 126 missing values for this variable.
1_c_Area_Perimeter_UTM_(996c).AREA_sqkm & 1_c_Area_Perimeter_UTM_(996c).PERIMETER_km are the areas and perimeters of the catchment area around the lake. There are 5 missing values for these variables.
4 of the variables had heavily skewed distributions and so the logarithm was taken. Furthermore, the names of these variables were fairly long and so were changed.
Data$`1_L_Area_Perimeter_UTM_(997L).AREA_sqkm`<-log(Data$`1_L_Area_Perimeter_UTM_(997L).AREA_sqkm`)
Data$`1_L_Area_Perimeter_UTM_(997L).PERIMETER_km`<-log(Data$`1_L_Area_Perimeter_UTM_(997L).PERIMETER_km`)
Data$`1_c_Area_Perimeter_UTM_(996c).AREA_sqkm`<-log(Data$`1_c_Area_Perimeter_UTM_(996c).AREA_sqkm`)
Data$`1_c_Area_Perimeter_UTM_(996c).PERIMETER_km`<-log(Data$`1_c_Area_Perimeter_UTM_(996c).PERIMETER_km`)
colnames(Data)[13]<-"Log.Lake.Area"
colnames(Data)[14]<-"Log.Lake.Perimeter"
colnames(Data)[23]<-"Log.Catchment.Area"
colnames(Data)[24]<-"Log.Catchment.Perimeter"
The Rocktype variable had a level called “no data”, it affected one observation - it appears to be a slight mistake I changed it to NA.
levels(Data$Rocktype)[levels(Data$Rocktype)=="no data"]<-NA
levels(Data$GLiM)[levels(Data$GLiM)==15]<-NA
str(Data)
## 'data.frame': 732 obs. of 24 variables:
## $ GloboLakes_ID : num 2 3 4 5 6 7 8 9 10 11 ...
## $ LakeName : chr "SUPERIOR" "VICTORIA" "ARAL Sea" "HURON" ...
## $ Group : Factor w/ 9 levels "1","2","3","4",..: 4 6 9 5 9 6 4 4 6 4 ...
## $ Latitude : num 47.7 -1.3 45.1 44.8 43.9 ...
## $ Longitude : num -88.2 33.2 60.1 -82.2 -87.1 ...
## $ Type : chr "Lake" "Lake" "Lake" "Lake" ...
## $ Elevation : num 184 1140 42 176 176 837 450 157 485 158 ...
## $ LakeSize : num 4094 2313 1535 2763 2710 ...
## $ OverallAvg : num 6.1 25.14 11.11 8.35 9.42 ...
## $ OverallMeanAmp : num 14.65 1.56 26.72 19.28 19.26 ...
## $ PC1 : num -41.3 100.1 -11.1 -29.7 -24.2 ...
## $ PC2 : num -19.132 -13.852 35.134 0.822 3.668 ...
## $ Log.Lake.Area : num 11.3 11.1 11.1 11 11 ...
## $ Log.Lake.Perimeter : num 8.15 8.7 8.18 8.28 7.72 ...
## $ KG_Coding : chr "Dfb" "Aw" "BWk" "Dfb" ...
## $ KG_Class : chr "Snow, fully humid, warm summer" "Equatorial savannah with dry winter" "Arid desert cold" "Snow, fully humid, warm summer" ...
## $ KG_ID : Factor w/ 26 levels "11","12","13",..: 18 4 5 18 18 4 25 19 4 19 ...
## $ KG_RGB : chr "99, 0, 99" "255, 204, 204" "255, 255, 99" "99, 0, 99" ...
## $ TEOW : chr "Temperate Broadleaf and Mixed Forests" "Tropical and Subtropical Grasslands, Savannas and Shrublands" "Deserts and Xeric Shrublands" "Temperate Broadleaf and Mixed Forests" ...
## $ Rocktype : Factor w/ 14 levels "acid plutonic rocks",..: 14 14 14 14 14 9 14 10 14 10 ...
## $ GLiM : Factor w/ 14 levels "1","2","3","4",..: 11 11 11 11 11 8 11 5 11 5 ...
## $ RiverDensity : num 0.124 0.131 0.203 0.171 0.132 ...
## $ Log.Catchment.Area : num 12.3 12.7 14 12.9 12.1 ...
## $ Log.Catchment.Perimeter: num 8.33 8.51 9.08 8.69 7.95 ...
Plotting the boxplot overlayed by scatter points for each group will give us us an idea as to how the groups differ with respect to log lake area. Observations with NA were removed.
a<-ggplot(Data, aes(x=Group, y=Log.Lake.Area))+geom_boxplot()+geom_jitter(width=0.2, aes(text = paste("Lake:", LakeName), colour=Group))
## Warning: Ignoring unknown aesthetics: text
ggplotly(a)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Using log lake area on its own does not appear to allow us to discriminate between groups as their distributions do not greatly differ in the way in which we would desire. Bearing in mind that we’ve used the log scale groups 3,4,5,6 and 9 have some observations with very large areas.
Plotting the boxplot overlayed by scatter points for each group will give us us an idea as to how the groups differ with respect to log lake perimeter.
b<-ggplot(Data, aes(x=Group,y=Log.Lake.Perimeter))+geom_boxplot()+geom_jitter(width=0.2, aes(text = paste("Lake:", LakeName), colour=Group))
## Warning: Ignoring unknown aesthetics: text
ggplotly(b)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
The distribution of observations in each group are similar to those in the log lake area plot, not offering a great deal of discrimination.
Plotting the histogram of the KG_Coding colour coded by group may offer some form of discrimination.
c<-ggplot(Data, aes(x=KG_Coding, fill=Group))+geom_bar()+theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7))
ggplotly(c)
The KG_Coding variable offers distinct discrimination between groups.
Unselect all groups in the legend and select the groups mentioned.
Clear 2 class discrimination: (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (2,7), (2,9), (3,4), (3,5), (3,7), (3,8), (3,9), (4,6), (4,7), (4,8), (4,9), (5,6), (6,7).
Clear 3 class discrimination: (1,3,4), (1,3,5), (2,3,4), (2,3,5), (2,3,7), (2,4,7), (3,4,7), (3,4,8), (3,4,9).
Clear 4 class discrimination: (2,3,4,7).
Plotting the histogram of TEOW colour coded by group may offer some form of discrimination.
d<-ggplot(Data, aes(x=TEOW, fill=Group))+geom_bar()+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=6))
ggplotly(d)
The TEOW variable offers some distinct discrimination between groups.
Clear 2 class discrimination: (1,3), (1,4), (1,6), (2,4), (3,4), (3,5), (3,7), (3,9), (4,6), (4,7), (5,6), (6,7), (6,9).
Clear 3 class discrimination: (3,4,7), (4,6,7).
In comparison, KG_Coding appears to be better at discriminating between groups than TEOW.
After plotting the histogram of rock type colour coded by group, it turns out that rock type does not offer any clear discrimination between groups.
e<-ggplot(Data, aes(x=Rocktype, fill=Group))+geom_bar()+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=6))
ggplotly(e)
Plotting the boxplot overlayed by scatter points for each group will give us an idea as to how groups differ with respect to river density.
f<-ggplot(Data, aes(x=Group, y=RiverDensity))+geom_boxplot()+geom_jitter(width=0.2, aes(text = paste("Lake:", LakeName), colour=Group))
## Warning: Ignoring unknown aesthetics: text
ggplotly(f)
## Warning: Removed 126 rows containing non-finite values (stat_boxplot).
Using river density does not appear to allow us to discriminate between groups as their distribution do not greatly differ.
126 observations are missing - these are mainly of class 4 & 5 with a few observations from 3 & 9.
Plotting the boxplot overlayed by scatter points for each group shows that log catchment area does not offer much discrimination between groups.
g<-ggplot(Data, aes(x=Group, y=Log.Catchment.Area))+geom_boxplot()+geom_jitter(width=0.2, aes(text = paste("Lake:", LakeName),colour=Group))
## Warning: Ignoring unknown aesthetics: text
ggplotly(g)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Plotting the boxplot overlayed by scatter points for each group shows that log catchment perimeter does not offer much discrimination between groups.
h<-ggplot(Data, aes(x=Group, y=Log.Catchment.Perimeter))+geom_boxplot()+geom_jitter(width=0.2, aes(text = paste("Lake:", LakeName), colour=Group))
## Warning: Ignoring unknown aesthetics: text
ggplotly(h)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Seeing as KG_Coding was able to allow us to discriminate between some groups and given the way the variable encodes information about main climate, precipitation and temperature, we can extract each of these components to create new simpler varibles and explore the discrimination they offer.
The extracted main climate variable offers some level of discrimination.
Main_Climate<-matrix(data = NA, nrow = 732, ncol = 1)
colnames(Main_Climate)[1]<-"Main_Climate"
for(i in 1:nrow(Data)){
if(grepl("A.*", Data$KG_Coding[i])) {Main_Climate[i,1]<-"Equitorial"}
if(grepl("B.*", Data$KG_Coding[i])) {Main_Climate[i,1]<-"Arid"}
if(grepl("C.*", Data$KG_Coding[i])) {Main_Climate[i,1]<-"Warm Temp"}
if(grepl("D.*", Data$KG_Coding[i])) {Main_Climate[i,1]<-"Snow"}
if(grepl("E.*", Data$KG_Coding[i])) {Main_Climate[i,1]<-"Polar"}
}
Data<-cbind(Data, Main_Climate)
i<-ggplot(Data, aes(x=Main_Climate, fill=Group))+geom_bar()
ggplotly(i)
From the plot above, group 5 and group 4 mainly occur in polar, snow and arid climates. Group 3 mainly occurs in equitorial climates.
Clear 2 class discrimination: (1,4), (2,3), (2,4), (3,4), (3,5), (4,6), (4,7), (4,8), (5,6).
Clear 3 class discrimination: (2,3,4).
Precipitation<-matrix(data = NA, nrow = 732, ncol = 1)
colnames(Precipitation)[1]<-"Precipitation"
for(i in 1:nrow(Data)){
if(grepl(".?W.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Dessert"}
if(grepl(".?S.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Steppe"}
if(grepl(".?f.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Fully Humid"}
if(grepl(".?s.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Summer Dry"}
if(grepl(".?w.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Winter Dry"}
if(grepl(".?m.?", Data$KG_Coding[i])) {Precipitation[i,1]<-"Monsoonal"}
}
Data<-cbind(Data, Precipitation)
j<-ggplot(Data, aes(x=Precipitation, fill=Group))+geom_bar()
ggplotly(j)
The extracted percepitation variable does not offer any clear discrimination between groups. There is a bit of lost information in the NA’s. Group 5 usually occurs in fully humid precipitation. However, the data set obsevations are mainly made up of fully humid precipitation.
Temperature<-matrix(data = NA, nrow = 732, ncol = 1)
colnames(Temperature)[1]<-"Temperature"
for(i in 1:nrow(Data)){
if(grepl(".*h", Data$KG_Coding[i])) {Temperature[i,1]<-"Hot Arid"}
if(grepl(".*k", Data$KG_Coding[i])) {Temperature[i,1]<-"Cold Arid"}
if(grepl(".*a", Data$KG_Coding[i])) {Temperature[i,1]<-"Hot Summer"}
if(grepl(".*b", Data$KG_Coding[i])) {Temperature[i,1]<-"Warm Summer"}
if(grepl(".*c", Data$KG_Coding[i])) {Temperature[i,1]<-"Cool Summer"}
if(grepl(".*d", Data$KG_Coding[i])) {Temperature[i,1]<-"Extremely Continental"}
if(grepl(".*F", Data$KG_Coding[i])) {Temperature[i,1]<-"Polar Frost"}
if(grepl(".*T", Data$KG_Coding[i])) {Temperature[i,1]<-"Polar Tundra"}
}
Data<-cbind(Data, Temperature)
k<-ggplot(Data, aes(x=Temperature, fill=Group))+geom_bar()+theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 7))
ggplotly(k)
Using the extracted temperature variable, there is a lot of lost information in the NA’s. This is mainly due to equitorial observations not having a third letter in their string for temperature. Group 5 predominently occurs in places with cool summers and warm summers. Groups 9 and 1 predominently occur in places with hot and warm summers. For the most part, the other groups straddle a wide variety of temperature levels.
The extracted variable offers some level of discrimination.
Clear 2 class discrimination: (1,4), (2,4), (2,5), (2,7), (4,8), (4,9).
In terms of QDA and cross validation, we would ideally like to explore how various subsets of the entire set of variables perform in a stratified and non-stratified way. A more concise and elegant way to do this would be to present this in a shiny app (this is being worked on). The code for the draft app can be found here.
Here we will look at how various subsets of continuous variables performed in terms of QDA classification. When using the new variables the data set was reduced to exclude observations with a corresponding missing value.
ggplot(Matrix.2, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.3, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Longitude")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.4, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using OverallAvg")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.5, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using OverallMeanAmp")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.6, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+Longitude")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.7, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using OverallAvg+OverallMeanAmp")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.8, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+OverallMeanAmp")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.9, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+Longitude+OverallMeanAmp")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.10, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+Longitude+OverallAvg")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
Matrix.10
## Folds Error Est.Variance Partitioning
## 1 5 0.05874317 0.0001883527 Non-Stratified
## 2 10 0.05601093 0.0016416102 Non-Stratified
## 3 20 0.05601093 0.0018837242 Non-Stratified
## 4 50 0.06010929 0.0045429473 Non-Stratified
## 5 100 0.05464481 0.0086983932 Non-Stratified
## 6 5 0.05601093 0.0003879013 Stratified
## 7 10 0.06147541 0.0018220023 Stratified
## 8 20 0.05601093 0.0023755192 Stratified
## 9 50 0.05464481 0.0039068774 Stratified
## 10 100 0.05874317 0.0075298059 Stratified
ggplot(Matrix.11, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+Longitude+OverallAvg+OverallMeanAmp")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.12, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold CV for QDA using Latitude+Longitude+OverallAvg+Elevation")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
ggplot(Matrix.13, aes(x=Folds, y=Error, group=Partitioning, color=Partitioning)) +
geom_errorbar(aes(ymin=Error-Est.Variance, ymax=Error+Est.Variance), width=3.5)+
geom_line() + geom_point()+
labs(y="Error Rate", title="K-fold Cross Validation for QDA using Log.Catchment.Area")+
scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100))
On their own, the new continuous variables all had fairly high error rates (+ 0.6) and did not offer much when combined with other sets of continuous variables.
Interestingly, one of the best performing subsets of variables was the combination of Latitude, Longitude & OverallAvg - producing error rates comparable to the ones produced by the two principal components. The significance of this set of variables is that they essentially represent the location and temperature of the lake. This insight could help to form the basis of future classification and the inclusion of categorical variables relating to temperature and location. The extracted categorical variables ( Main_Climate & Temperature ) may be useful in classification.
Looking specifically at stratified 5-fold cross validation, as it performed pretty well, we would like to get an idea of how the misclassifications are occuring. Tabling this information will give us an insight into where we may want to focus our attention with respect to seperating groups.
The set of variables we will look at are {PC1, PC2} and {Latitude, Longitude, OverallAvg}.
strat.cv.qda(data=Data, model=Group~PC1+PC2, y = "Group", K = 5, seed = 234)
## $call
## Group ~ PC1 + PC2
##
## $K
## [1] 5
##
## $qda_error_rate
## [1] 0.04781421
##
## $Group1_error_rate
## [1] 0.03614009
##
## $Group2_error_rate
## [1] 0.04493018
##
## $Group3_error_rate
## [1] 0
##
## $Group4_error_rate
## [1] 0.04980874
##
## $Group5_error_rate
## [1] 0.05336523
##
## $Group6_error_rate
## [1] 0
##
## $Group7_error_rate
## [1] 0.09972678
##
## $Group8_error_rate
## [1] 0.03301457
##
## $Group9_error_rate
## [1] 0.09016393
##
## $qda_sd_error_rate
## [1] 0.01879339
##
## $Classification_tables
## $Classification_tables[[1]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 1 8 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 21 3 0 0 0 0
## 5 0 0 0 1 46 0 0 0 2
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 3 1 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 1 0 0 0 19
##
## $Classification_tables[[2]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 9 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 24 0 0 0 0 0
## 5 0 0 0 3 43 0 0 0 2
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 3 0 0
## 8 0 0 0 0 0 0 1 5 0
## 9 0 0 0 0 0 0 0 0 20
##
## $Classification_tables[[3]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 10 0 0 0 0 0 0 0 1
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 15 0 0 0 0 0 0
## 4 0 0 0 23 1 0 0 0 0
## 5 0 0 0 0 49 0 0 0 0
## 6 0 0 0 0 0 9 0 0 0
## 7 0 0 0 0 0 0 4 0 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 1 0 0 0 19
##
## $Classification_tables[[4]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 10 0 0 0 0 0 0 0 1
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 23 1 0 0 0 0
## 5 0 0 0 3 46 0 0 0 0
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 3 1 0
## 8 0 0 0 0 0 0 0 5 0
## 9 2 0 0 0 2 0 0 0 16
##
## $Classification_tables[[5]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 8 0 0 0 1 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 24 1 0 0 0 0
## 5 0 0 0 0 47 0 0 0 2
## 6 0 0 0 0 0 9 0 0 0
## 7 0 0 0 0 0 0 4 0 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 3 0 0 0 17
##
##
## $seed
## [1] 234
Using PC1 & PC2 as explanatory variables, groups 3 & 6 have an error rate of 0% i.e. are perfectly classified. Groups 7 & 9 have the highest error rates with 9.97% and 9% respectively.
Some interesting talking points:
Whenever an observation of group 7 was misclassified it was for group 8 and vice versa. This may suggest that observations in both groups may be very alike in terms of their feature values - as the 2 principal components capture 97.3% of the variability in the data. Looking into this further, the only way in which observations in group 7 & 8 appear to differ significantly are in terms of OverallAvg / PC1. Observations in the two groups are very close geographically.
Side note, it’s also nice to see that the folds are properly stratified.
strat.cv.qda(data=Data, model=Group~Latitude+Longitude+OverallAvg, y = "Group", K = 5, seed = 234)
## $call
## Group ~ Latitude + Longitude + OverallAvg
##
## $K
## [1] 5
##
## $qda_error_rate
## [1] 0.05601093
##
## $Group1_error_rate
## [1] 0.03651267
##
## $Group2_error_rate
## [1] 0.06754706
##
## $Group3_error_rate
## [1] 0
##
## $Group4_error_rate
## [1] 0.09130237
##
## $Group5_error_rate
## [1] 0.05743629
##
## $Group6_error_rate
## [1] 0.02510246
##
## $Group7_error_rate
## [1] 0.1657559
##
## $Group8_error_rate
## [1] 0
##
## $Group9_error_rate
## [1] 0.07008197
##
## $qda_sd_error_rate
## [1] 0.01969521
##
## $Classification_tables
## $Classification_tables[[1]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 9 1 0 0 0 0 0 0 1
## 2 1 8 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 22 2 0 0 0 0
## 5 0 0 0 2 45 0 0 0 2
## 6 0 0 1 0 0 7 0 0 0
## 7 0 0 0 0 0 0 3 1 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 1 0 0 0 19
##
## $Classification_tables[[2]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 9 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 23 1 0 0 0 0
## 5 0 0 0 5 42 0 0 0 1
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 2 1 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 0 0 0 0 20
##
## $Classification_tables[[3]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 15 0 0 0 0 0 0
## 4 0 0 0 19 5 0 0 0 0
## 5 0 0 0 0 49 0 0 0 0
## 6 0 0 0 0 0 9 0 0 0
## 7 0 0 0 0 0 0 4 0 0
## 8 0 0 0 0 0 0 0 6 0
## 9 0 0 0 0 1 0 0 0 19
##
## $Classification_tables[[4]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 8 0 0 0 0 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 21 3 0 0 0 0
## 5 0 0 0 3 46 0 0 0 0
## 6 0 0 0 0 0 8 0 0 0
## 7 0 0 0 0 0 0 3 1 0
## 8 0 0 0 0 0 0 0 5 0
## 9 0 0 0 0 3 0 0 0 17
##
## $Classification_tables[[5]]
## qda.predy
## qda.y 1 2 3 4 5 6 7 8 9
## 1 11 0 0 0 0 0 0 0 0
## 2 0 7 0 0 0 2 0 0 0
## 3 0 0 16 0 0 0 0 0 0
## 4 0 0 0 25 0 0 0 0 0
## 5 0 0 0 1 48 0 0 0 0
## 6 0 0 0 0 0 9 0 0 0
## 7 0 0 0 0 0 0 4 0 0
## 8 0 0 0 0 0 0 0 6 0
## 9 1 0 0 0 1 0 0 0 18
##
##
## $seed
## [1] 234
Using Latitude, Longitude & OverallAvg as explanatory variables, groups 3 & 8 have an error rate of 0% i.e. are perfectly classified. Groups 7 & 4 have the highest error rates with 16.57% and 9.1%, respectively.
Interestingly, in both cases, observations of group 3 were not misclassified. Also, in both cases, the error rate for group 7 was the highest of all 9 groups. The high misclassification rates of group 7 may also be a symptom of its low prevalence in the data set.
Some interesting talking points:
Focusing on the volume of the mistakes over the 5 stratified folds, there were (12+8+6+10+5) 41 mistakes , (4+6+5+6+1) 22 of these were either observations of group 4 mistaken as group 5 or observations of group 5 mistaken as group 4. We may want to look at ways to more clearly discriminate between these two groups.