Concho Water Snakes Classification

Kay Royo

03/5/2022

Introduction

The main purpose of this project is to develop classification rules for the genders of Concho Water Snakes and to allocate objects to these two classes or groups using the data displayed in Table 1 attached in the Appendix section at the end of this report. The strategy includes partitioning the set of all sample outcomes to two regions that each correspond to a population. Both graphical and numerical techniques are utilized in order to extensively explore the data and the association between the variables included in the data. The primary question of interest for this analysis project is stated as follows.

Can the tail length and snout to vent length be used to classify Concho Water Snakes genders?

The following tasks are performed in order to analyze the data for this project and provide answer to the question of interest stated above.

Obtain overall summary statistics to describe the data.
Use graphical methods such as pairwise scatter plot of data color coded by populations, side by side boxplots, main effects plot, and density plots to visualize the data and show how the quantitative variables are distributed across the populations
Specify the minimum ECM rule for the gender of Concho Water Snakes using the assumptions that multivariate normal distributions with common variance covariance matrix, equal prior probabilities, and equal costs of misclassification exist. In addition, add the separating line to the scatter plot of the data.
For the classification rule obtained, determine the confusion matrix of the data and compute the Apparent Error Rate (APER).
For the classification rule obtained, determine the confusion matrix using the holdout procedure.

The Materials and Methods section below provides a description of the data and the methods used for this analysis. The Results section that follows provides a detailed interpretation of the results obtained from this analysis. The R codes used for this analysis are attached in the Appendix section at the end of this report.

Material and Methods

Data

The data used for this project is displayed in Table 1 shown in the Appendix section at the end of this report, which includes measurements for Concho Water Snakes. This dataset contains one categorical variable, X1, which represents genders male and female, and two quantitative variables: X2 and X3. The variable X2 represents tail length in millimeters and X3 represents snout to vent length in millimeters. The dataset includes a total of 66 observations for both genders. Out of the 66 observations, there are $n_1=37$ observations for both tail length and snout to vent length for female Concho Water Snakes and $n_2=29$ observations for male Concho Water Snakes. The observations in this dataset can be denoted as $X_{\ell j}$ where $\ell=1,\cdots,g$ and $j=1,\cdots,n_{\ell}$ such that

\[Female: X_{11}, X_{12}, \cdots, X_{1n_1}\] \[Male: X_{21}, X_{22}, \cdots, X_{2n_2} \] \[ \begin{aligned} p&\text{: total number of features in a single observation = 2}\\ g&\text{: total number of populations = 2}\\ n_\ell&\text{: number of observation in the $\ell$th population} \end{aligned} \]

Methods

I. Summary Statistics

The summary statistics chart of the entire dataset, shown as Data Frame Summary in the Results and Appendix section, is generated using the dfSummary() function in the summarytools package. This function produces a summary table with statistics, frequencies, and graphs for all variables in the data frame. The computed summary statistics included in this method are mean, standard deviation, minimum, median, maximum, and interquartile range (IQR), and coefficient of variation (CV). The formulas to compute some of these summary measures are shown below. The upper quartile (Q3) and lower quartile (Q1) are obtained using the summary() function.

The formula for mean is: $\overline{X}_{p} = \frac{1}{n_p}\sum_{j=1}^{n_p} X_{pj}$, where $p=1,2$ for two features tail length and snout to vent length, $j=1,\cdots, n_p$ and $n_p = 66$
Standard deviation: $s_p^2=\sqrt(\frac{\sum_{j=1}^{n_p}(X_{pj}-\overline{X}_p)^2}{n_p -1})$, where $p=1,2$, $j=1,\cdots, n_p$ and $n_p = 66$
$IQR = 75th \ percentile \ - \ 25th \ percentile = Q3 - Q1$
$CV = \frac{standard \ deviation}{mean}$
$Q_3 = \frac{3}{4}(n_p+1)^{th} \ term$ and $Q_1=\frac{1}{4}(n_p+1)^{th} \ term$, where $p=1,2$ and and $n_p = 66$

The summary statistics chart of the entire data grouped by school is generated using the st() function in the vtable package (Table 3). This function produces a summary table with statistical measures. The computed summary statistics included in this method are total number for observations for each population (N), mean, standard deviation, minimum, maximum, upper quartile (75th percentile), and lower quartile (25th percentile). The formulas to compute some of these summary measures are shown above.The variance-covariance matrix for three schools or populations are generated using the cov() function.

II. Data Visualizations

The pairwise correlation plot is created using the ggpairs() function. This visualization includes density plots to show the approximate distribution of the two quantitative variables, scatter plot and correlation values to show the association between Tail Length (X2) and Snout to Vent Length (X3) in each population. The side-by-side boxplots of X2 and X3 are generated for each gender using the ggplot() function in the ggplot2 package in order to display the approximate distributions of these two variables within each population.

III. Minimum ECM Rule

The assumptions stated below are used to obtain the estimated Minimum Expected Cost of Misclassification (ECM) Rule.

Multivariate normal distributions
Common variance covariance matrix: $\Sigma_1= \Sigma_2=\Sigma$
Equal prior probabilities: $\frac{p_2}{p_1}=1$
Equal costs of misclassification: $C(1|2)=C(2|1)$

The formula used to obtain the ECM is $ECM=C(2|1)P(2|1)p_1 + C(1|2)P(1|2)p_2$

The goal is to identify the regions $R_1$ and $R_2$ that minimize the ECM. Therefore, the regions $R_1$ and $R_2$ that minimizes ECM are the following.

\[R_1: \ \ all \ \ \underline{x} \ \ s.t. \ \ \frac{f_1(x)}{f_2(x)} \ge \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}\] \[R_2: \ \ all \ \ \underline{x} \ \ s.t. \ \ \frac{f_1(x)}{f_2(x)} < \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}\] Using this, the Minimum ECM Rule can be defined as follows.

Allocate $\underline{X}$ to $\pi_1$ if $\underline{X}$ in $R_1$: all $\underline{X}$ s.t. $\frac{f_1(x)}{f_2(x)} \ge \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}$
Allocate $\underline{X}$ to $\pi_2$ otherwise

Alternatively, the Minimum ECM Rule for two normal populations where $\Sigma_1= \Sigma_2=\Sigma$ can also be estimated using the following method.

Allocate $\underline{X}$ to $\pi_1$ if $(\overline{X_1}- \overline{X_2})^TS_{pooled}^{-1}(\underline{X})-\frac{1}{2}(\overline{X_1}- \overline{X_2})\ge Ln[\frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}]$
Allocate $\underline{X}$ to $\pi_2$ otherwise

where

$X_{11}, \cdots, X_{1n_1}$ are random samples from $\pi_1$ (Female)

$X_{21}, \cdots, X_{2n_2}$ are random samples from $\pi_2$ (Male)

\[\overline{X_\ell}=\frac{1}{n_{\ell}} \sum^{n_{\ell}}_{j=1}X_{\ell j}, \ \ \ \ell=1,2\]

\[S_{\ell} = \frac{1}{n_{\ell}-1} \sum^{n_{\ell}}_{j=1}(X_{\ell j}) - \overline{X}_{\ell})(X_{\ell j} - \overline{X}_{\ell})^T\]

\[S_{pooled} = \frac{(n_1 -1)S_1+(n_2-1)S_2}{(n_1+n_2-2)}\]

IV. Confusion Matrix and Apparent Error Rate (APER)

Confusion Matrix

The confusion matrix can be described as follows.

Confusion Matrix	$\pi_1$	$\pi_2$
Actual population ($\pi_1$)	$n_{1C}$	$n_{1M}=n_1-n_{1C}$
Actual population ($\pi_2$)	$n_{2M}=n_2-n_{2C}$	$n_{2C}$

where

$n_{1C}$ is the number of $\pi_1$ items correctly classified

$n_{2C}$ is the number of $\pi_2$ items correctly classified

$n_{1M}$ is the number of $\pi_1$ items correctly misclassified

$n_{2M}$ is the number of $\pi_2$ items correctly misclassified

Apparent Error Rate (APER)

The formula used to calculate the APER used for this project is defined below.

\[APER = \frac{n_{1M}+n_{2M}}{n_1 + n_2}\]

V. Confusion Matrix (Holdout Procedure)

Lauchen Bruch’s “Holdout” procedure is another way to obtain confusion matrix with better estimation. This approach involves ommitting one observation from $\pi_1$, developing a classifier based on the training observation ($n_1-1, n_2$), and classifying the holdout observation. These steps are then repeated for $\pi_2$.

Results

I. Summary Statistics

The following chart displays the general summary statistics of the data being explored for this project. All two quantitative variables, $X2$ ( tail length ) and $X3$ ( snout to vent length ), contain integer values and $X1$ (gender) is a categorical variable with two factors. In this dataset, there are a total 37 observations for female and 29 observations for Male Concho Water Snakes. Therefore, the sample sizes are not equal across two genders. In the graph column of this chart, the bar graph for each variable shows the type of distributions each variable have. Both X2 and X3 are approximately normally distributed. The chart also shows that the data does not contain any missing values for all three features. The second column in the chart shows the computed mean, standard deviation, minimum, median, maximum, and interquartile range (IQR), and coefficient of variation (CV) for each quantitative variable. The number of distinct values in each quantitative variable are also included, showing that there are slightly more distinct values in X3 than X2. The upper and lower quartiles are also shown in Table 2 below. Since X2 and X3 are two different measures, it is not necessary to compare them. However, the summary statistics of these two variables can be compared across two genders. These two variables can also be used to classify Female and Male Concho Water Snakes.

Data Frame Summary

data

Dimensions: 66 x 3
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

X1 [factor]

1. Female

2. Male

37	(	56.1%	)
29	(	43.9%	)

0 (0.0%)

X2 [integer]

Mean (sd) : 167.6 (21.1)

min ≤ med ≤ max:

115 ≤ 170 ≤ 211

IQR (CV) : 23.5 (0.1)

41 distinct values

0 (0.0%)

X3 [integer]

Mean (sd) : 517.3 (68.6)

min ≤ med ≤ max:

361 ≤ 498 ≤ 683

IQR (CV) : 97.2 (0.1)

58 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2022-08-25

Table 2: X2 and X3 Summary
	X2	X3
	Min. :115.0	Min. :361.0
	1st Qu.:158.2	1st Qu.:475.0
	Median :170.0	Median :498.0
	Mean :167.6	Mean :517.3
	3rd Qu.:181.8	3rd Qu.:572.2
	Max. :211.0	Max. :683.0

The summary statistic for each gender category is displayed in Table 3 below. The mean column shows that the average values for each variable X2 and X3 between two genders slightly differ. Female Concho Water Snakes has higher average tail length (X2) compare to the male Concho Water Snakes. Male Concho Water Snakes has higher snout to vent length (X3) mean compare to the Female Concho Water Snakes. This observation can be helpful in answering the question of whether Female and Male Concho Water Snakes can be classified using these two measurements. Based on the standard deviation column included for each gender, it appears that X2 and X3 have the higher standard deviation in Male than Female so these variables are more dispersed relative to their mean in Male. Additionally, the standard deviation in X2 for Male and Female are approximately equivalent with a very small difference.

Table 3: Summary statistics grouped by gender
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
X1: Female
X2	37	172.676	20.322	127	165	183	211
X3	37	488.73	49.173	376	462	528	574
X1: Male
X2	29	161.103	20.498	115	145	175	205
X3	29	553.793	73.183	361	493	606	683

The variance-covariance matrix for Female Concho Water Snakes is displayed in Table 4 below. The variance of X2 (tail length) in this population is 413.0030 while the variance of X3 (snout to vent length) is 872.6877 The covariance between X2 and X3 is 2417.9805, which is positive so this suggests that the two variables tend to increase or decrease in tandem.

Table 4: Covariance for Female
	X2	X3
X2	413.0030	872.6877
X3	872.6877	2417.9805

The variance-covariance matrix for Male Concho Water Snakes is displayed in Table 5 below. The variance of X2 (tail length) in this population is 420.1675 while the variance of X3 (snout to vent length) is 5355.741. The covariance between X2 and X3 is 1401.844, which is positive so this suggests that the two variables also increase or decrease in tandem in this population, which is similar to the female population. The variance-covariance values for X2 and X3 in Male appears to be generally slightly higher than the values in the variance-covariance matrix for Female above.

Table 5: Covariance for Male
	X2	X3
X2	420.1675	1401.844
X3	1401.8436	5355.741

II. Data Visualizations

Based on the pairwise plot below (Figure 1), the correlations between X2 and X3 for two genders are almost equivalent. There appears to be an overall significantly high and positive correlation between X2 and X3 across two populations (genders) and within each population. In this figure, the density plots also show that the two variables are approximately normally distributed in two populations and do not significantly deviate from the normality assumption. The exception is X2 in Female and X3 in Male, which appears to have a somewhat bimodal distribution.

Figure 1: Pairwise scatter plot by gender

The side-by-side boxplots of X2 and X3 for two genders are also attached below (Figure 2). Based on this plot, there are two outliers that deviate from other points, 127 mm and 133 mm tail lengths in Female. The plot also further supports that previous finding that the mean of X2 are higher in Female than in Male. Meanwhile, Male has significantly higher mean in X3 than Female Concho Water Snakes. However, this finding can be further investigated using other quantitative methods to verify if X2 and X3 are indeed different across two genders.

Figure 2: Side-by-side boxplots

Based on the 95% confidence ellipse for Male and Female Concho Water Snakes shown below, there are some points ($x_2$, $x_3$) in Female population that is included in the Male cluster. Therefore, we can infer that there might be some misclassification present.

Figure 3: Gender confidence ellipse

III. Minimum ECM Rule

Assuming multivariate normal distributions with common variance covariance matrix, equal prior probabilities ($p_1=p_2=0.5$), and equal costs of misclassification, the Minimum Expected Cost of Misclassification (ECM) Rule for two genders of Concho Water Snakes using the outputs below is defined as follows. This result is also obtained using the methods specified in Part III of the Material and Methods section.

Allocate $(X_1, X_2)$ to $\pi_1$ if: $0.3564042X_2 - 0.1238378X_3 \ge -5.07174$

Allocate $(X_1, X_2)$ to $\pi_2$ if: $0.3564042X_2- 0.1238378X_3 < -5.07174$

##          [,1]
## X2  0.3564042
## X3 -0.1238378

##          [,1]
## [1,] -5.07174

Figure 4: Gender confidence ellipse

IV. Confusion Matrix & APER

Confusion Matrix

The confusion matrix of the data is

Confusion Matrix	Classified as normal	Classified as obligatory
Actual population (normal)	34	3
Actual population (obligatory)	2	27

where

$34$ is the number of $\pi_1$ items correctly classified

$27$ is the number of $\pi_2$ items correctly classified

$3$ is the number of $\pi_1$ items correctly misclassified

$2$ is the number of $\pi_2$ items correctly misclassified

##         p
##          Female Male
##   Female     34    3
##   Male        2   27

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     34    2
##     Male        3   27
##                                          
##                Accuracy : 0.9242         
##                  95% CI : (0.832, 0.9749)
##     No Information Rate : 0.5606         
##     P-Value [Acc > NIR] : 7.579e-11      
##                                          
##                   Kappa : 0.8468         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.9444         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.5606         
##          Detection Rate : 0.5152         
##    Detection Prevalence : 0.5455         
##       Balanced Accuracy : 0.9250         
##                                          
##        'Positive' Class : Female         
##

Apparent Error Rate (APER)

The Apparent Error Rate (APER), which is calculated using the formula provided in the Methods section is displayed below.

\[APER = \frac{3+2}{66} = 0.07575758\]

## [1] 0.07575758

V. Confusion Matrix (Holdout Procedure)

The confusion matrix obtained using the holdout procedure appears to be equivalent to the confusion matrix obtained above in the previous section. Figure 5 below, which shows the partition plot, also shows that there are three misclassified observations from the Female population and two misclassified observations from the Male population, which verifies the misclassifications shown in Figure 4 above. The estimated expected actual error rate (AER) is also 0.07575758.

##         
##          Female Male
##   Female     34    3
##   Male        2   27

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     34    2
##     Male        3   27
##                                          
##                Accuracy : 0.9242         
##                  95% CI : (0.832, 0.9749)
##     No Information Rate : 0.5606         
##     P-Value [Acc > NIR] : 7.579e-11      
##                                          
##                   Kappa : 0.8468         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.9444         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.5606         
##          Detection Rate : 0.5152         
##    Detection Prevalence : 0.5455         
##       Balanced Accuracy : 0.9250         
##                                          
##        'Positive' Class : Female         
##

## Actual Error Rate (AER): 0.07575758

Figure 5: Partition Plot

Appendix

#load data
data <- read.table("C:/Users/kayan/R-Projects/STA135/Concho_Water_Snakes/Project_4_Data.txt", header=T)
data <- as.data.frame(data)

#display data
library(DT)
datatable(data, caption = htmltools::tags$caption(
                  style = 'caption-side: bottom; text-align: center;',
                  'Table 1: ', htmltools::em('Data'), rownames = FALSE,filter="top", options = list(pageLength = 5, autoWidth = TRUE, scrollX=F, columnDefs = list(list(width = '50px', targets = "_all")))))

#change variable type
data['X1'] <- lapply(data['X1'], as.factor)

I. Summary Statistics

#view summary statistics
library(summarytools)
print(dfSummary(data, varnumbers   = FALSE, 
                valid.col    = FALSE, 
                graph.magnif = 0.7), method = "render")

Data Frame Summary

data

Dimensions: 66 x 3
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

X1 [factor]

1. Female

2. Male

37	(	56.1%	)
29	(	43.9%	)

0 (0.0%)

X2 [integer]

Mean (sd) : 167.6 (21.1)

min ≤ med ≤ max:

115 ≤ 170 ≤ 211

IQR (CV) : 23.5 (0.1)

41 distinct values

0 (0.0%)

X3 [integer]

Mean (sd) : 517.3 (68.6)

min ≤ med ≤ max:

361 ≤ 498 ≤ 683

IQR (CV) : 97.2 (0.1)

58 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2022-08-25

# get quartile
library(knitr)
library(kableExtra)
g<- summary(data)
kable_styling(kable(g[, 2:3], caption = "Table 2: X2 and X3 Summary"), position = "center")

Table 2: X2 and X3 Summary
	X2	X3
	Min. :115.0	Min. :361.0
	1st Qu.:158.2	1st Qu.:475.0
	Median :170.0	Median :498.0
	Mean :167.6	Mean :517.3
	3rd Qu.:181.8	3rd Qu.:572.2
	Max. :211.0	Max. :683.0

##Summary statistics of variables grouped by gender
library(vtable)
st(data, group = 'X1',group.long = TRUE, group.test = TRUE, title = "Table 3: Summary statistics grouped by gender")

Table 3: Summary statistics grouped by gender
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
X1: Female
X2	37	172.676	20.322	127	165	183	211
X3	37	488.73	49.173	376	462	528	574
X1: Male
X2	29	161.103	20.498	115	145	175	205
X3	29	553.793	73.183	361	493	606	683

#create subsets for each school
df1 <- subset(data,X1=="Female")
df2 <- subset(data,X1=="Male")

#covariance in female 
library(formattable)
library(kableExtra)
quantitative<-c('X2','X3')
c <-cov(df1[,c(quantitative)], use = "pairwise.complete.obs")
cov_df<-as.data.frame(c)
cov_df %>%
  kable(escape = F, caption = "Table 4: Covariance for Female") %>%
  kable_styling(full_width = F)

Table 4: Covariance for Female
	X2	X3
X2	413.0030	872.6877
X3	872.6877	2417.9805

#create subsets for each school
df1 <- subset(data,X1=="Female")
df2 <- subset(data,X1=="Male")

#covariance in male 
library(formattable)
library(kableExtra)
quantitative<-c('X2','X3')
c <-cov(df2[,c(quantitative)], use = "pairwise.complete.obs")
cov_df<-as.data.frame(c)
cov_df %>%
  kable(escape = F, caption = "Table 5: Covariance for Male") %>%
  kable_styling(full_width = F)

Table 5: Covariance for Male
	X2	X3
X2	420.1675	1401.844
X3	1401.8436	5355.741

II. Data Visualizations

##pairwise
library(ggplot2)
library(GGally)
library(dplyr)
library(plotly)
gender = data$X1
gender <-as.factor(gender)
p <- ggpairs(data [,2:3]
            , aes(color = gender, alpha = 0.9), 
            , upper = list(continuous = "points")
            , lower = list(continuous = "cor")
            )
for(i in 1:p$nrow) {
  for(j in 1:p$ncol){
    p[i,j] <- p[i,j] + 
        scale_fill_manual(values=c("darkorchid","springgreen2")) +
        scale_color_manual(values=c("darkorchid","springgreen2"))  }
}

p

Figure 1: Pairwise scatter plot by gender

library(plotly)

a1<-ggplot(data, aes(x = X1, y = X2)) +
  geom_jitter(alpha = .5, aes(color = X1), size=2) +
  geom_boxplot(alpha = .5, color = "dodgerblue4")+
  scale_color_manual(values =c("darkorchid","springgreen2"))+
  labs(y = "Tail Length (X2) ", x=" ")+
  theme(legend.position="none")


a2<-ggplot(data, aes(x = X1, y =  X3)) +
  geom_jitter(alpha = .5, aes(color = X1), size=2) +
  geom_boxplot(alpha = .5, color = "dodgerblue4")+
  scale_color_manual(values = c("darkorchid","springgreen2"))+
  labs(y = "Snout to Vent Length (X3) ", x=" ")+
  theme(legend.position="none")

subplot(a1, a2, shareX= T, titleY=T, nrows = 2)

Figure 2: Side-by-side boxplots

# Scatter plot by group
library(ggplot2)
dt<-data %>% mutate(cluster=as.factor(X1))
p<-  dt %>%ggplot(aes(x = X2, y = X3, color = cluster)) +geom_point(alpha =0.6, size = 1.8) +
  scale_color_manual(values = c("darkorchid","springgreen2"))+ guides(fill=guide_legend(title="spectral cluster"))+
  labs(x = "Tail Length (mm)", y = "Snout to Vent Length (mm)",title = "Gender confidence ellipse")+
  stat_ellipse(level=0.95, alpha=0.2, show.legend=F)+
    theme(plot.title = element_text(hjust = 0.5))
ggplotly(p)

Figure 3: Gender confidence ellipse

III. Minimum ECM Rule

#number of observations per gender
n <- by(data[,2:3], data$X1, nrow)
n2 <- n[1][[1]]
n3 <- n[2][[1]]

#variable means for each gender 
means <- by(data[,2:3], data$X1, colMeans)
m2 <- means[1][[1]]
m3 <- means[2][[1]]

#sample var-cov matrix  
s <- by(data[,2:3], data$X1, var)
s2 <- s[1][[1]]
s3 <- s[2][[1]] 

#Spooled
sp <- ((n2-1)*s2 + (n3-1)*s3)/(n2+n3-2)


(a <- solve(sp)%*%(m2-m3))

##          [,1]
## X2  0.3564042
## X3 -0.1238378

(m <- (t(a)%*%m2+t(a)%*%m3)/2)

##          [,1]
## [1,] -5.07174

library(ggplot2)
dt<-data %>% mutate(cluster=as.factor(X1))
p<-  dt %>%ggplot(aes(x = X2, y = X3, color = cluster)) +geom_point(alpha =0.6, size = 1.8) +
  scale_color_manual(values = c("darkorchid","springgreen2"))+ guides(fill=guide_legend(title="spectral cluster"))+
  labs(x = "Tail Length (mm)", y = "Snout to Vent Length (mm)",title = "Gender confidence ellipse")+
  stat_ellipse(level=0.95, alpha=0.2, show.legend=F)+geom_abline(intercept = m/a[2], slope = -a[1]/a[2])+
    theme(plot.title = element_text(hjust = 0.5))
ggplotly(p)

Figure 4: Gender confidence ellipse

IV. Confusion Matrix and Apparent Error Rate (APER)

predictions <- t(a)%*%t(data[, 2:3]) < m[1]
p <- factor(predictions, labels=c("Female", "Male"))
(confusion_matrix <- table(data$X1, p))

##         p
##          Female Male
##   Female     34    3
##   Male        2   27

library(caret)
confusionMatrix(p, data$X1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     34    2
##     Male        3   27
##                                          
##                Accuracy : 0.9242         
##                  95% CI : (0.832, 0.9749)
##     No Information Rate : 0.5606         
##     P-Value [Acc > NIR] : 7.579e-11      
##                                          
##                   Kappa : 0.8468         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.9444         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.5606         
##          Detection Rate : 0.5152         
##    Detection Prevalence : 0.5455         
##       Balanced Accuracy : 0.9250         
##                                          
##        'Positive' Class : Female         
##

((confusion_matrix[1,2] + confusion_matrix[2,1])/nrow(data))

## [1] 0.07575758

V. Confusion Matrix (Holdout Procedure)

library(MASS)
library(tidyverse)
library(caret)
lda_fit_holdout <- lda(X1~X2+X3, data = data, CV=TRUE, prior = c(0.5,0.5))
(confusion_matrix_holdout <- table(data$X1, lda_fit_holdout$class))

##         
##          Female Male
##   Female     34    3
##   Male        2   27

confusionMatrix(lda_fit_holdout$class, data$X1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     34    2
##     Male        3   27
##                                          
##                Accuracy : 0.9242         
##                  95% CI : (0.832, 0.9749)
##     No Information Rate : 0.5606         
##     P-Value [Acc > NIR] : 7.579e-11      
##                                          
##                   Kappa : 0.8468         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.9444         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.5606         
##          Detection Rate : 0.5152         
##    Detection Prevalence : 0.5455         
##       Balanced Accuracy : 0.9250         
##                                          
##        'Positive' Class : Female         
##

aer<-(confusion_matrix_holdout[1,2] + confusion_matrix_holdout[2,1])/(nrow(data))

cat("Actual Error Rate (AER):" , aer)

## Actual Error Rate (AER): 0.07575758

library(klaR)
partimat(X1~X2+X3, data = data, method = "lda")

Figure 4: Partition Plot

Session information

sessionInfo()

## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] klaR_1.7-1         forcats_0.5.1      stringr_1.4.0      purrr_0.3.4       
##  [5] readr_2.1.2        tidyr_1.2.0        tibble_3.1.7       tidyverse_1.3.1   
##  [9] MASS_7.3-56        caret_6.0-93       lattice_0.20-45    plotly_4.10.0     
## [13] dplyr_1.0.9        GGally_2.1.2       ggplot2_3.3.6      formattable_0.2.1 
## [17] vtable_1.3.4       kableExtra_1.3.4   knitr_1.39         summarytools_1.0.1
## [21] DT_0.23           
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.4.0         backports_1.4.1      systemfonts_1.0.4   
##   [4] plyr_1.8.7           lazyeval_0.2.2       splines_4.2.0       
##   [7] crosstalk_1.2.0      listenv_0.8.0        pryr_0.1.5          
##  [10] digest_0.6.29        foreach_1.5.2        htmltools_0.5.2     
##  [13] magick_2.7.3         fansi_1.0.3          magrittr_2.0.3      
##  [16] checkmate_2.1.0      tzdb_0.3.0           recipes_1.0.1       
##  [19] globals_0.16.0       modelr_0.1.8         gower_1.0.0         
##  [22] matrixStats_0.62.0   svglite_2.1.0        hardhat_1.2.0       
##  [25] rmdformats_1.0.4     colorspace_2.0-3     rvest_1.0.2         
##  [28] haven_2.5.0          xfun_0.31            tcltk_4.2.0         
##  [31] crayon_1.5.1         jsonlite_1.8.0       survival_3.3-1      
##  [34] iterators_1.0.14     glue_1.6.2           gtable_0.3.0        
##  [37] ipred_0.9-13         webshot_0.5.3        questionr_0.7.7     
##  [40] future.apply_1.9.0   rapportools_1.1      scales_1.2.0        
##  [43] DBI_1.1.3            miniUI_0.1.1.1       Rcpp_1.0.8.3        
##  [46] xtable_1.8-4         viridisLite_0.4.0    proxy_0.4-27        
##  [49] stats4_4.2.0         lava_1.6.10          prodlim_2019.11.13  
##  [52] htmlwidgets_1.5.4    httr_1.4.3           RColorBrewer_1.1-3  
##  [55] ellipsis_0.3.2       pkgconfig_2.0.3      reshape_0.8.9       
##  [58] farver_2.1.0         nnet_7.3-17          sass_0.4.1          
##  [61] dbplyr_2.2.0         utf8_1.2.2           tidyselect_1.1.2    
##  [64] labeling_0.4.2       rlang_1.0.4          reshape2_1.4.4      
##  [67] later_1.3.0          munsell_0.5.0        cellranger_1.1.0    
##  [70] tools_4.2.0          cli_3.3.0            generics_0.1.2      
##  [73] sjlabelled_1.2.0     broom_0.8.0          evaluate_0.15       
##  [76] fastmap_1.1.0        yaml_2.3.5           ModelMetrics_1.2.2.2
##  [79] fs_1.5.2             pander_0.6.5         future_1.27.0       
##  [82] nlme_3.1-157         mime_0.12            xml2_1.3.3          
##  [85] compiler_4.2.0       rstudioapi_0.13      e1071_1.7-11        
##  [88] reprex_2.0.1         bslib_0.3.1          stringi_1.7.6       
##  [91] highr_0.9            Matrix_1.4-1         vctrs_0.4.1         
##  [94] pillar_1.7.0         lifecycle_1.0.1      combinat_0.0-8      
##  [97] jquerylib_0.1.4      data.table_1.14.2    insight_0.18.2      
## [100] httpuv_1.6.5         R6_2.5.1             bookdown_0.27       
## [103] promises_1.2.0.1     parallelly_1.32.1    codetools_0.2-18    
## [106] assertthat_0.2.1     withr_2.5.0          parallel_4.2.0      
## [109] hms_1.1.1            labelled_2.9.1       grid_4.2.0          
## [112] rpart_4.1.16         timeDate_3043.102    class_7.3-20        
## [115] rmarkdown_2.14       pROC_1.18.0          shiny_1.7.2         
## [118] lubridate_1.8.0      base64enc_0.1-3

Confusion Matrix	\(\pi_1\)	\(\pi_2\)
Actual population (\(\pi_1\))	\(n_{1C}\)	\(n_{1M}=n_1-n_{1C}\)
Actual population (\(\pi_2\))	\(n_{2M}=n_2-n_{2C}\)	\(n_{2C}\)