Introduction
The main purpose of this project is to develop classification rules for the genders of Concho Water Snakes and to allocate objects to these two classes or groups using the data displayed in Table 1 attached in the Appendix section at the end of this report. The strategy includes partitioning the set of all sample outcomes to two regions that each correspond to a population. Both graphical and numerical techniques are utilized in order to extensively explore the data and the association between the variables included in the data. The primary question of interest for this analysis project is stated as follows.
- Can the tail length and snout to vent length be used to classify Concho Water Snakes genders?
The following tasks are performed in order to analyze the data for this project and provide answer to the question of interest stated above.
Obtain overall summary statistics to describe the data.
Use graphical methods such as pairwise scatter plot of data color coded by populations, side by side boxplots, main effects plot, and density plots to visualize the data and show how the quantitative variables are distributed across the populations
Specify the minimum ECM rule for the gender of Concho Water Snakes using the assumptions that multivariate normal distributions with common variance covariance matrix, equal prior probabilities, and equal costs of misclassification exist. In addition, add the separating line to the scatter plot of the data.
For the classification rule obtained, determine the confusion matrix of the data and compute the Apparent Error Rate (APER).
For the classification rule obtained, determine the confusion matrix using the holdout procedure.
The Materials and Methods section below provides a description of the data and the methods used for this analysis. The Results section that follows provides a detailed interpretation of the results obtained from this analysis. The R codes used for this analysis are attached in the Appendix section at the end of this report.
Material and Methods
Data
The data used for this project is displayed in Table 1 shown in the Appendix section at the end of this report, which includes measurements for Concho Water Snakes. This dataset contains one categorical variable, X1, which represents genders male and female, and two quantitative variables: X2 and X3. The variable X2 represents tail length in millimeters and X3 represents snout to vent length in millimeters. The dataset includes a total of 66 observations for both genders. Out of the 66 observations, there are \(n_1=37\) observations for both tail length and snout to vent length for female Concho Water Snakes and \(n_2=29\) observations for male Concho Water Snakes. The observations in this dataset can be denoted as \(X_{\ell j}\) where \(\ell=1,\cdots,g\) and \(j=1,\cdots,n_{\ell}\) such that
\[Female: X_{11}, X_{12}, \cdots, X_{1n_1}\] \[Male: X_{21}, X_{22}, \cdots, X_{2n_2} \] \[ \begin{aligned} p&\text{: total number of features in a single observation = 2}\\ g&\text{: total number of populations = 2}\\ n_\ell&\text{: number of observation in the $\ell$th population} \end{aligned} \]
Methods
I. Summary Statistics
The summary statistics chart of the entire dataset, shown as Data Frame Summary in the Results and Appendix section, is generated using the dfSummary() function in the summarytools package. This function produces a summary table with statistics, frequencies, and graphs for all variables in the data frame. The computed summary statistics included in this method are mean, standard deviation, minimum, median, maximum, and interquartile range (IQR), and coefficient of variation (CV). The formulas to compute some of these summary measures are shown below. The upper quartile (Q3) and lower quartile (Q1) are obtained using the summary() function.
The formula for mean is: \(\overline{X}_{p} = \frac{1}{n_p}\sum_{j=1}^{n_p} X_{pj}\), where \(p=1,2\) for two features tail length and snout to vent length, \(j=1,\cdots, n_p\) and \(n_p = 66\)
Standard deviation: \(s_p^2=\sqrt(\frac{\sum_{j=1}^{n_p}(X_{pj}-\overline{X}_p)^2}{n_p -1})\), where \(p=1,2\), \(j=1,\cdots, n_p\) and \(n_p = 66\)
\(IQR = 75th \ percentile \ - \ 25th \ percentile = Q3 - Q1\)
\(CV = \frac{standard \ deviation}{mean}\)
\(Q_3 = \frac{3}{4}(n_p+1)^{th} \ term\) and \(Q_1=\frac{1}{4}(n_p+1)^{th} \ term\), where \(p=1,2\) and and \(n_p = 66\)
The summary statistics chart of the entire data grouped by school is generated using the st() function in the vtable package (Table 3). This function produces a summary table with statistical measures. The computed summary statistics included in this method are total number for observations for each population (N), mean, standard deviation, minimum, maximum, upper quartile (75th percentile), and lower quartile (25th percentile). The formulas to compute some of these summary measures are shown above.The variance-covariance matrix for three schools or populations are generated using the cov() function.
II. Data Visualizations
The pairwise correlation plot is created using the ggpairs() function. This visualization includes density plots to show the approximate distribution of the two quantitative variables, scatter plot and correlation values to show the association between Tail Length (X2) and Snout to Vent Length (X3) in each population. The side-by-side boxplots of X2 and X3 are generated for each gender using the ggplot() function in the ggplot2 package in order to display the approximate distributions of these two variables within each population.
III. Minimum ECM Rule
The assumptions stated below are used to obtain the estimated Minimum Expected Cost of Misclassification (ECM) Rule.
Multivariate normal distributions
Common variance covariance matrix: \(\Sigma_1= \Sigma_2=\Sigma\)
Equal prior probabilities: \(\frac{p_2}{p_1}=1\)
Equal costs of misclassification: \(C(1|2)=C(2|1)\)
The formula used to obtain the ECM is \(ECM=C(2|1)P(2|1)p_1 + C(1|2)P(1|2)p_2\)
The goal is to identify the regions \(R_1\) and \(R_2\) that minimize the ECM. Therefore, the regions \(R_1\) and \(R_2\) that minimizes ECM are the following.
\[R_1: \ \ all \ \ \underline{x} \ \ s.t. \ \ \frac{f_1(x)}{f_2(x)} \ge \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}\] \[R_2: \ \ all \ \ \underline{x} \ \ s.t. \ \ \frac{f_1(x)}{f_2(x)} < \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}\] Using this, the Minimum ECM Rule can be defined as follows.
Allocate \(\underline{X}\) to \(\pi_1\) if \(\underline{X}\) in \(R_1\): all \(\underline{X}\) s.t. \(\frac{f_1(x)}{f_2(x)} \ge \frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}\)
Allocate \(\underline{X}\) to \(\pi_2\) otherwise
Alternatively, the Minimum ECM Rule for two normal populations where \(\Sigma_1= \Sigma_2=\Sigma\) can also be estimated using the following method.
Allocate \(\underline{X}\) to \(\pi_1\) if \((\overline{X_1}- \overline{X_2})^TS_{pooled}^{-1}(\underline{X})-\frac{1}{2}(\overline{X_1}- \overline{X_2})\ge Ln[\frac{C(1|2)}{C(2|1)}\frac{p_2}{p_1}]\)
Allocate \(\underline{X}\) to \(\pi_2\) otherwise
where
\(X_{11}, \cdots, X_{1n_1}\) are random samples from \(\pi_1\) (Female)
\(X_{21}, \cdots, X_{2n_2}\) are random samples from \(\pi_2\) (Male)
\[\overline{X_\ell}=\frac{1}{n_{\ell}} \sum^{n_{\ell}}_{j=1}X_{\ell j}, \ \ \ \ell=1,2\]
\[S_{\ell} = \frac{1}{n_{\ell}-1} \sum^{n_{\ell}}_{j=1}(X_{\ell j}) - \overline{X}_{\ell})(X_{\ell j} - \overline{X}_{\ell})^T\]
\[S_{pooled} = \frac{(n_1 -1)S_1+(n_2-1)S_2}{(n_1+n_2-2)}\]
IV. Confusion Matrix and Apparent Error Rate (APER)
Confusion Matrix
The confusion matrix can be described as follows.
| Confusion Matrix | \(\pi_1\) | \(\pi_2\) |
|---|---|---|
| Actual population (\(\pi_1\)) | \(n_{1C}\) | \(n_{1M}=n_1-n_{1C}\) |
| Actual population (\(\pi_2\)) | \(n_{2M}=n_2-n_{2C}\) | \(n_{2C}\) |
where
\(n_{1C}\) is the number of \(\pi_1\) items correctly classified
\(n_{2C}\) is the number of \(\pi_2\) items correctly classified
\(n_{1M}\) is the number of \(\pi_1\) items correctly misclassified
\(n_{2M}\) is the number of \(\pi_2\) items correctly misclassified
Apparent Error Rate (APER)
The formula used to calculate the APER used for this project is defined below.
\[APER = \frac{n_{1M}+n_{2M}}{n_1 + n_2}\]
V. Confusion Matrix (Holdout Procedure)
Lauchen Bruch’s “Holdout” procedure is another way to obtain confusion matrix with better estimation. This approach involves ommitting one observation from \(\pi_1\), developing a classifier based on the training observation (\(n_1-1, n_2\)), and classifying the holdout observation. These steps are then repeated for \(\pi_2\).
Results
I. Summary Statistics
The following chart displays the general summary statistics of the data being explored for this project. All two quantitative variables, \(X2\) ( tail length ) and \(X3\) ( snout to vent length ), contain integer values and \(X1\) (gender) is a categorical variable with two factors. In this dataset, there are a total 37 observations for female and 29 observations for Male Concho Water Snakes. Therefore, the sample sizes are not equal across two genders. In the graph column of this chart, the bar graph for each variable shows the type of distributions each variable have. Both X2 and X3 are approximately normally distributed. The chart also shows that the data does not contain any missing values for all three features. The second column in the chart shows the computed mean, standard deviation, minimum, median, maximum, and interquartile range (IQR), and coefficient of variation (CV) for each quantitative variable. The number of distinct values in each quantitative variable are also included, showing that there are slightly more distinct values in X3 than X2. The upper and lower quartiles are also shown in Table 2 below. Since X2 and X3 are two different measures, it is not necessary to compare them. However, the summary statistics of these two variables can be compared across two genders. These two variables can also be used to classify Female and Male Concho Water Snakes.
Data Frame Summary
data
Dimensions: 66 x 3Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 [factor] |
|
|
0 (0.0%) | |||||||||||
| X2 [integer] |
|
41 distinct values | 0 (0.0%) | |||||||||||
| X3 [integer] |
|
58 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.0)
2022-08-25
| X2 | X3 | |
|---|---|---|
| Min. :115.0 | Min. :361.0 | |
| 1st Qu.:158.2 | 1st Qu.:475.0 | |
| Median :170.0 | Median :498.0 | |
| Mean :167.6 | Mean :517.3 | |
| 3rd Qu.:181.8 | 3rd Qu.:572.2 | |
| Max. :211.0 | Max. :683.0 |
The summary statistic for each gender category is displayed in Table 3 below. The mean column shows that the average values for each variable X2 and X3 between two genders slightly differ. Female Concho Water Snakes has higher average tail length (X2) compare to the male Concho Water Snakes. Male Concho Water Snakes has higher snout to vent length (X3) mean compare to the Female Concho Water Snakes. This observation can be helpful in answering the question of whether Female and Male Concho Water Snakes can be classified using these two measurements. Based on the standard deviation column included for each gender, it appears that X2 and X3 have the higher standard deviation in Male than Female so these variables are more dispersed relative to their mean in Male. Additionally, the standard deviation in X2 for Male and Female are approximately equivalent with a very small difference.
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| X1: Female | |||||||
| X2 | 37 | 172.676 | 20.322 | 127 | 165 | 183 | 211 |
| X3 | 37 | 488.73 | 49.173 | 376 | 462 | 528 | 574 |
| X1: Male | |||||||
| X2 | 29 | 161.103 | 20.498 | 115 | 145 | 175 | 205 |
| X3 | 29 | 553.793 | 73.183 | 361 | 493 | 606 | 683 |
The variance-covariance matrix for Female Concho Water Snakes is displayed in Table 4 below. The variance of X2 (tail length) in this population is 413.0030 while the variance of X3 (snout to vent length) is 872.6877 The covariance between X2 and X3 is 2417.9805, which is positive so this suggests that the two variables tend to increase or decrease in tandem.
| X2 | X3 | |
|---|---|---|
| X2 | 413.0030 | 872.6877 |
| X3 | 872.6877 | 2417.9805 |
The variance-covariance matrix for Male Concho Water Snakes is displayed in Table 5 below. The variance of X2 (tail length) in this population is 420.1675 while the variance of X3 (snout to vent length) is 5355.741. The covariance between X2 and X3 is 1401.844, which is positive so this suggests that the two variables also increase or decrease in tandem in this population, which is similar to the female population. The variance-covariance values for X2 and X3 in Male appears to be generally slightly higher than the values in the variance-covariance matrix for Female above.
| X2 | X3 | |
|---|---|---|
| X2 | 420.1675 | 1401.844 |
| X3 | 1401.8436 | 5355.741 |
II. Data Visualizations
Based on the pairwise plot below (Figure 1), the correlations between X2 and X3 for two genders are almost equivalent. There appears to be an overall significantly high and positive correlation between X2 and X3 across two populations (genders) and within each population. In this figure, the density plots also show that the two variables are approximately normally distributed in two populations and do not significantly deviate from the normality assumption. The exception is X2 in Female and X3 in Male, which appears to have a somewhat bimodal distribution.
Figure 1: Pairwise scatter plot by gender
The side-by-side boxplots of X2 and X3 for two genders are also attached below (Figure 2). Based on this plot, there are two outliers that deviate from other points, 127 mm and 133 mm tail lengths in Female. The plot also further supports that previous finding that the mean of X2 are higher in Female than in Male. Meanwhile, Male has significantly higher mean in X3 than Female Concho Water Snakes. However, this finding can be further investigated using other quantitative methods to verify if X2 and X3 are indeed different across two genders.
Figure 2: Side-by-side boxplots
Based on the 95% confidence ellipse for Male and Female Concho Water Snakes shown below, there are some points (\(x_2\), \(x_3\)) in Female population that is included in the Male cluster. Therefore, we can infer that there might be some misclassification present.
Figure 3: Gender confidence ellipse
III. Minimum ECM Rule
Assuming multivariate normal distributions with common variance covariance matrix, equal prior probabilities (\(p_1=p_2=0.5\)), and equal costs of misclassification, the Minimum Expected Cost of Misclassification (ECM) Rule for two genders of Concho Water Snakes using the outputs below is defined as follows. This result is also obtained using the methods specified in Part III of the Material and Methods section.
Allocate \((X_1, X_2)\) to \(\pi_1\) if: \(0.3564042X_2 - 0.1238378X_3 \ge -5.07174\)
Allocate \((X_1, X_2)\) to \(\pi_2\) if: \(0.3564042X_2- 0.1238378X_3 < -5.07174\)
## [,1]
## X2 0.3564042
## X3 -0.1238378
## [,1]
## [1,] -5.07174
Figure 4: Gender confidence ellipse
IV. Confusion Matrix & APER
Confusion Matrix
The confusion matrix of the data is
| Confusion Matrix | Classified as normal | Classified as obligatory |
|---|---|---|
| Actual population (normal) | 34 | 3 |
| Actual population (obligatory) | 2 | 27 |
where
\(34\) is the number of \(\pi_1\) items correctly classified
\(27\) is the number of \(\pi_2\) items correctly classified
\(3\) is the number of \(\pi_1\) items correctly misclassified
\(2\) is the number of \(\pi_2\) items correctly misclassified
## p
## Female Male
## Female 34 3
## Male 2 27
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 34 2
## Male 3 27
##
## Accuracy : 0.9242
## 95% CI : (0.832, 0.9749)
## No Information Rate : 0.5606
## P-Value [Acc > NIR] : 7.579e-11
##
## Kappa : 0.8468
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9189
## Specificity : 0.9310
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9000
## Prevalence : 0.5606
## Detection Rate : 0.5152
## Detection Prevalence : 0.5455
## Balanced Accuracy : 0.9250
##
## 'Positive' Class : Female
##
Apparent Error Rate (APER)
The Apparent Error Rate (APER), which is calculated using the formula provided in the Methods section is displayed below.
\[APER = \frac{3+2}{66} = 0.07575758\]
## [1] 0.07575758
V. Confusion Matrix (Holdout Procedure)
The confusion matrix obtained using the holdout procedure appears to be equivalent to the confusion matrix obtained above in the previous section. Figure 5 below, which shows the partition plot, also shows that there are three misclassified observations from the Female population and two misclassified observations from the Male population, which verifies the misclassifications shown in Figure 4 above. The estimated expected actual error rate (AER) is also 0.07575758.
##
## Female Male
## Female 34 3
## Male 2 27
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 34 2
## Male 3 27
##
## Accuracy : 0.9242
## 95% CI : (0.832, 0.9749)
## No Information Rate : 0.5606
## P-Value [Acc > NIR] : 7.579e-11
##
## Kappa : 0.8468
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9189
## Specificity : 0.9310
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9000
## Prevalence : 0.5606
## Detection Rate : 0.5152
## Detection Prevalence : 0.5455
## Balanced Accuracy : 0.9250
##
## 'Positive' Class : Female
##
## Actual Error Rate (AER): 0.07575758
Figure 5: Partition Plot
Appendix
#load data
data <- read.table("C:/Users/kayan/R-Projects/STA135/Concho_Water_Snakes/Project_4_Data.txt", header=T)
data <- as.data.frame(data)
#display data
library(DT)
datatable(data, caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 1: ', htmltools::em('Data'), rownames = FALSE,filter="top", options = list(pageLength = 5, autoWidth = TRUE, scrollX=F, columnDefs = list(list(width = '50px', targets = "_all")))))
#change variable type
data['X1'] <- lapply(data['X1'], as.factor)
I. Summary Statistics
#view summary statistics
library(summarytools)
print(dfSummary(data, varnumbers = FALSE,
valid.col = FALSE,
graph.magnif = 0.7), method = "render")
Data Frame Summary
data
Dimensions: 66 x 3Duplicates: 0
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 [factor] |
|
|
0 (0.0%) | |||||||||||
| X2 [integer] |
|
41 distinct values | 0 (0.0%) | |||||||||||
| X3 [integer] |
|
58 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.0)
2022-08-25
# get quartile
library(knitr)
library(kableExtra)
g<- summary(data)
kable_styling(kable(g[, 2:3], caption = "Table 2: X2 and X3 Summary"), position = "center")
| X2 | X3 | |
|---|---|---|
| Min. :115.0 | Min. :361.0 | |
| 1st Qu.:158.2 | 1st Qu.:475.0 | |
| Median :170.0 | Median :498.0 | |
| Mean :167.6 | Mean :517.3 | |
| 3rd Qu.:181.8 | 3rd Qu.:572.2 | |
| Max. :211.0 | Max. :683.0 |
##Summary statistics of variables grouped by gender
library(vtable)
st(data, group = 'X1',group.long = TRUE, group.test = TRUE, title = "Table 3: Summary statistics grouped by gender")
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| X1: Female | |||||||
| X2 | 37 | 172.676 | 20.322 | 127 | 165 | 183 | 211 |
| X3 | 37 | 488.73 | 49.173 | 376 | 462 | 528 | 574 |
| X1: Male | |||||||
| X2 | 29 | 161.103 | 20.498 | 115 | 145 | 175 | 205 |
| X3 | 29 | 553.793 | 73.183 | 361 | 493 | 606 | 683 |
#create subsets for each school
df1 <- subset(data,X1=="Female")
df2 <- subset(data,X1=="Male")
#covariance in female
library(formattable)
library(kableExtra)
quantitative<-c('X2','X3')
c <-cov(df1[,c(quantitative)], use = "pairwise.complete.obs")
cov_df<-as.data.frame(c)
cov_df %>%
kable(escape = F, caption = "Table 4: Covariance for Female") %>%
kable_styling(full_width = F)
| X2 | X3 | |
|---|---|---|
| X2 | 413.0030 | 872.6877 |
| X3 | 872.6877 | 2417.9805 |
#create subsets for each school
df1 <- subset(data,X1=="Female")
df2 <- subset(data,X1=="Male")
#covariance in male
library(formattable)
library(kableExtra)
quantitative<-c('X2','X3')
c <-cov(df2[,c(quantitative)], use = "pairwise.complete.obs")
cov_df<-as.data.frame(c)
cov_df %>%
kable(escape = F, caption = "Table 5: Covariance for Male") %>%
kable_styling(full_width = F)
| X2 | X3 | |
|---|---|---|
| X2 | 420.1675 | 1401.844 |
| X3 | 1401.8436 | 5355.741 |
II. Data Visualizations
##pairwise
library(ggplot2)
library(GGally)
library(dplyr)
library(plotly)
gender = data$X1
gender <-as.factor(gender)
p <- ggpairs(data [,2:3]
, aes(color = gender, alpha = 0.9),
, upper = list(continuous = "points")
, lower = list(continuous = "cor")
)
for(i in 1:p$nrow) {
for(j in 1:p$ncol){
p[i,j] <- p[i,j] +
scale_fill_manual(values=c("darkorchid","springgreen2")) +
scale_color_manual(values=c("darkorchid","springgreen2")) }
}
p
Figure 1: Pairwise scatter plot by gender
library(plotly)
a1<-ggplot(data, aes(x = X1, y = X2)) +
geom_jitter(alpha = .5, aes(color = X1), size=2) +
geom_boxplot(alpha = .5, color = "dodgerblue4")+
scale_color_manual(values =c("darkorchid","springgreen2"))+
labs(y = "Tail Length (X2) ", x=" ")+
theme(legend.position="none")
a2<-ggplot(data, aes(x = X1, y = X3)) +
geom_jitter(alpha = .5, aes(color = X1), size=2) +
geom_boxplot(alpha = .5, color = "dodgerblue4")+
scale_color_manual(values = c("darkorchid","springgreen2"))+
labs(y = "Snout to Vent Length (X3) ", x=" ")+
theme(legend.position="none")
subplot(a1, a2, shareX= T, titleY=T, nrows = 2)
Figure 2: Side-by-side boxplots
# Scatter plot by group
library(ggplot2)
dt<-data %>% mutate(cluster=as.factor(X1))
p<- dt %>%ggplot(aes(x = X2, y = X3, color = cluster)) +geom_point(alpha =0.6, size = 1.8) +
scale_color_manual(values = c("darkorchid","springgreen2"))+ guides(fill=guide_legend(title="spectral cluster"))+
labs(x = "Tail Length (mm)", y = "Snout to Vent Length (mm)",title = "Gender confidence ellipse")+
stat_ellipse(level=0.95, alpha=0.2, show.legend=F)+
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p)
Figure 3: Gender confidence ellipse
III. Minimum ECM Rule
#number of observations per gender
n <- by(data[,2:3], data$X1, nrow)
n2 <- n[1][[1]]
n3 <- n[2][[1]]
#variable means for each gender
means <- by(data[,2:3], data$X1, colMeans)
m2 <- means[1][[1]]
m3 <- means[2][[1]]
#sample var-cov matrix
s <- by(data[,2:3], data$X1, var)
s2 <- s[1][[1]]
s3 <- s[2][[1]]
#Spooled
sp <- ((n2-1)*s2 + (n3-1)*s3)/(n2+n3-2)
(a <- solve(sp)%*%(m2-m3))
## [,1]
## X2 0.3564042
## X3 -0.1238378
(m <- (t(a)%*%m2+t(a)%*%m3)/2)
## [,1]
## [1,] -5.07174
library(ggplot2)
dt<-data %>% mutate(cluster=as.factor(X1))
p<- dt %>%ggplot(aes(x = X2, y = X3, color = cluster)) +geom_point(alpha =0.6, size = 1.8) +
scale_color_manual(values = c("darkorchid","springgreen2"))+ guides(fill=guide_legend(title="spectral cluster"))+
labs(x = "Tail Length (mm)", y = "Snout to Vent Length (mm)",title = "Gender confidence ellipse")+
stat_ellipse(level=0.95, alpha=0.2, show.legend=F)+geom_abline(intercept = m/a[2], slope = -a[1]/a[2])+
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p)
Figure 4: Gender confidence ellipse
IV. Confusion Matrix and Apparent Error Rate (APER)
predictions <- t(a)%*%t(data[, 2:3]) < m[1]
p <- factor(predictions, labels=c("Female", "Male"))
(confusion_matrix <- table(data$X1, p))
## p
## Female Male
## Female 34 3
## Male 2 27
library(caret)
confusionMatrix(p, data$X1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 34 2
## Male 3 27
##
## Accuracy : 0.9242
## 95% CI : (0.832, 0.9749)
## No Information Rate : 0.5606
## P-Value [Acc > NIR] : 7.579e-11
##
## Kappa : 0.8468
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9189
## Specificity : 0.9310
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9000
## Prevalence : 0.5606
## Detection Rate : 0.5152
## Detection Prevalence : 0.5455
## Balanced Accuracy : 0.9250
##
## 'Positive' Class : Female
##
((confusion_matrix[1,2] + confusion_matrix[2,1])/nrow(data))
## [1] 0.07575758
V. Confusion Matrix (Holdout Procedure)
library(MASS)
library(tidyverse)
library(caret)
lda_fit_holdout <- lda(X1~X2+X3, data = data, CV=TRUE, prior = c(0.5,0.5))
(confusion_matrix_holdout <- table(data$X1, lda_fit_holdout$class))
##
## Female Male
## Female 34 3
## Male 2 27
confusionMatrix(lda_fit_holdout$class, data$X1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 34 2
## Male 3 27
##
## Accuracy : 0.9242
## 95% CI : (0.832, 0.9749)
## No Information Rate : 0.5606
## P-Value [Acc > NIR] : 7.579e-11
##
## Kappa : 0.8468
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9189
## Specificity : 0.9310
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9000
## Prevalence : 0.5606
## Detection Rate : 0.5152
## Detection Prevalence : 0.5455
## Balanced Accuracy : 0.9250
##
## 'Positive' Class : Female
##
aer<-(confusion_matrix_holdout[1,2] + confusion_matrix_holdout[2,1])/(nrow(data))
cat("Actual Error Rate (AER):" , aer)
## Actual Error Rate (AER): 0.07575758
library(klaR)
partimat(X1~X2+X3, data = data, method = "lda")
Figure 4: Partition Plot
Session information
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] klaR_1.7-1 forcats_0.5.1 stringr_1.4.0 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 tidyverse_1.3.1
## [9] MASS_7.3-56 caret_6.0-93 lattice_0.20-45 plotly_4.10.0
## [13] dplyr_1.0.9 GGally_2.1.2 ggplot2_3.3.6 formattable_0.2.1
## [17] vtable_1.3.4 kableExtra_1.3.4 knitr_1.39 summarytools_1.0.1
## [21] DT_0.23
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.0 backports_1.4.1 systemfonts_1.0.4
## [4] plyr_1.8.7 lazyeval_0.2.2 splines_4.2.0
## [7] crosstalk_1.2.0 listenv_0.8.0 pryr_0.1.5
## [10] digest_0.6.29 foreach_1.5.2 htmltools_0.5.2
## [13] magick_2.7.3 fansi_1.0.3 magrittr_2.0.3
## [16] checkmate_2.1.0 tzdb_0.3.0 recipes_1.0.1
## [19] globals_0.16.0 modelr_0.1.8 gower_1.0.0
## [22] matrixStats_0.62.0 svglite_2.1.0 hardhat_1.2.0
## [25] rmdformats_1.0.4 colorspace_2.0-3 rvest_1.0.2
## [28] haven_2.5.0 xfun_0.31 tcltk_4.2.0
## [31] crayon_1.5.1 jsonlite_1.8.0 survival_3.3-1
## [34] iterators_1.0.14 glue_1.6.2 gtable_0.3.0
## [37] ipred_0.9-13 webshot_0.5.3 questionr_0.7.7
## [40] future.apply_1.9.0 rapportools_1.1 scales_1.2.0
## [43] DBI_1.1.3 miniUI_0.1.1.1 Rcpp_1.0.8.3
## [46] xtable_1.8-4 viridisLite_0.4.0 proxy_0.4-27
## [49] stats4_4.2.0 lava_1.6.10 prodlim_2019.11.13
## [52] htmlwidgets_1.5.4 httr_1.4.3 RColorBrewer_1.1-3
## [55] ellipsis_0.3.2 pkgconfig_2.0.3 reshape_0.8.9
## [58] farver_2.1.0 nnet_7.3-17 sass_0.4.1
## [61] dbplyr_2.2.0 utf8_1.2.2 tidyselect_1.1.2
## [64] labeling_0.4.2 rlang_1.0.4 reshape2_1.4.4
## [67] later_1.3.0 munsell_0.5.0 cellranger_1.1.0
## [70] tools_4.2.0 cli_3.3.0 generics_0.1.2
## [73] sjlabelled_1.2.0 broom_0.8.0 evaluate_0.15
## [76] fastmap_1.1.0 yaml_2.3.5 ModelMetrics_1.2.2.2
## [79] fs_1.5.2 pander_0.6.5 future_1.27.0
## [82] nlme_3.1-157 mime_0.12 xml2_1.3.3
## [85] compiler_4.2.0 rstudioapi_0.13 e1071_1.7-11
## [88] reprex_2.0.1 bslib_0.3.1 stringi_1.7.6
## [91] highr_0.9 Matrix_1.4-1 vctrs_0.4.1
## [94] pillar_1.7.0 lifecycle_1.0.1 combinat_0.0-8
## [97] jquerylib_0.1.4 data.table_1.14.2 insight_0.18.2
## [100] httpuv_1.6.5 R6_2.5.1 bookdown_0.27
## [103] promises_1.2.0.1 parallelly_1.32.1 codetools_0.2-18
## [106] assertthat_0.2.1 withr_2.5.0 parallel_4.2.0
## [109] hms_1.1.1 labelled_2.9.1 grid_4.2.0
## [112] rpart_4.1.16 timeDate_3043.102 class_7.3-20
## [115] rmarkdown_2.14 pROC_1.18.0 shiny_1.7.2
## [118] lubridate_1.8.0 base64enc_0.1-3