1 Introduction

Classification of dry beans based on the physical characteristics is important in understanding the seed quality. The seed quality reflects the market values of the beans and therefore it becomes essential for farmers to know for planting. Since classifying these beans manually can be time-consuming and difficult especially for a large sample of data, the use of statistical models become necessary and make the classification simpler. A model that can predict and classify different types of dry beans with a good predictive accuracy can be used for efficient and accurate classification and prediction. In this paper, I will apply various classification methods to the dataset of the dry beans and determine which method performed the best based on the misclassification error, which measures of the predictive accuracy of a model.

# importing the dataset
drybeans<- readxl::read_xlsx("Dry_Bean_Dataset.xlsx")
# first convert "Class" into a factor variable
drybeans$Class <- as.factor(drybeans$Class)
# creating the new variable (Class_num)
drybeans$Class_num <- as.factor(drybeans$Class)
drybeans$Class_num <- as.numeric(drybeans$Class)

2 Preliminary Data Analysis

I used the dry bean dataset which was retrieved from UCI Machine Learning Repository. The data was collected by taking images of grains with a high-resolution camera. This dataset consists of the total of 13,611 observations with 17 variables. The variables in the dataset represent the attributes that were obtained with a computer vision system based on the images of grains. The following is the list of the variables:

The response variable is Class and the other variables are the predictor variables when I fit models to the dataset. As part of the exploratory data analysis, I first looked at the correlations among the variables using the correlation plot. Figure 1 below shows the correlation plot that summarizes the correlations between the variables.

# correlation plot
corr_drybeans <-cor(drybeans[c(-17,-18)])
corrplot(corr_drybeans, order = 'hclust', addrect = 2)
Figure 1: Correlation plot of the variables in the dry beans dataset

Figure 1: Correlation plot of the variables in the dry beans dataset

Based on Figure 1, every variable except Extent is highly correlated with at least one of the other variables.

In addition, I produced histograms and Q-Q plots to examine the distributions of the variables in the dataset using the functions hist(), qqnorm() and qqline . Figure 2 shows the histograms of the variables in the dry beans dataset.

# Check for Normality
# Histograms
par(mfrow=c(4,4),mar = c(2, 2, 1, .2), adj = 0.5)
hist(drybeans$Area, main = "Area")
hist(drybeans$Perimeter, main = "Perimeter")
hist(drybeans$MajorAxisLength, main = "MajorAxisLength")
hist(drybeans$MinorAxisLength, main = "MinorAxisLength")
hist(drybeans$AspectRation, main = "AspectRation")
hist(drybeans$Eccentricity, main = "Eccentricity")
hist(drybeans$ConvexArea, main = "ConvexArea")
hist(drybeans$EquivDiameter, main = "EquivDiameter")
hist(drybeans$Extent, main = "Extent")
hist(drybeans$Solidity, main = "Solidity")
hist(drybeans$roundness, main = "roundness")
hist(drybeans$Compactness, main = "Compactness")
hist(drybeans$ShapeFactor1, main = "ShapeFactor 1")
hist(drybeans$ShapeFactor2, main = "ShapeFactor 2")
hist(drybeans$ShapeFactor3, main = "ShapeFactor 3")
hist(drybeans$ShapeFactor4, main = "ShapeFactor 4")
Figure 2: Histogram of the variables in the dry beans dataset

Figure 2: Histogram of the variables in the dry beans dataset

The variables Area, Perimeter, MajorAxisLength, MinorAxisLength, ConvexArea, and EquivDiameter are positively skewed whereas Eccentricity, Extent, Solidity roundness, and ShapeFactor4 are negatively skewed. AspectRation, Compactness, ShapeFactor1, ShapeFactor3 are slightly skewed but they look very close to the normal distribution.

Figure 3 shows the Q-Q plots of the variables. The variables Compactness and ShapeFactor3 seem to be normally distributed (or close to normal) as most of the observations lie on the diagonal line. As for the other variables, they are skewed or don’t look to be normally distributed as the tails of the distribution do not resemble those for the normal distribution.

# QQ Plots
par(mfrow=c(4,4),mar = c(2, 2, 1, .2), adj = 0.5)
qqnorm(drybeans$Area, main = "Area")
qqline(drybeans$Area)
qqnorm(drybeans$Perimeter, main = "Perimeter")
qqline(drybeans$Perimeter)
qqnorm(drybeans$MajorAxisLength, main = "Major Axis Length")
qqline(drybeans$MajorAxisLength)
qqnorm(drybeans$MinorAxisLength, main = "Minor Axis Length")
qqline(drybeans$MinorAxisLength)
qqnorm(drybeans$AspectRation, main = "Aspect Ratio")
qqline(drybeans$AspectRation)
qqnorm(drybeans$Eccentricity, main = "Eccentricity")
qqline(drybeans$Eccentricity)
qqnorm(drybeans$ConvexArea, main = "Convex Area")
qqline(drybeans$ConvexArea)
qqnorm(drybeans$EquivDiameter, main = "Equiv. Diameter")
qqline(drybeans$EquivDiameter)
qqnorm(drybeans$Extent, main = "Extent")
qqline(drybeans$Extent)
qqnorm(drybeans$Solidity, main = "Solidity")
qqline(drybeans$Solidity)
qqnorm(drybeans$roundness, main = "Roundness")
qqline(drybeans$roundness)
qqnorm(drybeans$Compactness, main = "Compactness")
qqline(drybeans$Compactness)
qqnorm(drybeans$ShapeFactor1, main = "Shape Factor 1")
qqline(drybeans$ShapeFactor1)
qqnorm(drybeans$ShapeFactor2, main = "Shape Factor 2")
qqline(drybeans$ShapeFactor2)
qqnorm(drybeans$ShapeFactor3, main = "Shape Factor 3")
qqline(drybeans$ShapeFactor3)
qqnorm(drybeans$ShapeFactor4, main = "Shape Factor 4")
qqline(drybeans$ShapeFactor4)
Figure 3: Q-Q plots of the variables in the dry beans dataset

Figure 3: Q-Q plots of the variables in the dry beans dataset

3 Modeling and Analysis

# Standardize the variables
drybeans$AreaScaled <- scale(drybeans$Area)
drybeans$PerimeterScaled <- scale(drybeans$Perimeter)
drybeans$MJALScaled <- scale(drybeans$MajorAxisLength)
drybeans$MnALScaled <- scale(drybeans$MinorAxisLength)
drybeans$AspectRatioScaled <- scale(drybeans$AspectRation)
drybeans$EccentricityScaled <- scale(drybeans$Eccentricity)
drybeans$CAScaled <- scale(drybeans$ConvexArea)
drybeans$EDScaled <- scale(drybeans$EquivDiameter)
drybeans$ExtentScaled <- scale(drybeans$Extent)
drybeans$SolidityScaled <- scale(drybeans$Solidity)
drybeans$roundnessScaled <- scale(drybeans$roundness)
drybeans$CompactnessScaled <- scale(drybeans$Compactness)
drybeans$SF1Scaled <- scale(drybeans$ShapeFactor1)
drybeans$SF2Scaled <- scale(drybeans$ShapeFactor2)
drybeans$SF3Scaled <- scale(drybeans$ShapeFactor3)
drybeans$SF4Scaled <- scale(drybeans$ShapeFactor4)
# data after feature selection
drybeans_scaled_selected <- drybeans[c(-1:-16, -18)]

Before fitting the classification and clustering models, the variables in the dataset were standardized since each of them had different scales. Then, the dataset was split into a training and test set. 70% of the data set was allocated to the training set and 30% was allocated to the test set. The training set consists of 9528 observations and the test set consists of 4083 observations.

3.1 Linear Discriminant Analysis

The linear discriminant analysis (LDA) method involves classifying an observation with a linear decision boundary. The goal of linear discriminant analysis is to separate the groups of observations as much as possible. Using the lda() function from MASS package, I fitted a LDA model to the training set. Table 1 summarizes the coefficients of the first linear discriminant function that I obtained after fitting the model.

# LDA
lda_drybeans <- lda(train_drybeans[,-1], train_drybeans$Class)
lda_drybeans
## Call:
## lda(train_drybeans[, -1], train_drybeans$Class)
## 
## Prior probabilities of groups:
##   BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
## 0.09813182 0.03883291 0.11849286 0.26322418 0.13769941 0.15207809 0.19154072 
## 
## Group means:
##           AreaScaled PerimeterScaled MJALScaled   MnALScaled AspectRatioScaled
## BARBUNYA  0.56766607       0.8846161  0.5762181  0.845997109       -0.16591466
## BOMBAY    4.07937371       3.3937400  3.1775562  3.798465347        0.02479384
## CALI      0.77543650       0.9521121  1.0506913  0.767273659        0.60941393
## DERMASON -0.71078715      -0.8831077 -0.8545677 -0.809385389       -0.37523167
## HOROZ     0.01930038       0.2984888  0.6085328 -0.400474103        1.78588811
## SEKER    -0.44898057      -0.5963801 -0.8048710 -0.005653568       -1.37530421
## SIRA     -0.28227503      -0.2731313 -0.2412632 -0.251868075       -0.05610279
##          EccentricityScaled    CAScaled   EDScaled ExtentScaled SolidityScaled
## BARBUNYA         0.02938597  0.57562762  0.7440038  -0.03265870    -0.91094755
## BOMBAY           0.22361057  4.07136222  3.6294705   0.54150573    -0.05970743
## CALI             0.69503118  0.77802616  0.9632634   0.16894697    -0.43890345
## DERMASON        -0.15343821 -0.71140753 -0.8635820   0.07288982     0.23044698
## HOROZ            1.26309075  0.02158167  0.1285628  -0.89426746    -0.37469997
## SEKER           -1.81375874 -0.45336310 -0.4750163   0.44866314     0.69283264
## SIRA             0.17527362 -0.28382275 -0.2468598  -0.01817183     0.16924254
##          roundnessScaled CompactnessScaled  SF1Scaled   SF2Scaled   SF3Scaled
## BARBUNYA      -1.2180145        0.09319318 -1.0699003 -0.53303448  0.06650663
## BOMBAY        -0.1615943       -0.13152309 -2.7564264 -1.46384548 -0.15893647
## CALI          -0.4558779       -0.69919751 -0.9885524 -1.02487822 -0.71279878
## DERMASON       0.5868548        0.31059246  1.0470863  0.74005932  0.28191571
## HOROZ         -1.3144166       -1.59834808  0.3900488 -1.11940152 -1.52852930
## SEKER          1.2044744        1.57778635 -0.2058675  1.38843123  1.63889032
## SIRA           0.1939376       -0.03806288  0.1337314 -0.05576821 -0.07027227
##            SF4Scaled
## BARBUNYA  0.15154641
## BOMBAY   -0.74463907
## CALI     -1.01726090
## DERMASON  0.42136537
## HOROZ    -0.75830799
## SEKER     0.75709591
## SIRA      0.06794326
## 
## Coefficients of linear discriminants:
##                             LD1          LD2           LD3           LD4
## AreaScaled          29.07372316 -21.70587745  -56.83635619  -35.00329995
## PerimeterScaled     11.59147220  -7.20914665   -1.07930691    1.92854577
## MJALScaled          43.89381035 -33.94449809  -73.83073013   -1.08031895
## MnALScaled          27.44250637 -12.16601734  -78.97074055   -9.05947459
## AspectRatioScaled   -7.01173882   0.08258481  -31.13351383   23.73178426
## EccentricityScaled  -0.68613116   4.41230856   14.74624261  -12.91442924
## CAScaled           -24.92320458  15.96661981   51.57436479   20.04835930
## EDScaled           -91.53345875  53.54737832  154.28726662   27.81450091
## ExtentScaled        -0.06248009   0.02923517    0.03057347    0.10826802
## SolidityScaled      -0.18618452   0.38132351    0.41859788    0.02260822
## roundnessScaled      1.88927927  -1.61761360   -0.64436630   -0.37775053
## CompactnessScaled   -4.37666244 -53.56345991 -228.45161097  179.84898015
## SF1Scaled           -2.68653241  -4.86546641    3.66349364    3.80386643
## SF2Scaled           -5.44078030   8.60833939   -8.05837132    0.76871531
## SF3Scaled            2.08762112  43.28082061  226.65660557 -164.09639613
## SF4Scaled            0.79849338  -0.75914707   -0.69225694   -0.34433132
##                             LD5          LD6
## AreaScaled         -20.05166361   7.46073940
## PerimeterScaled     17.79303136  -7.28359163
## MJALScaled         -57.59532048 -36.26049065
## MnALScaled         -43.35612311 -21.53733293
## AspectRatioScaled   22.75575594  24.49225397
## EccentricityScaled   1.14285845  -8.13389200
## CAScaled            27.36861153  15.85846512
## EDScaled            69.84726158  25.15200973
## ExtentScaled        -0.04959757  -0.05507111
## SolidityScaled       0.24679894   0.32312694
## roundnessScaled      1.04955332  -1.02904972
## CompactnessScaled   47.44284373  95.59156865
## SF1Scaled            0.82699852 -11.40937382
## SF2Scaled           -5.61676025  -2.43154882
## SF3Scaled          -20.60787010 -85.50896942
## SF4Scaled           -1.22094269  -1.15387252
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6 
## 0.5398 0.2175 0.1068 0.0911 0.0303 0.0145
# Table 1 (Coefficients for the first linear discriminant function from the LDA model)
kbl(coefficients(lda_drybeans)[,1], booktabs = T, col.names = "Coefficient") %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 1: Coefficients for the first linear discriminant function from the LDA model", general_title = "")
Coefficient
AreaScaled 29.0737232
PerimeterScaled 11.5914722
MJALScaled 43.8938103
MnALScaled 27.4425064
AspectRatioScaled -7.0117388
EccentricityScaled -0.6861312
CAScaled -24.9232046
EDScaled -91.5334587
ExtentScaled -0.0624801
SolidityScaled -0.1861845
roundnessScaled 1.8892793
CompactnessScaled -4.3766624
SF1Scaled -2.6865324
SF2Scaled -5.4407803
SF3Scaled 2.0876211
SF4Scaled 0.7984934
Table 1: Coefficients for the first linear discriminant function from the LDA model

Based on the coefficients obtained from Table 1 above, the first linear discriminant is the linear combinations of the predictor variables and it is defined as:

\[29.0737232 \times AreaScaled + 11.5914722 \times PerimeterScaled + 43.8938103 \times MJALScaled + 27.4425064 \times MnALScaled - 7.0117388 \times AspectRatioScaled - 0.6861312 \times EccentricityScaled - 24.9232046 \times CAScaled - 91.5334587 \times ED_scaled -0.0624801 \times ExtentScaled - 0.1861845 \times SolidityScaled + 1.8892793 \times roundnessScaled - 4.3766624 \times CompactnessScaled - 2.6865324 \times SF1Scaled - 5.4407803 \times SF2Scaled + 2.0876211 \times SF3Scaled + 0.7984934 \times SF4Scaled, \tag{1}\]

Since the variable with the largest coefficient in absolute value is EDScaled (standardized value of the diameter), this variable has the largest effect on the first linear discriminant function. On the other hand, ExtentScaled (standardized value of the solidity) has the least effect on the first linear discriminant function as its coefficient is the smallest.

Table 2 summarizes the prior probabilities computed based on the LDA model. They represent the proportions of observations that classified into different types of dry beans. 9.8% of the training observations are classified as Barbunya, 3.9% as Bombay, 11.8% as Cali, 26.3% as Dermason, 13.8% as Horoz, 15.2% as Seker and 19.2% as Sira.

# Table 2 (Table of Prior Probabilities from LDA)
kbl(table_LDA_pp, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 2: Prior probabilities from the LDA model", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
Prior Probabilities 0.098 0.039 0.118 0.263 0.138 0.152 0.192
Table 2: Prior probabilities from the LDA model
# LDA prediction
lda_pred <- predict(lda_drybeans, newdata = test_drybeans[,-1], type = "class")
table_LDA <- table(test_drybeans$Class, lda_pred$class, dnn = c('Actual Group','Predicted Group'))

Table 3 is the confusion matrix from the LDA model which reports the number of correctly classified cases after the fitted LDA model was tested on the test set. The numbers in the diagonal entries represent the number of correctly classified observations whereas the numbers in the non-diagonal entries represent the number of observation that were classified incorrectly.

# Table 3 (Confusion matrix from the LDA model)
kbl(table_LDA, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 3: Confusion matrix from the LDA model", general_title = "")
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
BARBUNYA 323 0 28 0 2 3 31
BOMBAY 0 152 0 0 0 0 0
CALI 1 0 477 0 7 0 16
DERMASON 1 0 0 893 1 15 128
HOROZ 2 0 12 3 571 0 28
SEKER 4 0 0 8 0 523 43
SIRA 1 0 1 46 7 1 755
Table 3: Confusion matrix from the LDA model

Table 4 summarizes the percentages of the correctly classified observations. The LDA model successfully predicted the observations that are Bombay in the test set.

# Table 4 (% of correctly classified observations by the LDA model)
kbl(table_LDA_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 4: Percentages of correctly classified observations by the LDA model", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 83.463 100 95.21 86.031 92.695 90.484 93.095
Table 4: Percentages of correctly classified observations by the LDA model
# Computing the missclassification error from the LDA model
sum(diag(prop.table(table_LDA)))
## [1] 0.9047269
1-sum(diag(prop.table(table_LDA)))
## [1] 0.09527308

In order to determine the performance of the LDA model, I computed the misclassification error for the LDA model and it is 0.09527308.

3.2 Quadratic Discriminant Analysis

The quadratic discriminant analysis (QDA) method is similar to the linear discriminant analysis method as it is intended to separate different classes of observations as much as possible but the QDA uses the decision boundary that is a quadratic function of the predictors. The quadratic discriminant analysis method assumes that each class of the response variable has its own covariance matrix.

# QDA
qda_drybeans <- qda(train_drybeans[,-1], grouping = train_drybeans$Class)

Table 5 summarizes the prior probabilities computed from the QDA model. The values in Table 5 match in those from Table 2 for the LDA model. This implies both LDA and QDA models classified the same percentages of the observations.

# Table 5 (Prior probabilities from the QDA model)
kbl(table_QDA_pp, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 5: Prior probabilities from the QDA model", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
Prior Probabilities 0.098 0.039 0.118 0.263 0.138 0.152 0.192
Table 5: Prior probabilities from the QDA model
# QDA prediction
qda_pred <- predict(qda_drybeans, newdata = test_drybeans[,-1], type = "class")
table_QDA <- table(test_drybeans$Class, qda_pred$class, dnn = c('Actual Group','Predicted Group'))

Table 6 is the confusion matrix based on the QDA model. The numbers in the diagonal entries represented the number of correctly classified observations whereas the numbers in the non-diagonal entries represent the number of observation that were classified incorrectly.

# Table 6 (Confusion matrix from the QDA model)
kbl(table_QDA, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 6: Confusion matrix from the QDA model", general_title = "")
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
BARBUNYA 345 0 25 0 2 1 14
BOMBAY 0 152 0 0 0 0 0
CALI 19 0 465 0 12 0 5
DERMASON 1 0 0 895 6 20 116
HOROZ 1 0 8 7 589 0 11
SEKER 7 0 0 9 0 537 25
SIRA 4 0 2 53 15 8 729
Table 6: Confusion matrix from the QDA model

Table 7 summarizes the percentages of the correctly classified observations. Just as the LDA model, the QDA model successfully classified observations that are Bombay in the test set. However, the predictive accuracy improved for some of the types of dry beans and worsened for the others.

# Table 7 (% of correctly classified observations by the QDA model)
kbl(table_QDA_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 7: Percentages of correctly classified observations by the QDA model", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 89.147 100 92.814 86.224 95.617 92.907 89.889
Table 7: Percentages of correctly classified observations by the QDA model
# Computing the misclassification error for the QDA model
sum(diag(prop.table(table_QDA)))
## [1] 0.9091354
1-sum(diag(prop.table(table_QDA)))
## [1] 0.09086456

The misclassification error was computed to evaluate the the performance of the QDA model and it is 0.09086456 and this value is slightly lower than that for the LDA model. Therefore, QDA model slightly outperformed the LDA model.

4 K-Nearest Neighbors

The K-nearest neighbors method is a non-parametric approach for classifying an observation that uses the selected number of closest data point(s) to determine the group where that observation should belong to. Using knn() from class package, I fitted K-nearest neighbors models to the training set with different values of K. (K=1, 5, and 10)

# KNN (K=1)
drybeans_knn_1 <- knn(train = train_drybeans[,-1], test = test_drybeans[,-1], cl = train_drybeans$Class, k = 1, prob = F)
# KNN (K=5)
drybeans_knn_5 <- knn(train = train_drybeans[,-1], test = test_drybeans[,-1], cl = train_drybeans$Class, k = 5, prob = F)
# KNN (K=10)
drybeans_knn_10 <- knn(train = train_drybeans[,-1,], test = test_drybeans[,-1], cl = train_drybeans$Class, k = 10, prob = F)

Table 8 summarizes the percentages of the correctly classified observations by the KNN model with K = 1. Compared to the LDA and QDA models, the KNN model with K = 1 successfully classified observations that are Bombay. However, the predictive accuracy improved for some of the types of dry beans and worsened for the others.

# Table 8 (% of correctly classified observations by the KNN model with K = 1)
kbl(table_KNN1_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 8: Percentages of correctly classified observations by the KNN model with K = 1", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 90.256 100 92.958 89.293 93.831 93.717 84.302
Table 8: Percentages of correctly classified observations by the KNN model with K = 1

Table 9 summarizes the percentages of the correctly classified observations by the KNN model with K = 5. Compared to the KNN model with K = 1, the KNN model with K = 5 has predicted with better accuracy as the percentage for correctly classified observations for each type of dry beans is higher.

# Table 9 (% of correctly classified observations by the KNN model with K = 5)
kbl(table_KNN5_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 9: Percentages of correctly classified observations by the KNN model with K = 5", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 94.906 100 94.556 90.458 95.574 94.774 85.181
Table 9: Percentages of correctly classified observations by the KNN model with K = 5

Table 10 summarizes the percentages of the correctly classified observations. Compared to the previous KNN models with K = 1 and K = 5, the predictive accuracy improved for some of the types of dry beans and worsened for the others.

# Table 10 (% of correctly classified observations by the KNN model with K = 10)
kbl(table_KNN10_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 10: Percentages of correctly classified observations by the KNN model with K = 10", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 95.935 100 94.6 90.181 95.738 94.894 84.551
Table 10: Percentages of correctly classified observations by the KNN model with K = 10
# Computing the misclassification errors from the KNN models
ME_KNN1 <- 1-sum(diag(KNN1_table))/sum(KNN1_table)
ME_KNN5 <- 1-sum(diag(KNN5_table))/sum(KNN5_table)
ME_KNN10 <- 1-sum(diag(KNN10_table))/sum(KNN10_table)
ME_KNN1
## [1] 0.09453833
ME_KNN5
## [1] 0.08008817
ME_KNN10
## [1] 0.07861866
# Table 11 (Misclassification Error from KNN Methods)
kbl(table_KNN_ME, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 11: Misclassification Error from KNN Methods", general_title = "")
KNN Methods K = 1 K = 5 K = 10
Misclassification Error 0.09453833 0.07984325 0.08057801
Table 11: Misclassification Error from KNN Methods

The misclassification error was computed to evaluate the the performance of the KNN models with K = 1, 5, 10 and they are reported in Table 11. Since the KNN model with K = 5 had the lowest error, this model classified the observations with the best accuracy among the KNN models.

4.1 Classification Tree Model

Tree model is another classification method that uses the predictor space into a number of simple regions to predict categorical responses. The prediction for a given observation is based on the criteria set for each node of the model which is determined by training observations. Using the rpart() function in the rpart package, I fitted a tree model to the training set.

# tree model
tree_drybeans <- rpart(Class ~., data=train_drybeans)

Figure 4 is the tree diagram which shows how observations in the training set are classified based on the values of the predictor variables. The tree model uses the standardized values of the major axis length, shape factor 1, minor axis length, compactness, and roundness from each observation to separate the observations into different types of dry beans.

# tree diagram
prp(tree_drybeans, type=4, cex=0.8) #, extra=3)
Figure 4: Tree diagram from the classification model on the dry beans dataset

Figure 4: Tree diagram from the classification model on the dry beans dataset

# tree prediction
set.seed(1)
tree_pred <- predict(tree_drybeans, newdata=test_drybeans, type="class")
tree_table<- table(tree_pred, test_drybeans$Class)

Table 12 summarizes the percentages of the correctly classified observations from the tree model. Compared to the KNN models, the predictive accuracy for the bean type Barbunya significantly worsened as the model corrected predicted only 72.8% of the observations. Just as the other models, the tree model successfully predicted the observations that are Bombay in the test set.

# Table 12 (% of correctly classified observations by the tree model)
kbl(table_tree_pc, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 12:  Percentages of correctly classified observations by the tree model", general_title = "")
Drybeans Class Barbunya Bombay Cali Dermason Horoz Seker Sira
% of Correctely Classified Observations 72.809 100 90.687 87.008 96.954 85.592 83.594
Table 12: Percentages of correctly classified observations by the tree model

The misclassification error was computed to evaluate the the performance of the tree model and it is 0.1305413. Since this error is the largest among the classification models, the tree model was outperformed by the other classification models.

4.2 Comparison with Clustering Model (K-Means)

In addition to the classification methods mentioned in the previous section, I applied a clustering method to analyze the dry beans dataset. Unlike the classification models, clustering models involves grouping observations based on their similarities. Therefore, clustering models does not classify the observations into groups in the way that classification models do. For this paper, I used the K-means model. Since this method requires to choose the number of clusters before applying the method, I first chose the number of clusters, K, using the plot of the error sum of squares associated with the K-means solutions for each number of groups. Figure 5 is the plot of the error sum of squares. The slope of the graph starts to flatten at K = 2 so there are 2 clusters when I fit a K-means model. However, since there are 7 types of dry beans in the drybeans dataset, I fitted the model with K = 7 as well.

# clustering
n <- nrow(drybeans_scaled_selected[,-1])
wss <- rep(0, 10)
wss[1] <- (n - 1) * sum(sapply(drybeans_scaled_selected[,-1], var))
for (i in 2:10)
  wss[i] <- sum(kmeans(drybeans_scaled_selected[,-1], centers = i)$withinss)
plot(1:10, wss, type = "b", xlab = "Number of groups",ylab = "Within groups sum of squares")
Figure 5: The plot of error sum of squares for determining the number of clusters for the K-means model

Figure 5: The plot of error sum of squares for determining the number of clusters for the K-means model

Figure 6 below shows the plots of clusters after fitting the K-means model with K = 2 and K = 7 to the test set. Unfortunately, this clustering method was not successful for the dry beans data set as the parts of the clusters are overlapping with each other.

kmeans_plots_combined <- ggarrange(p1,p2, ncol = 2, nrow = 1, common.legend = FALSE) + theme(plot.margin = margin(0.1,0.1,1,0.1, "cm")) 
kmeans_plots_combined
Figure 6: K-means clustering with K = 2 (left) and K = 7 (right)

Figure 6: K-means clustering with K = 2 (left) and K = 7 (right)

5 Discussion

The use of classification models helps group observations based on different characteristics. In order to evaluate and compare the performance of the classification models discussed in the previous section, I compared the misclassification errors. Table 13 summarizes the misclassification errors for the classification methods discussed in the previous section.

# Table 13 (Summary of Classification Models)
kbl(table_summary_models, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 13: Summary of Classification Models", general_title = "")
Model LDA QDA KNN (K=1) KNN (K=5) KNN (K=10) Classification Tree
Misclassification Error 0.09527308 0.09086456 0.09453833 0.07984325 0.08057801 0.1305413
Table 13: Summary of Classification Models

Based on Table 13, the method with the least misclassification error is the KNN model with K = 5. Therefore, the KNN model with K = 5 is the best method in predicting the types of dry beans from this dataset. In future work, I will consider applying other classification methods in order to compare the predictive accuracy with the models discussed in this paper.

6 Reference

UCI Machine Learning Repository: Dry Bean Dataset Data set. (n.d.). Retrieved December 18, 2022, from https://archive.ics.uci.edu/ml/datasets/dry+bean+dataset