Multi-Classification of Dry Beans Using Machine Learning

## Libraries
library(knitr)
library(dplyr)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(corrplot)
library(class)
library(caret)
library(rpart)
library(maptree)
library(pca3d)
library(MASS)
library(tibble)
library(randomForest)
library(e1071)
library(kernlab)
library(kableExtra)
library(gridExtra)
library(reshape)
library(GGally)
library(readr)
library(readxl)
library(float)
library(patchwork)
library(ggeasy)
theme_set(theme_classic())
#load data
labeled <- read.csv('labeled.csv') %>% dplyr::select(-X)
sampA <- read.csv('samp.A.csv')%>% dplyr::select(-X)
sampB <- read.csv('samp.B.csv')%>% dplyr::select(-X)
sampC <- read.csv('samp.C.csv')%>% dplyr::select(-X)
#convert Class into factor
labeled$Class <- as.factor(labeled$Class)

#set up a new variable 'Roundness'
#Roundness = 4*Area*pi/(perimeter)^2 (refer to the dry bean paper)
Roundess <- 4*pi*labeled$Area/(labeled$Perimeter)^2
labeled <- add_column(labeled, Roundness = Roundess, .after = 7)
sampA$Roundness <- 4*pi*sampA$Area/(sampA$Perimeter)^2
sampB$Roundness <- 4*pi*sampB$Area/(sampB$Perimeter)^2
sampC$Roundness <- 4*pi*sampC$Area/(sampC$Perimeter)^2

#check for duplicate rows
dup.rows = sum(labeled%>%duplicated(), sampA%>%duplicated(),
               sampB%>%duplicated(),sampC%>%duplicated())

Photo Credits: nitsuki

Aim of the Project

selecting the best seed species is one of the main concerns for both bean producers and the market. Since different genotypes are cultivated worldwide, it is important to separate the best seed variety from the mixed dry bean population, otherwise the market value of these mixed species of beans could drop enormously (Varankaya & Ceyhan, 2012).

The aim of this project is to develop a supervised machine learning algorithm to perform a multi-classification of dry beans species harvested from population cultivation from a single farm. A supervised machine learning algorithm is one that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data

Exploratory Data Analysis

There are two datasets used in this project; The ‘labeled’ and ‘unlabeled’ dataset. The labeled dataset- which contains the classes of dry-beans will be used to train the various ML models, hence also refered to as the training dataset. The training dataset contains 3000 observations and 8 variables. The dependent variable has 6 levels (Classes): BOMBAY, CALI, DERMASON, HOROZ, SEKER, and SIRA - These classes represent the different varieties of dry beans. In the training set, each class has 500 observations.

The unlabeled dataset is made up of the combination of three seperate samples, namely, Sample A, B, and C. There are 7 variables which does not include the Class variable -thus unlabeled. The total observations for is 3131 respectively. Roundness, which is the measure of how closely the shape of beans approaches a perfect circle, is calculated and added as an additional predictor variable to both labeled and unlabeled datasets (Koklu & Ozkan, 2020). Below is the summary statistics of the variables in the training(labeled) data.

Summary of Data

summary.stats <- round(as.data.frame((labeled[,-9])%>%
                                       psych::describe())%>%
                         dplyr::select(n,mean, sd, median, min, max, range, se), 2)

kbl(summary.stats, caption="Statistical distribution of features of dry beans varieties (in pixels) - Label")%>%
  kable_classic(full_width = F, html_font = "Cambria", font_size = 12)
Statistical distribution of features of dry beans varieties (in pixels) - Label
n mean sd median min max range se
Area 3000 69874.98 49578.52 48714.50 20645.00 251320.00 230675.00 905.18
Perimeter 3000 1012.24 347.75 941.90 384.17 2164.10 1779.93 6.35
MajorAxisLength 3000 362.05 124.52 332.90 161.52 740.97 579.45 2.27
MinorAxisLength 3000 225.19 73.35 202.73 106.00 473.39 367.39 1.34
Eccentricity 3000 0.76 0.10 0.77 0.30 0.94 0.64 0.00
ConvexArea 3000 70944.12 50382.27 50807.50 8912.00 259965.00 251053.00 919.85
Extent 3000 0.75 0.05 0.77 0.57 0.85 0.28 0.00
Roundness 3000 0.84 0.29 0.77 0.39 2.06 1.66 0.01

The variables, Area and Convex Area, had the largest range for all four datasets. There are large differences in the range of variables, the variables with larger ranges can dominate over those with small ranges which may lead to biased results, therefore it is necessary to transform/scale these variables before fitting distance-based models such as K-Nearest Neighbors and SVM).

Class Covariances

Below is the class variances which measures variability from the average or mean.The variance of each variable by class shows evidence of non-constant variance. Based on the normality distribution and non-constant variance evident in the data, I expect the models which assumes distinct covariances among classes such as the Quadratic Discriminant Analysis(QDA) model to perform well.

var.tab1 <- labeled%>%group_by(Class)%>%
  summarize(Area=var(Area),Perimeter=var(Perimeter),
            Maj.Axis.=var(MajorAxisLength),Min.Axis.=var(MinorAxisLength), 
            Eccentricity=min(Eccentricity), var.ConvexArea=max(ConvexArea), 
            Extent=max(Extent), Roundness=max(Roundness))


kbl(var.tab1, caption = "Variance of Variable by Class")%>%
  kable_classic(full_width = F, html_font = "Cambria", font_size = 12)
Variance of Variable by Class
Class Area Perimeter Maj.Axis. Min.Axis. Eccentricity var.ConvexArea Extent Roundness
BOMBAY 552697974 32015.80 3177.8992 826.6840 0.5472634 259965 0.8502428 1.262811
CALI 91528410 26272.79 1188.1780 491.4581 0.6183656 117510 0.8427527 1.512446
DERMASON 24651963 22913.81 696.7121 498.7684 0.5494947 56174 0.8471957 2.055745
HOROZ 56765885 24960.58 1252.4894 456.5644 0.7227374 82462 0.8420894 1.620064
SEKER 22567179 24170.41 736.8507 419.4165 0.3006355 65674 0.8183099 1.990205
SIRA 22641401 23977.06 782.4715 401.8085 0.6098838 73945 0.8418021 1.718602

Histograms and Violin Plots

The histograms from the labeled data (Figure 1) show evidence of multimodality behavior in the variables. This means that at least one of the classes of beans is very distinct from the others. Upon further checks I found that this multimodality is caused by the BOMBAY beans type.Upon further checks, I found that this multimodality is caused by the BOMBAY beans type in both the labeled set and Sample A, but not in Sample B and C. This multi,odality behaviour implies that, they might be very low predictions of BOMBAY for Sample B and C.

theme_set(theme_classic())

a <- labeled%>%ggplot() + 
  geom_histogram(aes(x=Area/1000), fill="brown", col="white") + 
  labs(title = "Histogram  of Area") + xlab("Area(in 1000s)") + ggeasy::easy_center_title()
    
b <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Area/10000), col="brown")  + 
  labs(title = "Boxplot of class vs Area") + 
  geom_violin(aes(x=Class, y=Area/10000), alpha=0.4) + ylab("Area") + ggeasy::easy_center_title()


c <- labeled%>%ggplot() +
  geom_histogram(aes(x=Perimeter/10), fill="brown", col="white")+xlab("Peri")
    
d <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Perimeter/10), col="brown")   + 
  geom_violin(aes(x=Class, y=Perimeter/10), alpha=0.4) + ylab("Peri")




g <- labeled%>%ggplot() +
  geom_histogram(aes(x=MinorAxisLength), fill="brown", col="white") + xlab("MIX.Length")
    
h <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=MinorAxisLength), col="brown")   + 
  geom_violin(aes(x=Class, y=MinorAxisLength), alpha=0.4) + ylab("MIX.Length")


i <- labeled%>%ggplot() +
  geom_histogram(aes(x=ConvexArea), fill="brown", col="white") + xlab("Co.Area")
    
j <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=ConvexArea), col="brown")   + 
  geom_violin(aes(x=Class, y=ConvexArea), alpha=0.4) + ylab("Co.Area")



k <- labeled%>%ggplot() +
  geom_histogram(aes(x=Eccentricity), fill="brown", col="white") + xlab("Eccen")
    
l <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Eccentricity), col="brown")   + 
  geom_violin(aes(x=Class, y=Eccentricity), alpha=0.4) + ylab("Eccen")


m <- labeled%>%ggplot() +
  geom_histogram(aes(x=Extent), fill="brown", col="white") + xlab("Extent")
    
n <- labeled%>%ggplot() + 
  stat_summary(aes(x=Class, y=Extent), col="brown")   + 
  geom_violin(aes(x=Class, y=Extent), alpha=0.4) + ylab("Extent")




img <- ((a + b) + plot_layout(widths = c(1, 2)))/
  ((c+d)+ plot_layout(widths = c(1, 2)))/ 
  ((g+h)+ plot_layout(widths = c(1, 2)))/
  ((i+j)+ plot_layout(widths = c(1, 2)))/
  ((k+l)+ plot_layout(widths = c(1, 2)))/
  ((m+n)+ plot_layout(widths = c(1, 2)))
 # plot_annotation(tag_levels = "A",tag_suffix = ")")

ggsave(filename = "explo1.png",
 width = 10, height = 8,
 dpi = 700)  

Figure 1: Exploratory Analysis of Labeled Data Set The violin plots from the labeled data show that BOMBAY and CALI beans are very distinct from the other beans. It can be seen from the average points that Roundness and Extent seems to be a strong predictor for the SEKER. Eccentricity seems to be a good predictor to HOROZ. The violin plots for each class shows that most of the class distributions are approximately normal except for the distributions for Roundness and Extent. From these distributions, we expect BOMBAY and CALI to be easily predicted by our models.

Correlation of Variables

Most of the variables except for Eccentricity, Extent, and Roundness, are highly correlated with each other. This behavior is also seen in the correlation of the variables by classes (diagram not shown).

library(ellipse)
library(RColorBrewer)

# Build a Pannel of 100 colors with Rcolor Brewer
my_colors <- brewer.pal(3, "Accent")
my_colors <- colorRampPalette(my_colors)(100)


corrplot(cor(labeled%>% dplyr::select(-Class)), method = 'ellipse', type = "lower", outline = TRUE, tl.col = "brown", col=my_colors)
Correlation plot

Correlation plot

Principle Components Analysis (PCA)

Principal component analysis (PCA) is a technique for reducing the dimensionality of datasets, increasing interpretability and at the same time minimizing loss of information. This is done by creating new uncorrelated variables that successively maximize variance. The components act as new variables that are constructed as linear combinations or mixtures of the initial variables. The principal component analysis below indicates that the first 3 components explain more than 90% of all variance in the dataset. The first four components can explain almost all the vzriance in the data.

#####pca#####
library(factoextra)
pca.labeled <- prcomp(labeled %>% dplyr::select(-Class), scale = TRUE)
pca.sampA <- prcomp(sampA, scale = TRUE)
pca.sampB <- prcomp(sampB, scale = TRUE)
pca.sampC <- prcomp(sampC, scale = TRUE)

cumsum <- as.data.frame(cbind(PC=as.factor(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")), cumsum=cumsum(pca.labeled$sdev^2 / sum(pca.labeled$sdev^2))))


pc <- fviz_eig(pca.labeled, 
         choice = c("variance"), 
         ggtheme = theme_classic(), 
         geom = c("bar"), 
         barfill = "brown", 
         barcolor = "black",  
         main = "Principal Component Plot",
         xlab = "Principal Components") + ggeasy::easy_center_title()

cum <- ggplot(cumsum, aes(x=PC, y=cumsum)) + 
  geom_point(col = "brown", size=2) + 
  geom_hline(yintercept=0.9, linetype="dashed", color = "brown", size=1) +
  labs(y="Cumm Var Exp",
  x ="Principal Components",
  title="Cummulative Variance Explained") + ggeasy::easy_center_title()


(pc + cum) + plot_layout(widths = c(1, 2))
Principal Component Analysis

Principal Component Analysis

## Construct labled.sc dataset and pca dataset

#construc scaled label data
labeled.sc <- as.data.frame(scale(labeled %>% dplyr::select(-Class)))
labeled.sc$Class <- labeled$Class

#construct pca label data
labeled.pca <- as.data.frame(pca.labeled$x)
labeled.pca$Class <- labeled$Class

Leave-One-Out Cross Validation (LOOCV)

Leave-one-out cross-validation is a special case of cross-validation where the number of folds equals the number of instances in the data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. This is iteratively done for all n observations in the data. The LOOCV is less biased and reduces the chance of overstimatin test error rate compared to other validation approach.

K-Nearest Neigbhors (KNN)

The KNN algorithm assumes that similar things are near to each other.To select the K (number of neighbors around a particular data point) we run the KNN algorithm several times with different values of K and choose the K that gives the highest leave one out cross validation (LOOCV) average prediction accuracy.

#ctrl <- trainControl(method = "LOOCV")
#
#knn.train <- train(
#  Class ~.,
#  data = labeled,
#  method = "knn",
#  tuneGrid = data.frame(k=1:20),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  preProc = c("scale")
#)
#ggplot(plsFit) + labs(y="Accuracy(LOOCV)",
#  x ="Numb of Neighbors",
#  title="Best Number of Neighbors")

## KNN 


set.seed(12345)
Acc.lab <- NULL
Acc.pca <- NULL
for (k in 1:20) {
knn.lab <- knn.cv(labeled.sc[,1:8],cl=labeled$Class, k)
knn.pca <- knn.cv(labeled.pca[,1:4],cl=labeled.pca$Class, k)
Acc.lab[k] <- mean(knn.lab==labeled.sc$Class)
Acc.pca[k] <- mean(knn.pca==labeled.pca$Class)
AC <- as.tibble(cbind(k=1:20, ACC.lab=Acc.lab, ACC.pca=Acc.pca))
}

k.lab <- which(Acc.lab==max(AC$ACC.lab)) # result is 16
k.pca <- which(Acc.pca==max(AC$ACC.pca)) # result is 15



ggplot(AC) + 
  geom_line(aes(x=k, y=Acc.lab), color = "brown", size=1) +
  geom_point(aes(x=k, y=Acc.lab), color = "black", size=3) + 
  geom_vline(xintercept = k.lab[1], linetype="dashed", 
             color = "brown", size=1)+
  labs(y="Accuracy(LOOCV)",
  x ="Numb of Neighbors",
  title="Optimal Number of Neighbors")+ 
  ggeasy::easy_center_title()

optk.pca <- ggplot(AC) + 
  geom_line(aes(x=k, y=Acc.pca), color = "brown", size=1) +
  geom_point(aes(x=k, y=Acc.pca), color = "black", size=3) + 
  geom_vline(xintercept = k.pca[1], linetype="dashed", 
             color = "brown", size=1)+
  labs(y="Accuracy(LOOCV)",
  x ="Numb of Neighbors",
  title="Best Number of Neighbors")+ 
  ggeasy::easy_center_title()

knn <- knn.cv(labeled.sc[,1:8],
                     cl=labeled.sc$Class, k=min(k.lab))



conf <- confusionMatrix(data = labeled$Class, knn)

knn.acc <- conf$overall[1]

The optimal K nearest neighbors is 16.

Quadratic Discriminant Analysis (QDA)

Quadratic Discrimination is the general form of Bayesian discrimination. Quadratic discriminant analysis is quite similar to Linear discriminant analysis except we relaxed the assumption that the mean and covariance of all the classes were equal.

#ctrl <- trainControl(method = "LOOCV")
#
#qda <- train(
#  Class ~.,
#  data = labeled,
#  method = "qda",
#  #tuneGrid = data.frame(k=1:20),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  #preProc = c("scale")
#)

qda <- qda(Class~., data = labeled, CV = TRUE)


qda.conf <- confusionMatrix(data = labeled$Class, qda$class)


qda.acc <- qda.conf$overall[1]

Random Forest

The Random Forest classifier contains a large number of individual decision trees, where each individual tree in the random forest produces a class prediction and the class with thr most votes becomes the model’s prediction. LOOCV is used to select the optimal number of features in each tree.

#ctrl <- trainControl(method = "LOOCV", number = 1)
#
#rf <- train(
#  Class ~.,
#  data = labeled,
#  method = "rf",
#  tuneGrid = data.frame(mtry=1:(ncol(data)-1)),
#  trControl = ctrl,
#  ## Center and scale the predictors for the training
#  ## set and all future samples.
#  #preProc = c("scale")
#)
#ggplot(rf)

# all variables
set.seed(12345)
n <- ncol(labeled) -1
errRate <- c(1)
for (i in 1:n){  
m <- randomForest(Class~.,data=labeled,mtry=i,CV=TRUE)  
err<-mean(m$err.rate)  
errRate[i] <- err  
}  
#a= which.min(errRate) 




rf_sum <- data.frame(cbind(mtry=1:8, error_rate=errRate, acc = 1-errRate)) 

# my result is 2
mtry_best <- ggplot(rf_sum, aes(x=mtry, y=acc)) + 
  geom_vline(xintercept = which.min(rf_sum$error_rate)  , linetype="dashed", color = "brown", size=1)+
  geom_line(color = "brown", size=1)+
  geom_point(color = "black", size=3) +
  labs(y="Accuracy(LOOCV)",
  x ="Numb of Variables",
  title="Opt Num of Variables")
  
#rf.model <- 
  ((mtry_best + ~plot(m, main = 'Bagging Error'))) + plot_layout(widths = c(1, 2))

#confusionMatrix(labeled$Class, m$predicted)

rff <- randomForest(Class~., data = labeled, mtry=2)

rf.conf <- confusionMatrix(labeled$Class, rff$predicted)

rf.acc <- rf.conf$overall[1]

#ggsave(filename = "rf.model.png",
# width = 10, height = 7,
# dpi = 700) 

The optimal number of variables to use in each tree in the random forest is 2 variables. The out of bag error rates for each class of dry beans indicate that the optimal number of trees levels out after about 200 trees. 500 trees is used as the number of trees (ntree) in the final model.

Model Performance

There is very little difference in the performance of the three models. The QDA model performs slightly better with the highest accuracy than KNN and random Forest model. Each of theses models can be used as the final model.

acc.data <- data.frame(cbind(Model=c("KNN", "QDA", "RF"), rbind(table(knn), table(qda$class), table(m$predicted)), Accuracy=round(rbind(knn.acc, qda.acc, rf.acc),3)), row.names = NULL )
#acc.data
model.plot <- ggplot(acc.data, aes(x=Model, y=Accuracy)) + geom_col(width = .60, aes(fill=Model), show.legend = F) + ggeasy::easy_center_title()


(wrap_elements(gridExtra::tableGrob(acc.data)) / model.plot ) + plot_layout(widths = c(1, 2)) + plot_annotation(
 title = "Model Predictions(TPR+TNR) and Accuracy")

pred.label.dat <- as.data.frame(cbind("class"=qda$class,
                                      "Eccentricity"=labeled$Eccentricity,
                                      "Extent"=labeled$Extent))%>%
  mutate(class=as.factor(ifelse(class=="1", "BOMBAY", ifelse(class=="2","CALI", ifelse(class=="3", "DERMASON", ifelse(class=="4", "HOROZ", ifelse(class=="5","SEKER", "SIRA" )))))))

Visualizing Best Selected Model

The diagram below shows that the QDA model performs very well on the training dataset. This model produces very similar classification.

grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+ 
  theme(legend.position="bottom") + ggeasy::easy_center_title(),
ggplot(pred.label.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "LOOCV qda Labeled") + theme(legend.position="bottom") + ggeasy::easy_center_title(), ncol=2)
Final selected model (QDA)

Final selected model (QDA)

Results (Classification of Sample Data)

Class predictions are very similaar to the true labeled data.

sample_data <- data.frame(rbind(sampA, sampB, sampC))

qda <- qda(Class~., data = labeled, CV = FALSE)
  


qda_predict <- predict(qda, newdata = sample_data)

#table(qda_predict$class)


pred.samp.dat <- as.data.frame(cbind("class"=qda_predict$class,sample_data))
grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+ 
  theme(legend.position="bottom") + ggeasy::easy_center_title(),
ggplot(pred.samp.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "Unlabeled") + theme(legend.position="bottom") + ggeasy::easy_center_title(), ncol=2)

Acknowledgement.

I would like to give credit to the following persons for their valuable contribution to all aspects of this project; Jie Hu, Kenneth Annan, Siyi Lui and Iftekhar Chowdhury.

References

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.

  2. Heuzé V., Tran G., Nozière P., & Lebas F. (2015). Common Bean (Phaseolus vulgaris), Feedipedia.org – Animal Feed Resources Information System – A programme by INRA, CIRAD, AFZ and FAO, http://www.feedipedia.org/node/266 (accessed on 29 April 2021).

  3. Koklu, M., & Ozkan, I. A. (2020). Multiclass classification of dry beans using computer vision and ma-chine learning techniques. Computers and Electronics in Agriculture, 174, 105507. doi:10.1016/j.compag.2020.105507

  4. Varankaya, S., & Ceyhan, E. (2012). Problems Encountered in Bean Farming in the Central Anatolia Region and Solution Suggestions. Selçuk Tarım Bilim. Journal. 26, 15–26.

  5. https://en.m.wikipedia.org/wiki/Sensitivity_and_specificity

  6. https://www.geeksforgeeks.org/loocvleave-one-out-cross-validation-in-r-programming/

  7. https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

  8. https://alekhyo.medium.com/interview-questions-on-svm-bf13e5fbcca8://alekhyo.medium.com/interview-questions-on-svm-bf13e5fbcca8