Multi-Classification of Dry Beans Using Machine Learning
## Libraries
library(knitr)
library(dplyr)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(corrplot)
library(class)
library(caret)
library(rpart)
library(maptree)
library(pca3d)
library(MASS)
library(tibble)
library(randomForest)
library(e1071)
library(kernlab)
library(kableExtra)
library(gridExtra)
library(reshape)
library(GGally)
library(readr)
library(readxl)
library(float)
library(patchwork)
library(ggeasy)
theme_set(theme_classic())#load data
labeled <- read.csv('labeled.csv') %>% dplyr::select(-X)
sampA <- read.csv('samp.A.csv')%>% dplyr::select(-X)
sampB <- read.csv('samp.B.csv')%>% dplyr::select(-X)
sampC <- read.csv('samp.C.csv')%>% dplyr::select(-X)
#convert Class into factor
labeled$Class <- as.factor(labeled$Class)
#set up a new variable 'Roundness'
#Roundness = 4*Area*pi/(perimeter)^2 (refer to the dry bean paper)
Roundess <- 4*pi*labeled$Area/(labeled$Perimeter)^2
labeled <- add_column(labeled, Roundness = Roundess, .after = 7)
sampA$Roundness <- 4*pi*sampA$Area/(sampA$Perimeter)^2
sampB$Roundness <- 4*pi*sampB$Area/(sampB$Perimeter)^2
sampC$Roundness <- 4*pi*sampC$Area/(sampC$Perimeter)^2
#check for duplicate rows
dup.rows = sum(labeled%>%duplicated(), sampA%>%duplicated(),
sampB%>%duplicated(),sampC%>%duplicated())Photo Credits: nitsuki
Aim of the Project
selecting the best seed species is one of the main concerns for both bean producers and the market. Since different genotypes are cultivated worldwide, it is important to separate the best seed variety from the mixed dry bean population, otherwise the market value of these mixed species of beans could drop enormously (Varankaya & Ceyhan, 2012).
The aim of this project is to develop a supervised machine learning algorithm to perform a multi-classification of dry beans species harvested from population cultivation from a single farm. A supervised machine learning algorithm is one that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data
Exploratory Data Analysis
There are two datasets used in this project; The ‘labeled’ and ‘unlabeled’ dataset. The labeled dataset- which contains the classes of dry-beans will be used to train the various ML models, hence also refered to as the training dataset. The training dataset contains 3000 observations and 8 variables. The dependent variable has 6 levels (Classes): BOMBAY, CALI, DERMASON, HOROZ, SEKER, and SIRA - These classes represent the different varieties of dry beans. In the training set, each class has 500 observations.
The unlabeled dataset is made up of the combination of three seperate samples, namely, Sample A, B, and C. There are 7 variables which does not include the Class variable -thus unlabeled. The total observations for is 3131 respectively. Roundness, which is the measure of how closely the shape of beans approaches a perfect circle, is calculated and added as an additional predictor variable to both labeled and unlabeled datasets (Koklu & Ozkan, 2020). Below is the summary statistics of the variables in the training(labeled) data.
Summary of Data
summary.stats <- round(as.data.frame((labeled[,-9])%>%
psych::describe())%>%
dplyr::select(n,mean, sd, median, min, max, range, se), 2)
kbl(summary.stats, caption="Statistical distribution of features of dry beans varieties (in pixels) - Label")%>%
kable_classic(full_width = F, html_font = "Cambria", font_size = 12)| n | mean | sd | median | min | max | range | se | |
|---|---|---|---|---|---|---|---|---|
| Area | 3000 | 69874.98 | 49578.52 | 48714.50 | 20645.00 | 251320.00 | 230675.00 | 905.18 |
| Perimeter | 3000 | 1012.24 | 347.75 | 941.90 | 384.17 | 2164.10 | 1779.93 | 6.35 |
| MajorAxisLength | 3000 | 362.05 | 124.52 | 332.90 | 161.52 | 740.97 | 579.45 | 2.27 |
| MinorAxisLength | 3000 | 225.19 | 73.35 | 202.73 | 106.00 | 473.39 | 367.39 | 1.34 |
| Eccentricity | 3000 | 0.76 | 0.10 | 0.77 | 0.30 | 0.94 | 0.64 | 0.00 |
| ConvexArea | 3000 | 70944.12 | 50382.27 | 50807.50 | 8912.00 | 259965.00 | 251053.00 | 919.85 |
| Extent | 3000 | 0.75 | 0.05 | 0.77 | 0.57 | 0.85 | 0.28 | 0.00 |
| Roundness | 3000 | 0.84 | 0.29 | 0.77 | 0.39 | 2.06 | 1.66 | 0.01 |
The variables, Area and Convex Area, had the largest range for all four datasets. There are large differences in the range of variables, the variables with larger ranges can dominate over those with small ranges which may lead to biased results, therefore it is necessary to transform/scale these variables before fitting distance-based models such as K-Nearest Neighbors and SVM).
Class Covariances
Below is the class variances which measures variability from the average or mean.The variance of each variable by class shows evidence of non-constant variance. Based on the normality distribution and non-constant variance evident in the data, I expect the models which assumes distinct covariances among classes such as the Quadratic Discriminant Analysis(QDA) model to perform well.
var.tab1 <- labeled%>%group_by(Class)%>%
summarize(Area=var(Area),Perimeter=var(Perimeter),
Maj.Axis.=var(MajorAxisLength),Min.Axis.=var(MinorAxisLength),
Eccentricity=min(Eccentricity), var.ConvexArea=max(ConvexArea),
Extent=max(Extent), Roundness=max(Roundness))
kbl(var.tab1, caption = "Variance of Variable by Class")%>%
kable_classic(full_width = F, html_font = "Cambria", font_size = 12)| Class | Area | Perimeter | Maj.Axis. | Min.Axis. | Eccentricity | var.ConvexArea | Extent | Roundness |
|---|---|---|---|---|---|---|---|---|
| BOMBAY | 552697974 | 32015.80 | 3177.8992 | 826.6840 | 0.5472634 | 259965 | 0.8502428 | 1.262811 |
| CALI | 91528410 | 26272.79 | 1188.1780 | 491.4581 | 0.6183656 | 117510 | 0.8427527 | 1.512446 |
| DERMASON | 24651963 | 22913.81 | 696.7121 | 498.7684 | 0.5494947 | 56174 | 0.8471957 | 2.055745 |
| HOROZ | 56765885 | 24960.58 | 1252.4894 | 456.5644 | 0.7227374 | 82462 | 0.8420894 | 1.620064 |
| SEKER | 22567179 | 24170.41 | 736.8507 | 419.4165 | 0.3006355 | 65674 | 0.8183099 | 1.990205 |
| SIRA | 22641401 | 23977.06 | 782.4715 | 401.8085 | 0.6098838 | 73945 | 0.8418021 | 1.718602 |
Histograms and Violin Plots
The histograms from the labeled data (Figure 1) show evidence of multimodality behavior in the variables. This means that at least one of the classes of beans is very distinct from the others. Upon further checks I found that this multimodality is caused by the BOMBAY beans type.Upon further checks, I found that this multimodality is caused by the BOMBAY beans type in both the labeled set and Sample A, but not in Sample B and C. This multi,odality behaviour implies that, they might be very low predictions of BOMBAY for Sample B and C.
theme_set(theme_classic())
a <- labeled%>%ggplot() +
geom_histogram(aes(x=Area/1000), fill="brown", col="white") +
labs(title = "Histogram of Area") + xlab("Area(in 1000s)") + ggeasy::easy_center_title()
b <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=Area/10000), col="brown") +
labs(title = "Boxplot of class vs Area") +
geom_violin(aes(x=Class, y=Area/10000), alpha=0.4) + ylab("Area") + ggeasy::easy_center_title()
c <- labeled%>%ggplot() +
geom_histogram(aes(x=Perimeter/10), fill="brown", col="white")+xlab("Peri")
d <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=Perimeter/10), col="brown") +
geom_violin(aes(x=Class, y=Perimeter/10), alpha=0.4) + ylab("Peri")
g <- labeled%>%ggplot() +
geom_histogram(aes(x=MinorAxisLength), fill="brown", col="white") + xlab("MIX.Length")
h <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=MinorAxisLength), col="brown") +
geom_violin(aes(x=Class, y=MinorAxisLength), alpha=0.4) + ylab("MIX.Length")
i <- labeled%>%ggplot() +
geom_histogram(aes(x=ConvexArea), fill="brown", col="white") + xlab("Co.Area")
j <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=ConvexArea), col="brown") +
geom_violin(aes(x=Class, y=ConvexArea), alpha=0.4) + ylab("Co.Area")
k <- labeled%>%ggplot() +
geom_histogram(aes(x=Eccentricity), fill="brown", col="white") + xlab("Eccen")
l <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=Eccentricity), col="brown") +
geom_violin(aes(x=Class, y=Eccentricity), alpha=0.4) + ylab("Eccen")
m <- labeled%>%ggplot() +
geom_histogram(aes(x=Extent), fill="brown", col="white") + xlab("Extent")
n <- labeled%>%ggplot() +
stat_summary(aes(x=Class, y=Extent), col="brown") +
geom_violin(aes(x=Class, y=Extent), alpha=0.4) + ylab("Extent")
img <- ((a + b) + plot_layout(widths = c(1, 2)))/
((c+d)+ plot_layout(widths = c(1, 2)))/
((g+h)+ plot_layout(widths = c(1, 2)))/
((i+j)+ plot_layout(widths = c(1, 2)))/
((k+l)+ plot_layout(widths = c(1, 2)))/
((m+n)+ plot_layout(widths = c(1, 2)))
# plot_annotation(tag_levels = "A",tag_suffix = ")")
ggsave(filename = "explo1.png",
width = 10, height = 8,
dpi = 700) The violin plots from the labeled data show that BOMBAY and CALI beans are very distinct from the other beans. It can be seen from the average points that Roundness and Extent seems to be a strong predictor for the SEKER. Eccentricity seems to be a good predictor to HOROZ. The violin plots for each class shows that most of the class distributions are approximately normal except for the distributions for Roundness and Extent. From these distributions, we expect BOMBAY and CALI to be easily predicted by our models.
Correlation of Variables
Most of the variables except for Eccentricity, Extent, and Roundness, are highly correlated with each other. This behavior is also seen in the correlation of the variables by classes (diagram not shown).
library(ellipse)
library(RColorBrewer)
# Build a Pannel of 100 colors with Rcolor Brewer
my_colors <- brewer.pal(3, "Accent")
my_colors <- colorRampPalette(my_colors)(100)
corrplot(cor(labeled%>% dplyr::select(-Class)), method = 'ellipse', type = "lower", outline = TRUE, tl.col = "brown", col=my_colors)Correlation plot
Principle Components Analysis (PCA)
Principal component analysis (PCA) is a technique for reducing the dimensionality of datasets, increasing interpretability and at the same time minimizing loss of information. This is done by creating new uncorrelated variables that successively maximize variance. The components act as new variables that are constructed as linear combinations or mixtures of the initial variables. The principal component analysis below indicates that the first 3 components explain more than 90% of all variance in the dataset. The first four components can explain almost all the vzriance in the data.
#####pca#####
library(factoextra)
pca.labeled <- prcomp(labeled %>% dplyr::select(-Class), scale = TRUE)
pca.sampA <- prcomp(sampA, scale = TRUE)
pca.sampB <- prcomp(sampB, scale = TRUE)
pca.sampC <- prcomp(sampC, scale = TRUE)
cumsum <- as.data.frame(cbind(PC=as.factor(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")), cumsum=cumsum(pca.labeled$sdev^2 / sum(pca.labeled$sdev^2))))
pc <- fviz_eig(pca.labeled,
choice = c("variance"),
ggtheme = theme_classic(),
geom = c("bar"),
barfill = "brown",
barcolor = "black",
main = "Principal Component Plot",
xlab = "Principal Components") + ggeasy::easy_center_title()
cum <- ggplot(cumsum, aes(x=PC, y=cumsum)) +
geom_point(col = "brown", size=2) +
geom_hline(yintercept=0.9, linetype="dashed", color = "brown", size=1) +
labs(y="Cumm Var Exp",
x ="Principal Components",
title="Cummulative Variance Explained") + ggeasy::easy_center_title()
(pc + cum) + plot_layout(widths = c(1, 2))Principal Component Analysis
## Construct labled.sc dataset and pca dataset
#construc scaled label data
labeled.sc <- as.data.frame(scale(labeled %>% dplyr::select(-Class)))
labeled.sc$Class <- labeled$Class
#construct pca label data
labeled.pca <- as.data.frame(pca.labeled$x)
labeled.pca$Class <- labeled$ClassLeave-One-Out Cross Validation (LOOCV)
Leave-one-out cross-validation is a special case of cross-validation where the number of folds equals the number of instances in the data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. This is iteratively done for all n observations in the data. The LOOCV is less biased and reduces the chance of overstimatin test error rate compared to other validation approach.
K-Nearest Neigbhors (KNN)
The KNN algorithm assumes that similar things are near to each other.To select the K (number of neighbors around a particular data point) we run the KNN algorithm several times with different values of K and choose the K that gives the highest leave one out cross validation (LOOCV) average prediction accuracy.
#ctrl <- trainControl(method = "LOOCV")
#
#knn.train <- train(
# Class ~.,
# data = labeled,
# method = "knn",
# tuneGrid = data.frame(k=1:20),
# trControl = ctrl,
# ## Center and scale the predictors for the training
# ## set and all future samples.
# preProc = c("scale")
#)
#ggplot(plsFit) + labs(y="Accuracy(LOOCV)",
# x ="Numb of Neighbors",
# title="Best Number of Neighbors")
## KNN
set.seed(12345)
Acc.lab <- NULL
Acc.pca <- NULL
for (k in 1:20) {
knn.lab <- knn.cv(labeled.sc[,1:8],cl=labeled$Class, k)
knn.pca <- knn.cv(labeled.pca[,1:4],cl=labeled.pca$Class, k)
Acc.lab[k] <- mean(knn.lab==labeled.sc$Class)
Acc.pca[k] <- mean(knn.pca==labeled.pca$Class)
AC <- as.tibble(cbind(k=1:20, ACC.lab=Acc.lab, ACC.pca=Acc.pca))
}
k.lab <- which(Acc.lab==max(AC$ACC.lab)) # result is 16
k.pca <- which(Acc.pca==max(AC$ACC.pca)) # result is 15
ggplot(AC) +
geom_line(aes(x=k, y=Acc.lab), color = "brown", size=1) +
geom_point(aes(x=k, y=Acc.lab), color = "black", size=3) +
geom_vline(xintercept = k.lab[1], linetype="dashed",
color = "brown", size=1)+
labs(y="Accuracy(LOOCV)",
x ="Numb of Neighbors",
title="Optimal Number of Neighbors")+
ggeasy::easy_center_title()optk.pca <- ggplot(AC) +
geom_line(aes(x=k, y=Acc.pca), color = "brown", size=1) +
geom_point(aes(x=k, y=Acc.pca), color = "black", size=3) +
geom_vline(xintercept = k.pca[1], linetype="dashed",
color = "brown", size=1)+
labs(y="Accuracy(LOOCV)",
x ="Numb of Neighbors",
title="Best Number of Neighbors")+
ggeasy::easy_center_title()
knn <- knn.cv(labeled.sc[,1:8],
cl=labeled.sc$Class, k=min(k.lab))
conf <- confusionMatrix(data = labeled$Class, knn)
knn.acc <- conf$overall[1]The optimal K nearest neighbors is 16.
Quadratic Discriminant Analysis (QDA)
Quadratic Discrimination is the general form of Bayesian discrimination. Quadratic discriminant analysis is quite similar to Linear discriminant analysis except we relaxed the assumption that the mean and covariance of all the classes were equal.
#ctrl <- trainControl(method = "LOOCV")
#
#qda <- train(
# Class ~.,
# data = labeled,
# method = "qda",
# #tuneGrid = data.frame(k=1:20),
# trControl = ctrl,
# ## Center and scale the predictors for the training
# ## set and all future samples.
# #preProc = c("scale")
#)
qda <- qda(Class~., data = labeled, CV = TRUE)
qda.conf <- confusionMatrix(data = labeled$Class, qda$class)
qda.acc <- qda.conf$overall[1]Random Forest
The Random Forest classifier contains a large number of individual decision trees, where each individual tree in the random forest produces a class prediction and the class with thr most votes becomes the model’s prediction. LOOCV is used to select the optimal number of features in each tree.
#ctrl <- trainControl(method = "LOOCV", number = 1)
#
#rf <- train(
# Class ~.,
# data = labeled,
# method = "rf",
# tuneGrid = data.frame(mtry=1:(ncol(data)-1)),
# trControl = ctrl,
# ## Center and scale the predictors for the training
# ## set and all future samples.
# #preProc = c("scale")
#)
#ggplot(rf)
# all variables
set.seed(12345)
n <- ncol(labeled) -1
errRate <- c(1)
for (i in 1:n){
m <- randomForest(Class~.,data=labeled,mtry=i,CV=TRUE)
err<-mean(m$err.rate)
errRate[i] <- err
}
#a= which.min(errRate)
rf_sum <- data.frame(cbind(mtry=1:8, error_rate=errRate, acc = 1-errRate))
# my result is 2
mtry_best <- ggplot(rf_sum, aes(x=mtry, y=acc)) +
geom_vline(xintercept = which.min(rf_sum$error_rate) , linetype="dashed", color = "brown", size=1)+
geom_line(color = "brown", size=1)+
geom_point(color = "black", size=3) +
labs(y="Accuracy(LOOCV)",
x ="Numb of Variables",
title="Opt Num of Variables")
#rf.model <-
((mtry_best + ~plot(m, main = 'Bagging Error'))) + plot_layout(widths = c(1, 2))#confusionMatrix(labeled$Class, m$predicted)
rff <- randomForest(Class~., data = labeled, mtry=2)
rf.conf <- confusionMatrix(labeled$Class, rff$predicted)
rf.acc <- rf.conf$overall[1]
#ggsave(filename = "rf.model.png",
# width = 10, height = 7,
# dpi = 700) The optimal number of variables to use in each tree in the random forest is 2 variables. The out of bag error rates for each class of dry beans indicate that the optimal number of trees levels out after about 200 trees. 500 trees is used as the number of trees (ntree) in the final model.
Model Performance
There is very little difference in the performance of the three models. The QDA model performs slightly better with the highest accuracy than KNN and random Forest model. Each of theses models can be used as the final model.
acc.data <- data.frame(cbind(Model=c("KNN", "QDA", "RF"), rbind(table(knn), table(qda$class), table(m$predicted)), Accuracy=round(rbind(knn.acc, qda.acc, rf.acc),3)), row.names = NULL )
#acc.data
model.plot <- ggplot(acc.data, aes(x=Model, y=Accuracy)) + geom_col(width = .60, aes(fill=Model), show.legend = F) + ggeasy::easy_center_title()
(wrap_elements(gridExtra::tableGrob(acc.data)) / model.plot ) + plot_layout(widths = c(1, 2)) + plot_annotation(
title = "Model Predictions(TPR+TNR) and Accuracy")pred.label.dat <- as.data.frame(cbind("class"=qda$class,
"Eccentricity"=labeled$Eccentricity,
"Extent"=labeled$Extent))%>%
mutate(class=as.factor(ifelse(class=="1", "BOMBAY", ifelse(class=="2","CALI", ifelse(class=="3", "DERMASON", ifelse(class=="4", "HOROZ", ifelse(class=="5","SEKER", "SIRA" )))))))Visualizing Best Selected Model
The diagram below shows that the QDA model performs very well on the training dataset. This model produces very similar classification.
grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+
theme(legend.position="bottom") + ggeasy::easy_center_title(),
ggplot(pred.label.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "LOOCV qda Labeled") + theme(legend.position="bottom") + ggeasy::easy_center_title(), ncol=2)Final selected model (QDA)
Results (Classification of Sample Data)
Class predictions are very similaar to the true labeled data.
sample_data <- data.frame(rbind(sampA, sampB, sampC))
qda <- qda(Class~., data = labeled, CV = FALSE)
qda_predict <- predict(qda, newdata = sample_data)
#table(qda_predict$class)
pred.samp.dat <- as.data.frame(cbind("class"=qda_predict$class,sample_data))grid.arrange(
ggplot(labeled)+geom_point(aes(x=Extent, y=Eccentricity,col=Class), size=.5)+ labs(title = "True Labeled")+
theme(legend.position="bottom") + ggeasy::easy_center_title(),
ggplot(pred.samp.dat)+geom_point(aes(x=Extent, y=Eccentricity,col=class), size=.5)+ labs(title = "Unlabeled") + theme(legend.position="bottom") + ggeasy::easy_center_title(), ncol=2)Acknowledgement.
I would like to give credit to the following persons for their valuable contribution to all aspects of this project; Jie Hu, Kenneth Annan, Siyi Lui and Iftekhar Chowdhury.
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
Heuzé V., Tran G., Nozière P., & Lebas F. (2015). Common Bean (Phaseolus vulgaris), Feedipedia.org – Animal Feed Resources Information System – A programme by INRA, CIRAD, AFZ and FAO, http://www.feedipedia.org/node/266 (accessed on 29 April 2021).
Koklu, M., & Ozkan, I. A. (2020). Multiclass classification of dry beans using computer vision and ma-chine learning techniques. Computers and Electronics in Agriculture, 174, 105507. doi:10.1016/j.compag.2020.105507
Varankaya, S., & Ceyhan, E. (2012). Problems Encountered in Bean Farming in the Central Anatolia Region and Solution Suggestions. Selçuk Tarım Bilim. Journal. 26, 15–26.
https://www.geeksforgeeks.org/loocvleave-one-out-cross-validation-in-r-programming/
https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710