library(readxl)
library(tidyverse)
library(tidymodels)
library(scales)
theme_set(theme_light())
library(visdat)
library(inspectdf)
library(embed)
The dataset is from Tableau MakeoverMonday 2021 week 14, https://data.world/makeovermonday/2021w14/workspace/file?filename=Dry_Bean_Dataset.xlsx, where the original source is: [DOI]https://doi.org/10.1016/j.compag.2020.105507
KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507.
The data is complete without missing values, all values are numerical except Class is categorical variable, which is used for classification.
dataset <- read_xlsx("Dry_Bean_Dataset.xlsx")
dataset %>%
vis_dat()
Since all values are numerical, we can check their distributions. We notice some of the variables has long tails, some have bi-mode, this indicate some bean class will be quite distinct from others.
dataset %>%
select(-`Bean ID`) %>%
inspect_num() %>%
show_plot()
We can also check the correlations between all the values. Since quite few numerical values are feature engineered from the base values, we observe there are quite a lot highly correlated variables.
dataset %>%
select(-`Bean ID`) %>%
select(where(is.numeric)) %>%
vis_cor()
Given the nature of all numerical variables, we will perform PCA on the dataset to see if we can drop some of the variables.
recipe <-
recipe(Class ~ ., data = dataset) %>%
update_role(`Bean ID`, new_role = "id variable")
summary(recipe)
## # A tibble: 18 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 Bean ID numeric id variable original
## 2 Area numeric predictor original
## 3 Perimeter numeric predictor original
## 4 MajorAxisLength numeric predictor original
## 5 MinorAxisLength numeric predictor original
## 6 AspectRation numeric predictor original
## 7 Eccentricity numeric predictor original
## 8 ConvexArea numeric predictor original
## 9 EquivDiameter numeric predictor original
## 10 Extent numeric predictor original
## 11 Solidity numeric predictor original
## 12 roundness numeric predictor original
## 13 Compactness numeric predictor original
## 14 ShapeFactor1 numeric predictor original
## 15 ShapeFactor2 numeric predictor original
## 16 ShapeFactor3 numeric predictor original
## 17 ShapeFactor4 numeric predictor original
## 18 Class nominal outcome original
Setup all the numerical variables, then normalise the data and perform PCA on the dataset.
pca_rec <-
recipe %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
After performing the PCA, we will plot to see how each components affect the variables. We see the first two principle components already cover more than 80% of the variation.
tidy(pca_prep, 2, type = "variance") %>%
filter(terms == "cumulative percent variance") %>%
ggplot(aes(x = component, y = value/100)) +
geom_line() +
scale_x_continuous(breaks = seq(2,16, 2),
name = "Principle Component") +
scale_y_continuous(labels = percent,
name = "Cumulative %") +
labs(title = "Principle Component Cumulative Distribution")
Since we only need a few principle components, we will have a look their effect on each variable. PC3 & PC4 only affects few variables to increase the overall cumulative variance described.
tidied_pca <- tidy(pca_prep, 2)
tidied_pca %>%
filter(component %in% paste0("PC", 1:4)) %>%
mutate(component = fct_inorder(component)) %>%
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, nrow = 1) +
labs(x = "Variable Weights by Principle Component",
y = "Variable",
title = "Principle Component Weights per Variable ")
We do scatter plot for the first two principle component. We see BOMBAY bean clearly stands out among the other type of beans, while the rest has quite a bit overlap.
juice(pca_prep) %>%
ggplot(aes(x = PC1, y = PC2, colour = Class)) +
geom_point(size = 1) +
labs(title = "PCA Separtion")
We repeat the same process with a similar steps for UMAP.
umap_rec <-
recipe %>%
step_normalize(all_predictors()) %>%
step_umap(all_predictors()) ## change to umap
umap_prep <- prep(umap_rec)
Again BOMBAY stands out more clearly, and there are more separations among the beans compare with PCA. DERMASON has overlap with SIRA.
juice(umap_prep) %>%
ggplot(aes(x = umap_1, y = umap_2, colour = Class)) +
geom_point(size = 1) +
labs(title = "UMAP Separtion")
We will use correlation to decide on the variable and use Boruta to check the results.
library(corrplot)
dataset_corr <-
dataset %>%
select(-c(`Bean ID`, Class)) %>%
cor()
From the Boruta package, we see Roundness and SF4 being the most important factors, although others are contributing at similar range of value of Importance below.
library(Boruta)
dataset_mod <-
dataset %>%
mutate(Class = as_factor(Class))
set.seed(1234)
dataset_boruta <-
Boruta(Class ~., data = dataset_mod[, -1])
print(dataset_boruta)
## Boruta performed 11 iterations in 3.26003 mins.
## 16 attributes confirmed important: Area, AspectRation, Compactness,
## ConvexArea, Eccentricity and 11 more;
## No attributes deemed unimportant.
attStats(dataset_boruta)
## meanImp medianImp minImp maxImp normHits decision
## Area 29.46689 29.53581 27.67640 30.93551 1 Confirmed
## Perimeter 34.29386 34.26594 32.79419 35.46343 1 Confirmed
## MajorAxisLength 27.19730 27.21149 26.46158 28.34735 1 Confirmed
## MinorAxisLength 36.48262 36.68610 35.24287 38.71170 1 Confirmed
## AspectRation 33.24349 33.58719 31.64350 34.68507 1 Confirmed
## Eccentricity 33.59207 33.37761 32.36574 35.29870 1 Confirmed
## ConvexArea 30.46398 30.50181 29.10034 31.79761 1 Confirmed
## EquivDiameter 30.12447 30.33384 28.13035 31.57752 1 Confirmed
## Extent 20.41291 20.65353 19.47552 21.22571 1 Confirmed
## Solidity 43.70001 43.79988 41.45425 46.36539 1 Confirmed
## roundness 62.22659 61.06600 60.13434 65.97331 1 Confirmed
## Compactness 42.35870 42.37120 40.70326 45.54776 1 Confirmed
## ShapeFactor1 41.52202 41.03857 40.45824 43.17576 1 Confirmed
## ShapeFactor2 26.51903 26.43647 25.39231 27.76872 1 Confirmed
## ShapeFactor3 42.44447 42.27166 40.47600 45.30099 1 Confirmed
## ShapeFactor4 66.78403 66.35163 65.06075 69.52765 1 Confirmed
plot(dataset_boruta)
We will use varies classification model to predict the class.
We split training and testing dataset at 80% level, setup a 5-fold cross validation on the training set.
set.seed(1234)
dataset_split <- initial_split(dataset,
prop = 0.8,
strata = Class)
training_set <- training(dataset_split)
testing_set <- testing(dataset_split)
set.seed(123)
training_cv_folds <- vfold_cv(training_set,
v = 5,
strata = Class)
For the pre-process, we change Bean ID so it won’t be used for predication, use correlation to filter out highly correlated variables, then normalise the variables for modeling.
recipe_model_all <-
recipe(Class ~., data = training_set) %>%
update_role(`Bean ID`, new_role = "id") %>%
step_corr(all_predictors(), threshold = 0.9) %>%
step_normalize(all_predictors())
First we use a (regularised) multinomial logistic model.
## multinomial logistic regression
logistic_spec <-
multinom_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
logistic_workflow <-
workflow() %>%
add_recipe(recipe_model_all) %>%
add_model(logistic_spec)
set.seed(1234)
logistic_tune <-
tune_grid(logistic_workflow,
resamples = training_cv_folds,
grid = 5)
By checking the performance metrics, we see the logistic model has accuracy of 0.925 for different tuned model, except Model 2. The ROC is above 0.995.
logistic_tune %>%
collect_metrics() %>%
ggplot(aes(y = mean, x = .config, color = .metric)) +
geom_point() +
facet_wrap(~.metric, scales = "free_y") +
scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
labs(x = "",
y = "Mean Value",
title = "Logistic Model Performance Metric") +
theme(legend.position = "none")
Next we use random forest model with ranger engine.
## random forest
ranger_spec <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(recipe_model_all) %>%
add_model(ranger_spec)
set.seed(1234)
ranger_tune <-
tune_grid(ranger_workflow,
resamples = training_cv_folds,
grid = 5)
The average accuracy is about 0.926, slightly higher than logistic model.
ranger_tune %>%
collect_metrics() %>%
ggplot(aes(y = mean, x = .config, color = .metric)) +
geom_point() +
facet_wrap(~.metric, scales = "free_y") +
scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
labs(x = "",
y = "Mean Value",
title = "Random Forest Model Performance Metric") +
theme(legend.position = "none")
K-nearest neighbours, since we know there are 7 types of beans, so we set to create 7 clusters.
## knn
kknn_spec <-
nearest_neighbor(neighbors = 7,
weight_func = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
kknn_workflow <-
workflow() %>%
add_recipe(recipe_model_all) %>%
add_model(kknn_spec)
set.seed(81707)
kknn_tune <-
tune_grid(kknn_workflow,
resamples = training_cv_folds,
grid = 5)
The average performance is lower than logistic models.
kknn_tune %>%
collect_metrics() %>%
ggplot(aes(y = mean, x = .config, color = .metric)) +
geom_point() +
facet_wrap(~.metric, scales = "free_y") +
scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
labs(x = "",
y = "Mean Value",
title = "K-nearest Neighbour Model Performance Metric") +
theme(legend.position = "none")
Support vector machine with radial kernal.
## SVM
svm_spec <-
svm_rbf(cost = tune(),
rbf_sigma = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_workflow <-
workflow() %>%
add_recipe(recipe_model_all) %>%
add_model(svm_spec)
set.seed(81707)
svm_tune <-
tune_grid(svm_workflow,
resamples = training_cv_folds,
grid = 5)
Although we have two models performed well at about 0.925, others are bad. It is less consistant than the previous models given the current tuning size.
svm_tune %>%
collect_metrics() %>%
ggplot(aes(y = mean, x = .config, color = .metric)) +
geom_point() +
facet_wrap(~.metric, scales = "free_y") +
scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
labs(x = "",
y = "Mean Value",
title = "Support Vector Machine Model Performance Metric") +
theme(legend.position = "none")
Finally we will collect the best models for each algorthim
logistic_best <- logistic_tune %>%
select_best("accuracy")
logistic_final <- logistic_workflow %>%
finalize_workflow(logistic_best) %>%
fit(data = training_set)
ranger_best <- ranger_tune %>%
select_best("accuracy")
ranger_final <- ranger_workflow %>%
finalize_workflow(ranger_best) %>%
fit(data = training_set)
kknn_best <- kknn_tune %>%
select_best("accuracy")
kknn_final <- kknn_workflow %>%
finalize_workflow(kknn_best) %>%
fit(data = training_set)
svm_best <- svm_tune %>%
select_best("accuracy")
svm_final <- svm_workflow %>%
finalize_workflow(svm_best) %>%
fit(data = training_set)
vip package offer a way to see the impact of the factors.
library(vip)
plot_logistic <- logistic_final %>%
pull_workflow_fit() %>%
vip()
## importance needed to be chosen at set_engine
## importance not available for kknn and svm
Group the best model output and plot in a single graph. The random forest model from ranger package gives the best overall measure compare with other models.
logistic_final_all <- logistic_final %>%
last_fit(split = dataset_split)
ranger_final_all <- ranger_final %>%
last_fit(split = dataset_split)
kknn_final_all <- kknn_final %>%
last_fit(split = dataset_split)
svm_final_all <- svm_final %>%
last_fit(split = dataset_split)
model_final_all_list <- list(logistic_final_all, ranger_final_all, kknn_final_all, svm_final_all)
map(model_final_all_list, collect_metrics) %>%
bind_rows(.id = "model") %>%
ggplot(aes(x = model, y = .estimate)) +
geom_point(size = 3) +
scale_x_discrete(labels = c("logistic", "ranger", "kknn", "svm")) +
facet_wrap(~.metric) +
labs(title = "Final testing set fit performance",
x = NULL,
y = "Performance Measure")