Load Package

library(readxl)
library(tidyverse)
library(tidymodels)
library(scales)
theme_set(theme_light())
library(visdat)
library(inspectdf)
library(embed)

The dataset is from Tableau MakeoverMonday 2021 week 14, https://data.world/makeovermonday/2021w14/workspace/file?filename=Dry_Bean_Dataset.xlsx, where the original source is: [DOI]https://doi.org/10.1016/j.compag.2020.105507

KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507.

Overview of the dataset

The data is complete without missing values, all values are numerical except Class is categorical variable, which is used for classification.

dataset <- read_xlsx("Dry_Bean_Dataset.xlsx")
dataset %>% 
  vis_dat()

Since all values are numerical, we can check their distributions. We notice some of the variables has long tails, some have bi-mode, this indicate some bean class will be quite distinct from others.

dataset %>% 
  select(-`Bean ID`) %>% 
  inspect_num() %>% 
  show_plot()

We can also check the correlations between all the values. Since quite few numerical values are feature engineered from the base values, we observe there are quite a lot highly correlated variables.

dataset %>% 
  select(-`Bean ID`) %>% 
  select(where(is.numeric)) %>% 
  vis_cor()

Clustering Check

PCA

Given the nature of all numerical variables, we will perform PCA on the dataset to see if we can drop some of the variables.

recipe <- 
  recipe(Class ~ ., data = dataset) %>%
  update_role(`Bean ID`, new_role = "id variable")

summary(recipe)
## # A tibble: 18 x 4
##    variable        type    role        source  
##    <chr>           <chr>   <chr>       <chr>   
##  1 Bean ID         numeric id variable original
##  2 Area            numeric predictor   original
##  3 Perimeter       numeric predictor   original
##  4 MajorAxisLength numeric predictor   original
##  5 MinorAxisLength numeric predictor   original
##  6 AspectRation    numeric predictor   original
##  7 Eccentricity    numeric predictor   original
##  8 ConvexArea      numeric predictor   original
##  9 EquivDiameter   numeric predictor   original
## 10 Extent          numeric predictor   original
## 11 Solidity        numeric predictor   original
## 12 roundness       numeric predictor   original
## 13 Compactness     numeric predictor   original
## 14 ShapeFactor1    numeric predictor   original
## 15 ShapeFactor2    numeric predictor   original
## 16 ShapeFactor3    numeric predictor   original
## 17 ShapeFactor4    numeric predictor   original
## 18 Class           nominal outcome     original

Setup all the numerical variables, then normalise the data and perform PCA on the dataset.

pca_rec <- 
  recipe %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors())
  
pca_prep <- prep(pca_rec)

After performing the PCA, we will plot to see how each components affect the variables. We see the first two principle components already cover more than 80% of the variation.

tidy(pca_prep, 2, type = "variance") %>% 
  filter(terms == "cumulative percent variance") %>% 
  ggplot(aes(x = component, y = value/100)) +
  geom_line() +
  scale_x_continuous(breaks = seq(2,16, 2),
                     name = "Principle Component") +
  scale_y_continuous(labels = percent, 
                     name = "Cumulative %") +
  labs(title = "Principle Component Cumulative Distribution")

Since we only need a few principle components, we will have a look their effect on each variable. PC3 & PC4 only affects few variables to increase the overall cumulative variance described.

tidied_pca <- tidy(pca_prep, 2)

tidied_pca %>% 
  filter(component %in% paste0("PC", 1:4)) %>% 
  mutate(component = fct_inorder(component)) %>% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ component, nrow = 1) +
  labs(x = "Variable Weights by Principle Component",
       y = "Variable",
       title = "Principle Component Weights per Variable ")

We do scatter plot for the first two principle component. We see BOMBAY bean clearly stands out among the other type of beans, while the rest has quite a bit overlap.

juice(pca_prep) %>% 
  ggplot(aes(x = PC1, y = PC2, colour = Class)) +
  geom_point(size = 1) +
  labs(title = "PCA Separtion")

UMAP

We repeat the same process with a similar steps for UMAP.

umap_rec <- 
  recipe %>% 
  step_normalize(all_predictors()) %>% 
  step_umap(all_predictors()) ## change to umap
  
umap_prep <- prep(umap_rec)

Again BOMBAY stands out more clearly, and there are more separations among the beans compare with PCA. DERMASON has overlap with SIRA.

juice(umap_prep) %>% 
  ggplot(aes(x = umap_1, y = umap_2, colour = Class)) +
  geom_point(size = 1) +
  labs(title = "UMAP Separtion")

Variable Selection

We will use correlation to decide on the variable and use Boruta to check the results.

Correlation

library(corrplot)

dataset_corr <- 
  dataset %>% 
  select(-c(`Bean ID`, Class)) %>% 
  cor()

From the Boruta package, we see Roundness and SF4 being the most important factors, although others are contributing at similar range of value of Importance below.

library(Boruta)

dataset_mod <- 
  dataset %>% 
  mutate(Class = as_factor(Class))

set.seed(1234)
dataset_boruta <- 
  Boruta(Class ~., data = dataset_mod[, -1])

print(dataset_boruta)
## Boruta performed 11 iterations in 3.26003 mins.
##  16 attributes confirmed important: Area, AspectRation, Compactness,
## ConvexArea, Eccentricity and 11 more;
##  No attributes deemed unimportant.
attStats(dataset_boruta) 
##                  meanImp medianImp   minImp   maxImp normHits  decision
## Area            29.46689  29.53581 27.67640 30.93551        1 Confirmed
## Perimeter       34.29386  34.26594 32.79419 35.46343        1 Confirmed
## MajorAxisLength 27.19730  27.21149 26.46158 28.34735        1 Confirmed
## MinorAxisLength 36.48262  36.68610 35.24287 38.71170        1 Confirmed
## AspectRation    33.24349  33.58719 31.64350 34.68507        1 Confirmed
## Eccentricity    33.59207  33.37761 32.36574 35.29870        1 Confirmed
## ConvexArea      30.46398  30.50181 29.10034 31.79761        1 Confirmed
## EquivDiameter   30.12447  30.33384 28.13035 31.57752        1 Confirmed
## Extent          20.41291  20.65353 19.47552 21.22571        1 Confirmed
## Solidity        43.70001  43.79988 41.45425 46.36539        1 Confirmed
## roundness       62.22659  61.06600 60.13434 65.97331        1 Confirmed
## Compactness     42.35870  42.37120 40.70326 45.54776        1 Confirmed
## ShapeFactor1    41.52202  41.03857 40.45824 43.17576        1 Confirmed
## ShapeFactor2    26.51903  26.43647 25.39231 27.76872        1 Confirmed
## ShapeFactor3    42.44447  42.27166 40.47600 45.30099        1 Confirmed
## ShapeFactor4    66.78403  66.35163 65.06075 69.52765        1 Confirmed
plot(dataset_boruta)

Modelling

We will use varies classification model to predict the class.

We split training and testing dataset at 80% level, setup a 5-fold cross validation on the training set.

set.seed(1234)
dataset_split <- initial_split(dataset, 
                               prop = 0.8,
                               strata = Class) 
training_set <- training(dataset_split)
testing_set <- testing(dataset_split)

set.seed(123)
training_cv_folds <- vfold_cv(training_set,
                              v = 5,
                              strata = Class)

For the pre-process, we change Bean ID so it won’t be used for predication, use correlation to filter out highly correlated variables, then normalise the variables for modeling.

recipe_model_all <- 
  recipe(Class ~., data = training_set) %>% 
  update_role(`Bean ID`, new_role = "id") %>% 
  step_corr(all_predictors(), threshold = 0.9) %>% 
  step_normalize(all_predictors())

First we use a (regularised) multinomial logistic model.

## multinomial logistic regression
logistic_spec <- 
  multinom_reg(penalty = tune(),
               mixture = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")
  

logistic_workflow <- 
  workflow() %>% 
  add_recipe(recipe_model_all) %>% 
  add_model(logistic_spec)

set.seed(1234)
logistic_tune <- 
  tune_grid(logistic_workflow, 
            resamples = training_cv_folds,
            grid = 5)

By checking the performance metrics, we see the logistic model has accuracy of 0.925 for different tuned model, except Model 2. The ROC is above 0.995.

logistic_tune %>%
  collect_metrics() %>%
  ggplot(aes(y = mean, x = .config, color = .metric)) +
  geom_point() +
  facet_wrap(~.metric, scales = "free_y") +
  scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
  labs(x = "",
         y = "Mean Value", 
         title = "Logistic Model Performance Metric") +
  theme(legend.position = "none")

Next we use random forest model with ranger engine.

## random forest
ranger_spec <-
  rand_forest(mtry = tune(),
              min_n = tune(),
              trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger")

ranger_workflow <-
  workflow() %>%
  add_recipe(recipe_model_all) %>%
  add_model(ranger_spec)

set.seed(1234)
ranger_tune <-
  tune_grid(ranger_workflow,
            resamples = training_cv_folds,
            grid = 5)

The average accuracy is about 0.926, slightly higher than logistic model.

ranger_tune %>%
  collect_metrics() %>%
  ggplot(aes(y = mean, x = .config, color = .metric)) +
  geom_point() +
  facet_wrap(~.metric, scales = "free_y") +
  scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
  labs(x = "",
         y = "Mean Value", 
         title = "Random Forest Model Performance Metric") +
  theme(legend.position = "none")

K-nearest neighbours, since we know there are 7 types of beans, so we set to create 7 clusters.

## knn
kknn_spec <- 
  nearest_neighbor(neighbors = 7, 
                   weight_func = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

kknn_workflow <- 
  workflow() %>% 
  add_recipe(recipe_model_all) %>% 
  add_model(kknn_spec) 

set.seed(81707)
kknn_tune <-
  tune_grid(kknn_workflow, 
            resamples = training_cv_folds, 
            grid = 5)

The average performance is lower than logistic models.

kknn_tune %>%
  collect_metrics() %>%
  ggplot(aes(y = mean, x = .config, color = .metric)) +
  geom_point() +
  facet_wrap(~.metric, scales = "free_y") +
  scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
  labs(x = "",
         y = "Mean Value", 
         title = "K-nearest Neighbour Model Performance Metric") +
  theme(legend.position = "none")

Support vector machine with radial kernal.

## SVM
svm_spec <-
  svm_rbf(cost = tune(),
          rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_workflow <-
  workflow() %>%
  add_recipe(recipe_model_all) %>%
  add_model(svm_spec)

set.seed(81707)
svm_tune <-
  tune_grid(svm_workflow,
            resamples = training_cv_folds,
            grid = 5)

Although we have two models performed well at about 0.925, others are bad. It is less consistant than the previous models given the current tuning size.

svm_tune %>%
  collect_metrics() %>%
  ggplot(aes(y = mean, x = .config, color = .metric)) +
  geom_point() +
  facet_wrap(~.metric, scales = "free_y") +
  scale_x_discrete(labels = c(paste0("Model", " ", 1:5))) +
  labs(x = "",
         y = "Mean Value", 
         title = "Support Vector Machine Model Performance Metric") +
  theme(legend.position = "none")

Finally we will collect the best models for each algorthim

logistic_best <- logistic_tune %>% 
  select_best("accuracy")
logistic_final <- logistic_workflow %>% 
  finalize_workflow(logistic_best) %>% 
  fit(data = training_set)

ranger_best <- ranger_tune %>% 
  select_best("accuracy")
ranger_final <- ranger_workflow %>% 
  finalize_workflow(ranger_best) %>% 
  fit(data = training_set)

kknn_best <- kknn_tune %>% 
  select_best("accuracy")
kknn_final <- kknn_workflow %>% 
  finalize_workflow(kknn_best) %>% 
  fit(data = training_set)

svm_best <- svm_tune %>% 
  select_best("accuracy")
svm_final <- svm_workflow %>% 
  finalize_workflow(svm_best) %>% 
  fit(data = training_set)

vip package offer a way to see the impact of the factors.

library(vip)

plot_logistic <- logistic_final %>% 
  pull_workflow_fit() %>% 
  vip()

## importance needed to be chosen at set_engine

## importance not available for kknn and svm

Group the best model output and plot in a single graph. The random forest model from ranger package gives the best overall measure compare with other models.

logistic_final_all <- logistic_final %>% 
  last_fit(split = dataset_split)

ranger_final_all <- ranger_final %>% 
  last_fit(split = dataset_split)

kknn_final_all <- kknn_final %>% 
  last_fit(split = dataset_split)

svm_final_all <- svm_final %>% 
  last_fit(split = dataset_split)

model_final_all_list <- list(logistic_final_all, ranger_final_all, kknn_final_all, svm_final_all)

map(model_final_all_list, collect_metrics) %>% 
  bind_rows(.id = "model") %>% 
  ggplot(aes(x = model, y = .estimate)) +
  geom_point(size = 3) +
  scale_x_discrete(labels = c("logistic", "ranger", "kknn", "svm")) +
  facet_wrap(~.metric) +
  labs(title = "Final testing set fit performance",
       x = NULL,
       y = "Performance Measure")