The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
# Load the Soybean dataset
library(mlbench)
data(Soybean)
# Get column names
cols <- names(Soybean)
# Create bar plots for each column
lapply(cols, function(col_name) {
ggplot(data = Soybean, aes(x = .data[[col_name]])) +
geom_bar(fill = "skyblue") +
coord_flip() +
labs(title = col_name, x = col_name, y = "Count") +
theme_minimal()
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
Based on the above histograms for each of the categorical predictors, a
degenerate distribution in the mycelium, sclerotia, and leaf.mild
variables. Because of low unique values, making them degenerate.
Strategy for handling missing data: First, eliminate predictors with degenerate distributions or excessive missing values, specifically those with over 30% or 40% missing data or near-zero variance (based on b outcomes - drop mycelium, sclerotia, leaf.mild). For the remaining predictors, apply model-based imputation methods, as they are categorical and their missingness is related to the class variable.