The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench) data(Glass) str(Glass)
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.5.2
library(e1071)
## Warning: package 'e1071' was built under R version 4.5.2
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:ggplot2':
##
## element
library(forcats)
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.1.0 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.4.1 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## Warning: package 'dials' was built under R version 4.5.2
## Warning: package 'infer' was built under R version 4.5.2
## Warning: package 'modeldata' was built under R version 4.5.2
## Warning: package 'parsnip' was built under R version 4.5.2
## Warning: package 'rsample' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.2
## Warning: package 'tune' was built under R version 4.5.2
## Warning: package 'workflows' was built under R version 4.5.2
## Warning: package 'workflowsets' was built under R version 4.5.2
## Warning: package 'yardstick' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ e1071::element() masks ggplot2::element()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
# load and quick structure check w/ data
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(Glass)
## [1] 214 10
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
# target distribution
table(Glass$Type)
##
## 1 2 3 5 6 7
## 70 76 17 13 9 29
# check missing values
colSums(is.na(Glass))
## RI Na Mg Al Si K Ca Ba Fe Type
## 0 0 0 0 0 0 0 0 0 0
# put predictors into a long format for easy faceted plots
glass_long <- Glass %>%
pivot_longer(cols = -Type, names_to = "predictor", values_to = "value")
# histograms for each predictor
ggplot(glass_long, aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ predictor, scales = "free") +
labs(
title = "Glass predictors: Histograms",
x = "Value",
y = "COunt")
# density plots
ggplot(glass_long, aes(x = value)) +
geom_density() +
facet_wrap(~ predictor, scales = "free") +
labs(
title = "Glass predictors: Density plots",
x = "Value",
y = "Density")
# boxplots
ggplot(glass_long, aes(x = predictor, y = value)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Glass predictors: Boxplots (outlier check)",
x = "",
y = "Value")
# pairwise relationships
GGally::ggpairs(
Glass,
columns = 1:9,
aes(alpha = 0.6)
)
# correlation heatmap
cor_mat <- cor(Glass %>%
select(-Type),
use = "complete.obs")
ggcorrplot(
cor_mat,
hc.order = TRUE,
type = "lower",
lab = TRUE
) +
labs(title = "Correlation heatmap (numeric predictors)")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
## Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# visualize how distributions differ by class
ggplot(glass_long, aes(x = Type, y = value)) +
geom_boxplot() +
facet_wrap(~ predictor, scales = "free") +
labs(title = "Predictors by Glass Type", x = "Type", y = "Value")
Using histograms and density plots, I saw that RI, Na, Al, and Si each have one main peak, so their values cluster in a single range. Some other predictors are clearly right skewed and have lots of values near zero. Ba and Fe are the clearest examples because most samples are exactly zero and only a few samples are above zero, which creates a long tail. From the boxplots, Ca and K also look like they have a few unusually large values compared to the rest of the data.
To look at how predictors relate to each other, I used a scatterplot matrix and a correlation heatmap. The heatmap shows a strong positive relationship between RI and Ca, around 0.81. It also shows a moderate negative relationship between Si and RI, around negative 0.54. Overall, the predictors are a mix of measurements that are somewhat related to each other, plus a few variables that are very skewed.
# function to flag outliers using boxplot rule
is_outlier_iqr <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
(x < lower) | (x > upper)
}
# compute skewness and number of outliers and number of zeros
outlier_skew_summary <- Glass %>%
select(-Type) %>% # exclude target class
summarise(across(
everything(),
list(
skewness = ~ e1071::skewness(.x, na.rm = TRUE, type = 2),
outliers = ~ sum(is_outlier_iqr(.x), na.rm = TRUE),
zeros = ~ sum(.x == 0, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)) %>%
pivot_longer(
cols = everything(),
names_to = c("predictor", "metric"),
names_sep = "_",
values_to = "value"
) %>%
pivot_wider(names_from = metric, values_from = value) %>%
arrange(desc(abs(skewness)))
outlier_skew_summary
## # A tibble: 9 × 4
## predictor skewness outliers zeros
## <chr> <dbl> <dbl> <dbl>
## 1 K 6.55 7 30
## 2 Ba 3.42 38 176
## 3 Ca 2.05 26 0
## 4 Fe 1.75 12 144
## 5 RI 1.63 17 0
## 6 Mg -1.15 0 42
## 7 Al 0.907 18 0
## 8 Si -0.730 12 0
## 9 Na 0.454 7 0
# show the top 3 most skewed predictors
top_skew <- outlier_skew_summary %>%
slice_head(n = 3)
top_skew
## # A tibble: 3 × 4
## predictor skewness outliers zeros
## <chr> <dbl> <dbl> <dbl>
## 1 K 6.55 7 30
## 2 Ba 3.42 38 176
## 3 Ca 2.05 26 0
Yes, I do see outliers in the Glass predictors using the 1.5 times IQR rule. A few variables have quite a lot of values flagged as outliers, especially Ba with 38 outliers, Ca with 26 outliers, and Fe with 12 outliers. This matches what I saw in the boxplots, where most values are in a tight range but a small number of points sit much higher than the rest.
I also see clear skewness in several predictors. K is the most skewed, with skewness 6.55, so it has a strong right tail. Ba is also very right skewed with skewness 3.42, and it has 176 zeros, which means most samples have Ba equal to zero and only a few have larger values. Ca is right skewed too with skewness 2.05. On the other hand, Mg and Si are left skewed, with skewness values of negative 1.15 and negative 0.73. Overall, the heavy skew and the large number of zeros in some variables, especially Ba and Fe, help explain why the IQR rule flags outliers so often.
# build a preprocessing recipe
rec <- recipe(Type ~ ., data = Glass) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
rec
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 9
##
## ── Operations
## • Yeo-Johnson transformation on: all_numeric_predictors()
## • Centering and scaling for: all_numeric_predictors()
# apply the preprocessing to see the transformed data
prep_rec <- prep(rec)
Glass_processed <- bake(prep_rec, new_data = NULL)
summary(Glass_processed)
## RI Na Mg Al
## Min. :-2.3759 Min. :-3.68296 Min. :-1.7352 Min. :-3.09581
## 1st Qu.:-0.6068 1st Qu.:-0.59533 1st Qu.:-0.7782 1st Qu.:-0.45123
## Median :-0.2257 Median :-0.09993 Median : 0.5240 Median :-0.07732
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2608 3rd Qu.: 0.53801 3rd Qu.: 0.6666 3rd Qu.: 0.46462
## Max. : 5.1252 Max. : 4.25332 Max. : 1.8719 Max. : 3.15444
## Si K Ca Ba
## Min. :-3.6679 Min. :-1.6158 Min. :-4.6093 Min. :-0.3521
## 1st Qu.:-0.4789 1st Qu.:-0.9812 1st Qu.:-0.4686 1st Qu.:-0.3521
## Median : 0.1795 Median : 0.4648 Median :-0.1394 Median :-0.3521
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5636 3rd Qu.: 0.5934 3rd Qu.: 0.3281 3rd Qu.:-0.3521
## Max. : 3.5622 Max. : 3.4444 Max. : 3.2396 Max. : 5.9832
## Fe Type
## Min. :-0.5851 1:70
## 1st Qu.:-0.5851 2:76
## Median :-0.5851 3:17
## Mean : 0.0000 5:13
## 3rd Qu.: 0.4412 6: 9
## Max. : 4.6490 7:29
From the skewness and outlier results in part b, a few predictors are very skewed, especially K with skewness 6.55 and Ba with skewness 3.42. Ba also has a lot of zeros, with 176 samples at zero, so most values pile up at the bottom and only a few are much larger. That creates a strong right tail and makes the IQR rule flag a lot of points as outliers. To make the data easier for a classifier to learn from, it helps to transform the most skewed variables. Using log1p on K, Ba, and Fe is a simple option because it works with zeros and it shrinks the extreme high values.
For a more general approach, I would apply a Yeo Johnson transformation to all numeric predictors since it reduces skewness and can handle zeros. After that, I would normalize the predictors so they are on the same scale, which is important for models like KNN and SVM. Since some predictors are correlated, like RI and Ca, PCA could also be used to reduce overlap between variables and make the model more stable.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: library(mlbench) data(Soybean)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
dim(Soybean)
## [1] 683 36
table(Soybean$Class)
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
# confirm missing values are NA
soybean_clean <- Soybean %>%
mutate(across(
where(is.factor),
~ {
x <- as.character(.x)
x[x == "?"] <- NA
factor(x)
}
))
# define predictor names
predictors <- setdiff(names(soybean_clean), "Class")
# build a frequency table for every predictor
freq_df <- map_dfr(predictors, function(v) {
tab <- table(soybean_clean[[v]], useNA = "ifany")
tibble(
predictor = v,
level = names(tab),
n = as.integer(tab),
pct = as.integer(tab) / sum(tab)
)
})
# show one predictor's frequency distribution
freq_df %>%
filter(predictor == predictors[1]) %>%
arrange(desc(n))
## # A tibble: 8 × 4
## predictor level n pct
## <chr> <chr> <int> <dbl>
## 1 date 5 149 0.218
## 2 date 4 131 0.192
## 3 date 3 118 0.173
## 4 date 2 93 0.136
## 5 date 6 90 0.132
## 6 date 1 75 0.110
## 7 date 0 26 0.0381
## 8 date <NA> 1 0.00146
deg_summary <- map_dfr(predictors, function(v) {
x <- soybean_clean[[v]]
x_nm <- x[!is.na(x)]
n_nm <- length(x_nm)
tab <- sort(table(x_nm), decreasing = TRUE)
n_levels <- length(tab)
top_n <- if (n_levels >= 1) as.integer(tab[1]) else 0
top_pct <- if (n_nm > 0) top_n / n_nm else NA_real_
second_n <- if (n_levels >= 2) as.integer(tab[2]) else 0
freq_ratio <- if (second_n == 0) Inf else top_n / second_n
pct_unique <- if (n_nm > 0) (n_levels / n_nm) * 100 else NA_real_
zero_var <- (n_levels <= 1)
near_zero <- (!zero_var) && (freq_ratio > 19) && (pct_unique < 10)
tibble(
predictor = v,
missing_pct = mean(is.na(x)),
n_nonmissing = n_nm,
n_levels = n_levels,
top_level = if (n_levels >= 1) names(tab)[1] else NA_character_,
top_pct = top_pct,
freq_ratio = freq_ratio,
pct_unique = pct_unique,
zero_var = zero_var,
near_zero = near_zero
)
}) %>%
arrange(desc(zero_var), desc(near_zero), desc(missing_pct), desc(top_pct))
deg_summary
## # A tibble: 35 × 10
## predictor missing_pct n_nonmissing n_levels top_level top_pct freq_ratio
## <chr> <dbl> <int> <int> <chr> <dbl> <dbl>
## 1 leaf.mild 0.158 575 3 0 0.930 26.8
## 2 mycelium 0.0556 645 2 0 0.991 106.
## 3 sclerotia 0.0556 645 2 0 0.969 31.2
## 4 lodging 0.177 562 2 0 0.925 12.4
## 5 hail 0.177 562 2 0 0.774 3.43
## 6 sever 0.177 562 3 1 0.573 1.65
## 7 seed.tmt 0.177 562 3 0 0.543 1.37
## 8 germ 0.164 571 3 1 0.373 1.10
## 9 shriveling 0.155 577 2 0 0.934 14.2
## 10 seed.discolor 0.155 577 2 0 0.889 8.02
## # ℹ 25 more rows
## # ℹ 3 more variables: pct_unique <dbl>, zero_var <lgl>, near_zero <lgl>
degenerate_vars <- deg_summary %>%
filter(zero_var | near_zero) %>%
pull(predictor)
degenerate_vars
## [1] "leaf.mild" "mycelium" "sclerotia"
I looked at the frequency distributions for all categorical predictors by making frequency tables and including the NA counts. Most variables only have a few levels, usually around two to four, which makes sense since they describe symptoms and environment conditions. Using a near zero variance style check, I found that leaf.mild, mycelium, and sclerotia look degenerate because one category shows up in almost all the rows and there is very little variation. Since these variables do not change much across the dataset, they probably will not add much useful information to a classification model, so they are good candidates to remove or to combine levels during preprocessing.
missing_by_var <- tibble(
predictor = predictors,
missing_pct = map_dbl(predictors, ~ mean(is.na(soybean_clean[[.x]])))
) %>%
arrange(desc(missing_pct))
missing_by_var
## # A tibble: 35 × 2
## predictor missing_pct
## <chr> <dbl>
## 1 hail 0.177
## 2 sever 0.177
## 3 seed.tmt 0.177
## 4 lodging 0.177
## 5 germ 0.164
## 6 leaf.mild 0.158
## 7 fruiting.bodies 0.155
## 8 fruit.spots 0.155
## 9 seed.discolor 0.155
## 10 shriveling 0.155
## # ℹ 25 more rows
# plot missing % by predictor
ggplot(missing_by_var, aes(x = reorder(predictor, missing_pct), y = missing_pct)) +
geom_col() +
coord_flip() +
labs(title = "Percent missing by predictor", x = NULL, y = "Missing proportion")
# heatmap for each predictor and missing % within each class
missing_by_class <- map_dfr(predictors, function(v) {
soybean_clean %>%
group_by(Class) %>%
summarise(missing_pct = mean(is.na(.data[[v]])), .groups = "drop") %>%
mutate(predictor = v)
})
ggplot(missing_by_class, aes(x = predictor, y = Class, fill = missing_pct)) +
geom_tile() +
labs(title = "Missingness pattern by Class (heatmap)", x = "Predictor", y = "Class") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
chi_results <- map_dfr(predictors, function(v) {
miss <- is.na(soybean_clean[[v]])
tbl <- table(miss, soybean_clean$Class)
test <- suppressWarnings(chisq.test(tbl))
tibble(
predictor = v,
p_value = test$p.value
)
}) %>%
mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
arrange(p_adj)
chi_results
## # A tibble: 35 × 3
## predictor p_value p_adj
## <chr> <dbl> <dbl>
## 1 precip 2.29e-133 6.69e-133
## 2 temp 2.29e-133 6.69e-133
## 3 crop.hist 2.29e-133 6.69e-133
## 4 plant.growth 2.29e-133 6.69e-133
## 5 stem 2.29e-133 6.69e-133
## 6 stem.cankers 2.29e-133 6.69e-133
## 7 canker.lesion 2.29e-133 6.69e-133
## 8 ext.decay 2.29e-133 6.69e-133
## 9 mycelium 2.29e-133 6.69e-133
## 10 int.discolor 2.29e-133 6.69e-133
## # ℹ 25 more rows
Some predictors have more missing values than others. The most missing ones are around 17.7 percent, including hail, sever, seed.tmt, and lodging. Germ is also fairly high at about 16.4 percent, and leaf.mild is about 15.8 percent. A lot of the other variables fall in the middle range with around 10 to 15 percent missing, and a few have very little missing data.
The missing pattern also seems connected to the outcome classes. In the heatmap, some classes show clearly higher missing rates for certain predictors. The chi square tests support this too, since many predictors have extremely small adjusted p values, around 10 to the minus 133 level. This suggests the missingness is not random across classes, so missing values may carry information and should be handled carefully when preprocessing the data.
# identify predictors with very high missingness
high_missing_vars <- missing_by_var %>%
filter(missing_pct > 0.40) %>%
pull(predictor)
# combine degenerate predictors
to_drop <- unique(c(degenerate_vars, high_missing_vars))
to_drop
## [1] "leaf.mild" "mycelium" "sclerotia"
# recipe strategy for categorical missing data
soy_rec <- recipe(Class ~ ., data = soybean_clean) %>%
step_nzv(all_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.01) %>%
step_unknown(all_nominal_predictors()) %>%
step_impute_mode(all_nominal_predictors())
soy_rec
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 35
##
## ── Operations
## • Sparse, unbalanced variable filter on: all_predictors()
## • Collapsing factor levels for: all_nominal_predictors()
## • Unknown factor level assignment for: all_nominal_predictors()
## • Mode imputation for: all_nominal_predictors()
prep_rec <- prep(soy_rec)
Soybean_processed <- bake(prep_rec, new_data = NULL)
glimpse(Soybean_processed)
## Rows: 683
## Columns: 33
## $ date <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ <fct> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ fruit.pods <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Class <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
To handle missing data in the Soybean dataset, I used a simple preprocessing plan that fits categorical variables. First, I removed predictors that had almost no variation because they do not add much information for classification. From the frequency check, leaf.mild, mycelium, and sclerotia were flagged as near zero variance, so I removed them.
Next, since missingness looked related to the class labels in part b, I treated missing values as meaningful by turning NA into an explicit unknown category. This keeps the missingness pattern instead of replacing missing values with the most common level. I also grouped very rare levels into an other category to reduce sparsity and make the model more stable. As a backup, I used mode imputation in case any missing values still remain after adding the unknown level. If the model needs numeric inputs, I would also use dummy encoding to convert the categorical predictors into binary columns.