The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
The exercise indicates we can access this data using > library(mlbench) > data(Glass) > str(Glass).
In addition to loading the data we will load any other needed libraries for the entire lab.
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(moments)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
##
## kurtosis, moment, skewness
Load and check data structure
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Currently the variable “type” indicates numbers, and I would like to change that to more descriptive labels that are based on the explanation of the dataset.
Glass$Type <- factor(Glass$Type, levels = c(1,2,3,4,5,6,7),
labels = c("building_windows_float_processed",
"building_windows_non_float_processed",
"vehicle_windows_float_processed",
"vehicle_windows_non_float_processed", # Note: no observations
"containers",
"tableware",
"headlamps"))
Now let’s change the dataset to long format and get the density plot for each predictor.
Glass_long <- Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value")
histogram_plots <- Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue3", color = "black", alpha = 0.7) +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal(base_size = 14) +
labs(title = "Histograms of Predictors",
subtitle = "Showing distribution of each predictor variable",
x = "Value", y = "Count")
print(histogram_plots)
Fe (Iron), Ba (Barium), K (Potassium), and Mg (Magnesium) have long right tails, indicating a small number of extreme high values. Ca (Calcium) has a moderate right skew. Na (Sodium) and RI (Refractive Index) show multiple peaks, suggesting that different glass types have different compositions. Finally, Al (Aluminum) and Si (Silicon) have more symmetrical distributions.
boxplot_plot <- Glass %>%
dplyr::select(where(is.numeric)) %>% # important! select only numeric columns
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(y = Value)) +
geom_boxplot(fill = "skyblue2", color = "black", alpha = 0.7) +
facet_wrap(~ Predictor, scales = 'free', ncol = 3) + # this ensures correct scaling
theme_minimal(base_size = 10) +
labs(
title = "Boxplots of Numerical Predictors",
subtitle = "Showing the spread and potential outliers for each variable",
y = "Value"
) +
theme(strip.text = element_text(size = 14, face = "bold")) # this improve facet labels
print(boxplot_plot)
We can see that there are outliers presents, and several predictors have
long upper whiskers, confirming the positive skew seen in the
histograms. - Ba (Barium) has extreme values far from the interquartile
range (IQR). - K (Potassium) and Fe (Iron) also have many outliers, with
values significantly higher than the typical range. - Ca (Calcium) has a
tighter spread, but still has some extreme values. - Mg (Magnesium)
shows two clusters, indicating possible differences between glass
types.
# Compute correlation matrix (only for numeric variables)
cor_matrix <- cor(Glass %>% dplyr::select(where(is.numeric)))
# Create the correlation plot
corrplot(cor_matrix, method = "color", type = "lower",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", # to add correlation coefficients
number.cex = 0.8, # adjust number size
col = colorRampPalette(c("blue", "white", "red"))(200))
This graph clearly illustrates strong correlations: - Calcium and
Refractive Index (0.81): Possible indicating that the refractive index
may be influenced heavily by calcium content. - Silicon and Refractive
Index (-0.54): A strong negative correlation, meaning as silicon
increases, refractive index decreases. - Mg vs Al (-0.48), Ba vs Mg
(-0.49): Suggests that certain glass combinations favor one element over
another.
type_distribution_plot <- Glass %>%
count(Type) %>%
ggplot(aes(x = reorder(Type, -n), y = n, fill = Type)) + # to order by frequency
geom_bar(stat = "identity", alpha = 0.8, color = "black") +
theme_minimal(base_size = 14) +
labs(title = "Distribution of Glass Types",
subtitle = "Number of samples for each category",
x = "Glass Type",
y = "Count") +
scale_fill_brewer(palette = "Set3") + # color-friendly palette
theme(
legend.position = "none", # x-axis already labels types
axis.text.x = element_text(angle = 30, hjust = 1, size = 10), # Rotate for readability
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 14, face = "italic")
)
print(type_distribution_plot)
We can see the most common glass type in the dataset is Building windows either float and non-float processed. Whereas containers and tableware have the fewest samples, which could impact classification performance. The problem with that distribution is that a classifier may struggle with predicting rare categories like “tableware” due to fewer examples.
Do there appear to be any outliers in the data? Are any predictors skewed?
skewness_values <- Glass %>%
dplyr::select(where(is.numeric)) %>%
summarise(across(everything(), skewness, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), skewness, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
print(skewness_values)
## RI Na Mg Al Si K Ca Ba
## 1 1.602715 0.4478343 -1.136452 0.8946104 -0.7202392 6.460089 2.018446 3.36868
## Fe
## 1 1.729811
As observed in the previous graphs, the dataset contains several outliers. Ba, Fe, K, and Mg show extreme values far from the interquartile range, confirming a positive skew in their distributions. K (6.51), Ba (3.39), Ca (2.03), and Fe (1.74) are highly skewed, suggesting the presence of a few extreme high values. Mg (-1.14) is negatively skewed, meaning most values are high with a few smaller ones. Ca has a moderate right skew, while Na and the refractive index show multimodal distributions, suggesting variations in glass compositions. Al and Si appear more symmetrically distributed. The correlation plot reveals strong relationships, notably between Ca and refractive index (0.81) and Si and refractive index (-0.54), possibly indicating how these elements influence refractive properties. The glass type distribution shows a class imbalance, where building windows dominate, while containers and tableware are underrepresented, potentially affecting model performance.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Based on the skewness values we can selectively transform highly skewed predictors. Now that we know which variables actually need transformation, we apply Box-Cox (or log for special cases like Ba and Fe). Box-Cox struggles with too many zeros: If a variable has many zero values, Box-Cox can distort the data or even result in misleading transformations.
skewness_values <- Glass %>%
summarise(across(where(is.numeric), skewness, na.rm = TRUE))
# based on skewness values identify predictors that need Box-Cox (|skewness| > 1.5) **excluding Ba and Fe** because of mostly 0s composition
skewed_predictors <- names(skewness_values)[abs(skewness_values[1,]) > 1.5]
skewed_predictors <- setdiff(skewed_predictors, c("Ba", "Fe")) # Remove Ba and Fe
# Apply Box-Cox only to these variables
Glass_transformed <- Glass
for (col in skewed_predictors) {
shift_value <- abs(min(Glass_transformed[[col]], na.rm = TRUE)) + 1e-4
response <- Glass_transformed[[col]] + shift_value
bc <- BoxCoxTrans(response)
Glass_transformed[[col]] <- predict(bc, response)
}
# Apply log(x+1) transformation to Ba and Fe
Glass_transformed$Ba <- log1p(Glass$Ba)
Glass_transformed$Fe <- log1p(Glass$Fe)
summary(Glass_transformed)
## RI Na Mg Al
## Min. :0.4453 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:0.4455 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :0.4455 Median :13.30 Median :3.480 Median :1.360
## Mean :0.4455 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:0.4456 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :0.4461 Max. :17.38 Max. :4.490 Max. :3.500
##
## Si K Ca Ba
## Min. :69.81 Min. :-2.4372 Min. :0.4958 Min. :0.0000
## 1st Qu.:72.28 1st Qu.:-1.4204 1st Qu.:0.4973 1st Qu.:0.0000
## Median :72.79 Median :-0.5245 Median :0.4975 Median :0.0000
## Mean :72.65 Mean :-0.8889 Mean :0.4975 Mean :0.1094
## 3rd Qu.:73.09 3rd Qu.:-0.4484 3rd Qu.:0.4977 3rd Qu.:0.0000
## Max. :75.41 Max. : 2.6901 Max. :0.4989 Max. :1.4231
##
## Fe Type
## Min. :0.00000 building_windows_float_processed :70
## 1st Qu.:0.00000 building_windows_non_float_processed:76
## Median :0.00000 vehicle_windows_float_processed :17
## Mean :0.05159 vehicle_windows_non_float_processed : 0
## 3rd Qu.:0.09531 containers :13
## Max. :0.41211 tableware : 9
## headlamps :29
To more easily compare the original skewness versus new one:
original_skewness <- Glass %>%
select_if(is.numeric) %>%
summarise(across(everything(), moments::skewness, na.rm = TRUE))
transformed_skewness <- Glass_transformed %>%
select_if(is.numeric) %>% # Same fix applied here
summarise(across(everything(), moments::skewness, na.rm = TRUE))
skewness_comparison <- bind_rows(original_skewness, transformed_skewness) %>%
mutate(Status = c("Original", "Transformed")) %>%
relocate(Status)
print(skewness_comparison)
## Status RI Na Mg Al Si K
## 1 Original 1.614015 0.4509917 -1.144465 0.9009179 -0.7253173 6.505635834
## 2 Transformed 1.595250 0.4509917 -1.144465 0.9009179 -0.7253173 0.009603982
## Ca Ba Fe
## 1 2.0326774 3.392431 1.742007
## 2 0.1099598 2.707521 1.575222
After the transformations we see successfully reduced skewness in some variables while others remained skewed. K (6.51 to 0.0096) and Ca (2.03 to 0.11) are now nearly normal, showing that Box-Cox worked well. However, Ba (3.39 to 2.71) and Fe (1.74 to 1.58) remain skewed despite using log(x+1) transformation. This is likely due to the large number of zeros in these variables, making full normalization difficult. RI (1.61 to 1.59), Na, Mg, Al, and Si remained unchanged, indicating they were correctly left untransformed. While Box-Cox effectively normalized K and Ca, Ba and Fe might be better represented as binary variables (presence/absence).
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Again, we start by loading the dataset and checking the structure:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Then, we can check frequency distributions for categorical predictors, since most predictors are categorical, we will compute frequency tables for each:
columns <- colnames(Soybean)
lapply(columns, function(col) {
print(
ggplot(Soybean, aes_string(col)) +
geom_bar(fill = "skyblue3", color = "black") +
coord_flip() +
ggtitle(col) +
theme_minimal()
)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
There are too many plots resulting, perhaps we can try a more efficient method, computing frequency tables for each categorical predictor, identify near-zero variance predictors and checking for missing values.
I will show a table with: “max_prop”: The largest category proportion (closer to 100% means near-zero variance). “most_common_category”: The dominant category. “num_categories”: Total unique categories in the predictor.
# Identify problematic categorical variables
summarize_categorical_issues <- function(df) {
categorical_vars <- names(Filter(is.factor, df))
summary_data <- lapply(categorical_vars, function(var) {
counts <- table(df[[var]])
proportions <- prop.table(counts) * 100
max_prop <- max(proportions)
list(
variable = var,
most_common_category = names(which.max(counts)),
max_prop = round(max_prop, 2),
num_categories = length(counts)
)
})
summary_df <- do.call(rbind, lapply(summary_data, as.data.frame))
summary_df <- summary_df[order(-summary_df$max_prop), ]
return(summary_df)
}
# Run function on Soybean dataset
summary_table <- summarize_categorical_issues(Soybean)
# Print the summary table
print(summary_table)
## variable most_common_category max_prop num_categories
## 26 mycelium 0 99.07 2
## 28 sclerotia 0 96.90 2
## 35 shriveling 0 93.41 2
## 19 leaf.mild 0 93.04 3
## 21 lodging 0 92.53 2
## 18 leaf.malf 0 92.49 2
## 27 int.discolor 0 90.08 3
## 34 seed.size 0 90.02 2
## 33 seed.discolor 0 88.91 2
## 13 leaves 1 88.73 2
## 32 mold.growth 0 88.66 2
## 36 roots 0 84.51 3
## 17 leaf.shread 0 83.53 2
## 24 fruiting.bodies 0 81.98 2
## 31 seed 0 80.54 2
## 6 hail 0 77.40 2
## 25 ext.decay 0 77.05 3
## 4 precip 2 71.16 3
## 29 fruit.pods 0 67.95 4
## 12 plant.growth 0 66.12 2
## 30 fruit.spots 0 59.79 4
## 15 leaf.marg 0 59.60 3
## 22 stem.cankers 0 58.76 4
## 9 sever 1 57.30 3
## 5 temp 1 57.27 3
## 14 leaf.halo 2 57.10 3
## 20 stem 1 55.62 2
## 3 plant.stand 0 54.71 2
## 16 leaf.size 1 54.59 3
## 10 seed.tmt 0 54.27 3
## 23 canker.lesion 0 49.61 4
## 11 germ 1 37.30 3
## 8 area.dam 1 33.28 4
## 7 crop.hist 2 32.83 4
## 2 date 5 21.85 7
## 1 Class brown-spot 13.47 19
With the table we can more easily see that the frequency distributions of the categorical predictors in the Soybean dataset reveal several degenerate distributions.
Many variables exhibit near-zero variance, meaning they are overwhelmingly dominated by a single category, providing little to no useful variation for predictive modeling.
For example, mycelium (99.07%), sclerotia (96.90%), shriveling (93.41%), and leaf.mild (93.04%) are almost entirely composed of a single category, making them redundant features.
Additionally, other variables such as lodging, leaf.malf, int.discolor, and seed.size have over 90% of their observations in one category, further limiting their value. Variables like roots, leaf.shread, and fruiting.bodies also show strong imbalances, exceeding 80% dominance in one category.
While some predictors, such as date, crop.hist, area.dam, and germ, show more balanced distributions, the target variable (Class) itself is imbalanced, with brown-spot (13.47%) being the most frequent disease.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
We can start by seeing a total count of missing values per predictor
Soybean %>%
summarise_all(~sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
ggplot(aes(y = reorder(variable, missing_count), x = missing_count, fill = missing_count > 0)) +
geom_col() +
labs(title = "Missing Values per Predictor",
x = "Number of Missing Values", y = "Variable") +
scale_fill_manual(values=c("grey", "red"), labels = c("Complete", "Missing")) +
theme_minimal()
This first plot highlights the extent of missing data across different predictors in the dataset. Severe missingness is observed in variables like sever, seed.tmt, lodging, hail, and germ, each having over 100 missing values. This suggests that some plant condition-related predictors may be inconsistently recorded.
missingness_plot <- Soybean %>%
group_by(Class) %>%
mutate(class_Total = n()) %>%
ungroup() %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
summarise(Missing = n(), class_Total = first(class_Total)) %>%
mutate(Proportion = Missing / class_Total) %>%
ggplot(aes(x = reorder(Class, -Proportion), y = Proportion, fill = Proportion)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "skyblue", high = "red") +
theme_minimal(base_size = 12) +
labs(title = "Proportion of Missing Values per Class",
x = "Disease Class",
y = "Proportion of Missing Data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Print the plot
print(missingness_plot)
This second plot shows the proportion of missing data across disease classes and that missingness is not random but class-dependent. Notably, the 2-4-d injury, cyst nematode, diaporthe pod & stem blight, and herbicide injury classes exhibit nearly 100% missingness, indicating these classes might have been poorly documented or contain minimal data. Phytophthora rot shows moderate missingness but retains a notable amount of complete data. This indicates that missing data patterns could impact classification performance and should be addressed through imputation or exclusion strategies.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Given that the missing data is both predictor specific and class-dependent, I would suggest adopting a mixed approach that balances data retention with model accuracy.
We saw that predictors like sever, seed.tmt, lodging, hail, and germ have high proportions of missing values (>15%). If these variables contain extremely vital information for classification, imputation may be the way to go. But if their contribution is not as important then removing them prevents introducing noise or bias and might be more effective. We could go with a threshold of removing more than 25% missing values.
Regarding missing classes, we can follow a similar approach, made obvious with dlasses like 2-4-d injury, cyst nematode, diaporthe pod & stem blight, and herbicide injury that exhibit nearly 100% missingness in some predictors. That being the case, if a class has too many missing values across all predictors such as more than 75%, we should consider removing it.
For certain cases we can go with imputation, filling with the most common category for variables with less than 15% missing data. Perhaps use some predictive imputation such as decision trees or k-nearest neighbors for variables with 15-30% missing data.