Author: Farhod Ibragimov
library(knitr)
library(kableExtra)
library(mlbench)
library(ggplot2)
library(tidyverse)
library(e1071)
library(caret)
library(corrplot)
library(gridExtra)
library(funModeling)
library(rstatix)
library(Amelia)
Glass_long <- GlassX |>
mutate(row_id = row_number()) |>
pivot_longer(cols = -row_id, names_to = "Variable", values_to = "Value")
skewValues <- apply(GlassX, 2, skewness)
skewValues <- sort(skewValues, decreasing = TRUE)
skew_df <- data.frame(
Variable = names(skewValues),
Skewness = skewValues
)
Glass_long2 <- Glass_long |>
left_join(skew_df, by = "Variable") |>
mutate(
Variable = reorder(Variable, -Skewness)
)
ggplot(Glass_long2, aes(Value)) +
geom_histogram(bins = 25, fill = "#008080") +
facet_wrap(~Variable, scales = "free")+
labs(title = 'Distributions of variables')
skewValues
## K Ba Ca Fe RI Al Na
## 6.4600889 3.3686800 2.0184463 1.7298107 1.6027151 0.8946104 0.4478343
## Si Mg
## -0.7202392 -1.1364523
Distribution and Skewness Analysis
K (skewness =
6.46)
K is extremely right skewed. Histogram shows most values very
close to zero, and few large values (outliers) up to 6. This creates
very long right tail. Very strong asymmetry.
Ba (skewness =
3.37)
Ba is heavily right skewed. Almost all observations are 0, and
only small number of values are larger. Histogram confirms strong
concentration at zero with long right tail. I have a suspicion that
outliers could have relationship with some outcome types. I will do
further analysis later
Ca (skewness =
2.02)
Ca shows strong positive skewness. Most values are around 8–9,
but some values go much higher, which creates right tail.
Fe (skewness =
1.73)
Fe is right skewed. Many values are equal to 0, and few
observations are larger. Histogram clearly shows long right tail. I’ll
do analysis of these outliers later.
RI (skewness =
1.60)
RI has moderate right skewness. Distribution is mostly
concentrated but moderate right tail is visible.
Al (skewness =
0.89)
Al has mild right skewness. Distribution is not symmetric but
skewness is not very strong.
Na (skewness =
0.45)
Na is close to symmetric. Slight right skew, but overall
distribution looks fairly balanced.
Si (skewness =
-0.72)
Si shows mild negative skewness. Histogram shows longer left
tail with few smaller values.
Mg (skewness =
-1.14)
Mg has moderate negative skewness. Most values are around 3–4,
but some small values create left tail.
Several predictors (especially K, Ba and Fe) show very strong right skewness. Fe and Ca are also noticeably skewed. Mg and Si show negative skewness. Such strong skewness may affect some classification methods that assume normal distribution, so transformation like log or Box-Cox may be considered for highly skewed variables.
correlationsGlass <- cor(GlassX)
corrplot(correlationsGlass,
method = "color",
tl.cex = 0.9
)
From correlations plot I see that Ca and RI are likely most correlated variables. Rest of variables have mild or less correlations.
strongCorr <- correlationsGlass |>
as.data.frame() |>
rownames_to_column("Pred1") |>
pivot_longer(-Pred1, names_to = "Pred2", values_to = "Correlation") |>
filter(Pred1 < Pred2) |>
filter(abs(Correlation) > 0.8) |>
arrange(desc(abs(Correlation)))
strongCorr
As we see from analysis above correlation coefficient for Ca and RI is 0.81 and is moderate.
ggplot(GlassX, aes(Ca, RI)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE)
As we see scatterplot of Ca and RI also show linear correlation. I would consider applying PCA to these variables or removal one of these features for modelling purposes.
Near-zero variance
nearZeroVar(GlassX, saveMetrics = TRUE)
The near-zero variance analysis show that none of predictors have
zero or near-zero variances. We can see all near-zero variance
(nzv) are flagged FALSE.
However, Ba (freqRatio = 88) and Fe
(freqRatio = 20.57) show very large frequency ratios,
meaning that one value (zero values) appears much more frequently than
others. This is consistent with histogram results, where many values are
equal to zero.
Even frequency ratio of Ba and Fe exceed the
freqRatio cutoff (>19) each have
percentUnique above 10%, so they are not classified as
near-zero variance.
Glass_long_out <- Glass_long |>
group_by(Variable) |>
mutate(
Q1 = quantile(Value, 0.25, na.rm = TRUE),
Q3 = quantile(Value, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
lower_thres = Q1 - 1.5 * IQR,
upper_thres = Q3 + 1.5 * IQR,
Outlier = (Value < lower_thres) | (Value > upper_thres)
) |>
ungroup()
outlier_summary <- Glass_long_out |>
group_by(Variable) |>
summarise(
n_outliers = sum(Outlier, na.rm = TRUE),
outliers_percentage = round(mean(Outlier, na.rm = TRUE) * 100, 2)
) |>
arrange(desc(n_outliers))
outlier_summary
Using 1.5×IQR rule, Ba has the highest number of outliers
(17.8%), followed by Ca (12.1%). This matches what we saw in
histograms, where both variables show long right tails and strong
skewness.
Al and RI have moderate amount of outliers, while
Fe and Si have smaller but still noticeable
percentage. Mg has no detected outliers, which means its
distribution is more compact and stable.
It is interesting that K has very high skewness (6.46), but only 3.3% of observations are flagged as outliers. This shows that skewness and outliers are not the same thing:
For this chemical data, large number of outliers in Ba and Ca likely represent heavy tail behavior and not measurement errors
ggplot(Glass_long_out, aes(Variable, Value)) +
geom_boxplot(outlier.shape = NA) +
facet_wrap(~Variable, scales = "free") +
geom_jitter(
data = Glass_long_out |> filter(Outlier == TRUE),
aes(colour = Outlier),
width = 0.001,
size = 0.7,
alpha = 0.7
) +
scale_color_manual(values = c(`TRUE` = "red")) +
theme(legend.position = "none")
Boxplots confirm the outlier results from IQR analysis.
Ba clearly shows the largest number of extreme values, with
many red points above zero, which matches its high skewness and heavy
right tail. Also boxplot confirms that majority of Ba values
are zeroes.
Ca also has several high outliers, especially strong positive
right skewness.
Al and RI show moderate number of extreme
observations, while Fe and Si have smaller but visible
outliers.
K has very high skewness, but fewer IQR outliers compared to
Ba, which means distribution is asymmetric but most values are
still concentrated near lower range.
Mg does not show any outliers.
Overall, the boxplots support previous findings that Ba, Fe and Ca have heavy tail behavior, and these extreme values likely represent real variation in glass composition rather than measurement errors.
Outlier in variable K
max(Glass$K)
## [1] 6.21
Glass[which.max(Glass$K),]
boxplot(K ~ Type, data = Glass,
col = "#008080",
main = "K by Glass Type")
The boxplot of K by glass type shows clear differences
between classes.
Types 1, 2 and 3 have similar distributions with moderate values and
small spread.
Type 5 stands out with much larger variability and extreme high values,
including one extreme observation above 6. It appears abnormal and
requires further investigation.
Type 6 has almost all values equal to zero.
Type 7 shows several high outliers.
The strong overall skewness of K is mainly driven by extreme values in
types 5 and 7.
I would investigate if this is error occured at data entry or if it is
normal during chemical processes.
mean(GlassX$Ba == 0)
## [1] 0.8224299
mean(GlassX$Fe == 0)
## [1] 0.6728972
prop.table(table(type, GlassX$Ba == 0), margin = 1)
##
## type FALSE TRUE
## 1 0.04285714 0.95714286
## 2 0.07894737 0.92105263
## 3 0.05882353 0.94117647
## 5 0.15384615 0.84615385
## 6 0.00000000 1.00000000
## 7 0.89655172 0.10344828
The proportion analysis shows that Ba equals zero for
majority of observations in almost all glass types. For types 1–6, the
range 84% - 96% of samples have Ba = 0, and for type 6 it is
100%.
However, type 7 shows something completely different - there are almost
90% of observations have Ba larger than zero.
This suggests that Ba is strongly related to class label and
may help distinguish type 7 from other glass types. Even though Ba is
highly skewed and zero-heavy, it may contain important classification
information.
Furthermore, because of these findings and small variance of Ba
I wouldn’t apply PCA to these variable since PCA gives more weight to
features with large variances when constructing components, and
variables with lower variances may be pushed into lower
components.
Also Log(Ba +1) transformation is not appropriate to apply. The reason
is because log transformation can compress separation between zero and
positive values. This can lead to reduction of its help to separate type
7 in the model.
Overall, I wouldn’t remove and transform this variable.
Same analysis can be applied to Fe variable.
prop.table(table(type, GlassX$Fe == 0), margin = 1)
##
## type FALSE TRUE
## 1 0.3571429 0.6428571
## 2 0.4210526 0.5789474
## 3 0.2941176 0.7058824
## 5 0.1538462 0.8461538
## 6 0.0000000 1.0000000
## 7 0.2068966 0.7931034
The proportion table for Fe shows that majority of
observations have Fe equal to zero across almost all glass types.
For types 1–3, around 58%–71% of samples have Fe = 0. For type
5 it is about 85%, and for type 6 it is 100%. Type 7 also has high
proportion of zeros (around 79%).
Compared to Ba, Fe does not show clear separation between one
specific type class and others. Even though Fe is skewed and
zero-heavy, it does not appear to strongly distinguish one glass type
from the rest based only on zero versus non-zero values.
Overall, at this point i would leave this feature as is and make a decision at modelling stage.
boxcox_plot <- function(x, var_name, bins = 25) {
bc_model <- BoxCoxTrans(x)
lambda <- round(bc_model$lambda, 3)
x_bxcx <- predict(bc_model, x)
par(mfrow = c(1,2))
hist(x,
breaks = bins,
main = paste(var_name, "(original)"),
col = "#008080")
hist(x_bxcx,
breaks = bins,
main = paste(var_name, "(Box-Cox lambda =", lambda, ")"),
col = "#008080")
par(mfrow = c(1,1))
invisible(list(model = bc_model, transformed = x_bxcx))
}
boxcox_plot(Glass$Ca, 'Ca')
The Box-Cox transformation (lambda = -1.1) makes Ca
distribution more symmetric compared to the original. In the original
histogram, Ca shows noticeable right skew with few larger
values in the range of 12 - 16.
After transformation, the right tail is compressed and the distribution
looks more balanced around the center. The transformation reduces
skewness and stabilizes spread of values.
However, since Ca was not extremely skewed (skewness = 2.02),
so the improvement is moderate.
boxcox_plot(Glass$Al, 'Al')
Boxcox transformation of Al improved distibution.
It is more symmetric and has reduced right skewness.
However, since the original skewness was not very strong, the overall
improvement is moderate.
boxcox_plot(Glass$RI, 'RI')
For Ri, Box-Cox transformation estimated lambda = -2, which
is a inverse transformation. The original RI distribution was
already fairly concentrated with only moderate right skewness.
After transformation, the distribution looks compressed and less
normal.
Since RI was not extremely skewed, applying transformation does
not appear necessary and may reduce interpretability without significant
benefit.
Al_bc <- BoxCoxTrans(GlassX$Al)
Ca_bc <- BoxCoxTrans(GlassX$Ca)
K_bc <- BoxCoxTrans(GlassX$K)
Glass_prep <- GlassX |>
mutate(
Al = predict(Al_bc, Al),
Ca = predict(Ca_bc, Ca),
K = log(K +0.1)
)
scale_vars <- setdiff(names(Glass_prep), c("Ba", "Fe"))
preProc <- preProcess(Glass_prep[, scale_vars],
method = c("center", "scale"))
Glass_final <- Glass_prep
Glass_final[, scale_vars] <- predict(preProc, Glass_prep[, scale_vars])
# Check result
summary(Glass_final)
## RI Na Mg Al
## Min. :-2.3759 Min. :-3.2793 Min. :-1.8611 Min. :-3.12421
## 1st Qu.:-0.6068 1st Qu.:-0.6127 1st Qu.:-0.3948 1st Qu.:-0.45169
## Median :-0.2257 Median :-0.1321 Median : 0.5515 Median :-0.08726
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2608 3rd Qu.: 0.5108 3rd Qu.: 0.6347 3rd Qu.: 0.44750
## Max. : 5.1252 Max. : 4.8642 Max. : 1.2517 Max. : 3.32208
## Si K Ca Ba
## Min. :-3.6679 Min. :-1.7924 Min. :-4.6168 Min. :0.000
## 1st Qu.:-0.4789 1st Qu.:-0.8188 1st Qu.:-0.4676 1st Qu.:0.000
## Median : 0.1795 Median : 0.4961 Median :-0.1401 Median :0.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :0.175
## 3rd Qu.: 0.5636 3rd Qu.: 0.5944 3rd Qu.: 0.3253 3rd Qu.:0.000
## Max. : 3.5622 Max. : 3.2545 Max. : 3.2694 Max. :3.150
## Fe
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.05701
## 3rd Qu.:0.10000
## Max. :0.51000
stats_before <- data.frame(
Variable = names(GlassX),
Min_Before = sapply(GlassX, min),
Max_Before = sapply(GlassX, max),
Mean_Before= sapply(GlassX, mean),
SD_Before = sapply(GlassX, sd)
)
stats_after <- data.frame(
Variable = names(Glass_final),
Min_After = sapply(Glass_final, min),
Max_After = sapply(Glass_final, max),
Mean_After = sapply(Glass_final, mean),
SD_After = sapply(Glass_final, sd)
)
stats_compare <- merge(stats_before, stats_after, by = "Variable")
kable(stats_compare, digits = 3, booktabs = TRUE) |>
kable_styling(latex_options = c("scale_down"), font_size = 8)
| Variable | Min_Before | Max_Before | Mean_Before | SD_Before | Min_After | Max_After | Mean_After | SD_After |
|---|---|---|---|---|---|---|---|---|
| Al | 0.290 | 3.500 | 1.445 | 0.499 | -3.124 | 3.322 | 0.000 | 1.000 |
| Ba | 0.000 | 3.150 | 0.175 | 0.497 | 0.000 | 3.150 | 0.175 | 0.497 |
| Ca | 5.430 | 16.190 | 8.957 | 1.423 | -4.617 | 3.269 | 0.000 | 1.000 |
| Fe | 0.000 | 0.510 | 0.057 | 0.097 | 0.000 | 0.510 | 0.057 | 0.097 |
| K | 0.000 | 6.210 | 0.497 | 0.652 | -1.792 | 3.255 | 0.000 | 1.000 |
| Mg | 0.000 | 4.490 | 2.685 | 1.442 | -1.861 | 1.252 | 0.000 | 1.000 |
| Na | 10.730 | 17.380 | 13.408 | 0.817 | -3.279 | 4.864 | 0.000 | 1.000 |
| RI | 1.511 | 1.534 | 1.518 | 0.003 | -2.376 | 5.125 | 0.000 | 1.000 |
| Si | 69.810 | 75.410 | 72.651 | 0.775 | -3.668 | 3.562 | 0.000 | 1.000 |
Several preprocessing steps were applied to improve the data. Box-Cox
transformation was applied to Al and Ca because their
distributions were right skewed. After transformation, both variables
became more symmetric and their spread looks more stable.
For K, I applied log(K + 0.1) transformation since it had very
strong positive skewness and extreme large outlier. This transformation
reduced the right tail and compressed extreme observations, but still
kept relative differences between values.
After that, centering and scaling were applied to all predictors
except Ba and Fe. These two variables were kept in
original scale because they are zero-heavy and Ba contain
important class information. Scaling helped to make variables comparable
since they were originally measured on very different scales.
As I mentioned earlier I wouldn’t apply PCA transformation on all
predictors at this point because:
It would hide Ba into lower components, reducing it’s dominance in distinguishing of type 7
There are only 9 predictors and 214 observations, therefore
dimensionality reduction probably will not help (actually likely will
hurt).
Overall, the transformations improved stability of the data without removing important structural patterns.
3.2. The soybean data can also be found at the UC Irvine Machine Learning
Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data("Soybean")
#Soybean
Class <- Soybean$Class
Predictors <- Soybean |> select(-Class)
#Predictors
Let’s check for near-zero variance predictors
nzv <- nearZeroVar(Predictors, saveMetrics = TRUE)
flagged <- rownames(nzv[nzv$nzv == TRUE, ])
flagged
## [1] "leaf.mild" "mycelium" "sclerotia"
The nearZeroVar() function identified following
predictors as near-zero variance:
leaf.mild
mycelium
sclerotia
This means that marginally (overall distribution), these variables are dominated by one level and have very little variation across the dataset.
But before removing them I want to check how these predictors are dominant within each class
for (v in flagged) {
cat("\n***********************************************\n")
cat("Predictor:", v, "\n")
tab <- table(Predictors[[v]], Class, useNA = "ifany")
prop_by_class <- prop.table(tab, margin = 2)
dominant <- apply(prop_by_class, 2,
function(col) rownames(prop_by_class)[which.max(col)])
print(data.frame(
Class = names(dominant),
Dominant_Level = dominant,
row.names = NULL
))
}
##
## ***********************************************
## Predictor: leaf.mild
## Class Dominant_Level
## 1 2-4-d-injury <NA>
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode <NA>
## 10 diaporthe-pod-&-stem-blight <NA>
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 2
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury <NA>
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot <NA>
## 17 powdery-mildew 1
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
##
## ***********************************************
## Predictor: mycelium
## Class Dominant_Level
## 1 2-4-d-injury <NA>
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode <NA>
## 10 diaporthe-pod-&-stem-blight 0
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury <NA>
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
##
## ***********************************************
## Predictor: sclerotia
## Class Dominant_Level
## 1 2-4-d-injury <NA>
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 1
## 9 cyst-nematode <NA>
## 10 diaporthe-pod-&-stem-blight 0
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury <NA>
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
After examining the dominant level within each disease class, we can see that the situation is more complex.
leaf.mild - the dominant level differs across
several classes. Some diseases are mostly <NA>, while
others are mostly 0, and for example downy-mildew
is dominated by level 2 and powdery-mildew by level 1. This
shows that even though the variable looks as near-zero variance, it
behaves different depending on the class. So looks like it contains
meaningful class information.
mycelium - most classes are dominated by level 0,
and only few classes show <NA> as dominant. This
variable appears more flat across classes compared to
leaf.mild.
sclerotia - most classes again have dominant level
0, but charcoal-rot clearly has dominant level 1. This suggests
that this predictor may help distinguish that particular disease from
others. So even if it is near-zero variance marginally, it still has
some conditional structure.
Overall, although these three predictors are indicated as near-zero
variance based on marginal frequency distributions,
leaf.mildand sclerotia show differences across
disease classes when we look conditionally. Therefore, they should not
be automatically removed only because of near-zero variance screening.
At the same time myceliumappears to be a strong candidate
to cut.
b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_data_perc <- sapply(Soybean,
function(x) sum(is.na(x))/length(x) * 100)
sort(missing_data_perc[missing_data_perc>0], decreasing = TRUE)
## hail sever seed.tmt lodging germ
## 17.7159590 17.7159590 17.7159590 17.7159590 16.3982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 15.8125915 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 14.6412884 13.4699854 13.4699854 13.4699854 12.2986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 12.2986823 12.2986823 12.2986823 12.2986823 5.5636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## sclerotia plant.stand roots temp crop.hist
## 5.5636896 5.2708638 4.5387994 4.3923865 2.3426061
## plant.growth stem date area.dam
## 2.3426061 2.3426061 0.1464129 0.1464129
As we see certain predictors are more likely to be missing. The
variables hail, sever, seed.tmt,
lodging, germand leaf.mild have the highest
missing proportions (around 16–18%), while some predictors such as
date and area.dam have almost no missing
values. This indicates that missingness is not evenly distributed across
predictors and is specific for each predictor.
table(Soybean$Class, is.na(Soybean$hail))
##
## FALSE TRUE
## 2-4-d-injury 0 16
## alternarialeaf-spot 91 0
## anthracnose 44 0
## bacterial-blight 20 0
## bacterial-pustule 20 0
## brown-spot 92 0
## brown-stem-rot 44 0
## charcoal-rot 20 0
## cyst-nematode 0 14
## diaporthe-pod-&-stem-blight 0 15
## diaporthe-stem-canker 20 0
## downy-mildew 20 0
## frog-eye-leaf-spot 91 0
## herbicide-injury 0 8
## phyllosticta-leaf-spot 20 0
## phytophthora-rot 20 68
## powdery-mildew 20 0
## purple-seed-stain 20 0
## rhizoctonia-root-rot 20 0
For the predictor hail, missingness is clearly
class-dependent. Several disease classes such as
2-4-d-injury, cyst-nematode, and
diaporthe-pod-&-stem-blight have 100% missing values,
while most other classes have no missing at all. This indicates that
missingness is not random and is strongly related to disease type.
It also sounds logical biologically, since hail can cause physical
damage to soybean plants and may be relevant only for certain diseases.
Therefore, hail appears to have informative missingness, meaning that
the presence of NA itself may contain predictive information.
missing_prop <- prop.table(table(Soybean$Class, is.na(Soybean$hail)), margin = 1)
# Multiply by 100 and round for easier reading
round(missing_prop * 100, 1)
##
## FALSE TRUE
## 2-4-d-injury 0.0 100.0
## alternarialeaf-spot 100.0 0.0
## anthracnose 100.0 0.0
## bacterial-blight 100.0 0.0
## bacterial-pustule 100.0 0.0
## brown-spot 100.0 0.0
## brown-stem-rot 100.0 0.0
## charcoal-rot 100.0 0.0
## cyst-nematode 0.0 100.0
## diaporthe-pod-&-stem-blight 0.0 100.0
## diaporthe-stem-canker 100.0 0.0
## downy-mildew 100.0 0.0
## frog-eye-leaf-spot 100.0 0.0
## herbicide-injury 0.0 100.0
## phyllosticta-leaf-spot 100.0 0.0
## phytophthora-rot 22.7 77.3
## powdery-mildew 100.0 0.0
## purple-seed-stain 100.0 0.0
## rhizoctonia-root-rot 100.0 0.0
miss_by_class_summary <- function(df, outcome = "Class") {
Class <- df[[outcome]]
preds <- setdiff(names(df), outcome)
out <- lapply(preds, function(p) {
miss_rate_by_class <- tapply(is.na(df[[p]]), Class, mean)
max_rate <- max(miss_rate_by_class)
max_class <- names(which.max(miss_rate_by_class))
data.frame(
predictor = p,
max_missing_rate = max_rate,
class_with_max_missing = max_class,
stringsAsFactors = FALSE
)
})
out <- do.call(rbind, out)
out <- out[out$max_missing_rate > 0, ]
out[order(-out$max_missing_rate), ]
}
miss_summary <- miss_by_class_summary(Soybean, outcome = "Class")
miss_summary$max_missing_percent <-
round(miss_summary$max_missing_rate * 100, 1)
miss_summary[, c("predictor",
"max_missing_percent",
"class_with_max_missing")]
The results show that many predictors have 100% missing values within
specific classes. For example, a large number of variables are
completely missing for class 2-4-d-injury. Also, several
leaf-related variables are fully missing for cyst-nematode.
It appears that missingness can be class-related and is not random.
c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
I think looking at variables one by one is probably the most
reasonable strategy here. Chapter 3 clearly says that before fixing or
removing missing values, we need to understand why they are missing in
the first place. If we blindly impute everything, we may destroy useful
structure in the data. For example, if hail was not
recorded for some class and is mostly NA in that particular outcome
class, and we apply something like KNN imputation, we could lose that
structured missingness and replace it with artificial values.
Based on the analysis, and especially since most predictors are categorical, it makes more sense to apply different approaches instead of one single rule for all predictors.
The most important step would be to investigate why the data are missing. If possible, I would try to trace back the data collection process and collect the missing values. Sometimes missingness is caused by recording errors or incomplete measurement. If the original data cannot be recovered, then we need to decide whether to remove the predictor or apply imputation.
Some predictors (like hail and several others) are 100%
missing for certain classes. That is clearly not random. It looks
structural and class-related. In this case, NA itself contains
information. So replacing missing values with the mode would be wrong. A
better idea is either to treat NA as its own category (“Missing”) or to
use models like tree-based methods that can naturally handle missing
data.
If a predictor has many missing values but they are not clearly related to class, then it may not be very useful. Keeping such a variable may only introduce noise. In that case, removing it might be more stable than imputing many artificial values.
If only a few values are missing and there is no class dependency, then simple imputation is reasonable. For categorical variables, replacing with the most frequent level may be enough. More complex methods like KNN imputation are possible, but for this dataset size it may be unnecessary and could even introduce instability.