DATA624_Assignment4

Author: Farhod Ibragimov

library(knitr)
library(kableExtra)
library(mlbench)
library(ggplot2)
library(tidyverse)
library(e1071)
library(caret)
library(corrplot)
library(gridExtra)
library(funModeling)
library(rstatix)
library(Amelia)

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data("Glass")
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

type <- Glass$Type
GlassX <- Glass |> select(-Type)

summary(GlassX)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe         
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05701  
##  3rd Qu.:0.10000  
##  Max.   :0.51000

table(type)

## type
##  1  2  3  5  6  7 
## 70 76 17 13  9 29

(a) Visualizing distributions and relationships

Distributions: histograms

Glass_long <- GlassX |>
  mutate(row_id = row_number()) |>
  pivot_longer(cols = -row_id, names_to = "Variable", values_to = "Value")

skewValues <- apply(GlassX, 2, skewness)
skewValues <- sort(skewValues, decreasing = TRUE)

skew_df <- data.frame(
  Variable = names(skewValues),
  Skewness = skewValues
)

Glass_long2 <- Glass_long |>
  left_join(skew_df, by = "Variable") |>
  mutate(
    Variable = reorder(Variable, -Skewness)
  )

ggplot(Glass_long2, aes(Value)) +
  geom_histogram(bins = 25, fill = "#008080") +
  facet_wrap(~Variable, scales = "free")+
  labs(title = 'Distributions of variables')

skewValues

##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523

Distribution and Skewness Analysis

K (skewness = 6.46)
K is extremely right skewed. Histogram shows most values very close to zero, and few large values (outliers) up to 6. This creates very long right tail. Very strong asymmetry.
Ba (skewness = 3.37)
Ba is heavily right skewed. Almost all observations are 0, and only small number of values are larger. Histogram confirms strong concentration at zero with long right tail. I have a suspicion that outliers could have relationship with some outcome types. I will do further analysis later
Ca (skewness = 2.02)
Ca shows strong positive skewness. Most values are around 8–9, but some values go much higher, which creates right tail.
Fe (skewness = 1.73)
Fe is right skewed. Many values are equal to 0, and few observations are larger. Histogram clearly shows long right tail. I’ll do analysis of these outliers later.
RI (skewness = 1.60)
RI has moderate right skewness. Distribution is mostly concentrated but moderate right tail is visible.
Al (skewness = 0.89)
Al has mild right skewness. Distribution is not symmetric but skewness is not very strong.
Na (skewness = 0.45)
Na is close to symmetric. Slight right skew, but overall distribution looks fairly balanced.
Si (skewness = -0.72)
Si shows mild negative skewness. Histogram shows longer left tail with few smaller values.
Mg (skewness = -1.14)
Mg has moderate negative skewness. Most values are around 3–4, but some small values create left tail.

Overall Conclusion

Several predictors (especially K, Ba and Fe) show very strong right skewness. Fe and Ca are also noticeably skewed. Mg and Si show negative skewness. Such strong skewness may affect some classification methods that assume normal distribution, so transformation like log or Box-Cox may be considered for highly skewed variables.

Correlation matrix

correlationsGlass <- cor(GlassX)
corrplot(correlationsGlass, 
         method = "color", 
         tl.cex = 0.9
         )

From correlations plot I see that Ca and RI are likely most correlated variables. Rest of variables have mild or less correlations.

strongCorr <- correlationsGlass |>
  as.data.frame() |>
  rownames_to_column("Pred1") |>
  pivot_longer(-Pred1, names_to = "Pred2", values_to = "Correlation") |>
  filter(Pred1 < Pred2) |>
  filter(abs(Correlation) > 0.8) |>
  arrange(desc(abs(Correlation)))

strongCorr

As we see from analysis above correlation coefficient for Ca and RI is 0.81 and is moderate.

ggplot(GlassX, aes(Ca, RI)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE)

As we see scatterplot of Ca and RI also show linear correlation. I would consider applying PCA to these variables or removal one of these features for modelling purposes.

(b) Outliers and net-zero-variance

Near-zero variance

nearZeroVar(GlassX, saveMetrics = TRUE)

The near-zero variance analysis show that none of predictors have zero or near-zero variances. We can see all near-zero variance (nzv) are flagged FALSE.
However, Ba (freqRatio = 88) and Fe (freqRatio = 20.57) show very large frequency ratios, meaning that one value (zero values) appears much more frequently than others. This is consistent with histogram results, where many values are equal to zero.
Even frequency ratio of Ba and Fe exceed the freqRatio cutoff (>19) each have percentUnique above 10%, so they are not classified as near-zero variance.

IQR-based outliers

Glass_long_out <- Glass_long |>
  group_by(Variable) |>
  mutate(
    Q1 = quantile(Value, 0.25, na.rm = TRUE),
    Q3 = quantile(Value, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    lower_thres = Q1 - 1.5 * IQR,
    upper_thres = Q3 + 1.5 * IQR,
    Outlier = (Value < lower_thres) | (Value > upper_thres)
  ) |>
  ungroup()

outlier_summary <- Glass_long_out |>
  group_by(Variable) |>
  summarise(
    n_outliers = sum(Outlier, na.rm = TRUE),
    outliers_percentage = round(mean(Outlier, na.rm = TRUE) * 100, 2)
  ) |>
  arrange(desc(n_outliers))

outlier_summary

Using 1.5×IQR rule, Ba has the highest number of outliers (17.8%), followed by Ca (12.1%). This matches what we saw in histograms, where both variables show long right tails and strong skewness.
Al and RI have moderate amount of outliers, while Fe and Si have smaller but still noticeable percentage. Mg has no detected outliers, which means its distribution is more compact and stable.

It is interesting that K has very high skewness (6.46), but only 3.3% of observations are flagged as outliers. This shows that skewness and outliers are not the same thing:

skewness measures asymmetry of distribution, while IQR method detects extreme values relative to quartiles.

For this chemical data, large number of outliers in Ba and Ca likely represent heavy tail behavior and not measurement errors

ggplot(Glass_long_out, aes(Variable, Value)) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~Variable, scales = "free") +
  geom_jitter(
    data = Glass_long_out |> filter(Outlier == TRUE),
    aes(colour = Outlier),
    width = 0.001,
    size = 0.7,
    alpha = 0.7
  ) +
  scale_color_manual(values = c(`TRUE` = "red")) +
  theme(legend.position = "none")

Boxplots confirm the outlier results from IQR analysis.

Ba clearly shows the largest number of extreme values, with many red points above zero, which matches its high skewness and heavy right tail. Also boxplot confirms that majority of Ba values are zeroes.
Ca also has several high outliers, especially strong positive right skewness.
Al and RI show moderate number of extreme observations, while Fe and Si have smaller but visible outliers.
K has very high skewness, but fewer IQR outliers compared to Ba, which means distribution is asymmetric but most values are still concentrated near lower range.
Mg does not show any outliers.

Overall, the boxplots support previous findings that Ba, Fe and Ca have heavy tail behavior, and these extreme values likely represent real variation in glass composition rather than measurement errors.

Outlier in variable K

max(Glass$K)

## [1] 6.21

Glass[which.max(Glass$K),]

boxplot(K ~ Type, data = Glass,
        col = "#008080",
        main = "K by Glass Type")

The boxplot of K by glass type shows clear differences between classes.
Types 1, 2 and 3 have similar distributions with moderate values and small spread.
Type 5 stands out with much larger variability and extreme high values, including one extreme observation above 6. It appears abnormal and requires further investigation.
Type 6 has almost all values equal to zero.
Type 7 shows several high outliers.
The strong overall skewness of K is mainly driven by extreme values in types 5 and 7.
I would investigate if this is error occured at data entry or if it is normal during chemical processes.

Zero-heavy predictors Ba and Fe

mean(GlassX$Ba == 0)

## [1] 0.8224299

mean(GlassX$Fe == 0)

## [1] 0.6728972

prop.table(table(type, GlassX$Ba == 0), margin = 1)

##     
## type      FALSE       TRUE
##    1 0.04285714 0.95714286
##    2 0.07894737 0.92105263
##    3 0.05882353 0.94117647
##    5 0.15384615 0.84615385
##    6 0.00000000 1.00000000
##    7 0.89655172 0.10344828

The proportion analysis shows that Ba equals zero for majority of observations in almost all glass types. For types 1–6, the range 84% - 96% of samples have Ba = 0, and for type 6 it is 100%.
However, type 7 shows something completely different - there are almost 90% of observations have Ba larger than zero.
This suggests that Ba is strongly related to class label and may help distinguish type 7 from other glass types. Even though Ba is highly skewed and zero-heavy, it may contain important classification information.
Furthermore, because of these findings and small variance of Ba I wouldn’t apply PCA to these variable since PCA gives more weight to features with large variances when constructing components, and variables with lower variances may be pushed into lower components.
Also Log(Ba +1) transformation is not appropriate to apply. The reason is because log transformation can compress separation between zero and positive values. This can lead to reduction of its help to separate type 7 in the model.

Overall, I wouldn’t remove and transform this variable.

Same analysis can be applied to Fe variable.

prop.table(table(type, GlassX$Fe == 0), margin = 1)

##     
## type     FALSE      TRUE
##    1 0.3571429 0.6428571
##    2 0.4210526 0.5789474
##    3 0.2941176 0.7058824
##    5 0.1538462 0.8461538
##    6 0.0000000 1.0000000
##    7 0.2068966 0.7931034

The proportion table for Fe shows that majority of observations have Fe equal to zero across almost all glass types.
For types 1–3, around 58%–71% of samples have Fe = 0. For type 5 it is about 85%, and for type 6 it is 100%. Type 7 also has high proportion of zeros (around 79%).
Compared to Ba, Fe does not show clear separation between one specific type class and others. Even though Fe is skewed and zero-heavy, it does not appear to strongly distinguish one glass type from the rest based only on zero versus non-zero values.

Overall, at this point i would leave this feature as is and make a decision at modelling stage.

(c) Transformations that might help classification

Box-Cox

boxcox_plot <- function(x, var_name, bins = 25) {
  
  bc_model <- BoxCoxTrans(x)
  lambda <- round(bc_model$lambda, 3)
  
  x_bxcx <- predict(bc_model, x)

  par(mfrow = c(1,2))
  
  hist(x,
       breaks = bins,
       main = paste(var_name, "(original)"),
       col = "#008080")
  
  hist(x_bxcx,
       breaks = bins,
       main = paste(var_name, "(Box-Cox lambda =", lambda, ")"),
       col = "#008080")
  
  par(mfrow = c(1,1))

  invisible(list(model = bc_model, transformed = x_bxcx))
}

boxcox_plot(Glass$Ca, 'Ca')

The Box-Cox transformation (lambda = -1.1) makes Ca distribution more symmetric compared to the original. In the original histogram, Ca shows noticeable right skew with few larger values in the range of 12 - 16.
After transformation, the right tail is compressed and the distribution looks more balanced around the center. The transformation reduces skewness and stabilizes spread of values.
However, since Ca was not extremely skewed (skewness = 2.02), so the improvement is moderate.

boxcox_plot(Glass$Al, 'Al')

Boxcox transformation of Al improved distibution.
It is more symmetric and has reduced right skewness.
However, since the original skewness was not very strong, the overall improvement is moderate.

boxcox_plot(Glass$RI, 'RI')

For Ri, Box-Cox transformation estimated lambda = -2, which is a inverse transformation. The original RI distribution was already fairly concentrated with only moderate right skewness.
After transformation, the distribution looks compressed and less normal.
Since RI was not extremely skewed, applying transformation does not appear necessary and may reduce interpretability without significant benefit.

(c.3) Center and scale

Al_bc <- BoxCoxTrans(GlassX$Al)
Ca_bc <- BoxCoxTrans(GlassX$Ca)
K_bc <- BoxCoxTrans(GlassX$K)
Glass_prep <- GlassX |>
  mutate(
    Al = predict(Al_bc, Al),
    Ca = predict(Ca_bc, Ca),
    K = log(K +0.1)
  )

scale_vars <- setdiff(names(Glass_prep), c("Ba", "Fe"))

preProc <- preProcess(Glass_prep[, scale_vars], 
                      method = c("center", "scale"))

Glass_final <- Glass_prep
Glass_final[, scale_vars] <- predict(preProc, Glass_prep[, scale_vars])

# Check result
summary(Glass_final)

##        RI                Na                Mg                Al          
##  Min.   :-2.3759   Min.   :-3.2793   Min.   :-1.8611   Min.   :-3.12421  
##  1st Qu.:-0.6068   1st Qu.:-0.6127   1st Qu.:-0.3948   1st Qu.:-0.45169  
##  Median :-0.2257   Median :-0.1321   Median : 0.5515   Median :-0.08726  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2608   3rd Qu.: 0.5108   3rd Qu.: 0.6347   3rd Qu.: 0.44750  
##  Max.   : 5.1252   Max.   : 4.8642   Max.   : 1.2517   Max.   : 3.32208  
##        Si                K                 Ca                Ba       
##  Min.   :-3.6679   Min.   :-1.7924   Min.   :-4.6168   Min.   :0.000  
##  1st Qu.:-0.4789   1st Qu.:-0.8188   1st Qu.:-0.4676   1st Qu.:0.000  
##  Median : 0.1795   Median : 0.4961   Median :-0.1401   Median :0.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :0.175  
##  3rd Qu.: 0.5636   3rd Qu.: 0.5944   3rd Qu.: 0.3253   3rd Qu.:0.000  
##  Max.   : 3.5622   Max.   : 3.2545   Max.   : 3.2694   Max.   :3.150  
##        Fe         
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05701  
##  3rd Qu.:0.10000  
##  Max.   :0.51000

stats_before <- data.frame(
  Variable   = names(GlassX),
  Min_Before = sapply(GlassX, min),
  Max_Before = sapply(GlassX, max),
  Mean_Before= sapply(GlassX, mean),
  SD_Before  = sapply(GlassX, sd)
)

stats_after <- data.frame(
  Variable   = names(Glass_final),
  Min_After  = sapply(Glass_final, min),
  Max_After  = sapply(Glass_final, max),
  Mean_After = sapply(Glass_final, mean),
  SD_After   = sapply(Glass_final, sd)
)

stats_compare <- merge(stats_before, stats_after, by = "Variable")

kable(stats_compare, digits = 3, booktabs = TRUE) |>
  kable_styling(latex_options = c("scale_down"), font_size = 8)

Variable	Min_Before	Max_Before	Mean_Before	SD_Before	Min_After	Max_After	Mean_After	SD_After
Al	0.290	3.500	1.445	0.499	-3.124	3.322	0.000	1.000
Ba	0.000	3.150	0.175	0.497	0.000	3.150	0.175	0.497
Ca	5.430	16.190	8.957	1.423	-4.617	3.269	0.000	1.000
Fe	0.000	0.510	0.057	0.097	0.000	0.510	0.057	0.097
K	0.000	6.210	0.497	0.652	-1.792	3.255	0.000	1.000
Mg	0.000	4.490	2.685	1.442	-1.861	1.252	0.000	1.000
Na	10.730	17.380	13.408	0.817	-3.279	4.864	0.000	1.000
RI	1.511	1.534	1.518	0.003	-2.376	5.125	0.000	1.000
Si	69.810	75.410	72.651	0.775	-3.668	3.562	0.000	1.000

Several preprocessing steps were applied to improve the data. Box-Cox transformation was applied to Al and Ca because their distributions were right skewed. After transformation, both variables became more symmetric and their spread looks more stable.
For K, I applied log(K + 0.1) transformation since it had very strong positive skewness and extreme large outlier. This transformation reduced the right tail and compressed extreme observations, but still kept relative differences between values.

After that, centering and scaling were applied to all predictors except Ba and Fe. These two variables were kept in original scale because they are zero-heavy and Ba contain important class information. Scaling helped to make variables comparable since they were originally measured on very different scales.
As I mentioned earlier I wouldn’t apply PCA transformation on all predictors at this point because:

It would hide Ba into lower components, reducing it’s dominance in distinguishing of type 7
There are only 9 predictors and 214 observations, therefore dimensionality reduction probably will not help (actually likely will hurt).

Overall, the transformations improved stability of the data without removing important structural patterns.

3.2. The soybean data can also be found at the UC Irvine Machine Learning

Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data("Soybean")
#Soybean
Class <- Soybean$Class
Predictors <- Soybean |> select(-Class)
#Predictors

Let’s check for near-zero variance predictors

nzv <- nearZeroVar(Predictors, saveMetrics = TRUE)

flagged <- rownames(nzv[nzv$nzv == TRUE, ])
flagged

## [1] "leaf.mild" "mycelium"  "sclerotia"

The nearZeroVar() function identified following predictors as near-zero variance:

leaf.mild
mycelium
sclerotia

This means that marginally (overall distribution), these variables are dominated by one level and have very little variation across the dataset.

But before removing them I want to check how these predictors are dominant within each class

for (v in flagged) {

  cat("\n***********************************************\n")
  cat("Predictor:", v, "\n")

  tab <- table(Predictors[[v]], Class, useNA = "ifany")
  prop_by_class <- prop.table(tab, margin = 2)

  dominant <- apply(prop_by_class, 2,
                    function(col) rownames(prop_by_class)[which.max(col)])

  print(data.frame(
    Class = names(dominant),
    Dominant_Level = dominant,
    row.names = NULL
  ))
}

## 
## ***********************************************
## Predictor: leaf.mild 
##                          Class Dominant_Level
## 1                 2-4-d-injury           <NA>
## 2          alternarialeaf-spot              0
## 3                  anthracnose              0
## 4             bacterial-blight              0
## 5            bacterial-pustule              0
## 6                   brown-spot              0
## 7               brown-stem-rot              0
## 8                 charcoal-rot              0
## 9                cyst-nematode           <NA>
## 10 diaporthe-pod-&-stem-blight           <NA>
## 11       diaporthe-stem-canker              0
## 12                downy-mildew              2
## 13          frog-eye-leaf-spot              0
## 14            herbicide-injury           <NA>
## 15      phyllosticta-leaf-spot              0
## 16            phytophthora-rot           <NA>
## 17              powdery-mildew              1
## 18           purple-seed-stain              0
## 19        rhizoctonia-root-rot              0
## 
## ***********************************************
## Predictor: mycelium 
##                          Class Dominant_Level
## 1                 2-4-d-injury           <NA>
## 2          alternarialeaf-spot              0
## 3                  anthracnose              0
## 4             bacterial-blight              0
## 5            bacterial-pustule              0
## 6                   brown-spot              0
## 7               brown-stem-rot              0
## 8                 charcoal-rot              0
## 9                cyst-nematode           <NA>
## 10 diaporthe-pod-&-stem-blight              0
## 11       diaporthe-stem-canker              0
## 12                downy-mildew              0
## 13          frog-eye-leaf-spot              0
## 14            herbicide-injury           <NA>
## 15      phyllosticta-leaf-spot              0
## 16            phytophthora-rot              0
## 17              powdery-mildew              0
## 18           purple-seed-stain              0
## 19        rhizoctonia-root-rot              0
## 
## ***********************************************
## Predictor: sclerotia 
##                          Class Dominant_Level
## 1                 2-4-d-injury           <NA>
## 2          alternarialeaf-spot              0
## 3                  anthracnose              0
## 4             bacterial-blight              0
## 5            bacterial-pustule              0
## 6                   brown-spot              0
## 7               brown-stem-rot              0
## 8                 charcoal-rot              1
## 9                cyst-nematode           <NA>
## 10 diaporthe-pod-&-stem-blight              0
## 11       diaporthe-stem-canker              0
## 12                downy-mildew              0
## 13          frog-eye-leaf-spot              0
## 14            herbicide-injury           <NA>
## 15      phyllosticta-leaf-spot              0
## 16            phytophthora-rot              0
## 17              powdery-mildew              0
## 18           purple-seed-stain              0
## 19        rhizoctonia-root-rot              0

After examining the dominant level within each disease class, we can see that the situation is more complex.

leaf.mild - the dominant level differs across several classes. Some diseases are mostly <NA>, while others are mostly 0, and for example downy-mildew is dominated by level 2 and powdery-mildew by level 1. This shows that even though the variable looks as near-zero variance, it behaves different depending on the class. So looks like it contains meaningful class information.
mycelium - most classes are dominated by level 0, and only few classes show <NA> as dominant. This variable appears more flat across classes compared to leaf.mild.
sclerotia - most classes again have dominant level 0, but charcoal-rot clearly has dominant level 1. This suggests that this predictor may help distinguish that particular disease from others. So even if it is near-zero variance marginally, it still has some conditional structure.

Overall, although these three predictors are indicated as near-zero variance based on marginal frequency distributions, leaf.mildand sclerotia show differences across disease classes when we look conditionally. Therefore, they should not be automatically removed only because of near-zero variance screening. At the same time myceliumappears to be a strong candidate to cut.

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_data_perc <- sapply(Soybean, 
                            function(x) sum(is.na(x))/length(x) * 100)
sort(missing_data_perc[missing_data_perc>0], decreasing = TRUE)

##            hail           sever        seed.tmt         lodging            germ 
##      17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##      14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##      12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##       5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
##    plant.growth            stem            date        area.dam 
##       2.3426061       2.3426061       0.1464129       0.1464129

As we see certain predictors are more likely to be missing. The variables hail, sever, seed.tmt, lodging, germand leaf.mild have the highest missing proportions (around 16–18%), while some predictors such as date and area.dam have almost no missing values. This indicates that missingness is not evenly distributed across predictors and is specific for each predictor.

table(Soybean$Class, is.na(Soybean$hail))

##                              
##                               FALSE TRUE
##   2-4-d-injury                    0   16
##   alternarialeaf-spot            91    0
##   anthracnose                    44    0
##   bacterial-blight               20    0
##   bacterial-pustule              20    0
##   brown-spot                     92    0
##   brown-stem-rot                 44    0
##   charcoal-rot                   20    0
##   cyst-nematode                   0   14
##   diaporthe-pod-&-stem-blight     0   15
##   diaporthe-stem-canker          20    0
##   downy-mildew                   20    0
##   frog-eye-leaf-spot             91    0
##   herbicide-injury                0    8
##   phyllosticta-leaf-spot         20    0
##   phytophthora-rot               20   68
##   powdery-mildew                 20    0
##   purple-seed-stain              20    0
##   rhizoctonia-root-rot           20    0

For the predictor hail, missingness is clearly class-dependent. Several disease classes such as 2-4-d-injury, cyst-nematode, and diaporthe-pod-&-stem-blight have 100% missing values, while most other classes have no missing at all. This indicates that missingness is not random and is strongly related to disease type.
It also sounds logical biologically, since hail can cause physical damage to soybean plants and may be relevant only for certain diseases. Therefore, hail appears to have informative missingness, meaning that the presence of NA itself may contain predictive information.

missing_prop <- prop.table(table(Soybean$Class, is.na(Soybean$hail)), margin = 1)

# Multiply by 100 and round for easier reading
round(missing_prop * 100, 1)

##                              
##                               FALSE  TRUE
##   2-4-d-injury                  0.0 100.0
##   alternarialeaf-spot         100.0   0.0
##   anthracnose                 100.0   0.0
##   bacterial-blight            100.0   0.0
##   bacterial-pustule           100.0   0.0
##   brown-spot                  100.0   0.0
##   brown-stem-rot              100.0   0.0
##   charcoal-rot                100.0   0.0
##   cyst-nematode                 0.0 100.0
##   diaporthe-pod-&-stem-blight   0.0 100.0
##   diaporthe-stem-canker       100.0   0.0
##   downy-mildew                100.0   0.0
##   frog-eye-leaf-spot          100.0   0.0
##   herbicide-injury              0.0 100.0
##   phyllosticta-leaf-spot      100.0   0.0
##   phytophthora-rot             22.7  77.3
##   powdery-mildew              100.0   0.0
##   purple-seed-stain           100.0   0.0
##   rhizoctonia-root-rot        100.0   0.0

miss_by_class_summary <- function(df, outcome = "Class") {
  Class <- df[[outcome]]
  preds <- setdiff(names(df), outcome)

  out <- lapply(preds, function(p) {
    miss_rate_by_class <- tapply(is.na(df[[p]]), Class, mean)
    max_rate <- max(miss_rate_by_class)
    max_class <- names(which.max(miss_rate_by_class))
    data.frame(
      predictor = p,
      max_missing_rate = max_rate,
      class_with_max_missing = max_class,
      stringsAsFactors = FALSE
    )
  })

  out <- do.call(rbind, out)
  out <- out[out$max_missing_rate > 0, ]          
  out[order(-out$max_missing_rate), ]
}

miss_summary <- miss_by_class_summary(Soybean, outcome = "Class")
miss_summary$max_missing_percent <- 
  round(miss_summary$max_missing_rate * 100, 1)

miss_summary[, c("predictor", 
                 "max_missing_percent", 
                 "class_with_max_missing")]

The results show that many predictors have 100% missing values within specific classes. For example, a large number of variables are completely missing for class 2-4-d-injury. Also, several leaf-related variables are fully missing for cyst-nematode. It appears that missingness can be class-related and is not random.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I think looking at variables one by one is probably the most reasonable strategy here. Chapter 3 clearly says that before fixing or removing missing values, we need to understand why they are missing in the first place. If we blindly impute everything, we may destroy useful structure in the data. For example, if hail was not recorded for some class and is mostly NA in that particular outcome class, and we apply something like KNN imputation, we could lose that structured missingness and replace it with artificial values.

Based on the analysis, and especially since most predictors are categorical, it makes more sense to apply different approaches instead of one single rule for all predictors.

The most important step would be to investigate why the data are missing. If possible, I would try to trace back the data collection process and collect the missing values. Sometimes missingness is caused by recording errors or incomplete measurement. If the original data cannot be recovered, then we need to decide whether to remove the predictor or apply imputation.

1. Variables with informative missingness

Some predictors (like hail and several others) are 100% missing for certain classes. That is clearly not random. It looks structural and class-related. In this case, NA itself contains information. So replacing missing values with the mode would be wrong. A better idea is either to treat NA as its own category (“Missing”) or to use models like tree-based methods that can naturally handle missing data.

2. Variables with large but not informative missingness

If a predictor has many missing values but they are not clearly related to class, then it may not be very useful. Keeping such a variable may only introduce noise. In that case, removing it might be more stable than imputing many artificial values.

3. Variables with small amount of missing

If only a few values are missing and there is no class dependency, then simple imputation is reasonable. For categorical variables, replacing with the most frequent level may be enough. More complex methods like KNN imputation are possible, but for this dataset size it may be unnecessary and could even introduce instability.