Data Pre-processing

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

a.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

The exercise indicates we can access this data using > library(mlbench) > data(Glass) > str(Glass).

In addition to loading the data we will load any other needed libraries for the entire lab.

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(moments)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(e1071) 
## 
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
## 
##     kurtosis, moment, skewness

Load and check data structure

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Currently the variable “type” indicates numbers, and I would like to change that to more descriptive labels that are based on the explanation of the dataset.

Glass$Type <- factor(Glass$Type, levels = c(1,2,3,4,5,6,7),
                     labels = c("building_windows_float_processed",
                                "building_windows_non_float_processed",
                                "vehicle_windows_float_processed",
                                "vehicle_windows_non_float_processed",  # Note: no observations
                                "containers",
                                "tableware",
                                "headlamps"))

Now let’s change the dataset to long format and get the density plot for each predictor.

Glass_long <- Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value")

histogram_plots <- Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue3", color = "black", alpha = 0.7) +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal(base_size = 14) +
  labs(title = "Histograms of Predictors",
       subtitle = "Showing distribution of each predictor variable",
       x = "Value", y = "Count")
print(histogram_plots)

Fe (Iron), Ba (Barium), K (Potassium), and Mg (Magnesium) have long right tails, indicating a small number of extreme high values. Ca (Calcium) has a moderate right skew. Na (Sodium) and RI (Refractive Index) show multiple peaks, suggesting that different glass types have different compositions. Finally, Al (Aluminum) and Si (Silicon) have more symmetrical distributions.

boxplot_plot <- Glass %>%
   dplyr::select(where(is.numeric)) %>%  # important! select only numeric columns
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(y = Value)) + 
  geom_boxplot(fill = "skyblue2", color = "black", alpha = 0.7) + 
  facet_wrap(~ Predictor, scales = 'free', ncol = 3) +  # this ensures correct scaling
  theme_minimal(base_size = 10) +
  labs(
    title = "Boxplots of Numerical Predictors",
    subtitle = "Showing the spread and potential outliers for each variable",
    y = "Value"
  ) +
  theme(strip.text = element_text(size = 14, face = "bold"))  # this improve facet labels
print(boxplot_plot)

We can see that there are outliers presents, and several predictors have long upper whiskers, confirming the positive skew seen in the histograms. - Ba (Barium) has extreme values far from the interquartile range (IQR). - K (Potassium) and Fe (Iron) also have many outliers, with values significantly higher than the typical range. - Ca (Calcium) has a tighter spread, but still has some extreme values. - Mg (Magnesium) shows two clusters, indicating possible differences between glass types.

# Compute correlation matrix (only for numeric variables)
cor_matrix <- cor(Glass %>%  dplyr::select(where(is.numeric)))

# Create the correlation plot
corrplot(cor_matrix, method = "color", type = "lower", 
         tl.col = "black", tl.srt = 45,  
         addCoef.col = "black",  # to add correlation coefficients
         number.cex = 0.8,  # adjust number size
         col = colorRampPalette(c("blue", "white", "red"))(200)) 

This graph clearly illustrates strong correlations: - Calcium and Refractive Index (0.81): Possible indicating that the refractive index may be influenced heavily by calcium content. - Silicon and Refractive Index (-0.54): A strong negative correlation, meaning as silicon increases, refractive index decreases. - Mg vs Al (-0.48), Ba vs Mg (-0.49): Suggests that certain glass combinations favor one element over another.

type_distribution_plot <- Glass %>%
  count(Type) %>%  
  ggplot(aes(x = reorder(Type, -n), y = n, fill = Type)) +  # to order by frequency
  geom_bar(stat = "identity", alpha = 0.8, color = "black") +
  theme_minimal(base_size = 14) +
  labs(title = "Distribution of Glass Types",
    subtitle = "Number of samples for each category",
    x = "Glass Type",
    y = "Count") +
  scale_fill_brewer(palette = "Set3") +  # color-friendly palette
  theme(
    legend.position = "none",  # x-axis already labels types
    axis.text.x = element_text(angle = 30, hjust = 1, size = 10),  # Rotate for readability
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 14, face = "italic")
  )
print(type_distribution_plot)

We can see the most common glass type in the dataset is Building windows either float and non-float processed. Whereas containers and tableware have the fewest samples, which could impact classification performance. The problem with that distribution is that a classifier may struggle with predicting rare categories like “tableware” due to fewer examples.

b.

Do there appear to be any outliers in the data? Are any predictors skewed?

skewness_values <- Glass %>%
   dplyr::select(where(is.numeric)) %>%
  summarise(across(everything(), skewness, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), skewness, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
print(skewness_values)
##         RI        Na        Mg        Al         Si        K       Ca      Ba
## 1 1.602715 0.4478343 -1.136452 0.8946104 -0.7202392 6.460089 2.018446 3.36868
##         Fe
## 1 1.729811

As observed in the previous graphs, the dataset contains several outliers. Ba, Fe, K, and Mg show extreme values far from the interquartile range, confirming a positive skew in their distributions. K (6.51), Ba (3.39), Ca (2.03), and Fe (1.74) are highly skewed, suggesting the presence of a few extreme high values. Mg (-1.14) is negatively skewed, meaning most values are high with a few smaller ones. Ca has a moderate right skew, while Na and the refractive index show multimodal distributions, suggesting variations in glass compositions. Al and Si appear more symmetrically distributed. The correlation plot reveals strong relationships, notably between Ca and refractive index (0.81) and Si and refractive index (-0.54), possibly indicating how these elements influence refractive properties. The glass type distribution shows a class imbalance, where building windows dominate, while containers and tableware are underrepresented, potentially affecting model performance.

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Based on the skewness values we can selectively transform highly skewed predictors. Now that we know which variables actually need transformation, we apply Box-Cox (or log for special cases like Ba and Fe). Box-Cox struggles with too many zeros: If a variable has many zero values, Box-Cox can distort the data or even result in misleading transformations.

skewness_values <- Glass %>%
  summarise(across(where(is.numeric), skewness, na.rm = TRUE))

# based on skewness values identify predictors that need Box-Cox (|skewness| > 1.5) **excluding Ba and Fe** because of mostly 0s composition
skewed_predictors <- names(skewness_values)[abs(skewness_values[1,]) > 1.5]
skewed_predictors <- setdiff(skewed_predictors, c("Ba", "Fe"))  # Remove Ba and Fe

# Apply Box-Cox only to these variables
Glass_transformed <- Glass
for (col in skewed_predictors) {
  shift_value <- abs(min(Glass_transformed[[col]], na.rm = TRUE)) + 1e-4
  response <- Glass_transformed[[col]] + shift_value  
  bc <- BoxCoxTrans(response)
  Glass_transformed[[col]] <- predict(bc, response)
}

# Apply log(x+1) transformation to Ba and Fe
Glass_transformed$Ba <- log1p(Glass$Ba)
Glass_transformed$Fe <- log1p(Glass$Fe)

summary(Glass_transformed)
##        RI               Na              Mg              Al       
##  Min.   :0.4453   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:0.4455   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :0.4455   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :0.4455   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:0.4456   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :0.4461   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##                                                                  
##        Si              K                 Ca               Ba        
##  Min.   :69.81   Min.   :-2.4372   Min.   :0.4958   Min.   :0.0000  
##  1st Qu.:72.28   1st Qu.:-1.4204   1st Qu.:0.4973   1st Qu.:0.0000  
##  Median :72.79   Median :-0.5245   Median :0.4975   Median :0.0000  
##  Mean   :72.65   Mean   :-0.8889   Mean   :0.4975   Mean   :0.1094  
##  3rd Qu.:73.09   3rd Qu.:-0.4484   3rd Qu.:0.4977   3rd Qu.:0.0000  
##  Max.   :75.41   Max.   : 2.6901   Max.   :0.4989   Max.   :1.4231  
##                                                                     
##        Fe                                            Type   
##  Min.   :0.00000   building_windows_float_processed    :70  
##  1st Qu.:0.00000   building_windows_non_float_processed:76  
##  Median :0.00000   vehicle_windows_float_processed     :17  
##  Mean   :0.05159   vehicle_windows_non_float_processed : 0  
##  3rd Qu.:0.09531   containers                          :13  
##  Max.   :0.41211   tableware                           : 9  
##                    headlamps                           :29

To more easily compare the original skewness versus new one:

original_skewness <- Glass %>%
  select_if(is.numeric) %>% 
  summarise(across(everything(), moments::skewness, na.rm = TRUE))

transformed_skewness <- Glass_transformed %>%
  select_if(is.numeric) %>%  # Same fix applied here
  summarise(across(everything(), moments::skewness, na.rm = TRUE))

skewness_comparison <- bind_rows(original_skewness, transformed_skewness) %>%
  mutate(Status = c("Original", "Transformed")) %>%
  relocate(Status)

print(skewness_comparison)
##        Status       RI        Na        Mg        Al         Si           K
## 1    Original 1.614015 0.4509917 -1.144465 0.9009179 -0.7253173 6.505635834
## 2 Transformed 1.595250 0.4509917 -1.144465 0.9009179 -0.7253173 0.009603982
##          Ca       Ba       Fe
## 1 2.0326774 3.392431 1.742007
## 2 0.1099598 2.707521 1.575222

After the transformations we see successfully reduced skewness in some variables while others remained skewed. K (6.51 to 0.0096) and Ca (2.03 to 0.11) are now nearly normal, showing that Box-Cox worked well. However, Ba (3.39 to 2.71) and Fe (1.74 to 1.58) remain skewed despite using log(x+1) transformation. This is likely due to the large number of zeros in these variables, making full normalization difficult. RI (1.61 to 1.59), Na, Mg, Al, and Si remained unchanged, indicating they were correctly left untransformed. While Box-Cox effectively normalized K and Ca, Ba and Fe might be better represented as binary variables (presence/absence).

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Again, we start by loading the dataset and checking the structure:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Then, we can check frequency distributions for categorical predictors, since most predictors are categorical, we will compute frequency tables for each:

columns <- colnames(Soybean)

lapply(columns, function(col) {
  print(
    ggplot(Soybean, aes_string(col)) +
      geom_bar(fill = "skyblue3", color = "black") +  
      coord_flip() +
      ggtitle(col) +
      theme_minimal()
  )
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

There are too many plots resulting, perhaps we can try a more efficient method, computing frequency tables for each categorical predictor, identify near-zero variance predictors and checking for missing values.

I will show a table with: “max_prop”: The largest category proportion (closer to 100% means near-zero variance). “most_common_category”: The dominant category. “num_categories”: Total unique categories in the predictor.

# Identify problematic categorical variables
summarize_categorical_issues <- function(df) {
  categorical_vars <- names(Filter(is.factor, df))  
  
  summary_data <- lapply(categorical_vars, function(var) {
    counts <- table(df[[var]])
    proportions <- prop.table(counts) * 100 
    max_prop <- max(proportions)
    
    list(
      variable = var,
      most_common_category = names(which.max(counts)),
      max_prop = round(max_prop, 2),
      num_categories = length(counts)
    )
  })
  
  summary_df <- do.call(rbind, lapply(summary_data, as.data.frame))
  summary_df <- summary_df[order(-summary_df$max_prop), ]  
  return(summary_df)
}

# Run function on Soybean dataset
summary_table <- summarize_categorical_issues(Soybean)

# Print the summary table
print(summary_table)
##           variable most_common_category max_prop num_categories
## 26        mycelium                    0    99.07              2
## 28       sclerotia                    0    96.90              2
## 35      shriveling                    0    93.41              2
## 19       leaf.mild                    0    93.04              3
## 21         lodging                    0    92.53              2
## 18       leaf.malf                    0    92.49              2
## 27    int.discolor                    0    90.08              3
## 34       seed.size                    0    90.02              2
## 33   seed.discolor                    0    88.91              2
## 13          leaves                    1    88.73              2
## 32     mold.growth                    0    88.66              2
## 36           roots                    0    84.51              3
## 17     leaf.shread                    0    83.53              2
## 24 fruiting.bodies                    0    81.98              2
## 31            seed                    0    80.54              2
## 6             hail                    0    77.40              2
## 25       ext.decay                    0    77.05              3
## 4           precip                    2    71.16              3
## 29      fruit.pods                    0    67.95              4
## 12    plant.growth                    0    66.12              2
## 30     fruit.spots                    0    59.79              4
## 15       leaf.marg                    0    59.60              3
## 22    stem.cankers                    0    58.76              4
## 9            sever                    1    57.30              3
## 5             temp                    1    57.27              3
## 14       leaf.halo                    2    57.10              3
## 20            stem                    1    55.62              2
## 3      plant.stand                    0    54.71              2
## 16       leaf.size                    1    54.59              3
## 10        seed.tmt                    0    54.27              3
## 23   canker.lesion                    0    49.61              4
## 11            germ                    1    37.30              3
## 8         area.dam                    1    33.28              4
## 7        crop.hist                    2    32.83              4
## 2             date                    5    21.85              7
## 1            Class           brown-spot    13.47             19

With the table we can more easily see that the frequency distributions of the categorical predictors in the Soybean dataset reveal several degenerate distributions.

Many variables exhibit near-zero variance, meaning they are overwhelmingly dominated by a single category, providing little to no useful variation for predictive modeling.

For example, mycelium (99.07%), sclerotia (96.90%), shriveling (93.41%), and leaf.mild (93.04%) are almost entirely composed of a single category, making them redundant features.

Additionally, other variables such as lodging, leaf.malf, int.discolor, and seed.size have over 90% of their observations in one category, further limiting their value. Variables like roots, leaf.shread, and fruiting.bodies also show strong imbalances, exceeding 80% dominance in one category.

While some predictors, such as date, crop.hist, area.dam, and germ, show more balanced distributions, the target variable (Class) itself is imbalanced, with brown-spot (13.47%) being the most frequent disease.

b.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can start by seeing a total count of missing values per predictor

Soybean %>%
  summarise_all(~sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  ggplot(aes(y = reorder(variable, missing_count), x = missing_count, fill = missing_count > 0)) +
  geom_col() +
  labs(title = "Missing Values per Predictor",
       x = "Number of Missing Values", y = "Variable") +
  scale_fill_manual(values=c("grey", "red"), labels = c("Complete", "Missing")) +
  theme_minimal()

This first plot highlights the extent of missing data across different predictors in the dataset. Severe missingness is observed in variables like sever, seed.tmt, lodging, hail, and germ, each having over 100 missing values. This suggests that some plant condition-related predictors may be inconsistently recorded.

missingness_plot <- Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  summarise(Missing = n(), class_Total = first(class_Total)) %>%
  mutate(Proportion = Missing / class_Total) %>%
  ggplot(aes(x = reorder(Class, -Proportion), y = Proportion, fill = Proportion)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_gradient(low = "skyblue", high = "red") +
  theme_minimal(base_size = 12) +
  labs(title = "Proportion of Missing Values per Class",
       x = "Disease Class",
       y = "Proportion of Missing Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Print the plot
print(missingness_plot)

This second plot shows the proportion of missing data across disease classes and that missingness is not random but class-dependent. Notably, the 2-4-d injury, cyst nematode, diaporthe pod & stem blight, and herbicide injury classes exhibit nearly 100% missingness, indicating these classes might have been poorly documented or contain minimal data. Phytophthora rot shows moderate missingness but retains a notable amount of complete data. This indicates that missing data patterns could impact classification performance and should be addressed through imputation or exclusion strategies.

c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Given that the missing data is both predictor specific and class-dependent, I would suggest adopting a mixed approach that balances data retention with model accuracy.

We saw that predictors like sever, seed.tmt, lodging, hail, and germ have high proportions of missing values (>15%). If these variables contain extremely vital information for classification, imputation may be the way to go. But if their contribution is not as important then removing them prevents introducing noise or bias and might be more effective. We could go with a threshold of removing more than 25% missing values.

Regarding missing classes, we can follow a similar approach, made obvious with dlasses like 2-4-d injury, cyst nematode, diaporthe pod & stem blight, and herbicide injury that exhibit nearly 100% missingness in some predictors. That being the case, if a class has too many missing values across all predictors such as more than 75%, we should consider removing it.

For certain cases we can go with imputation, filling with the most common category for variables with less than 15% missing data. Perhaps use some predictive imputation such as decision trees or k-nearest neighbors for variables with 15-30% missing data.