Homework4

The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Histograms for each predictor
Glass_long <- reshape2::melt(Glass, id.vars = "Type")
ggplot(Glass_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "green", color = "white") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

# Boxplots by Type
ggplot(Glass_long, aes(x = Type, y = value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

# remove the categorical variable
Glass2 = select(Glass, -Type)  
m = cor(Glass2)
corrplot(m, method = "color", type = "upper", tl.col = "black", order="hclust")

Distributions (Histograms):
- RI, Na, Si, and Al are roughly symmetrically distributed.
- K, Ca, Ba, and Fe are skewed with a lot of near-zero values.
- Mg shows a bimodal distribution.
Class Separation (Boxplots):
- Mg and Al show strong separation between glass types, making them highly informative predictors.
- Ca and Si show moderate separation — they vary across classes but with more overlap compared to Mg and Al.
- Na also shows some class-level shifts, but not as distinct as Mg or Al.
- Ba is almost always zero but spikes in Type 7, making it useful only for that class.
- Fe has very low variance across all classes, suggesting limited usefulness.
- RI shows small shifts but a lot of overlap, so it may be more useful in combination with other predictors.
Relationships (Correlation Heatmap):
- Mg is strongly negatively correlated with Al and Ba (this means that glass types with higher Mg tend to have a lower Al and Ba) – This suggests that these predictors are redundant and may cause multicollinearity if all are included.
- RI is positively correlated with Ca and negatively with Si. – This suggests overlap in the information they provide about glass composition.
- Al shows mild positive correlation with Ba. – This indicates partial overlap but not as strong as with Mg.
- Some predictors (like Fe and K) show no correlation to each other. – These are “independent” predictors, but because their values are nearly constant (evident in the previous plots), they may not add much classification power

3.1b Do there appear to be any outliers in the data? Are any predictors skewed?

# Histograms for each predictor
Glass_long <- reshape2::melt(Glass, id.vars = "Type")
ggplot(Glass_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "purple", color = "white") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

# Boxplots by Type
ggplot(Glass_long, aes(x = Type, y = value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

Skewed predictors:

Ba, Fe- strong skew
Ca, RI, K- mild skew
Al- moderate skew
Mg- bimodal
Na, Si- nearly symmetric

Outliers: A lot of the predictors show clear outliers.

K has an extreme high (~6) and other smaller highs (2–4).
Ca has unusually high values (>15) and some lows (<7).
Na shows both high (>16) and low (<12) outliers.
Si has extremes on both ends (>74 and <71).
Al has high values above 3.
Fe shows a few higher values (>0.3) despite mostly being near zero.
Ba is almost always around zero, with very few strong outliers (>2).
RI also has mild outliers both low (<1.515) and high (>1.530).

3.1c Are there any relevant transformations of one or more predictors that might improve the classification model?

# Duplicate dataset
Glass_trans <- Glass

# Variables that benefit from transformation
skewed_vars <- c("K", "Al", "Mg", "Ca")

# log transformation to skewed variables only
Glass_trans[skewed_vars] <- log1p(Glass_trans[skewed_vars])

# Keep Na, Si, RI, Ba, Fe as original(to be scaled later but not transformed)

# Scale all numeric predictors 
Glass_num <- Glass_trans[, -which(names(Glass_trans) == "Type")]
Glass_scaled <- as.data.frame(scale(Glass_num))
Glass_scaled$Type <- Glass$Type

# Compare distributions before vs after
Glass_long_orig <- melt(Glass[, -which(names(Glass) == "Type")])

## No id variables; using all as measure variables

Glass_long_orig$dataset <- "Original"

Glass_long_trans <- melt(Glass_scaled[, -which(names(Glass_scaled) == "Type")])

## No id variables; using all as measure variables

Glass_long_trans$dataset <- "Transformed"

Glass_long <- rbind(Glass_long_orig, Glass_long_trans)

ggplot(Glass_long, aes(x = value, fill = dataset)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal() +
  labs(title = "Selective Transformations on Glass Predictors")

Some predictors (K, Al, Mg, Ca) were highly skewed, so I applied a log transformation to reduce skewness and improve the symmetry. For predictors that were already fairly symmetric (like Na, Si, RI), I only applied centering and scaling. However, the zero-inflated variables (like Ba, Fe) were not changed because standard transformations like log or BoxCox would not resolve the heavy concentration of zeros.

These selective transformations improve classification by reducing skewness and ensuring all predictors contribute on comparable ranges.

An alternative would be to use Box-Cox transformations, which automatically estimate the best power transformation for each variable, but here I decided to go with the targeted log approach to directly address the skewed predictors without over-processing well-behaved ones.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

?Soybean

## Help on topic 'Soybean' was found in the following packages:
## 
##   Package               Library
##   mlbench               /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   nlme                  /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...

3.2a Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

majority_props <- Soybean %>%
  dplyr::select(-Class) %>%   
  summarise(across(everything(), ~ max(table(.)) / length(.))) %>%
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "PropMajority")

ggplot(majority_props, aes(x = reorder(Predictor, PropMajority), 
                           y = PropMajority, fill = PropMajority)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low = "blue", high = "red") +
  theme_minimal() +
  labs(title = "Proportion of Majority Class per Predictor",
       x = "Predictor", y = "Proportion Majority Class")

# Get top 5 most degenerate predictors
top_degenerate <- majority_props %>%
  arrange(desc(PropMajority)) %>%
  head(5)

top_degenerate

## # A tibble: 5 × 2
##   Predictor    PropMajority
##   <chr>               <dbl>
## 1 mycelium            0.936
## 2 sclerotia           0.915
## 3 leaves              0.887
## 4 int.discolor        0.851
## 5 leaf.malf           0.811

Yes, many predictors in the soybean dataset are degenerate - not all 35 predictors will be equally useful for classification, and highly degenerate ones may not add much value.

Predictors like mycelium, sclerotia, leaves, int.discolor and leaf.malf are highly degenerate, with over 80–90% of cases falling into a single category.
- Almost all observations for mycelium are “0,” which means it does not contribute much to distinguishing among the disease classes.
Some predictors show moderate degeneracy like fruit.pods and stem.cankers, where the distribution is skewed but not completely dominated by one category.
Other predictors such as date, germ and crop.hist are well-distributed and informative. They are likely to contribute more to the classification model.

3.2b Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Count missing values per predictor
missing_summary <- colSums(is.na(Soybean)) / nrow(Soybean)
missing_summary <- sort(missing_summary, decreasing = TRUE)

# Show top predictors with missing data
missing_summary[1:10]

##            hail           sever        seed.tmt         lodging            germ 
##       0.1771596       0.1771596       0.1771596       0.1771596       0.1639824 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##       0.1581259       0.1551977       0.1551977       0.1551977       0.1551977

missing_df <- data.frame(
  Predictor = names(missing_summary),
  MissingRate = missing_summary
)

ggplot(missing_df, aes(x = reorder(Predictor, -MissingRate), y = MissingRate)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "Proportion of Missing Data by Predictor", 
       y = "Proportion Missing", x = "Predictor") +
  theme_minimal()

Soybean %>%
  group_by(Class) %>%
  summarise(
    Total = n() * (ncol(Soybean) - 1), 
    Missing = sum(is.na(across(everything()))),  
    Proportion = Missing / Total,
    .groups = "drop"
  ) %>%
  arrange(desc(Proportion))

## # A tibble: 19 × 4
##    Class                       Total Missing Proportion
##    <fct>                       <dbl>   <int>      <dbl>
##  1 2-4-d-injury                  560     450      0.804
##  2 cyst-nematode                 490     336      0.686
##  3 herbicide-injury              280     160      0.571
##  4 phytophthora-rot             3080    1214      0.394
##  5 diaporthe-pod-&-stem-blight   525     177      0.337
##  6 alternarialeaf-spot          3185       0      0    
##  7 anthracnose                  1540       0      0    
##  8 bacterial-blight              700       0      0    
##  9 bacterial-pustule             700       0      0    
## 10 brown-spot                   3220       0      0    
## 11 brown-stem-rot               1540       0      0    
## 12 charcoal-rot                  700       0      0    
## 13 diaporthe-stem-canker         700       0      0    
## 14 downy-mildew                  700       0      0    
## 15 frog-eye-leaf-spot           3185       0      0    
## 16 phyllosticta-leaf-spot        700       0      0    
## 17 powdery-mildew                700       0      0    
## 18 purple-seed-stain             700       0      0    
## 19 rhizoctonia-root-rot          700       0      0

~18% of the soybean data are missing, but the missing data is not evenly distributed across predictors or classes.

Predictors with a lot of missing data: From the bar chart, I can tell that variables like hail, lodging, seed.tmt, sever and germ have the highest proportion of missing values (~ 15–18%). On the other end, variables like leaves, class, date and area.dam show very little to no missing data missingness (<1%).

The distribution of missing data is also related to the disease classes. From the summary table, I see that classes such as 2–4-d-injury (~80% missing), cyst-nematode (~69%), and herbicide-injury (~57%) have much higher rates of missing data. Others, like phytophthora-rot (~39%) and diaporthe-pod-&-stem-blight (~34%), also show quite a bit of missing data. However, classes such as brown-spot, bacterial-pustule, anthracnose, and alternaria-leaf-spot show no missing data at all.

Overall, this indicates that the missing data isn’t completely random. But rather, the pattern of missing data is related to both the predictor variables (like hail, lodging, germ) and the class labels (certain diseases have much higher proportions of missing entries).

3.2c Develop a strategy for handling missing data, either by eliminating predictors or imputation.

# Duplicate dataset
soy <- Soybean

# Check missing data per predictor
missing_summary <- colSums(is.na(soy)) / nrow(soy)
missing_summary <- sort(missing_summary, decreasing = TRUE)

# Show top predictors with missing data
missing_summary[1:15]

##            hail           sever        seed.tmt         lodging            germ 
##       0.1771596       0.1771596       0.1771596       0.1771596       0.1639824 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##       0.1581259       0.1551977       0.1551977       0.1551977       0.1551977 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##       0.1464129       0.1346999       0.1346999       0.1346999       0.1229868

# Drop predictors with too much missing data (> 15%)
high_missing <- names(missing_summary[missing_summary > 0.15])
soy_reduced <- soy %>%
  dplyr::select(-dplyr::all_of(high_missing))

# Define function for mode/median imputation
mode_impute <- function(x) {
  if (is.factor(x) || is.character(x)) {
    return(names(which.max(table(x))))  # mode for categorical
  } else {
    return(median(x, na.rm = TRUE))     # median for numeric
  }
}

# Apply imputation across dataset 
soy_imputed <- soy_reduced %>%
  mutate(across(-Class, ~ ifelse(is.na(.), mode_impute(.), .)))

# Verify no missing values remain
colSums(is.na(soy_imputed))

##         Class          date   plant.stand        precip          temp 
##             0             0             0             0             0 
##     crop.hist      area.dam  plant.growth        leaves     leaf.halo 
##             0             0             0             0             0 
##     leaf.marg     leaf.size   leaf.shread     leaf.malf          stem 
##             0             0             0             0             0 
##  stem.cankers canker.lesion     ext.decay      mycelium  int.discolor 
##             0             0             0             0             0 
##     sclerotia    fruit.pods          seed   mold.growth     seed.size 
##             0             0             0             0             0 
##         roots 
##             0

# Before: proportion missing
missing_df <- data.frame(
  Predictor = names(missing_summary),
  MissingRate = missing_summary
)

before <- ggplot(missing_df, aes(x = reorder(Predictor, -MissingRate), y = MissingRate)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "Before Imputation: Missing Data by Predictor",
       y = "Proportion Missing", x = "Predictor") +
  theme_minimal()

# After: confirm no missing values
missing_after <- colSums(is.na(soy_imputed)) / nrow(soy_imputed)
missing_after_df <- data.frame(
  Predictor = names(missing_after),
  MissingRate = missing_after
)

after <- ggplot(missing_after_df, aes(x = reorder(Predictor, -MissingRate), y = MissingRate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "After Imputation: Missing Data by Predictor",
       y = "Proportion Missing", x = "Predictor") +
  theme_minimal()

before

after

To address the ~18% missing data in the soybean dataset, I developed/applied this strategy:

Drop high-missing predictors (>15% missing):
- Variables like hail, sever, seed.tmt, lodging, and germ had too much missing data.
- Keeping them would either require heavy imputation (this risks distorting their distributions) or may introduce noise, so I removed them completely.
Impute remaining predictors:
- For the remaining variables, their missing values were imputed using the mode for categorical variables and the median for numeric variables. This ensured that missing data was filled in with realistic values while avoiding strong distortions of the variable distributions that could result from using the mean in skewed predictors with extreme values.

Verification: After this process, all predictors retained in the dataset had 0% missing values, confirming that the approach handled missing data effectively.

With this strategy - dropping prevents highly incomplete variables from overwhelming the model, while imputation preserves moderately incomplete predictors so useful information is not completely discarded.