Exercise 3.1

library(mlbench)
data(Glass)
library(ggplot2)
library(tidyr)
library(dplyr)

Part A

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Histograms of glass predictors
Glass %>%
  pivot_longer(-Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "white") +
  facet_wrap(~ Variable, scales = "free") +
  labs(title = "Histograms of Glass Predictors") +
  theme(
    plot.title=element_text(hjust=0.5)
  )

# Type frequencies
ggplot(data = Glass, aes(Type)) +
  geom_bar() +
  labs(title = "Frequencies by type") +
  theme(
    plot.title=element_text(hjust=0.5)
  )

# Correlation plot
library(corrplot)
glass_cor <- cor(Glass[ , -10])  # Type is the 10th column
corrplot(glass_cor, 
         method = "color", 
         type = "lower",
         tl.col = "black", 
         tl.srt = 45)

According to the historam:

  • Al and Ca look approximately normal.
  • Na and RI are skewed right.
  • Ba, Fe, and K have many 0 values with some outliers.
  • Mg seems to be bimodal with peaks around 0 and 3.5.

The bar plot reveals class imbalance.

  • Type 1 & 2 are the most frequent
  • Type 6 is the least frequent

According to the correlation plot:

  • There is a strong positive correlation between Ca and RI

  • There is a significant positive correlation between the following:

    • Ba and Al
    • Ba and Na
    • K and Al
  • There is a significant negative correlation between the following:

    • Si and RI
    • Ba and Mg
    • Al and Mg
    • Ca and Mg

Part B

Do there appear to be any outliers in the data? Are any predictors skewed?

Glass %>%
  pivot_longer(-Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_boxplot() +
  facet_wrap(~ Variable, scales = "free") +
  labs(title = "Boxplots of Glass Predictors") +
  theme(
    plot.title=element_text(hjust=0.5)
  )

The boxplots confirm that Ba, Fe, and K have strong outliers as there’s a few extreme values that extend far beyond the whiskers. Ca has slight outliers at the higher values. Na, RI, and Na show mild outliers. Si shows extreme values at both ends but seems fairly symmetric.

library(moments)
sapply(Glass[ , -10], skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6140150  0.4509917 -1.1444648  0.9009179 -0.7253173  6.5056358  2.0326774 
##         Ba         Fe 
##  3.3924309  1.7420068

From the skewness() function, we can see that Na, Al, and Si are approximately normal as the skewness factor is close to 0. RI, K, Ca, Ba, and Fe are all right-skewed to various degrees as their skewness factor is positive, with K having the strongest right-skewness.

Part C

Are there any relevant transformations of one or more predictors that might improve the classification model?

Since RI, K, Ca, Ba, and Fe are all right-skewed, a log transformation or Box-Cox transform could help reduce skewness and make the distributions more symmetric.

For Na, Al, and Si, I believe no transformation is extremely necessary since the distributions are already approximately normal. However, there is a slight right-skewness for Na and Al and a slight left-skewness for Si, so a log transform or Box-Cox transformation may be beneficial.

Since the predictors are on different scales, it would be good to standardize them by applying z-score standardization.

Transformations are not recommended for bimodal distributions, so Mg does not require a transformation.

Exercise 3.2

data(Soybean)

Part A

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

All predictors have a minimum of 2 factors. Many predictors have a significant amount of NAs - the only predictor without NAs is leaves.

# bar plot for each predictor
vars <- names(Soybean)[names(Soybean) != "Class"]

for (v in vars) {
  p <- ggplot(Soybean, aes_string(x = v)) +
    geom_bar(fill = "lightpink") +
    labs(title = paste("Distribution of", v)) +
    theme(
      plot.title = element_text(hjust = 0.5)
    )
  print(p)
}

Majority of the predictors have the zero case as the most common value. Most of the predictors have imbalanced distributions, with a few exceptions like germ, area.dam, crop.hist, seed.tmt, and plant.stand showing a more balanced distribution.

Mycelium, sclerotia, shriveling, and leaf.mild are not strictly degenerate since they have more than one category, but they have very few cases outside of zero, so they are almost degenerate. These predictors provide little variation and may add little value to the model.

Part B

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Missingness by predictor
Soybean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  ggplot(aes(x = reorder(Variable, -Missing), y = Missing)) +
  geom_col(fill = "limegreen") +
  coord_flip() +
  labs(title = "Missing Values by Predictor",
       x = "Predictor", y = "Number of Missing Values") +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

# Missingness by predictor + class
Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ mean(is.na(.))), .groups = "drop") %>%
  pivot_longer(-Class, names_to = "Variable", values_to = "PropMissing") %>%
  ggplot(aes(x = Variable, y = Class, fill = PropMissing)) +
  geom_tile() +
  scale_fill_gradient(low = "darkgreen", high = "white") +
  labs(title = "Proportion Missing by Predictor and Class",
       x = "Predictor", y = "Class", fill = "Proportion Missing") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

The column chart shows that hail, lodging, seed.tmt, server, and germ are the most frequent predictors with missing values.

The proportion of missing values by class + predictor plot is very helpful as it shows that the missing values only occur in a few classes: 2-4-d-injury, phytophthora-rot, herbicide-injury, diaporthe-pod-&-stem-blight, and cyst-nematode. This means that it’s unlikely that the values are missing at random and the missingness corresponds to the class.

Part C

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

# remove the nearly-degenerate predictors
soy_new <- Soybean %>%
  select(-c(mycelium, sclerotia, leaf.mild, shriveling)) 

# Impute by replacing all missing values with the global most common category for that column. 
cols <- setdiff(names(soy_new), "Class")

soy_imputed <- soy_new %>%
  mutate(across(all_of(cols), as.character)) %>%
  mutate(across(all_of(cols), ~ {
    x <- .
    x[is.na(x)] <- {
      x0 <- x[!is.na(x)]
      ux <- unique(x0); ux[which.max(tabulate(match(x0, ux)))]
    }
    x
  })) %>%
  mutate(across(everything(), as.factor))

colSums(is.na(soy_imputed))  # should be all zeros
##           Class            date     plant.stand          precip            temp 
##               0               0               0               0               0 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##               0               0               0               0               0 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##               0               0               0               0               0 
##       leaf.size     leaf.shread       leaf.malf            stem         lodging 
##               0               0               0               0               0 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay    int.discolor 
##               0               0               0               0               0 
##      fruit.pods     fruit.spots            seed     mold.growth   seed.discolor 
##               0               0               0               0               0 
##       seed.size           roots 
##               0               0

After we impute, let’s take a look at the predictors’ distributions before and after the imputation to make sure it didn’t distort the variables.

# bar plot for each predictor post-imputation
vars2 <- names(soy_imputed)[names(soy_imputed) != "Class"]

for (v in vars2) {
  p <- ggplot(soy_imputed, aes_string(x = v)) +
    geom_bar(fill = "salmon") +
    labs(title = paste("Distribution of", v, "\nPost-Imputation")) +
    theme(
      plot.title = element_text(hjust = 0.5)
    )
  print(p)
}

We can see that there are no more NAs in the predictors. After imputation, there are still no degenerate predictors, but there are some that could be considered slightly degenerate (mold.growth, seed.discolor, seed.size, and leaf.malf).