Homework 4

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1)

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

# Load required libraries
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
data(Glass)

#Histograms of Numerical Predictors
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

#Boxplots of Numerical Predictors
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

#Correlation Heatmap
Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot() 

#Bar Chart of Target Variable
Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

The Glass dataset exhibits varied distributional characteristics: Aluminum (Al), Barium (Ba), Calcium (Ca), Iron (Fe), Potassium (K), and Refractive Index (RI) display right skewness, with Ba and Fe concentrated near zero, while Magnesium (Mg) and Silicon (Si) show left skewness, Mg being bimodal, and Sodium (Na) approximates a normal distribution with a slight right tail; Glass Type is predominantly focused on Types 1, 2, and 7. Notable correlations include a strong positive relationship between RI and Ca, negative correlations between RI and Si, Al and Mg, Ca and Mg, Ba and Mg, and a positive correlation between Ba and Al.

# Boxplots to detect outliers
Glass %>%
  select(-Type) %>%  # Exclude the target variable
  gather(key = "Predictor", value = "Value") %>%
  ggplot(aes(x = Predictor, y = Value)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Boxplots for Outlier Detection", x = "Predictor", y = "Value")

# Skewness check
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
skew_values <- Glass %>%
  select(-Type) %>%
  summarise_all(funs(skewness))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(skew_values)
##         RI        Na        Mg        Al         Si        K       Ca      Ba
## 1 1.602715 0.4478343 -1.136452 0.8946104 -0.7202392 6.460089 2.018446 3.36868
##         Fe
## 1 1.729811

The boxplot visualization of the Glass dataset’s numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, Fe) reveals distinct distribution patterns: Ba, Fe, and K exhibit significant outliers and skewness, necessitating log transformations or robust scaling for improved model performance. Conversely, RI, Mg, Na, and Si display more consistent distributions, suggesting they could serve as stable inputs.

# Log-transform skewed predictors (adding a small constant to avoid log(0))
Glass_transformed <- Glass %>%
  mutate(
    Ba = log1p(Ba),  # Log transform Ba
    K = log1p(K),    # Log transform K
    Fe = log1p(Fe)   # Log transform Fe
  )

# Visualize distributions after transformation
Glass_transformed %>%
  select(Ba, K, Fe) %>%
  gather(key = "Predictor", value = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 20, fill = "lightgreen", color = "black") +
  facet_wrap(~Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Distributions After Log Transformation", x = "Value", y = "Frequency")

After a log transformation, Barium (Ba) and Iron (Fe) exhibit a concentration of values near zero, indicating minimal presence, with a small spread of higher values suggesting limited variability; conversely, Potassium (K) displays a broader distribution, peaking around 0.5 and extending to 2.0, demonstrating more retained variability. This transformation successfully reduces skewness, making the data more model-ready, though Ba and Fe’s limited variance suggests potentially lower predictive power compared to K, which retains more variability.

3.2)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# Load the Soybean dataset and others
library(tidyverse)
library(mlbench)
data(Soybean)

# Explore categorical predictors
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
columns <- colnames(Soybean)

lapply(columns, function(col) {
  ggplot(Soybean, aes_string(col, fill = col)) +
    geom_bar(color = "black", alpha = 0.8) +  # Add borders and transparency for better visuals
    coord_flip() +
    scale_fill_viridis_d(option = "C") +      # Use Viridis color palette for consistency
    theme_minimal() +
    ggtitle(paste("Frequency Distribution:", col)) +
    labs(x = col, y = "Count") +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      axis.text.y = element_text(size = 10),
      axis.text.x = element_text(size = 10)
    )
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Essentially, degenerate distributions represent variables that consistently have a single value; in this dataset, mycelium and sclerotia appear to exhibit this characteristic. Similarly, leaf.mild and leaf.malf show a strong tendency towards a single value, particularly when missing data is accounted for, indicating nearly one-sided distributions.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Missing values across variables
Soybean %>%
  summarise_all(~sum(is.na(.))) %>%
  pivot_longer(everything(), names_to = "variables", values_to = "missing_count") %>%
  mutate(missing = ifelse(missing_count > 0, "Missing", "Not Missing")) %>%
  ggplot(aes(x = reorder(variables, -missing_count), y = missing_count, fill = missing)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("red", "grey")) +
  labs(
    title = "Missing Data Proportions by Variable",
    x = "Variables",
    y = "Count of Missing Values",
    fill = "Missing Status"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10)
  )

The missing data appears to be non-random, concentrated within specific case groups. After removing these five distinct case groups, the dataset is complete, suggesting the missingness is tied to these particular classifications

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation
#  Missing proportions by class
Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  summarise(
    Missing = n(),
    class_Total = first(class_Total),
    Proportion = Missing / class_Total
  ) %>%
  ggplot(aes(x = reorder(Class, -Proportion), y = Proportion, fill = Class)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(option = "B") +
  labs(
    title = "Proportion of Missing Data by Class",
    x = "Class",
    y = "Proportion of Missing Data"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    legend.position = "none"
  )

Two potential strategies exist for addressing the missing data: First, the five case groups containing the missing values could be entirely removed to achieve a complete dataset. Alternatively, the dataset could be divided, isolating these five groups. Within these isolated groups, K-Nearest Neighbors (KNN) imputation could be utilized to fill the missing values. If certain variables within these groups are consistently devoid of data, they could be eliminated from the subsetted data to maintain data integrity.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.