Exercises: 3.1 and 3.2 from the Kuhn and Johnson book.

Exercise 3.1

3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification.The data consist of 214 glass samples labeled as one of seven class categories.There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench) data(Glass) str(Glass)

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
tbl <- rbind(table(Glass$Type), round(table(Glass$Type)/(nrow(Glass))*100.0,2)) 
rownames(tbl) <- c("Count of Glass type: ", "% of Dataset from Glass type: ")
tbl
##                                    1     2     3     5    6     7
## Count of Glass type:           70.00 76.00 17.00 13.00 9.00 29.00
## % of Dataset from Glass type:  32.71 35.51  7.94  6.07 4.21 13.55

Clearly, types 1 and 2 make up an outsized share of the glass types in the data set, which I’ll want to keep in mind. The predictor vlaues that correspond to those types will also make up an outsized share.

a

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_density(alpha = 0.3) +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal() +
  labs(
    title = "Density Plots of Numeric Predictors by Glass Type",
    x = "Value",
    y = "Density"
  )  

Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = Type, color = Type)) +
  geom_density(alpha = 0.3) +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal() +
  labs(
    title = "Density Plots of Numeric Predictors by Glass Type",
    x = "Value",
    y = "Density"
  )  

This gave me an idea of the overall distributions for each predictor. I can also see how somewhat bi-modal distributions in some areas appear to be related to the distributions when controlled by glass Type. For example for Al, the distributions look more normal for Types 1 and 2. But the distributions look less normal and have different centers for types 3 and 7. I’m interested in exploring transformations at this point, but I see that this is a latter part of the question, so I’ll hold off until then.

I want to get a better idea of the relationship between predictors both visually + in correlation terms.

# Using GGally for this first part

Glass %>%
  select(-Type) %>% 
  ggpairs(
    lower = list(continuous = wrap("points", alpha = 0.4, size = 0.5)),
    title = "Scatterplot Matrix of Glass Predictors"
  ) +
  theme_minimal()

## Doing correlation again for the colors: 
glass_cor <- Glass %>%
  select(-Type) %>%
  cor() 

corrplot(glass_cor, 
         method = "color", 
         type = "lower",
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black"
         ) 

  • So the correlation between Ca and RL is quite high at 0.81 - so here we have a strong positive correlation between the two variables.
  • Come other honorable mentions around the +/- 0.5 mark are:
    • Rl and Si: negatvie
    • Mg and Al: negative
    • Mg and Ca: negative
    • Mg and Ba: negative
    • Al and Ba: positive

b

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

I wanted to see the outliers via boxplot, but the shaped of the distribution via violin plot, so I ended up layering both.

Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value, y = "")) +  
  geom_violin(
    fill = "steelblue", 
    alpha = 0.5, 
    outlier.color = "red", # making outliers easier to see
    outlier.size = 2,
    outlier.shape = 16
  ) +
  geom_boxplot(
    width = 0.15,               # Keeps the boxplot skinny so it fits inside the violin
    fill = "white",             # Makes the boxplot stand out against the colored violin
    outlier.color = "red",      # Keeps the outliers bright red
    outlier.size = 1.5,
    outlier.shape = 16,
    alpha = 0.7
  ) +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank()
  ) +
  labs(
    title = "Violin of Glass Predictors (Outliers and Skewness)",
    x = "Predictor Value",
    y = ""
  )
## Warning in geom_violin(fill = "steelblue", alpha = 0.5, outlier.color = "red", : Ignoring unknown parameters: `outlier.colour`, `outlier.size`, and
## `outlier.shape`

At first glance, there are definitely outliers. But I’m going to take a second look while controlling for glass Type, since I know the distributions for each predicotr vary by glass type.

Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value, y = Type, fill = Type)) + 
  geom_violin(
    alpha = 0.5, 
    outlier.color = "red", 
    outlier.size = 1.5,
    outlier.shape = 16
  ) +
  geom_boxplot(
    width = 0.15,               # Keeps the boxplot skinny so it fits inside the violin
    fill = "white",             # Makes the boxplot stand out against the colored violin
    outlier.color = "red",      # Keeps the outliers bright red
    outlier.size = 1.5,
    outlier.shape = 16,
    alpha = 0.7
  ) +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(
    legend.position = "none"  
  ) +
  labs(
    title = "Violin plots of Glass Predictors Controlled by Glass Type",
    x = "Predictor Value",
    y = "Glass Type"
  )
## Warning in geom_violin(alpha = 0.5, outlier.color = "red", outlier.size = 1.5, : Ignoring unknown parameters: `outlier.colour`, `outlier.size`, and
## `outlier.shape`

Type 5 and Type 6 glass have some weird shapes that might imply heavier skew and outliers across almost all predictors, but there is also so little data for these two types, that this makes sense how a few outliers are holding a more powerful sway on the shape of the data.

With that noted:

  • Al might be a bit right skewed, seemingly by Type 5 outliers.
  • Mg might be a bit left skewed from Type 2 outliers. But it’s also uniquely has spreads with very different centers depemnding on the glass type and creating a bimodal distribution.
  • Ba, Fe, K, RI seem right skewed by outliers across nearly all Types.

c

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?
  • Centering and Scaling might improve the classification model. Because the predictors have both things like refractive index (RI) versus percentage of an element, the scale is different, Centering and scaling would make the mean 0 and standard deviation 1 so that models that need more numeric stability can perform better.
  • Box-Cox Transformation may be a good fit to handle the skewed predictors explored above. This might mean for Al, Ba, Fe, K and RI.
# Finding the appropriate transformations
yj_model <- preProcess(Glass %>% select(-Type), method = "YeoJohnson")

# Extract the lambda values
lambdas <- yj_model$yj
 
lambda_table <- data.frame(
  Predictor = names(lambdas), 
  Lambda = round(lambdas, 3)
)
 
print(lambda_table, row.names = FALSE)
##  Predictor Lambda
##         Na -0.187
##         Mg  2.237
##         Al  0.003
##          K -0.984
##         Ca -1.357

I tested this out, and actually this was better fit for Na, Mg, Al, K and Ca according to the YeoJohnson method. Probably due to the number of 0s in some of the other variables. These transformations would help the classification model. That said, the clearest transformations of these variables would of course be Mg, K and maybe Ca, becuase their lamda valeus can be more easily rounded to a number that is more interpretable as well.

Exercise 3.2

3.2 The soy bean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g.,temperature, precipitation) and plant conditions (e.g.,left spots,mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

data(Soybean)
head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0
?Soybean
## starting httpd help server ... done

a

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
summary(Soybean) 
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
# splitting this into 3 chunks because I wanted it to look right in the knitted document

Soybean %>%
  select(2:13) %>% 
  mutate(across(everything(), as.character)) %>% 
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free_x", ncol = 4) + 
  theme_minimal() +
  labs(title = "Soybean Predictors (Group 1 of 3)", x = "Level", y = "Count")

Soybean %>%
  select(14:25) %>% 
  mutate(across(everything(), as.character)) %>% 
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "darkslateblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free_x", ncol = 4) + 
  theme_minimal() +
  labs(title = "Soybean Predictors (Group 2 of 3)", x = "Level", y = "Count")

Soybean %>%
  select(26:36) %>% 
  mutate(across(everything(), as.character)) %>% 
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "cadetblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free_x", ncol = 4) + 
  theme_minimal() +
  labs(title = "Soybean Predictors (Group 3 of 3)", x = "Level", y = "Count")

# Use the function from the package the book author made
nzv_analysis <- caret::nearZeroVar(Soybean[, -1], saveMetrics = TRUE)

# Filtering on 'nzv == TRUE' should meet the book requirements: FreqRatio > 19 AND PercentUnique < 10
degenerate_vars <- nzv_analysis[nzv_analysis$nzv == TRUE, ]

# Output table
degenerate_output <- data.frame(
  Predictor = rownames(degenerate_vars),
  FreqRatio = round(degenerate_vars$freqRatio, 2),
  PercentUnique = round(degenerate_vars$percentUnique, 2),
  Is_Degenerate = degenerate_vars$nzv
) 
print(degenerate_output, row.names = FALSE)
##  Predictor FreqRatio PercentUnique Is_Degenerate
##  leaf.mild     26.75          0.44          TRUE
##   mycelium    106.50          0.29          TRUE
##  sclerotia     31.25          0.29          TRUE

Yes, the leaf.mild, mycelium and sclerotia are degenerate with one or more unque categories that are only present for a very small amount of the data.

b

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
par(mar = c(10, 4, 4, 2)) 

# Using VIM to see patterns of missingness
VIM::aggr(Soybean, 
     col = c('navyblue','red'), 
     numbers = TRUE, 
     sortVars = TRUE, 
     # las = 2 rotates the x-axis labels to be vertical
     # cex.axis = .5 makes the font smaller so they all fit
     labels = names(Soybean), 
     cex.axis = 0.5, 
     las = 2,
     gap = 3, 
     ylab = c("Proportion of Missingness","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000
# Creating a heatmap for percentage of missingness grouped by Class 
na_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ mean(is.na(.x)) * 100)) %>%
  pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Percent_NA")
 
ggplot(na_by_class, aes(x = Predictor, y = Class, fill = Percent_NA)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "firebrick") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 8)) +
  labs(
    title = "Heatmap of Missing Data by Disease Class",
    subtitle = "Darker red = higher concentration of NAs",
    x = "Predictors",
    y = "Disease Class",
    fill = "% Missing"
  )

From the first graph

Many of these variables have over 15% missingness (I won’t re-write the names because there are so many). Lukcily, of our degenerate distirbutions, the missingness is only around 5% for 2 of the 3. It’s over 15% for leaf.mild though, which is troublesome.

I can also see from the right-hand side plot that many of these variables are missing together, alongside other values in a certain pattern. This means that the missingness isn’t just random among predictors.

From the heatmap

Yes, I can see that certain predictor data is essentially missing for certain classes.

c

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputations

I thought through a strategy for this:

  1. I’d start by eliminating the three variables that were near-zero variance. They cannot add much information, can add problems, and one of them has a 15% missingness rate on top of it.

  2. For the data that I am able to impute (more on this on the next bullet point), I’d use KNN imputation. This is pretty suitable for the type of data we have here, when the missingness clumps together among similar cases. Plus, some other imputation methods simply wouldn’t be suited to this. Mice, for example, assume missing at random (MAR) data, and it looks like that is not what we have here.

  3. Lastly there is data that I should be certain not to impute. There were classes + predictors in my last plot that were 100% missing. I’d either need to remove this data from the data set (since we should never be trying to predict those values), or I’d need to add a feature flag for that missingness.