JK Chapter 3

Exercises

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identiﬁcation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Below we utilize histograms for visualizing how each predictor variable is distributed, boxplots to explore outliers, and a corrplot to determine the relationship these variables have together.

data(Glass)

Glass |>
  dplyr::select(-Type) -> Glass

par(mfrow = c(2, 5))

for(i in names(Glass))
  hist(Glass[[i]], main = i, xlab = "Value")

par(mfrow = c(2, 5))

for(i in names(Glass))
  boxplot(Glass[[i]], main = i, xlab = "Value")

Glass |>
  cor() |>
  corrplot()

Do there appear to be any outliers in the data? Are any predictors skewed?

There are many outliers very apparent in almost every single distribution besides Mg and Na. The extreme frequency of outliers in the other variables is a consequence of their skewness. For example: Fe, Ba, and K are all extremely rightward skewed with Ca being slightly less so in comparison. Mg is also bimodal and fairly leftward skewed. It’s probably to helpful to note that many of these skews come from the capping effect when a variable can not go below 0 and thus 0 becomes a mode.

Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

A box-cox transform would be able to help normalize the strong skewing with K, Ca, Ba, and Fe. However, Mg would not be able to be transformed to be normalized as bimodal data mathematically does not have a method to keep the data’s integrity while normalizing.

b <- boxcox(lm((Glass$Ba+1)*100 ~ 1))

lambda <- -1.9

hist((((Glass$Ba + 1)*100) ^ lambda - 1) / lambda)

Yet upon attempting to boxcox transform Ba we run into hurdles of not being able to box-cox transform the data without further transforming it to get values away from 0 which can not be transformed. Yet even with the additive and multiplicative transformations the data here seems to be too skewed to properly transform. Thus it should be noted that box-cox transformations that can be automatically optimized aren’t necessarily the 100% cure of normalizing a distribution.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
Soybean <- Soybean |> 
  dplyr::select(-Class)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

summary(Soybean)

##       date     plant.stand  precip      temp       hail     crop.hist 
##  5      :149   0   :354    0   : 74   0   : 80   0   :435   0   : 65  
##  4      :131   1   :293    1   :112   1   :374   1   :127   1   :165  
##  3      :118   NA's: 36    2   :459   2   :199   NA's:121   2   :219  
##  2      : 93               NA's: 38   NA's: 30              3   :218  
##  6      : 90                                                NA's: 16  
##  (Other):101                                                          
##  NA's   :  1                                                          
##  area.dam    sever     seed.tmt     germ     plant.growth leaves  leaf.halo 
##  0   :123   0   :195   0   :305   0   :165   0   :441     0: 77   0   :221  
##  1   :227   1   :322   1   :222   1   :213   1   :226     1:606   1   : 36  
##  2   :145   2   : 45   2   : 35   2   :193   NA's: 16             2   :342  
##  3   :187   NA's:121   NA's:121   NA's:112                        NA's: 84  
##  NA's:  1                                                                   
##                                                                             
##                                                                             
##  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild    stem     lodging   
##  0   :357   0   : 51   0   :487    0   :554   0   :535   0   :296   0   :520  
##  1   : 21   1   :327   1   : 96    1   : 45   1   : 20   1   :371   1   : 42  
##  2   :221   2   :221   NA's:100    NA's: 84   2   : 20   NA's: 16   NA's:121  
##  NA's: 84   NA's: 84                          NA's:108                        
##                                                                               
##                                                                               
##                                                                               
##  stem.cankers canker.lesion fruiting.bodies ext.decay  mycelium   int.discolor
##  0   :379     0   :320      0   :473        0   :497   0   :639   0   :581    
##  1   : 39     1   : 83      1   :104        1   :135   1   :  6   1   : 44    
##  2   : 36     2   :177      NA's:106        2   : 13   NA's: 38   2   : 20    
##  3   :191     3   : 65                      NA's: 38              NA's: 38    
##  NA's: 38     NA's: 38                                                        
##                                                                               
##                                                                               
##  sclerotia  fruit.pods fruit.spots   seed     mold.growth seed.discolor
##  0   :625   0   :407   0   :345    0   :476   0   :524    0   :513     
##  1   : 20   1   :130   1   : 75    1   :115   1   : 67    1   : 64     
##  NA's: 38   2   : 14   2   : 57    NA's: 92   NA's: 92    NA's:106     
##             3   : 48   4   :100                                        
##             NA's: 84   NA's:106                                        
##                                                                        
##                                                                        
##  seed.size  shriveling  roots    
##  0   :532   0   :539   0   :551  
##  1   : 59   1   : 38   1   : 86  
##  NA's: 92   NA's:106   2   : 15  
##                        NA's: 31  
##                                  
##                                  
##

The way that categorical data would be degenerate based on its frequency is if both the fraction of unique values is lower than 10% of the sample size and the second most prevalent value is 5% of the first most prevalent value. Each category here violates the first principle. However, both conditions are violated on only: mycelium, sclerotia, shriveling, and leaf.mild. We should consider removing these categories as they may negatively affect the predictive power of our model if it is susceptible to that.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The particular groups of predictors that are more likely to be missing are the ones involving food, seed, and leaf. There is a definite pattern here of missing data that is related to the classes. Looking for information about the classes in the information about the data we see that for many of these soybean plant attributes there is no option to indicate that something like the absence of a variable category. For example, the leaf attributes almost all share the same amount of 84 NAs and it’s the same for the number 38 we see time and time again. It is very likely that most of these NAs show the absence of said variable. Which is important information itself.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For the missing data that we have deemed as meaningful and relating to the absence of said category we should recode these NAs to a new level that would correspond to absence. Some NAs will be eliminated from the removal of degenerate frequencies such as mycelium, sclerotia, shriveling, and leaf.mild. Finally, any remaining NAs that don’t fit into either of these categories can be imputed through KNN imputation.

JK Chapter 3

Taha Ahmad

2023-09-28

Introduction

Exercises

3.1

3.2