3.1. The UC Irvine Machine Learning Repository to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(DataExplorer)
library(naniar)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
par(mfrow=c(3,4))
par(mai=c(.3,.3,.3,.3))
factors <- c("Type")
variables <- names(Glass)
for (i in 1:(length(variables)-1)) {
    if (! variables[i] %in% factors){
       hist(Glass[[variables[i]]], main = variables[i], col = "lightblue")
    }
}

The distributions vary across the different element numeric predictors:

ggplot(data=Glass,aes(Type)) +
    geom_bar() +
    labs(title='Type Frequencies')

The shape of a factor variable is a little less relevant in comparison to a numeric distribution assuming a non-degenerate distribution; however, there is a much heavier concentration in types 1 & 2 which may have significant relationships with the other predictors.

Let’s build a correlation plot to see the strength of the linear relationships among the predictors:

corrplot(cor(Glass |> dplyr::select(-Type)),
         method="color",
         diag=FALSE,
         type="lower",
         addCoef.col = "black",
         number.cex=0.70)

Most of the relationships are not unusually correlated with one another although several factors are moderately linearly related to RI and Mg. The one exception to that case is CA and RI that has a correlation strength of 0.81 which is by far the strongest positive or negative relationship.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
Glass |>
    select_if(is.numeric) %>%
    pivot_longer(cols=everything()) |> 
   # filter(cols!='Si') |>
    ggplot(aes(y = value,colour=name)) + 
    geom_boxplot()  

The boxplot would seem to indicate a few outliers that exist in the data set in most of the predictors although they are most pronounced/frequent in CA and NA. There are a few outliers that exist for Si, Ba, Al, and K.

As mentioned above when reviewing the distributions there are many predictors that are skewed in Ri, Mg, Al, Si, K, Ca, Ba, Fe. Almost all of the elements have skewed values that are present in the dataset.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?
glass_bc_prep <- Glass |>
    select_if(is.numeric) %>%  mutate(rownum=row_number()) |>
    pivot_longer(cols=c('Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba','RI','Fe')) |> mutate(value_adj=ifelse(value==0,0.001,value)) |>
    pivot_wider(id_cols=rownum,names_from=name,values_from=value_adj) |>
    dplyr::select(-rownum)
skewed <- c(colnames(glass_bc_prep))

for (i in 1:(length(colnames(glass_bc_prep)))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(glass_bc_prep[[skewed[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
knitr::kable(lambdas, format = "simple")
skewed lambdas
Na -0.0999999999999999
Mg 0.55
Al 0.5
Si 2
K 0.35
Ca -1.1
Ba -0.85
RI -2
Fe -0.45

Based on the proposed best power transformation from the Box Cox it does not appear that most of the columns have a reasonable transformation although we will replot them all to see what the new distributions are after applying the tranformation.

glass_bc <- glass_bc_prep |> mutate(na_mod=Na^-0.1,mg_mod=Mg^0.75,
                        al_mod=Al^0.5,si_mod=Si^2,k_mod=K^0.33,ca_mod=Ca^-1,
                        ba_mod=Ba^-1.3,ri_mod=RI^-2,fe_mod=Fe^-0.85) |> dplyr::select(contains('mod'))
par(mfrow=c(2,4))
par(mai=c(.3,.3,.3,.3))
bc_vars <- names(glass_bc)
for (i in 1:(length(bc_vars)-1)) {
    hist(glass_bc[[bc_vars[i]]], main = bc_vars[i], col = "lightblue")
}

After applying the maximized log likelihood transformations to the Glass data set it does not seem that many are really beneficial. Na, Al appear to have been normalized as expected and perhaps Ca and Si could be useful transformations, but the remainder do not appear to be effective.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

There are a decent amount of null values across many of the predictor columns. We will explore these occurrences in future plots. Most of the predictors have a minimum of two factors although they are heavily concentrated in one of those values.

#guidance found here: https://stackoverflow.com/questions/67158295/bar-plot-for-each-column-in-a-data-frame-in-r
all_plots <- lapply(names(Soybean), function(col) {
  ggplot(Soybean, aes(.data[[col]], after_stat(count),fill='black')) + 
    geom_bar(aes(fill = .data[[col]]), position = "dodge") +
    theme(legend.position="none",axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
})

Need to Sort

This plot shows a large variety of classes that exist in the Soybean dataset.

grid.arrange(grobs=all_plots[1],ncol=1)

grid.arrange(grobs=all_plots[2:10], ncol= 3)

grid.arrange(grobs=all_plots[11:19], ncol= 3)

grid.arrange(grobs=all_plots[20:29], ncol= 3)

grid.arrange(grobs=all_plots[30:36], ncol= 3)

Looking at the frequency distributions of the different predictors many of the features appear to be showing up at two or three values and it would need to be researched further how common the pairwise correlation of specific values is among the predictors. In general the zero case is the most common one for a vast majority of the features although there are certainly exceptions to that patterns (e.g. stem)

There are technically not any columns with only one value that corresponds to the degenerate distribution definition although mycelium and sclerotia have very few alternative cases outside of zero.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
p1 <- plot_missing(Soybean, missing_only = TRUE,
                   ggtheme = theme_classic())

It’s likely that there are interrelationships across specific predictors that are very similar to one another given multiple groups of columns have the same number of records missing.

There are a few features that appear to have the highest rate of missing values (germ, lodging, seed.tmt, sever, and hall) at 17.72%. The different groups of columns mostly have a number of NA values that might be appropriate to impute new values.

gg_miss_fct(x =Soybean , fct = Class) + labs(title='Missing Values by Class')

This missing frequency plot grouped by a factor is extremely useful to see the concentration of missing values by Class. It would appear that only a few classes make up all of the missing values. This warrants some special treatment or review to understand why this concentration is occurring. As it seems unlikely that this could be missing at random and more likely corresponds to the classes in some ways.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

While imputation might make sense for diaporthe-pod & stem-canker or phytophthora-rot it would probably be practical to separately consider the classes that have all the null values and determine whether these are slightly different types of soybeans or there is rationale that explains why there are so many missing values in them. Is there truly some unusual measurement error that has corrupted these values as it would be more challenging to reasonably estimate them with imputation without understanding some of the differences and similarities to other Class types. A model with the remainder of the data (i.e. classes without nulls) might be a useful starting point.