1

The UC Irvine Machine Learning Repository6 contains a data set related to my_glass identification. The data consist of 214 my_glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
  2. Do there appear to be any outliers in the data? Are any predictors skewed?
  3. Are there any relevant transformations of one or more predictors that might improve the classification model?
library(mlbench)
library(tidyverse)
library(fpp3)
library(gridExtra)
library(GGally)
library(fabletools)
library(caret)
library(ggplot2)
library("e1071")
library("reshape2")
 data(Glass)

my_glass<-Glass
# List of predictor names
predictors <- c("RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe")

# Loop over each predictor and plot its distribution
plot_list <- lapply(predictors, function(var) {
  median_value <- median(Glass[[var]], na.rm = TRUE)
  ggplot(my_glass, aes_string(var)) +
    geom_density( fill = "lightgreen", color = "black") +geom_vline(xintercept = median_value, color = "blue", linetype = "dashed") +
    
    ggtitle(paste("Dis", var))
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Display all histograms in a grid
grid.arrange(grobs = plot_list, ncol = 3)

 ggpairs(my_glass[, predictors])

plot_list2 <- lapply(predictors, function(var) {
  ggplot(my_glass, aes_string(x = '""', y = var)) +  # use aes_string() for dynamic variable names
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var))  # concatenate string with 'paste()'
})
grid.arrange(grobs = plot_list2, ncol = 3)

predictors_gl<- my_glass[sapply(my_glass, is.numeric)]
skewValues <- apply( predictors_gl ,2, skewness)
head(skewValues)
##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889

There are multiple varibles with many values of zeros as wll at outliers far out to on K. RI: 1.6027 (right-skewed) Na: 0.4478 (slightly right-skewed) Mg: -1.1365 (left-skewed) Al: 0.8946 (slightly right-skewed) Si: -0.7202 (left-skewed) K: 6.4601 (highly right-skewed

nearZeroVar(my_glass)
## integer(0)

Because we have multiple predictors with many observations of zero the Box Cox tranformation will not work here, instead we can try a Spatial Sign tranformation to mitigate the efects of outliers.

splom(~ my_glass[,-10], pch = 16, col = rgb(.2, .9, .4, .5), cex = .6,main= " Pre transformation")

cs_glass<-preProcess(my_glass[,-10], ,methos = c("center","scale"))
cspreddata<- predict(cs_glass, newdata = my_glass[,-10])
ss_glass <- spatialSign(cspreddata)
splom(~ ss_glass, pch = 16, col = rgb(.9, .4, .11, .5), cex = .6, main="Center scale Spatial Sign ")

2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
  2. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes? (c)Develop a strategy for handling missing data, either by eliminating predictors or imputation.
 data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
lapply(Soybean, function(x) if (is.factor(x)) table(x))
## $Class
## x
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
## $date
## x
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
## $plant.stand
## x
##   0   1 
## 354 293 
## 
## $precip
## x
##   0   1   2 
##  74 112 459 
## 
## $temp
## x
##   0   1   2 
##  80 374 199 
## 
## $hail
## x
##   0   1 
## 435 127 
## 
## $crop.hist
## x
##   0   1   2   3 
##  65 165 219 218 
## 
## $area.dam
## x
##   0   1   2   3 
## 123 227 145 187 
## 
## $sever
## x
##   0   1   2 
## 195 322  45 
## 
## $seed.tmt
## x
##   0   1   2 
## 305 222  35 
## 
## $germ
## x
##   0   1   2 
## 165 213 193 
## 
## $plant.growth
## x
##   0   1 
## 441 226 
## 
## $leaves
## x
##   0   1 
##  77 606 
## 
## $leaf.halo
## x
##   0   1   2 
## 221  36 342 
## 
## $leaf.marg
## x
##   0   1   2 
## 357  21 221 
## 
## $leaf.size
## x
##   0   1   2 
##  51 327 221 
## 
## $leaf.shread
## x
##   0   1 
## 487  96 
## 
## $leaf.malf
## x
##   0   1 
## 554  45 
## 
## $leaf.mild
## x
##   0   1   2 
## 535  20  20 
## 
## $stem
## x
##   0   1 
## 296 371 
## 
## $lodging
## x
##   0   1 
## 520  42 
## 
## $stem.cankers
## x
##   0   1   2   3 
## 379  39  36 191 
## 
## $canker.lesion
## x
##   0   1   2   3 
## 320  83 177  65 
## 
## $fruiting.bodies
## x
##   0   1 
## 473 104 
## 
## $ext.decay
## x
##   0   1   2 
## 497 135  13 
## 
## $mycelium
## x
##   0   1 
## 639   6 
## 
## $int.discolor
## x
##   0   1   2 
## 581  44  20 
## 
## $sclerotia
## x
##   0   1 
## 625  20 
## 
## $fruit.pods
## x
##   0   1   2   3 
## 407 130  14  48 
## 
## $fruit.spots
## x
##   0   1   2   4 
## 345  75  57 100 
## 
## $seed
## x
##   0   1 
## 476 115 
## 
## $mold.growth
## x
##   0   1 
## 524  67 
## 
## $seed.discolor
## x
##   0   1 
## 513  64 
## 
## $seed.size
## x
##   0   1 
## 532  59 
## 
## $shriveling
## x
##   0   1 
## 539  38 
## 
## $roots
## x
##   0   1   2 
## 551  86  15

Some variables have very skewed distributions, indicating potential degeneration:

leaves: Most cases are in category 1 (606 ), mycelium: Category 0 dominates (639 ) sclerotia: Most cases (625) are in category 0.

Now lets see about missing data

missing_data <- sapply(Soybean, function(x) sum(is.na(x)))
missing_data <- data.frame(Predictor = names(missing_data), MissingCount = missing_data)
missing_data <- missing_data[missing_data$MissingCount > 0, ]
print(missing_data)
##                       Predictor MissingCount
## date                       date            1
## plant.stand         plant.stand           36
## precip                   precip           38
## temp                       temp           30
## hail                       hail          121
## crop.hist             crop.hist           16
## area.dam               area.dam            1
## sever                     sever          121
## seed.tmt               seed.tmt          121
## germ                       germ          112
## plant.growth       plant.growth           16
## leaf.halo             leaf.halo           84
## leaf.marg             leaf.marg           84
## leaf.size             leaf.size           84
## leaf.shread         leaf.shread          100
## leaf.malf             leaf.malf           84
## leaf.mild             leaf.mild          108
## stem                       stem           16
## lodging                 lodging          121
## stem.cankers       stem.cankers           38
## canker.lesion     canker.lesion           38
## fruiting.bodies fruiting.bodies          106
## ext.decay             ext.decay           38
## mycelium               mycelium           38
## int.discolor       int.discolor           38
## sclerotia             sclerotia           38
## fruit.pods           fruit.pods           84
## fruit.spots         fruit.spots          106
## seed                       seed           92
## mold.growth         mold.growth           92
## seed.discolor     seed.discolor          106
## seed.size             seed.size           92
## shriveling           shriveling          106
## roots                     roots           31
missing_data$missing_percent <- (missing_data$MissingCount / nrow(Soybean)) * 100
print(missing_data)
##                       Predictor MissingCount missing_percent
## date                       date            1       0.1464129
## plant.stand         plant.stand           36       5.2708638
## precip                   precip           38       5.5636896
## temp                       temp           30       4.3923865
## hail                       hail          121      17.7159590
## crop.hist             crop.hist           16       2.3426061
## area.dam               area.dam            1       0.1464129
## sever                     sever          121      17.7159590
## seed.tmt               seed.tmt          121      17.7159590
## germ                       germ          112      16.3982430
## plant.growth       plant.growth           16       2.3426061
## leaf.halo             leaf.halo           84      12.2986823
## leaf.marg             leaf.marg           84      12.2986823
## leaf.size             leaf.size           84      12.2986823
## leaf.shread         leaf.shread          100      14.6412884
## leaf.malf             leaf.malf           84      12.2986823
## leaf.mild             leaf.mild          108      15.8125915
## stem                       stem           16       2.3426061
## lodging                 lodging          121      17.7159590
## stem.cankers       stem.cankers           38       5.5636896
## canker.lesion     canker.lesion           38       5.5636896
## fruiting.bodies fruiting.bodies          106      15.5197657
## ext.decay             ext.decay           38       5.5636896
## mycelium               mycelium           38       5.5636896
## int.discolor       int.discolor           38       5.5636896
## sclerotia             sclerotia           38       5.5636896
## fruit.pods           fruit.pods           84      12.2986823
## fruit.spots         fruit.spots          106      15.5197657
## seed                       seed           92      13.4699854
## mold.growth         mold.growth           92      13.4699854
## seed.discolor     seed.discolor          106      15.5197657
## seed.size             seed.size           92      13.4699854
## shriveling           shriveling          106      15.5197657
## roots                     roots           31       4.5387994

Lets fidnout if its related to a class by grouping by class

library(dplyr)

class_missing <- Soybean %>%
    group_by(Class) %>%
    summarise(across(everything(), ~sum(is.na(.)), .names = "Missing_{col}"))

print(class_missing)
## # A tibble: 19 × 36
##    Class            Missing_date Missing_plant.stand Missing_precip Missing_temp
##    <fct>                   <int>               <int>          <int>        <int>
##  1 2-4-d-injury                1                  16             16           16
##  2 alternarialeaf-…            0                   0              0            0
##  3 anthracnose                 0                   0              0            0
##  4 bacterial-blight            0                   0              0            0
##  5 bacterial-pustu…            0                   0              0            0
##  6 brown-spot                  0                   0              0            0
##  7 brown-stem-rot              0                   0              0            0
##  8 charcoal-rot                0                   0              0            0
##  9 cyst-nematode               0                  14             14           14
## 10 diaporthe-pod-&…            0                   6              0            0
## 11 diaporthe-stem-…            0                   0              0            0
## 12 downy-mildew                0                   0              0            0
## 13 frog-eye-leaf-s…            0                   0              0            0
## 14 herbicide-injury            0                   0              8            0
## 15 phyllosticta-le…            0                   0              0            0
## 16 phytophthora-rot            0                   0              0            0
## 17 powdery-mildew              0                   0              0            0
## 18 purple-seed-sta…            0                   0              0            0
## 19 rhizoctonia-roo…            0                   0              0            0
## # ℹ 31 more variables: Missing_hail <int>, Missing_crop.hist <int>,
## #   Missing_area.dam <int>, Missing_sever <int>, Missing_seed.tmt <int>,
## #   Missing_germ <int>, Missing_plant.growth <int>, Missing_leaves <int>,
## #   Missing_leaf.halo <int>, Missing_leaf.marg <int>, Missing_leaf.size <int>,
## #   Missing_leaf.shread <int>, Missing_leaf.malf <int>,
## #   Missing_leaf.mild <int>, Missing_stem <int>, Missing_lodging <int>,
## #   Missing_stem.cankers <int>, Missing_canker.lesion <int>, …

Lets make a map of missing predictors per class

missing_heatmap <- melt(class_missing, id.vars = "Class")
ggplot(missing_heatmap, aes(x = variable, y = Class)) +
    geom_tile(aes(fill = value), color = "white") +
    scale_fill_gradient(low = "white", high = "maroon") +
    theme_minimal() +
    labs(title = "Missing Data Pattern by Class", x = "Predictors", y = "Class")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the output above, we observe that there are several predictors with a significant amount of missing data, while others are fully complete. The classes “2-4-d-injury,” “cyst-nematode,” and “herbicide-injury” exhibit nearly complete missingness for many predictors, whereas “phytophthora-rot” shows a high rate of missing data across multiple predictors.

We could consider removing the classes that have a high rate of missing values, as imputation might be challenging due to complete missingness in certain classes. In this case, I would prefer to use a model that is not sensitive to imbalance or missing predictors, such as Naive Bayes. This approach can help mitigate the impact of the missing data while still allowing for effective classification.. ```