3.1. The UC Irvine Machine Learning Repository

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass_long <- pivot_longer(Glass, -Type, names_to="predictor", values_to="value")

ggpairs(Glass)

The ggpairs correlation values show us that the correlation between variables is pretty negligible with most of them falling on or under 0.44. The only exception to this pattern is the relationship between the refractive index of the glass and its Calcium content and we can see that these two are the only strongly correlated variables.

ggplot(Glass_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~predictor, scales="free")

The density plots show us how the data points are distributed. Looking at these we can see tat while the plots for Al, Na and Si all have relatively normal looking distributions centered around types , the plots for Mg, K, Ri seem bi modal, and yet others that look completely skewed

b) Do there appear to be any outliers in the data? Are any predictors skewed?

ggplot(Glass_long, aes(x=Type, y=value)) +
  geom_boxplot() +
  facet_wrap(~predictor, scales="free")

To see the outliers more clearly we can look at boxplots. here we can see that with the exception of Magnesium, every other plot has a lot of outliers

ggplot(Glass_long, aes(x=value)) +
  geom_boxplot() +
  facet_wrap(~predictor, scales="free")

The boxplots also seem to confirm our assessments of the density charts. While the outliers for Si, Na , Al and Ca mostly seem to lie on both sides of the majority of the observations within the interquartile range, Fe , Ba and K seem to have outliers only towards the right of the their interquartile range skewing their curves to the left.

We can also verify this with the skewness function from package e1071 . With results between

num_vars <- Glass[, 1:9]
skew_before <- apply(num_vars, 2, skewness)
skew_before

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

We see that in order of skewness from least to most would be Na, Si and Al with skewness values lying between -1 to 1 . Ri, Fe,Mg and Ca lying within 2 and Ba and K being the most skewed. This is in line with the plots we’ve seen from before

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We could use the Box-Cos transformation but we would have to add a a small constant to the plots that have zero values, Mg,K, Ba, Fe so as to not break the transformation. With this transformation we should be able to improve the classification model and address the skewness

Glass[, c("Mg","K","Ba","Fe")] <- Glass[, c("Mg","K","Ba","Fe")] + 1e-6

boxcox_skewness <- function(x) {
  BCT <- BoxCoxTrans(x)
  skewness(predict(BCT, x))
}

skew_after <- apply(Glass[, -10], 2, boxcox_skewness)
skew_after

##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.43270870  0.09105899 -0.65090568 -0.78216211 
##          Ca          Ba          Fe 
## -0.19395573  1.67566612  0.74424403

Comparing both

skew_table <- data.frame(
  Predictor = names(skew_before),
  Skew_Before = round(skew_before, 2),
  Skew_After  = round(skew_after, 2)
)
skew_table

##    Predictor Skew_Before Skew_After
## RI        RI        1.60       1.57
## Na        Na        0.45       0.03
## Mg        Mg       -1.14      -1.43
## Al        Al        0.89       0.09
## Si        Si       -0.72      -0.65
## K          K        6.46      -0.78
## Ca        Ca        2.02      -0.19
## Ba        Ba        3.37       1.68
## Fe        Fe        1.73       0.74

This transformation seems to have worked great in handling the skewness of the most skewed plots with K going from 6.46 to -.77 Ba going from 3.37 to 1.68 and would be a much better set for a classification model

3.2. The soybean data Repository

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Models are often crippled by predictors with degenerate distributions. For example predictor variables with a single unique values or values so similar that they have near zero variance. Its always advantageous to find and remove these sorts of predictors. The Caret package also has the function nearZeroVar that will return the column numbers of predictors that fall into this criteria

NZV_variables <- nearZeroVar(Soybean)
colnames(Soybean)[NZV_variables]

## [1] "leaf.mild" "mycelium"  "sclerotia"

We see that the three variables that fall under this criteria are leaf.mild, mycelium and sclerotia. Removing these will be advantageous to the model.

b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_pct <- colMeans(is.na(Soybean)) * 100
missing_pct <- round(missing_pct, 2)
missing_pct <- sort(missing_pct, decreasing = TRUE)

missing_pct

##            hail           sever        seed.tmt         lodging            germ 
##           17.72           17.72           17.72           17.72           16.40 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##           15.81           15.52           15.52           15.52           15.52 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##           14.64           13.47           13.47           13.47           12.30 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##           12.30           12.30           12.30           12.30            5.56 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##            5.56            5.56            5.56            5.56            5.56 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##            5.56            5.27            4.54            4.39            2.34 
##    plant.growth            stem            date        area.dam           Class 
##            2.34            2.34            0.15            0.15            0.00 
##          leaves 
##            0.00

This is the order for decreasing percentage of missing values within the predictors. hail, sever,seed.tmt, lodging,germ seem to be the predictors with the highest number of missing values

missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(Average_Missing = mean(rowSums(is.na(across(1:35))) / 35 * 100)) %>%  # first 35 columns are predictors
  arrange(desc(Average_Missing)) %>%
  mutate(Average_Missing = round(Average_Missing, 2))

missing_by_class

## # A tibble: 19 × 2
##    Class                       Average_Missing
##    <fct>                                 <dbl>
##  1 2-4-d-injury                           80.4
##  2 cyst-nematode                          68.6
##  3 herbicide-injury                       57.1
##  4 phytophthora-rot                       39.4
##  5 diaporthe-pod-&-stem-blight            33.7
##  6 alternarialeaf-spot                     0  
##  7 anthracnose                             0  
##  8 bacterial-blight                        0  
##  9 bacterial-pustule                       0  
## 10 brown-spot                              0  
## 11 brown-stem-rot                          0  
## 12 charcoal-rot                            0  
## 13 diaporthe-stem-canker                   0  
## 14 downy-mildew                            0  
## 15 frog-eye-leaf-spot                      0  
## 16 phyllosticta-leaf-spot                  0  
## 17 powdery-mildew                          0  
## 18 purple-seed-stain                       0  
## 19 rhizoctonia-root-rot                    0

Here we see that the missing data seems to be only in the 5 classes 2-4-d-injury, cyst-nematode, herbicide-injury , phytophthora-rot , diaporthe-pod-&-stem-blight

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The two main options we have is Removing predictors or imputing missing values. If the predictor has missing values across all classes it may not provide any information that is particularly useful. This could include the first 2-4-d-injury, cyst-nematode and herbicide injury. For the rest we could try imputation with either most frequent category or k-Nearest neighbor if correlation exists within the predictors. Here missingness seems to be related to class at least when it comes to the top 5, so we could add this average missing value we calculated as as a new factor.

Assignment4

2025-09-28