Exercises

3.1.

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

To analyze the distributions, we use Density Plots for individual variables and a Correlation Matrix for relationships.

Individual Distributions:

RI (Refractive Index) and Ca: Show a nearly normal distribution but with a long right tail (outliers).

Mg: Highly bimodal. Samples tend to have either ~0% or ~3.5% Magnesium. This is a key “signature” for certain glass types.

K, Ba, Fe: Most samples are clustered at 0. These are “sparse” predictors with extreme right skewness.

Relationships:

RI vs. Ca: There is a strong positive linear relationship (\(r \approx 0.81\)).

Other elements show very weak correlations, suggesting they provide independent information to the model.

library(mlbench)
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
data(Glass)

# 1. Density plots to understand individual distributions
featurePlot(x = Glass[, 1:9], 
            y = Glass$Type, 
            plot = "density", 
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")),
            auto.key = list(columns = 3))

# 2. Correlation matrix to understand relationships between predictors
glass_cor <- cor(Glass[, 1:9])
corrplot(glass_cor, method = "number", type = "upper")

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
library(e1071)
## Warning: package 'e1071' was built under R version 4.5.2
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
## 
##     element
# 1. Calculate skewness for all numeric predictors
skew_values <- apply(Glass[, 1:9], 2, skewness)
print("Skewness Values:")
## [1] "Skewness Values:"
print(sort(skew_values, decreasing = TRUE))
##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523
# 2. Identify outliers using boxplots

featurePlot(x = Glass[, 1:9], 
            y = Glass$Type, 
            plot = "box", 
            scales = list(y = list(relation="free")))

Outliers: K and Ba have several extreme values that are many standard deviations away from the mean. These will likely pull the “center” of a linear model away from the majority of the data.

  1. Are there relevant transformations of one or more predictors that might improve the classification model?

To improve a classification model (like LDA or Neural Networks), Kuhn & Johnson suggest:

Yeo-Johnson Transformation: Since many predictors (Ba, Fe, K) have zeros, the standard Box-Cox cannot be used. Yeo-Johnson will stabilize the variance and reduce skewness.

Centering and Scaling: Essential because the units vary (RI is around 1.5, while Si is around 70).

Spatial Sign: In Chapter 3, the authors emphasize that if outliers remain after transformation, a Spatial Sign transformation projects the data onto a unit sphere, effectively “moving” outliers closer to the rest of the data.

# 1. Define the pre-processing requirements: Center, Scale, and Yeo-Johnson

glass_transform_plan <- preProcess(Glass[, 1:9], method = c("center", "scale", "YeoJohnson"))

# 2. Apply the transformations to the data
glass_transformed <- predict(glass_transform_plan, Glass[, 1:9])

# 3. Verify improvement in skewness
new_skew <- apply(glass_transformed, 2, skewness)
print("Skewness after Yeo-Johnson:")
## [1] "Skewness after Yeo-Johnson:"
print(sort(new_skew, decreasing = TRUE))
##            Ba            Fe            RI            Al            Na 
##  3.3686799688  1.7298107096  1.6027150827  0.0002128321 -0.0088476749 
##             K            Ca            Si            Mg 
## -0.0708227694 -0.2063893005 -0.7202392108 -0.8770969306

3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
?Soybean
## starting httpd help server ... done
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Kuhn & Johnson describe “degenerate” predictors as those with Near-Zero Variance (NZV). These variables have a single value for almost all samples, providing no information to the model.

library(caret)
data(Soybean)

# Manual check for degenerate predictors
check_nzv <- function(x) {
  tab <- table(x, useNA = "no")
  if(length(tab) <= 1) return(TRUE)
  tab <- sort(tab, decreasing = TRUE)
  ratio <- tab[1] / tab[2]
  return(ratio > 19)
  }

degenerate_vars <- sapply(Soybean[, -1], check_nzv)
print(names(degenerate_vars[degenerate_vars == TRUE]))
## [1] "leaf.mild.0" "mycelium.0"  "sclerotia.0"

Analysis:

Several predictors (like leaf.mild, mycelium, and sclerotia) are degenerate.

They have a high frequency ratio (the most common value appears much more often than the second).

In Chapter 3, the authors recommend removing these variables because they can cause numerical instability in models like Linear Discriminant Analysis.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The authors emphasize that we must determine if data is Missing At Random (MAR) or if there is a pattern related to the outcome.

library(VIM)
## Warning: package 'VIM' was built under R version 4.5.2
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.5.2
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
# 1. Visualize missingness pattern
aggr(Soybean, prop = FALSE, numbers = TRUE, sortVars = TRUE)

## 
##  Variables sorted by number of missings: 
##         Variable Count
##             hail   121
##            sever   121
##         seed.tmt   121
##          lodging   121
##             germ   112
##        leaf.mild   108
##  fruiting.bodies   106
##      fruit.spots   106
##    seed.discolor   106
##       shriveling   106
##      leaf.shread   100
##             seed    92
##      mold.growth    92
##        seed.size    92
##        leaf.halo    84
##        leaf.marg    84
##        leaf.size    84
##        leaf.malf    84
##       fruit.pods    84
##           precip    38
##     stem.cankers    38
##    canker.lesion    38
##        ext.decay    38
##         mycelium    38
##     int.discolor    38
##        sclerotia    38
##      plant.stand    36
##            roots    31
##             temp    30
##        crop.hist    16
##     plant.growth    16
##             stem    16
##             date     1
##         area.dam     1
##            Class     0
##           leaves     0
# 2. Check if missingness is related to the Class
missing_by_class <- table(Soybean$Class, complete.cases(Soybean))
print(missing_by_class)
##                              
##                               FALSE TRUE
##   2-4-d-injury                   16    0
##   alternarialeaf-spot             0   91
##   anthracnose                     0   44
##   bacterial-blight                0   20
##   bacterial-pustule               0   20
##   brown-spot                      0   92
##   brown-stem-rot                  0   44
##   charcoal-rot                    0   20
##   cyst-nematode                  14    0
##   diaporthe-pod-&-stem-blight    15    0
##   diaporthe-stem-canker           0   20
##   downy-mildew                    0   20
##   frog-eye-leaf-spot              0   91
##   herbicide-injury                8    0
##   phyllosticta-leaf-spot          0   20
##   phytophthora-rot               68   20
##   powdery-mildew                  0   20
##   purple-seed-stain               0   20
##   rhizoctonia-root-rot            0   20

Analysis:

Approximately 18% of the data is missing. Predictors like hail, sever, and seed.tmt have the most missing values.

The missing data is not random. Specific classes (e.g., phytophthora-rot) contain almost all the missing values. If we remove rows with missing data, we might delete entire categories of diseases from our model.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
library(mlbench)
data(Soybean)

# 1. REMOVE DEGENERATE PREDICTORS (Manually)

degenerate_cols <- c("leaf.mild", "mycelium", "sclerotia")
soybean_filtered <- Soybean[, !(names(Soybean) %in% degenerate_cols)]

# 2. HANDLE MISSING DATA

soybean_clean <- soybean_filtered

for(col_name in names(soybean_clean)) {
  if(is.factor(soybean_clean[[col_name]])) {
    # Add a new factor level called "NA_category"
    levels(soybean_clean[[col_name]]) <- c(levels(soybean_clean[[col_name]]), "NA_category")
    
    # Replace actual NA values with this new level
    soybean_clean[is.na(soybean_clean[[col_name]]), col_name] <- "NA_category"
  }
}

# 3. VERIFY
print(paste("Remaining NAs:", sum(is.na(soybean_clean))))
## [1] "Remaining NAs: 0"
summary(soybean_clean$hail)
##           0           1 NA_category 
##         435         127         121

Analysis:

Filtering: We manually removed the predictors identified in part (a). By removing leaf.mild, mycelium, and sclerotia, we prevent the model from struggling with variables that have no predictive variance.

Informative Coding: Instead of “guessing” values (imputation), we converted NA into a specific level (NA_category).

Reasoning: In the Soybean dataset, missing values are not random. Certain disease classes are missing specific measurements. By making “Missing” a category, the model can learn that “the absence of this symptom” is actually a diagnostic feature of that specific disease.