The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
To analyze the distributions, we use Density Plots for individual variables and a Correlation Matrix for relationships.
Individual Distributions:
RI (Refractive Index) and Ca: Show a nearly normal distribution but with a long right tail (outliers).
Mg: Highly bimodal. Samples tend to have either ~0% or ~3.5% Magnesium. This is a key “signature” for certain glass types.
K, Ba, Fe: Most samples are clustered at 0. These are “sparse” predictors with extreme right skewness.
Relationships:
RI vs. Ca: There is a strong positive linear relationship (\(r \approx 0.81\)).
Other elements show very weak correlations, suggesting they provide independent information to the model.
library(mlbench)
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
data(Glass)
# 1. Density plots to understand individual distributions
featurePlot(x = Glass[, 1:9],
y = Glass$Type,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
auto.key = list(columns = 3))
# 2. Correlation matrix to understand relationships between predictors
glass_cor <- cor(Glass[, 1:9])
corrplot(glass_cor, method = "number", type = "upper")
library(e1071)
## Warning: package 'e1071' was built under R version 4.5.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
##
## element
# 1. Calculate skewness for all numeric predictors
skew_values <- apply(Glass[, 1:9], 2, skewness)
print("Skewness Values:")
## [1] "Skewness Values:"
print(sort(skew_values, decreasing = TRUE))
## K Ba Ca Fe RI Al Na
## 6.4600889 3.3686800 2.0184463 1.7298107 1.6027151 0.8946104 0.4478343
## Si Mg
## -0.7202392 -1.1364523
# 2. Identify outliers using boxplots
featurePlot(x = Glass[, 1:9],
y = Glass$Type,
plot = "box",
scales = list(y = list(relation="free")))
Outliers: K and Ba have several extreme values that are many standard deviations away from the mean. These will likely pull the “center” of a linear model away from the majority of the data.
To improve a classification model (like LDA or Neural Networks), Kuhn & Johnson suggest:
Yeo-Johnson Transformation: Since many predictors (Ba, Fe, K) have zeros, the standard Box-Cox cannot be used. Yeo-Johnson will stabilize the variance and reduce skewness.
Centering and Scaling: Essential because the units vary (RI is around 1.5, while Si is around 70).
Spatial Sign: In Chapter 3, the authors emphasize that if outliers remain after transformation, a Spatial Sign transformation projects the data onto a unit sphere, effectively “moving” outliers closer to the rest of the data.
# 1. Define the pre-processing requirements: Center, Scale, and Yeo-Johnson
glass_transform_plan <- preProcess(Glass[, 1:9], method = c("center", "scale", "YeoJohnson"))
# 2. Apply the transformations to the data
glass_transformed <- predict(glass_transform_plan, Glass[, 1:9])
# 3. Verify improvement in skewness
new_skew <- apply(glass_transformed, 2, skewness)
print("Skewness after Yeo-Johnson:")
## [1] "Skewness after Yeo-Johnson:"
print(sort(new_skew, decreasing = TRUE))
## Ba Fe RI Al Na
## 3.3686799688 1.7298107096 1.6027150827 0.0002128321 -0.0088476749
## K Ca Si Mg
## -0.0708227694 -0.2063893005 -0.7202392108 -0.8770969306
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
?Soybean
## starting httpd help server ... done
Kuhn & Johnson describe “degenerate” predictors as those with Near-Zero Variance (NZV). These variables have a single value for almost all samples, providing no information to the model.
library(caret)
data(Soybean)
# Manual check for degenerate predictors
check_nzv <- function(x) {
tab <- table(x, useNA = "no")
if(length(tab) <= 1) return(TRUE)
tab <- sort(tab, decreasing = TRUE)
ratio <- tab[1] / tab[2]
return(ratio > 19)
}
degenerate_vars <- sapply(Soybean[, -1], check_nzv)
print(names(degenerate_vars[degenerate_vars == TRUE]))
## [1] "leaf.mild.0" "mycelium.0" "sclerotia.0"
Analysis:
Several predictors (like leaf.mild, mycelium, and sclerotia) are degenerate.
They have a high frequency ratio (the most common value appears much more often than the second).
In Chapter 3, the authors recommend removing these variables because they can cause numerical instability in models like Linear Discriminant Analysis.
The authors emphasize that we must determine if data is Missing At Random (MAR) or if there is a pattern related to the outcome.
library(VIM)
## Warning: package 'VIM' was built under R version 4.5.2
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.5.2
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# 1. Visualize missingness pattern
aggr(Soybean, prop = FALSE, numbers = TRUE, sortVars = TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## hail 121
## sever 121
## seed.tmt 121
## lodging 121
## germ 112
## leaf.mild 108
## fruiting.bodies 106
## fruit.spots 106
## seed.discolor 106
## shriveling 106
## leaf.shread 100
## seed 92
## mold.growth 92
## seed.size 92
## leaf.halo 84
## leaf.marg 84
## leaf.size 84
## leaf.malf 84
## fruit.pods 84
## precip 38
## stem.cankers 38
## canker.lesion 38
## ext.decay 38
## mycelium 38
## int.discolor 38
## sclerotia 38
## plant.stand 36
## roots 31
## temp 30
## crop.hist 16
## plant.growth 16
## stem 16
## date 1
## area.dam 1
## Class 0
## leaves 0
# 2. Check if missingness is related to the Class
missing_by_class <- table(Soybean$Class, complete.cases(Soybean))
print(missing_by_class)
##
## FALSE TRUE
## 2-4-d-injury 16 0
## alternarialeaf-spot 0 91
## anthracnose 0 44
## bacterial-blight 0 20
## bacterial-pustule 0 20
## brown-spot 0 92
## brown-stem-rot 0 44
## charcoal-rot 0 20
## cyst-nematode 14 0
## diaporthe-pod-&-stem-blight 15 0
## diaporthe-stem-canker 0 20
## downy-mildew 0 20
## frog-eye-leaf-spot 0 91
## herbicide-injury 8 0
## phyllosticta-leaf-spot 0 20
## phytophthora-rot 68 20
## powdery-mildew 0 20
## purple-seed-stain 0 20
## rhizoctonia-root-rot 0 20
Analysis:
Approximately 18% of the data is missing. Predictors like hail, sever, and seed.tmt have the most missing values.
The missing data is not random. Specific classes (e.g., phytophthora-rot) contain almost all the missing values. If we remove rows with missing data, we might delete entire categories of diseases from our model.
library(mlbench)
data(Soybean)
# 1. REMOVE DEGENERATE PREDICTORS (Manually)
degenerate_cols <- c("leaf.mild", "mycelium", "sclerotia")
soybean_filtered <- Soybean[, !(names(Soybean) %in% degenerate_cols)]
# 2. HANDLE MISSING DATA
soybean_clean <- soybean_filtered
for(col_name in names(soybean_clean)) {
if(is.factor(soybean_clean[[col_name]])) {
# Add a new factor level called "NA_category"
levels(soybean_clean[[col_name]]) <- c(levels(soybean_clean[[col_name]]), "NA_category")
# Replace actual NA values with this new level
soybean_clean[is.na(soybean_clean[[col_name]]), col_name] <- "NA_category"
}
}
# 3. VERIFY
print(paste("Remaining NAs:", sum(is.na(soybean_clean))))
## [1] "Remaining NAs: 0"
summary(soybean_clean$hail)
## 0 1 NA_category
## 435 127 121
Analysis:
Filtering: We manually removed the predictors identified in part (a). By removing leaf.mild, mycelium, and sclerotia, we prevent the model from struggling with variables that have no predictive variance.
Informative Coding: Instead of “guessing” values (imputation), we converted NA into a specific level (NA_category).
Reasoning: In the Soybean dataset, missing values are not random. Certain disease classes are missing specific measurements. By making “Missing” a category, the model can learn that “the absence of this symptom” is actually a diagnostic feature of that specific disease.