The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Interestingly this was motivated by forensic science. If you can identify a piece of broken glass by its chemical signature or refractive index you could confirm if it’s consistent with glass you find elsewhere, establishing a potential link to a crime scene.
# Check out packages
library(mlbench)
library(ggplot2)
library(gridExtra)
library(caret)
library(corrplot)
library(e1071)
library(knitr)
library(dplyr)
# Load and view data
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
We produced histograms to give a glimpse into the distributions for the the different values. It looks like Sodium, Aluminium, Silicon and Calcium are always present in glass but in a number of samples, there is no Magnesium, Potassium, Barium or Iron are present.
The basic formula for glass is 70-75% Silicon Dioxide, 12-15% Sodium Oxide, 10% Calcium Oxide and a little aluminum oxide to improve it’s properties (durability, temperature and chemical resistance, and ability to be shaped at high temperatures.). Whereas the four elements with some zero values are used for specialty glass, like glass lenses.
While the distributions look skewed, we won’t be able to take the log or Box-Cox transformation without addressing the zeros (possibly by adding 0.0001 to all of the values).
# Plot histograms for each predictor
plot_list <- lapply(names(Glass[, -ncol(Glass)]), function(var) {
ggplot(Glass, aes_string(x = var)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
ggtitle(paste("Distribution of", var)) +
theme_minimal()
})
# Arrange plots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
There are no near zero predictors, meaning we don’t have to exclude any predictors for being mostly empty.
# Identify near-zero variance predictors
#nearZeroVar(Glass)
# Result: None
Calcium is highly correllated with refractive index. One role of calcium oxide in glass is to prevent crystals from forming which would make the glass more opaque.
Silicon Dioxide is negatively correlated with refractive index but that’s probably because if there’s more silica then there’s less calcium oxide or other addititives that improve the optic qualities of the glass.
# Calculate correlation matrix
correlations <- cor(Glass[, sapply(Glass, is.numeric)])
# Plot correlation matrix
corrplot(correlations, method = "square", type = "lower", order = "original",
tl.cex = 0.8, tl.col = "black", cl.cex = 0.8)
Do there appear to be any outliers in the data? Are any predictors skewed?
With boxplots, 50% of the values fall in the box, any values up to 1.5 times the box height is captured in the line, and any points beyond that are considered outliers and plotted individually, here in red.
Because there are so many outliers, after we transform our data, we should center and scale it.
# Create boxplots for each numeric column
numeric_columns <- names(Glass)[sapply(Glass, is.numeric)]
plot_list <- lapply(numeric_columns, function(col) {
ggplot(Glass, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "lightblue") +
ggtitle(paste("Boxplot of", col)) +
theme_minimal()
})
# Display all boxplots in a grid
library(gridExtra)
do.call(grid.arrange, c(plot_list, ncol = 3))
Yes, all of the variables are skewed.
# Calculate skewness for all numeric columns in Glass
skewness_values <- sapply(Glass[, sapply(Glass, is.numeric)], skewness)
# Display skewness values
kable(data.frame(Skewness = skewness_values), align="c")
Skewness | |
---|---|
RI | 1.6027151 |
Na | 0.4478343 |
Mg | -1.1364523 |
Al | 0.8946104 |
Si | -0.7202392 |
K | 6.4600889 |
Ca | 2.0184463 |
Ba | 3.3686800 |
Fe | 1.7298107 |
Are there any relevant transformations of one or more predictors that might improve the classification model?
Since Box-cox only works on positive values, and we have four predictors with lots of zeros we will only be able to transform the other five predictors.
# Apply Box-Cox transformation to each numeric column
GlassTrans <- Glass # Copy original data
for (col in names(Glass[, sapply(Glass, is.numeric)])) {
# Check if the column is strictly positive
if (all(Glass[[col]] > 0)) {
# Create Box-Cox transformation object
bc_trans <- BoxCoxTrans(Glass[[col]])
# Apply transformation and store result
GlassTrans[[col]] <- predict(bc_trans, Glass[[col]])
} else {
cat("Skipping Box-Cox for column:", col, "- contains non-positive values\n")
}
}
## Skipping Box-Cox for column: Mg - contains non-positive values
## Skipping Box-Cox for column: K - contains non-positive values
## Skipping Box-Cox for column: Ba - contains non-positive values
## Skipping Box-Cox for column: Fe - contains non-positive values
Now that we’ve transformed the data we can center and scale it.
# Center and scale the transformed data
GlassTrans[, sapply(GlassTrans, is.numeric)] <- scale(GlassTrans[, sapply(GlassTrans, is.numeric)])
Here we display the data after transformations, and center and scaling.
# Plot histograms for each transformed predictor
plot_list <- lapply(names(GlassTrans[, sapply(GlassTrans, is.numeric)]), function(col) {
ggplot(GlassTrans, aes_string(x = col)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
ggtitle(paste("Distribution of", col)) +
theme_minimal()
})
# Display all histograms in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., [leaf size], mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
There are several predictors with degenerate distributions, the exact number depending on the threshold of frequency you deem degenerate
Here we load data and suppress code which would show the frequencies of every category for the different predictors, since there are so many of them, and only show two.
Notice how leaf.size
has three categories with a healthy
number of records in each. mycelium
on the otherhand is
considered degenerate because less than 1% of it’s records are in only
one of the two categories. Including it in our model could cause
unpredictable fitting and the model might not learn the pattern of the
rare category at all.
# Load the dataset
data(Soybean)
# Check the structure to identify categorical predictors
#str(Soybean)
# Calculate frequency distributions for each categorical predictor
categorical_columns <- names(Soybean)[sapply(Soybean, is.factor)]
frequency_distributions <- lapply(Soybean[categorical_columns], table)
# Display frequency distributions
#frequency_distributions
# Display just one frequency distribution
frequency_distributions[c("leaf.size", "mycelium")]
## $leaf.size
##
## 0 1 2
## 51 327 221
##
## $mycelium
##
## 0 1
## 639 6
Depending on how we set our threshold we can remove more or fewer degenerate predictors.
If we set our threshold to 99% we only remove the
mycelium
predictor. If we set it to 95% we also remove
sclerotia
. Setting the threshold to 90% would remove eight
total predictors. See predictor names that are TRUE
.
# Check for degenerate distributions (single or dominant category)
degenerate_distributions <- sapply(frequency_distributions, function(x) {
max(x) / sum(x) > 0.90 # Adjust threshold as needed
})
# Show results
degenerate_distributions
## Class date plant.stand precip temp
## FALSE FALSE FALSE FALSE FALSE
## hail crop.hist area.dam sever seed.tmt
## FALSE FALSE FALSE FALSE FALSE
## germ plant.growth leaves leaf.halo leaf.marg
## FALSE FALSE FALSE FALSE FALSE
## leaf.size leaf.shread leaf.malf leaf.mild stem
## FALSE FALSE TRUE TRUE FALSE
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## TRUE FALSE FALSE FALSE FALSE
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## TRUE TRUE TRUE FALSE FALSE
## seed mold.growth seed.discolor seed.size shriveling
## FALSE FALSE FALSE TRUE TRUE
## roots
## FALSE
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Yes, certain predictors are more likely to be missing. All of the missing data is in one of five classes.
If I sum the NA
values across the whole dataset (there
are no NA
s in the Class
category), and divide
by the number of predictor elements, I get ~9.77% of predictors are
missing, not 18%.
The documentation says ’The value “dna” means does not apply” however
there are no dna
values in the data.
# Percent of predictors missing
sum(is.na(Soybean))/(prod(dim(Soybean)) - nrow(Soybean))
## [1] 0.09776197
Here we list the predictors by percent missing. This may be where the ~18% number came from as the four predictors with the most missingness are all roughly 18% missing.
# Calculate percentage of missing values per column
missing_per_column <- sapply(Soybean, function(x) {
sum(is.na(x)) / length(x) * 100
})
# Convert to a data frame and order by missing percentage
missing_df <- data.frame(Predictor = names(missing_per_column),
MissingPercentage = missing_per_column)
missing_df <- missing_df[order(-missing_df$MissingPercentage), ]
rownames(missing_df) <- NULL
# Display the ranked predictors by missing percentage
#print(missing_df)
kable(missing_df, align="c")
Predictor | MissingPercentage |
---|---|
hail | 17.7159590 |
sever | 17.7159590 |
seed.tmt | 17.7159590 |
lodging | 17.7159590 |
germ | 16.3982430 |
leaf.mild | 15.8125915 |
fruiting.bodies | 15.5197657 |
fruit.spots | 15.5197657 |
seed.discolor | 15.5197657 |
shriveling | 15.5197657 |
leaf.shread | 14.6412884 |
seed | 13.4699854 |
mold.growth | 13.4699854 |
seed.size | 13.4699854 |
leaf.halo | 12.2986823 |
leaf.marg | 12.2986823 |
leaf.size | 12.2986823 |
leaf.malf | 12.2986823 |
fruit.pods | 12.2986823 |
precip | 5.5636896 |
stem.cankers | 5.5636896 |
canker.lesion | 5.5636896 |
ext.decay | 5.5636896 |
mycelium | 5.5636896 |
int.discolor | 5.5636896 |
sclerotia | 5.5636896 |
plant.stand | 5.2708638 |
roots | 4.5387994 |
temp | 4.3923865 |
crop.hist | 2.3426061 |
plant.growth | 2.3426061 |
stem | 2.3426061 |
date | 0.1464129 |
area.dam | 0.1464129 |
Class | 0.0000000 |
leaves | 0.0000000 |
Here we display the number of missing values by class. Only five of the class have all of the missing values.
# Make list of values in Class
classes <- unique(Soybean$Class)
# Calculate missing values per record for each class
missing_by_class <- lapply(classes, function(class_name) {
# Filter the data for the current class
class_data <- Soybean[Soybean$Class == class_name, ]
# Calculate missing values per record for this class
sum(rowSums(is.na(class_data)))
})
# Name the list elements by class for easier reference
names(missing_by_class) <- classes
# Convert to dataframe
missing_by_class <- as.data.frame(missing_by_class)
rownames(missing_by_class) <- "Missing"
# Display the result
t(missing_by_class)
## Missing
## diaporthe.stem.canker 0
## charcoal.rot 0
## rhizoctonia.root.rot 0
## phytophthora.rot 1214
## brown.stem.rot 0
## powdery.mildew 0
## downy.mildew 0
## brown.spot 0
## bacterial.blight 0
## bacterial.pustule 0
## purple.seed.stain 0
## anthracnose 0
## phyllosticta.leaf.spot 0
## alternarialeaf.spot 0
## frog.eye.leaf.spot 0
## diaporthe.pod...stem.blight 177
## cyst.nematode 336
## X2.4.d.injury 450
## herbicide.injury 160
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
I ended up skimming the 1980 paper https://www.mli.gmu.edu/papers/79-80/80-2.pdf
that this data came from and there’s no discussion about why they
eliminated the last four conditions but did not exclude
phytophthora.rot
, all five of which having missing
values.
My guess is that the last four don’t have enough records or examples to impute the missing values.
Another possibility is phytophthora rot may be difficult to classify with similar symptoms as the other classes in the model, but the four that are excluded can be easily ruled out before you model. For instance, it may be that nematode cysts have a singular diagnoses (are there cysts, with nematodes?) and so couldn’t be captured effectively in the model unless we used one model for the first fifteen classes and wrapped it in a decision tree for the other four. Two of the four excluded classes are herbicide damage (unspecified and X24d) and so they may be easy to rule out, too.
Either way our strategy will be to exclude those last four.
I reviewed Phytophthora rot and I don’t believe there are enough records to impute it. 55 or 68 records are missing data for predictors with missingness but there are only 88 records for Phytophthora rot.
My strategy would be to exclude it.
Alternatively we could make an indicator for missingness… Maybe
that’s what dna
meant in the
??mlbench::Soybean
description, it wasn’t in the data but
the original authors made all of the Phytophthora rot values for the 13
predictors with missingness, dna
.
I don’t think that would be a very good strategy though, since only one class would have a missingness indicator, the model would learn a missingness indicator means Phytophthora rot.
Here’s the number of missing values by predictor for Phytophthora rot records:
SoybeanPhytoRot <- Soybean[Soybean$Class == "phytophthora-rot", ]
# Count the number of NA values in each column
missing_counts <- colSums(is.na(SoybeanPhytoRot))
# Display the result
print(missing_counts)
## Class date plant.stand precip temp
## 0 0 0 0 0
## hail crop.hist area.dam sever seed.tmt
## 68 0 0 68 68
## germ plant.growth leaves leaf.halo leaf.marg
## 68 0 0 55 55
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 55 55 55 55 0
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 68 0 0 68 0
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 0 0 0 68 68
## seed mold.growth seed.discolor seed.size shriveling
## 68 68 68 68 68
## roots
## 0
These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.