Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Interestingly this was motivated by forensic science. If you can identify a piece of broken glass by its chemical signature or refractive index you could confirm if it’s consistent with glass you find elsewhere, establishing a potential link to a crime scene.

# Check out packages
library(mlbench)
library(ggplot2)
library(gridExtra)
library(caret)
library(corrplot)
library(e1071)
library(knitr)
library(dplyr)

# Load and view data
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Part a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Histograms

We produced histograms to give a glimpse into the distributions for the the different values. It looks like Sodium, Aluminium, Silicon and Calcium are always present in glass but in a number of samples, there is no Magnesium, Potassium, Barium or Iron are present.

The basic formula for glass is 70-75% Silicon Dioxide, 12-15% Sodium Oxide, 10% Calcium Oxide and a little aluminum oxide to improve it’s properties (durability, temperature and chemical resistance, and ability to be shaped at high temperatures.). Whereas the four elements with some zero values are used for specialty glass, like glass lenses.

While the distributions look skewed, we won’t be able to take the log or Box-Cox transformation without addressing the zeros (possibly by adding 0.0001 to all of the values).

# Plot histograms for each predictor
plot_list <- lapply(names(Glass[, -ncol(Glass)]), function(var) {
  ggplot(Glass, aes_string(x = var)) +
    geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Distribution of", var)) +
    theme_minimal()
})

# Arrange plots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

Near zero predictors

There are no near zero predictors, meaning we don’t have to exclude any predictors for being mostly empty.

# Identify near-zero variance predictors
#nearZeroVar(Glass)

# Result: None

Correlation matix

Calcium is highly correllated with refractive index. One role of calcium oxide in glass is to prevent crystals from forming which would make the glass more opaque.

Silicon Dioxide is negatively correlated with refractive index but that’s probably because if there’s more silica then there’s less calcium oxide or other addititives that improve the optic qualities of the glass.

# Calculate correlation matrix
correlations <- cor(Glass[, sapply(Glass, is.numeric)])

# Plot correlation matrix
corrplot(correlations, method = "square", type = "lower", order = "original", 
         tl.cex = 0.8, tl.col = "black", cl.cex = 0.8)

Part b

Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers

With boxplots, 50% of the values fall in the box, any values up to 1.5 times the box height is captured in the line, and any points beyond that are considered outliers and plotted individually, here in red.

Because there are so many outliers, after we transform our data, we should center and scale it.

# Create boxplots for each numeric column
numeric_columns <- names(Glass)[sapply(Glass, is.numeric)]
plot_list <- lapply(numeric_columns, function(col) {
  ggplot(Glass, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "lightblue") +
    ggtitle(paste("Boxplot of", col)) +
    theme_minimal()
})

# Display all boxplots in a grid
library(gridExtra)
do.call(grid.arrange, c(plot_list, ncol = 3))

Skewness

Yes, all of the variables are skewed.

# Calculate skewness for all numeric columns in Glass
skewness_values <- sapply(Glass[, sapply(Glass, is.numeric)], skewness)

# Display skewness values
kable(data.frame(Skewness = skewness_values), align="c")

	Skewness
RI	1.6027151
Na	0.4478343
Mg	-1.1364523
Al	0.8946104
Si	-0.7202392
K	6.4600889
Ca	2.0184463
Ba	3.3686800
Fe	1.7298107

Part c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-cox transformation

Since Box-cox only works on positive values, and we have four predictors with lots of zeros we will only be able to transform the other five predictors.

# Apply Box-Cox transformation to each numeric column
GlassTrans <- Glass  # Copy original data
for (col in names(Glass[, sapply(Glass, is.numeric)])) {
  # Check if the column is strictly positive
  if (all(Glass[[col]] > 0)) {
    # Create Box-Cox transformation object
    bc_trans <- BoxCoxTrans(Glass[[col]])
    
    # Apply transformation and store result
    GlassTrans[[col]] <- predict(bc_trans, Glass[[col]])
  } else {
    cat("Skipping Box-Cox for column:", col, "- contains non-positive values\n")
  }
}

## Skipping Box-Cox for column: Mg - contains non-positive values
## Skipping Box-Cox for column: K - contains non-positive values
## Skipping Box-Cox for column: Ba - contains non-positive values
## Skipping Box-Cox for column: Fe - contains non-positive values

Center and scale

Now that we’ve transformed the data we can center and scale it.

# Center and scale the transformed data
GlassTrans[, sapply(GlassTrans, is.numeric)] <- scale(GlassTrans[, sapply(GlassTrans, is.numeric)])

Display final data

Here we display the data after transformations, and center and scaling.

# Plot histograms for each transformed predictor
plot_list <- lapply(names(GlassTrans[, sapply(GlassTrans, is.numeric)]), function(col) {
  ggplot(GlassTrans, aes_string(x = col)) +
    geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Distribution of", col)) +
    theme_minimal()
})

# Display all histograms in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., [leaf size], mold growth). The outcome labels consist of 19 distinct classes.

Part a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are several predictors with degenerate distributions, the exact number depending on the threshold of frequency you deem degenerate

Load data

Here we load data and suppress code which would show the frequencies of every category for the different predictors, since there are so many of them, and only show two.

Notice how leaf.size has three categories with a healthy number of records in each. mycelium on the otherhand is considered degenerate because less than 1% of it’s records are in only one of the two categories. Including it in our model could cause unpredictable fitting and the model might not learn the pattern of the rare category at all.

# Load the dataset
data(Soybean)

# Check the structure to identify categorical predictors
#str(Soybean)

# Calculate frequency distributions for each categorical predictor
categorical_columns <- names(Soybean)[sapply(Soybean, is.factor)]
frequency_distributions <- lapply(Soybean[categorical_columns], table)

# Display frequency distributions
#frequency_distributions

# Display just one frequency distribution
frequency_distributions[c("leaf.size", "mycelium")]

## $leaf.size
## 
##   0   1   2 
##  51 327 221 
## 
## $mycelium
## 
##   0   1 
## 639   6

Degenerate predictors

Depending on how we set our threshold we can remove more or fewer degenerate predictors.

If we set our threshold to 99% we only remove the mycelium predictor. If we set it to 95% we also remove sclerotia. Setting the threshold to 90% would remove eight total predictors. See predictor names that are TRUE.

# Check for degenerate distributions (single or dominant category)
degenerate_distributions <- sapply(frequency_distributions, function(x) {
  max(x) / sum(x) > 0.90  # Adjust threshold as needed
})

# Show results
degenerate_distributions

##           Class            date     plant.stand          precip            temp 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##           FALSE           FALSE            TRUE            TRUE           FALSE 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##            TRUE           FALSE           FALSE           FALSE           FALSE 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##            TRUE            TRUE            TRUE           FALSE           FALSE 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##           FALSE           FALSE           FALSE            TRUE            TRUE 
##           roots 
##           FALSE

Part b

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Yes, certain predictors are more likely to be missing. All of the missing data is in one of five classes.

Match the missing

If I sum the NA values across the whole dataset (there are no NAs in the Class category), and divide by the number of predictor elements, I get ~9.77% of predictors are missing, not 18%.

The documentation says ’The value “dna” means does not apply” however there are no dna values in the data.

# Percent of predictors missing
sum(is.na(Soybean))/(prod(dim(Soybean)) - nrow(Soybean))

## [1] 0.09776197

Predictors with missingness

Here we list the predictors by percent missing. This may be where the ~18% number came from as the four predictors with the most missingness are all roughly 18% missing.

# Calculate percentage of missing values per column
missing_per_column <- sapply(Soybean, function(x) {
  sum(is.na(x)) / length(x) * 100
})

# Convert to a data frame and order by missing percentage
missing_df <- data.frame(Predictor = names(missing_per_column), 
                         MissingPercentage = missing_per_column)
missing_df <- missing_df[order(-missing_df$MissingPercentage), ]
rownames(missing_df) <- NULL

# Display the ranked predictors by missing percentage
#print(missing_df)
kable(missing_df, align="c")

Predictor	MissingPercentage
hail	17.7159590
sever	17.7159590
seed.tmt	17.7159590
lodging	17.7159590
germ	16.3982430
leaf.mild	15.8125915
fruiting.bodies	15.5197657
fruit.spots	15.5197657
seed.discolor	15.5197657
shriveling	15.5197657
leaf.shread	14.6412884
seed	13.4699854
mold.growth	13.4699854
seed.size	13.4699854
leaf.halo	12.2986823
leaf.marg	12.2986823
leaf.size	12.2986823
leaf.malf	12.2986823
fruit.pods	12.2986823
precip	5.5636896
stem.cankers	5.5636896
canker.lesion	5.5636896
ext.decay	5.5636896
mycelium	5.5636896
int.discolor	5.5636896
sclerotia	5.5636896
plant.stand	5.2708638
roots	4.5387994
temp	4.3923865
crop.hist	2.3426061
plant.growth	2.3426061
stem	2.3426061
date	0.1464129
area.dam	0.1464129
Class	0.0000000
leaves	0.0000000

Missingness by class

Here we display the number of missing values by class. Only five of the class have all of the missing values.

# Make list of values in Class
classes <- unique(Soybean$Class)

# Calculate missing values per record for each class
missing_by_class <- lapply(classes, function(class_name) {
  # Filter the data for the current class
  class_data <- Soybean[Soybean$Class == class_name, ]
  
  # Calculate missing values per record for this class
  sum(rowSums(is.na(class_data)))
})

# Name the list elements by class for easier reference
names(missing_by_class) <- classes

# Convert to dataframe
missing_by_class <- as.data.frame(missing_by_class)
rownames(missing_by_class) <- "Missing"

# Display the result
t(missing_by_class)

##                             Missing
## diaporthe.stem.canker             0
## charcoal.rot                      0
## rhizoctonia.root.rot              0
## phytophthora.rot               1214
## brown.stem.rot                    0
## powdery.mildew                    0
## downy.mildew                      0
## brown.spot                        0
## bacterial.blight                  0
## bacterial.pustule                 0
## purple.seed.stain                 0
## anthracnose                       0
## phyllosticta.leaf.spot            0
## alternarialeaf.spot               0
## frog.eye.leaf.spot                0
## diaporthe.pod...stem.blight     177
## cyst.nematode                   336
## X2.4.d.injury                   450
## herbicide.injury                160

Part c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Exclude four

I ended up skimming the 1980 paper https://www.mli.gmu.edu/papers/79-80/80-2.pdf that this data came from and there’s no discussion about why they eliminated the last four conditions but did not exclude phytophthora.rot, all five of which having missing values.

My guess is that the last four don’t have enough records or examples to impute the missing values.

Another possibility is phytophthora rot may be difficult to classify with similar symptoms as the other classes in the model, but the four that are excluded can be easily ruled out before you model. For instance, it may be that nematode cysts have a singular diagnoses (are there cysts, with nematodes?) and so couldn’t be captured effectively in the model unless we used one model for the first fifteen classes and wrapped it in a decision tree for the other four. Two of the four excluded classes are herbicide damage (unspecified and X24d) and so they may be easy to rule out, too.

Either way our strategy will be to exclude those last four.

Phytophthora rot

I reviewed Phytophthora rot and I don’t believe there are enough records to impute it. 55 or 68 records are missing data for predictors with missingness but there are only 88 records for Phytophthora rot.

My strategy would be to exclude it.

Alternatively we could make an indicator for missingness… Maybe that’s what dna meant in the ??mlbench::Soybean description, it wasn’t in the data but the original authors made all of the Phytophthora rot values for the 13 predictors with missingness, dna.

I don’t think that would be a very good strategy though, since only one class would have a missingness indicator, the model would learn a missingness indicator means Phytophthora rot.

Here’s the number of missing values by predictor for Phytophthora rot records:

SoybeanPhytoRot <- Soybean[Soybean$Class == "phytophthora-rot", ]

# Count the number of NA values in each column
missing_counts <- colSums(is.na(SoybeanPhytoRot))

# Display the result
print(missing_counts)

##           Class            date     plant.stand          precip            temp 
##               0               0               0               0               0 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##              68               0               0              68              68 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##              68               0               0              55              55 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              55              55              55              55               0 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##              68               0               0              68               0 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##               0               0               0              68              68 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              68              68              68              68              68 
##           roots 
##               0

References

These exercises come from ‘Applied Predictive Modeling’ by Max Kuhn and Kjell Johnson.

D624 Homework 4: Data pre-processing

PK O’Flaherty

2024-11-03

Exercise 3.1

Part a

Histograms

Near zero predictors

Correlation matix

Part b

Outliers

Skewness

Part c

Box-cox transformation

Center and scale

Display final data

Exercise 3.2

Part a

Load data

Degenerate predictors

Part b

Match the missing

Predictors with missingness

Missingness by class

Part c

Exclude four

Phytophthora rot

References