DATA 624 RB HW4

——————————————————————————–

(3.1)

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

pred_columns <- colnames(Glass)
pred_columns  <- pred_columns[-10]
glass_predictors <- Glass[,-10] #Exclude the Type

# Distribution
par(mfrow = c(2,5))

for (i in col(Glass[1,])) {
  title <- paste(colnames(Glass)[i], "Distribution", sep = " ") 
  hist(as.numeric(Glass[,i]), xlab= colnames(Glass)[i], main =title)}

# Correlation
glass_predictors %>%
  cor() %>%
  corrplot::corrplot(method = "number", diag = FALSE, type = 'lower')

The distributions for these predictor variables are all over the place, with very few resembling a normal distribution. I can also see skew and outliers right off the bat for almost every element.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

par(mfrow = c(2,5))

# Box plot to show outliers
for (i in 1:ncol(glass_predictors)) {
  title <- paste(colnames(glass_predictors)[i], "Boxplot", sep = " ")
  boxplot(glass_predictors[,i], main = title, ylab = colnames(glass_predictors)[i])}

# Checking skewness
skew_values <- apply(glass_predictors, 2, skewness)
skew_summary <- data.frame(Predictor = colnames(glass_predictors), Skewness = skew_values)
print(skew_summary)

##    Predictor   Skewness
## RI        RI  1.6027151
## Na        Na  0.4478343
## Mg        Mg -1.1364523
## Al        Al  0.8946104
## Si        Si -0.7202392
## K          K  6.4600889
## Ca        Ca  2.0184463
## Ba        Ba  3.3686800
## Fe        Fe  1.7298107

The advantage of boxplots is that they represent outliers (observations more than 2 standard deviations from the mean) as dots. We can see that there are outliers for every element apart from Magnesium (Mg). We also see that most variables are skewed, with the highest being Potassium (K), while the histogram distributions for Na, Al, and Si look the most normal, and therefore have the least skewness.

par(mfrow = c(2,5))

# Box Cox, Center, Scale (Ch. 3.2)
glass_normalized <- preProcess(glass_predictors, method = c("BoxCox", "center", "scale"))
glass_normalized_predict <- predict(glass_normalized, glass_predictors)

# Plot
glass_normalized_predict %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram()+
  ggtitle("Predictor Variables Distribution - BoxCox Transform") +
  facet_wrap(~ key, scales = "free")

# Box-plot after transform
par(mfrow = c(2,5))
for (i in 1:ncol(glass_normalized_predict)) {
  title <- paste(colnames(glass_normalized_predict)[i], "Box-Cox Transformed Boxplot", sep = " ")
  boxplot(glass_normalized_predict[,i], main = title, ylab = colnames(glass_normalized_predict)[i])}


# PCA
glass_normalized_pca <- preProcess(glass_predictors, method = c("BoxCox","center", "scale", "pca"))
glass_normalized_pca_predict <- predict(glass_normalized_pca, glass_predictors)

# Plot
glass_normalized_pca_predict %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram()+
  ggtitle("PCA of BoxCox Transformed Predictor Variables") +
  facet_wrap(~ key, scales = "free")

In our textbook, in Section 3.2 on Centering and Scaling, it discusses how the average predictor value is subtracted from all values, giving it a mean of 0. The data is scaled by value of the predictor variable being divided by its standard deviation, which coerces the values to have a common standard deviation of 1. The Box-Cox transform helps normalize the data by finding the optimal lambda for the family of transformations (Pg. 32), and we can see that the distributions for each of these 9 predictor variables has somewhat improved, apart from Ba, Fe, K, and Mg. Chapter 3.3 discusses the value of Principal Component Analysis, which seeks to find linear combinations of the predictors to capture the most variance. These transformations help normalize the data and improve the classification model.

(3.2)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
?Soybean

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Frequency table for each predictor variable

#str(Soybean)

# Names of categorical predictors
categorical_columns <- sapply(Soybean, is.factor)

# Exclude 'Class' from list of predictors
categorical_columns <- categorical_columns[names(Soybean) != "Class"]

# Create empty data frame to store counts
count_table <- data.frame()

# Loop through categorical predictors
for (col in names(Soybean)[categorical_columns]) {
  # Get the frequency distribution for the current column
  freq <- table(Soybean[[col]])
  # Convert to data frame for neat arrangement
  freq_df <- as.data.frame(freq)
  # Assign column names
  colnames(freq_df) <- c("Level", "Count")
  # Add the variable name as a new column (row name)
  freq_df$Variable <- col
  # Bind this table to the main table
  count_table <- rbind(count_table, freq_df)
}

# Reshape the data to a wide format
count_table_wide <- count_table %>%
  pivot_wider(names_from = Level, values_from = Count, values_fill = list(Count = 0))


# I'll remove 'Class' from the count table, and its associated columns
count_table_wide <- count_table_wide %>%
  filter(Variable != "Class")

# Define the allowed levels
allowed_levels <- as.character(0:6)

# Get the names of the columns (excluding the 'Variable' column)
columns_to_keep <- colnames(count_table_wide)

# Keep only columns that are in allowed_levels, and 'Variable'
columns_to_keep <- c("Variable", columns_to_keep[columns_to_keep %in% allowed_levels])

# Select the relevant columns from count_table_wide
count_table_wide <- count_table_wide[, columns_to_keep]

print(count_table_wide)

## # A tibble: 35 × 8
##    Variable      `0`   `1`   `2`   `3`   `4`   `5`   `6`
##    <chr>       <int> <int> <int> <int> <int> <int> <int>
##  1 date           26    75    93   118   131   149    90
##  2 plant.stand   354   293     0     0     0     0     0
##  3 precip         74   112   459     0     0     0     0
##  4 temp           80   374   199     0     0     0     0
##  5 hail          435   127     0     0     0     0     0
##  6 crop.hist      65   165   219   218     0     0     0
##  7 area.dam      123   227   145   187     0     0     0
##  8 sever         195   322    45     0     0     0     0
##  9 seed.tmt      305   222    35     0     0     0     0
## 10 germ          165   213   193     0     0     0     0
## # ℹ 25 more rows

# Plot all variables to see degeneracy frequency

plot_list <- list()

for (var in names(Soybean)) {
  if (is.factor(Soybean[[var]])) {
    p <- ggplot(Soybean, aes_string(x = var)) + 
            geom_bar() + 
            theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
            labs(x = var, y = 'Count') + 
            ggtitle(var)
    plot_list[[var]] <- p
  }
}

plot_grid <- wrap_plots(plot_list, ncol = 4)
plot_grid

Our textbook mentions that “some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.” (Pg. 44), and the meaning degenerate implies that the variable doesn’t provide meaningful information for analysis due to a lack of variation, i.e.- a constant value, or one that vary rarely varies. We can see that most of the variables are degenerate, with only a few that aren’t, such as the date, crop.hist, and area.dam

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Missing values
missing_counts <- colSums(is.na(Soybean))
missing_counts_sorted <- sort(missing_counts, decreasing = TRUE)
print(missing_counts_sorted)

##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             108             106             106             106             106 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##             100              92              92              92              84 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##              38              38              38              38              38 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##              38              36              31              30              16 
##    plant.growth            stem            date        area.dam           Class 
##              16              16               1               1               0 
##          leaves 
##               0

# Checking if the pattern of missing data is related to the class
missing_soybean <- Soybean %>% 
  group_by(Class) %>% 
  summarise(across(date:roots, ~sum(is.na(.))))

# Add new column of all missing values in each class
missing_soybean <- missing_soybean %>%
  mutate(Missing_Sum = rowSums(across(starts_with("date"):starts_with("roots"))))

We can see that alot of the data are missing, with the highest being 121 observations for hail, sever, seed.tmt, and lodging. By grouping and summarizing the missing values for each class, we can see the classes with the most missing data: phytophthora-rot (1214), 2-4-d-injury (450), cyst-nematode (336), diaporthe-pod-&-stem-blight (177), and herbicide-injury (160). All other classes have no missing data.

# Imputing missing data, by getting the mean value for each field by class
Soybean_imputed <- Soybean %>%
  mutate(across(-Class, ~as.numeric(as.character(.)))) %>%
  group_by(Class) %>%
  mutate(across(everything(), ~if_else(is.na(.), mean(., na.rm = TRUE), .))) %>%
  ungroup() %>%
  mutate(across(-Class, ~factor(., ordered = TRUE)))

print("Original Soybean Data")

## [1] "Original Soybean Data"

print(sapply(Soybean, function(x) sum(is.na(x))))

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

print("Imputed Soybean Data ")

## [1] "Imputed Soybean Data "

print(sapply(Soybean_imputed, function(x) sum(is.na(x))))

##           Class            date     plant.stand          precip            temp 
##               0               0               0               0               0 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##               0               0               0               0               0 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##               0               0               0               0               0 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##               0               0               0               0               0 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##               0               0               0               0               0 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##               0               0               0               0               0 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##               0               0               0               0               0 
##           roots 
##               0

The textbook offers insight onto missing data, and it seems that some of the classes might be structurally missing, or informatively missing where we don’t have those kinds of observations for a given soybean. Given the dataset is small, it makes no sense to remove the missing values. Alternatively, we can check to see if any predictors have collinearity or have near-zero variance, and remove the predictor entirely. If class-specific imputation didn’t suffice, then K-nearest neighbors or the MICE (Multiple Imputation by Chained Equations) method could instead.

DATA 624 RB HW4

Ron Balaban

2025-02-05

——————————————————————————–

(3.1)

(3.2)