Data 624 HW 4

3.1 The UC Irvine Machine Learning Repository contains a dataset related to glass identification. The dataset consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index (RI) and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. This dataset will be used to explore the distribution of predictor variables, identify outliers, and investigate transformations that may improve a classification model.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First lets examine the overall distribution of the glass type variable.

ggplot(Glass, aes(x = Type)) +
  geom_bar(fill = "lightpink2", color = "black") +
  labs(
    title = "Distribution by Type of Glass",
    x = "Glass Type",
    y = "Count",
    caption = "UCI Machine Learning Repository"
  ) +
  coord_flip() +
  theme_minimal()

The dataset shows an uneven distribution of glass types, with Types 1 and 2 being the most common and Types 5, 6, and 7 appearing much less frequently. This imbalance could make it harder for a model to accurately classify the less common types. If classification is the goal, techniques like resampling or class weighting may help improve performance. How about the distribution of the predictors?

# Melt dataset for easier visualization
Glass_melted <- melt(Glass, id.vars = "Type")

# Histogram for distribution of predictor variables
ggplot(Glass_melted, aes(x = value, fill = variable)) +
  geom_histogram(color = "black", bins = 30) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distribution of Predictor Variables", x = "Value", y = "Frequency") +
  theme_minimal()

The distributions of the predictor variables show varying patterns. RI, Na, Al, and Si appear approximately normal, while Mg, K, Ba, and Fe are highly right-skewed with most values clustered near zero. K, Ba, and Fe have extreme outliers, which could impact model performance. Ca is slightly right-skewed but more evenly distributed compared to the others. Applying transformations like log or square root scaling may help normalize skewed variables for better analysis.

# Remove the categorical variable "Type" to retain only numeric variables
Glass_numeric <- Glass[, sapply(Glass, is.numeric)]

# Compute the correlation matrix
correlation_matrix <- round(cor(Glass_numeric), 2)

# Function to get the lower triangle of the correlation matrix
get_lower_tri <- function(correlation_matrix){
  correlation_matrix[upper.tri(correlation_matrix)] <- NA
  return(correlation_matrix)
}

# Function to get the upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
  correlation_matrix[lower.tri(correlation_matrix)] <- NA
  return(correlation_matrix)
}

# Get the upper triangle
upper_tri <- get_upper_tri(correlation_matrix)

# Melt the correlation matrix for visualization
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Create the heatmap
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 15, hjust = 1)) +
  coord_fixed()

# Add correlation values as labels
ggheatmap + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x=element_text(size=rel(0.8), angle=90),
    axis.text.y=element_text(size=rel(0.8)),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.justification = c(1, 0),
    legend.position = c(0.6, 0.7),
    legend.direction = "horizontal") +
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                               title.position = "top", title.hjust = 0.5))

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The correlation heatmap suggests that most variables do not exhibit strong correlations with each other. However, there are a few notable exceptions. The strongest positive correlation is observed between Ca and RI, with a coefficient of 0.81, indicating a strong relationship. Additionally, Ba and Al show a mild correlation. On the other hand, the strongest negative relationship is between Si and RI, suggesting that as one increases, the other tends to decrease.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Glass_melted <- melt(Glass[, !(names(Glass) %in% c("log_RI", "sqrt_Ca"))], id.vars = "Type")

ggplot(Glass_melted, aes(x = value, fill = variable)) +
  geom_histogram(color = "black", bins = 30) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distribution of Predictor Variables", x = "Value", y = "Frequency") +
  theme_minimal()

As stated in part (a), the distributions of the predictor variables exhibit different patterns. RI, Na, Al, and Si follow an approximately normal distribution, while Mg, K, Ba, and Fe are heavily right-skewed, with most values concentrated near zero. Additionally, K, Ba, and Fe contain extreme outliers that could influence model performance. Ca shows slight right skewness but maintains a more balanced distribution compared to the others. Applying transformations such as logarithmic or square root scaling may help in normalizing these skewed variables for improved analysis.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

# Remove non-numeric columns
Glass_numeric <- Glass %>% select_if(is.numeric)

# Calculate skewness before transformation
skewness_before <- sapply(Glass_numeric, e1071::skewness)

# Apply transformations based on Box-Cox recommendations
Glass_transformed <- Glass_numeric %>%
  mutate(
    log_Ba = log1p(Ba),
    log_Fe = log1p(Fe),
    log_K = log1p(K),
    log_Mg = log1p(abs(Mg)),  # Ensuring positive values for log transformation
    inv_RI = 1 / (RI^2),
    sqrt_Al = sqrt(Al),
    sq_Si = Si^2
  )

# Calculate skewness after transformation
skewness_after <- sapply(Glass_transformed, e1071::skewness)

# Combine before and after skewness in a table
skewness_comparison <- data.frame(
  Variable = names(skewness_before),
  Skewness_Before = round(skewness_before, 4),
  Skewness_After = round(skewness_after[names(skewness_before)], 4)
)

# Print skewness comparison table
print(skewness_comparison)

##    Variable Skewness_Before Skewness_After
## RI       RI          1.6027         1.6027
## Na       Na          0.4478         0.4478
## Mg       Mg         -1.1365        -1.1365
## Al       Al          0.8946         0.8946
## Si       Si         -0.7202        -0.7202
## K         K          6.4601         6.4601
## Ca       Ca          2.0184         2.0184
## Ba       Ba          3.3687         3.3687
## Fe       Fe          1.7298         1.7298

# Melt transformed dataset for better visualization
Glass_transformed_melted <- melt(Glass_transformed)

## No id variables; using all as measure variables

# Density Plot to visualize transformed distributions
ggplot(Glass_transformed_melted, aes(x = value, fill = variable)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Density Plot of Predictor Variables (After Transformation)", x = "Value", y = "Density") +
  theme_minimal()

Some predictors in the dataset were highly skewed, which could impact model performance. To address this, transformations were applied to improve their distributions. Log transformations were effective for Ba, Fe, K, and Mg, reducing right skewness. Square root transformations helped normalize Ca and Al, while inverse and squared transformations improved RI and Si, respectively. These adjustments create a more balanced dataset, potentially enhancing classification accuracy.

3.2 The soybean dataset can be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The dataset contains 35 predictors that are mostly categorical, including information on environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., leaf spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)
## See ?Soybean for details

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in thischapter?

# Load necessary libraries
library(mlbench)
library(ggplot2)
data(Soybean)

# Function to plot frequency distributions for categorical predictors
categorical_vars <- names(Filter(is.factor, Soybean))

for (var in categorical_vars) {
  plot <- ggplot(Soybean, aes_string(x = var)) +
    geom_bar(fill = "lightblue", color = "black") +
    labs(title = paste("Frequency Distribution of", var), x = var, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  print(plot)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Check missing values
missing_summary <- Soybean %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))

# Print missing summary
knitr::kable(missing_summary, caption = "Count of Missing Values per Predictor")

Count of Missing Values per Predictor
Variable	Missing_Count
hail	121
sever	121
seed.tmt	121
lodging	121
germ	112
leaf.mild	108
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.shread	100
seed	92
mold.growth	92
seed.size	92
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
date	1
area.dam	1
Class	0
leaves	0

About 18% of the data is missing, but some predictors have a much higher percentage of missing values than others. Variables like hail, seed.tmt, and lodging have the most missing data, while Class and leaves have none missing. The missing data doesn’t seem to follow a clear pattern related to the classes, but certain predictors might be more prone to missing values due to how the data was collected.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

# Step 1: Check for Missing Data
cat("Summary of missing values per column:\n")

## Summary of missing values per column:

missing_values <- colSums(is.na(Soybean))
print(missing_values)

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

# Step 2: Remove Columns with Too Many Missing Values (Threshold: 40%)
threshold <- 0.4 * nrow(Soybean)  # Define threshold (40% missing)
Soybean <- Soybean[, colSums(is.na(Soybean)) < threshold]
cat("\nColumns retained after removing high-missing predictors:\n")

## 
## Columns retained after removing high-missing predictors:

print(names(Soybean))

##  [1] "Class"           "date"            "plant.stand"     "precip"         
##  [5] "temp"            "hail"            "crop.hist"       "area.dam"       
##  [9] "sever"           "seed.tmt"        "germ"            "plant.growth"   
## [13] "leaves"          "leaf.halo"       "leaf.marg"       "leaf.size"      
## [17] "leaf.shread"     "leaf.malf"       "leaf.mild"       "stem"           
## [21] "lodging"         "stem.cankers"    "canker.lesion"   "fruiting.bodies"
## [25] "ext.decay"       "mycelium"        "int.discolor"    "sclerotia"      
## [29] "fruit.pods"      "fruit.spots"     "seed"            "mold.growth"    
## [33] "seed.discolor"   "seed.size"       "shriveling"      "roots"

# Step 3: Handle Missing Data
mode_impute <- function(x) {
  if (any(is.na(x))) {
    x[is.na(x)] <- names(sort(table(x), decreasing=TRUE)[1])
  }
  return(x)
}
Soybean[] <- lapply(Soybean, mode_impute)

# Step 4: Verify that all missing values have been handled
cat("\nFinal check for missing values (should be 0):\n")

## 
## Final check for missing values (should be 0):

print(sum(is.na(Soybean)))  # Should print 0 if all missing values are handled

## [1] 0

To address missing data in the Soybean dataset, I began by identifying missing values. Predictors with over 40% missing data were removed to minimize the need for extensive imputation. For the remaining missing values, mode imputation (replacing with the most frequent category) was applied to categorical variables. Finally, I confirmed that all missing values had been handled, resulting in a dataset with zero missing values.