Data 624 Week 5

Instructions

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

Exercise 3.1 - Glass

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?

Distribution & Skewness

Al (Aluminum): Right-skewed with values clustered around 1.0-1.5; the skewness could impact models assuming normality.
Ba (Barium): Highly right-skewed and sparse, with a large concentration at 0; the distribution may pose modeling challenges.
Ca (Calcium): Nearly normal with a slight left skew around 7-10; suitable for models without transformation, and skewness has minimal impact.
Fe (Iron): Highly right-skewed with most values near 0 and a few outliers; the skewness and outliers may affect model sensitivity.
K (Potassium): Highly right-skewed and sparse, mostly close to 0; skewness and sparsity may challenge predictive models.
Mg (Magnesium): Moderately right-skewed with bimodal peaks around 0 and 3-4; retaining this structure could be valuable for capturing glass type differences.
Na (Sodium): Slightly right-skewed, concentrated between 12-14; mild skewness unlikely to impact model performance.
RI (Refractive Index): Slightly right-skewed, tightly clustered around 1.515-1.525; near-normal distribution fits well in models, but limited variance may reduce predictive power.
Si (Silicon): Slightly left-skewed around 71-73; near-normal distribution makes it suitable without transformation, with minimal effect on models.

# Load the Glass dataset
data("Glass")
glass_data <- as.data.frame(Glass)

# Plot histograms
glass_data %>%
  gather(key = "Feature", value = "Value", -Type) %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "darkblue", alpha = 0.6) +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Features in Glass Dataset", x = "Value", y = "Count")

Outliers

Al (Aluminum): Outliers across types 1, 2, 5, and 6 may indicate different compositions and may require careful handling in models sensitive to outliers.
Ba (Barium): Mostly zero values with pronounced spikes in type 2 and 5; the model may need ot handle the sparsity of the data.
Ca (Calcium): High variability and outliers in types 2 and 5; may require may require careful handling in models sensitive to outliers.
Fe (Iron): Clustered around 0 with some outliers at 3 and 5; low variance suggests limited predictive power unless outliers are significant.
K (Potassium): Sparse distribution with spikes in type 5; the spike could be relevant for the glass types.
Mg (Magnesium): Outliers in types 2 and 7; the two peaks of this distribution and potential outliers may indicate predictive power.
Na (Sodium): Moderate variability and scattered outliers across the types
RI (Refractive Index): Narrow distribution with few outliers; low variance may limit predictive power.
Si (Silicon): Several outliers around a normalish distribution; outliers may have a minimal impact on model.

# Boxplots
glass_data %>%
  gather(key = "Feature", value = "Value", -Type) %>%
  ggplot(aes(x = as.factor(Type), y = Value)) +
  geom_boxplot() +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Boxplot of Features by Glass Type", x = "Glass Type", y = "Value")

Predictor Relationships

High correlations between predictors can lead to multicollinearity, making coefficient estimates in models unstable; we may need to remove highly correlated variables or use dimensionality reduction to eliminate colinerarity. Variables with low correlations to others, like Fe and Ba, may help the model predictive power. Strong correlations help identify which variables might be redundant or need special handling.

Moderate to High Positive Correlation:

Ca and RI (0.81): These variables have a strong positive relationship, suggesting that an increase in calcium content is associated with an increase in the refractive index.
Al and Si (0.48): There’s a moderate positive correlation between Aluminum and Silicon, indicating that they tend to increase together but not as strongly as Ca and RI.

Moderate to Strong Negative Correlations:

Mg and RI (-0.54): This indicates that an increase in Magnesium is associated with a decrease in refractive index.
Mg and Ca (-0.49): This strong negative relationship suggests that samples with higher Magnesium tend to have lower Calcium content.
Al and Mg (-0.48): There is a moderate negative correlation, indicating that higher values of Aluminum are associated with lower values of Magnesium.

Low to No Correlation:

Fe (Ir) Iron shows very low or negligible correlation with all other features, suggesting it is largely independent of the other variables.
Ba (Barium): Barium generally shows weak correlations with other variables, indicating limited linear relationships with most other variables.

# Calculate the correlation matrix
cor_matrix <- cor(glass_data %>% select_if(is.numeric))

# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6,
         title = "Correlation Heatmap of Glass Dataset Predictors", 
         addCoef.col = "white",
         number.cex = 0.7, mar = c(0, 0, 1, 0), 
         col = colorRampPalette(c("blue", "white", "red"))(200))

Transformations

Address Skewness with Log Transformations

The variables in the dataset that are highly skewed (e.g., Ba, Fe, K, Mg) may benefit from log transformation to make distributions more normal-like, improving model performance. The log transformation for these variables appear to have limited impact on skewness and may not improve the model.

# Apply log transformation
glass_data <- glass_data %>%
  mutate(
    Ba_log = log(Ba + 1),
    Fe_log = log(Fe + 1),
    K_log = log(K + 1)
  )

# Combine before and after log transformation in one plot
glass_data %>%
  gather(key = "Feature", value = "Value", Ba, Fe, K, Ba_log, Fe_log, K_log) %>%
  mutate(
    Transformation = ifelse(str_detect(Feature, "_log"), "After Log", "Before Log"),
    BaseFeature = str_replace(Feature, "_log", "")
  ) %>%
  ggplot(aes(x = Value, fill = Transformation)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  facet_grid(rows = vars(Transformation), cols = vars(BaseFeature), scales = "free") +
  theme_minimal() +
  labs(title = "Before and After Log Transformation", x = "Value", y = "Count") +
  scale_fill_manual(values = c("darkblue", "orange"))

Normalize or Standardize

Variables like Na, Ca, and RI have different ranges and scales. Standardization (mean = 0, standard deviation = 1) or normalization (scaling to a fixed range) ensures that predictors contribute equally to the model. This transformation appears to shift the data significantly in some variable potentially improving the model.

# Select original variables for standardization
numeric_vars <- glass_data %>% select(Ba, Fe, K, Na, Mg, Al, Si, Ca, RI)

# Standardize
glass_data_scaled <- as.data.frame(scale(numeric_vars))
colnames(glass_data_scaled) <- paste0(colnames(glass_data_scaled), "_scaled")

# Bind original and scaled values
glass_data <- bind_cols(glass_data, glass_data_scaled)

# Plot combined values
glass_data %>%
  gather(key = "Feature", value = "Value", Ba, Fe, K, Na, Mg, Al, Si, Ca, RI, Ba_scaled, Fe_scaled, K_scaled, Na_scaled, Mg_scaled, Al_scaled, Si_scaled, Ca_scaled, RI_scaled) %>%
  mutate(Transformation = ifelse(str_detect(Feature, "_scaled"), "After Standardization", "Before Standardization"),
         BaseFeature = str_remove(Feature, "_scaled")) %>%
  ggplot(aes(x = Value, fill = Transformation)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  facet_wrap(~ BaseFeature, scales = "free") +
  theme_minimal() +
  labs(title = "Before and After Standardization", x = "Value", y = "Count") +
  scale_fill_manual(values = c("darkblue", "red"))

Exercise 3.2 - Soybeans

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Class Distributions

The Class plot shows the proportion of missing and available data for each class in the soybean dataset. Most classes have a very high percentage of available data (near or at 100%). The few exceptions are charcoal rot, diaporthe-stem-canker, and phyllosticta-leaf-spot, which have higher proportions of missing data (over 50%). This suggests that some classes contain more incomplete observations, which could potentially affect modeling performed on those specific classes.

# Load the Soybean dataset
data("Soybean")
soybean_data <- as.data.frame(Soybean)

# Calculate the missingness for each column by Class
missing_summary <- soybean_data %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(-Class, names_to = "Variable", values_to = "Missing_Percentage")

# Calculate available data percentage
missing_summary <- missing_summary %>%
  mutate(Available_Percentage = 1 - Missing_Percentage) %>%
  pivot_longer(cols = c("Missing_Percentage", "Available_Percentage"),
               names_to = "Data_Type", values_to = "Percentage")

# Labels for "Data_Type"
missing_summary$Data_Type <- factor(missing_summary$Data_Type, 
                                    levels = c("Available_Percentage", "Missing_Percentage"),
                                    labels = c("Available Data", "Missing Data"))

# Plot 100% stacked bar chart
ggplot(missing_summary, aes(x = Class, y = Percentage, fill = Data_Type)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Percentage of Missing and Available Data by Class in Soybean Dataset",
       x = "Class", y = "Percentage") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
    axis.text.y = element_text(size = 8),
    axis.title = element_text(size = 8),
    plot.title = element_text(size = 12)
  ) +
  scale_fill_manual(values = c("Available Data" = "darkblue", "Missing Data" = "lightgray"))

Predictor Distributions

The distributions of the selected predictors in the soybean dataset show several variables with notable missing values (labeled “NA”). For variables like lodging, seed.discolor, and germ, the missing values are relatively high, which may influence the balance of the observed values and impact the analysis of these predictors. Other variables, such as fruit.spots and fruiting.bodies, have fewer missing values, which suggests a more reliable distribution of observed data; the proportion of missing values can affect the interpretation and model performance.

These 12 predictors have the highest percentage of missing values.

# Filter the soybean_data
variables_to_include <- c("hail", "sever", "seed.tmt", "lodging", "germ", "leaf.mild", "fruiting.bodies", "fruit.spots", "seed.discolor", "shriveling", "leaf.shread", "seed")

# Convert columns to numeric
soybean_data_long <- soybean_data %>%
  mutate(across(all_of(variables_to_include), as.character)) %>%
  pivot_longer(cols = all_of(variables_to_include), names_to = "Variable", values_to = "Value")

# Plot bar charts
ggplot(soybean_data_long, aes(x = Value)) +
  geom_bar(fill = "orange") +
  facet_wrap(~ Variable, scales = "free") +
  labs(title = "Frequency Distributions of Selected Predictors in the Soybean Dataset", x = "Value", y = "Count") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 8),
    strip.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

Missing Predictor %

The bar plot shows the percentage of missing data across variables in the soybean dataset. The variables sever, seed.tmt, and lodging have the highest percentages of missing data, each exceeding roughly 17.5%. This could significantly impact analyses and modeling involving these variables. Variables like area.dam, leaves, and date have almost no missing values, suggesting that they are more complete and reliable for analysis. The presence of varying levels of missing data across predictors indicates that we need to impute or exclude data. We address the higher percentages of missingness to avoid biases in modeling.

# Bar plot of missing data per variable by percentage
gg_miss_var(soybean_data, show_pct = TRUE) + 
  labs(title = "Percentage of Missing Data per Variable")

MCAR Statistic

The MCAR (Missing Completely At Random) test checks whether the missing data is randomly distributed without any systematic pattern. In this result, the high statistic (3712.484) and a p-value of 0 indicate that the missing data is not MCAR. There is a significant pattern to how the data is missing. This suggests that the missingness is likely related to other variables in the dataset. If the data were randomly distributed with no systematic pattern, we could potentially remove the missing records. In this case, the MCAR test points to imputation as an appropriate approach to handle the missing data.

# Perform MCAR test
mcar_result <- mcar_test(soybean_data)

# Display the result
print(mcar_result)

## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1     3712.   130       0                9

Missing Data Correlation Matrix

To further check the potential impact of the missing data, we plot a missing data correlation matrix. A correlation matrix of missing data helps identify patterns in missingness. The strong blue patterns in this matrix indicate strong positive correlations between missing values across variables, suggesting that if one variable has missing data, it is likely that other correlated variables also have missing values. The large clusters of missingness may imply that the missing data is not random and is associated with certain variables.

# Create missing indicators for each variable
missing_indicators <- soybean_data %>%
  mutate(across(everything(), ~ as.numeric(is.na(.)), .names = "miss_{col}"))

# Select only the indicator columns for correlation excluding miss_Class and miss_leaves
missing_indicators_numeric <- missing_indicators %>%
  select(starts_with("miss_")) %>%
  select(-miss_Class, -miss_leaves)

# Compute correlation matrix for missingness
cor_matrix <- cor(missing_indicators_numeric)

# Plot the correlation matrix
corrplot(cor_matrix, 
         method = "color", 
         type = "upper", 
         tl.cex = 0.6,
         cl.cex = 0.6,
         number.cex = 0.6)

KNN Imputation & Difference

We use K-Nearest Neighbors (K-NN) imputation to fill in the missing values in the soybean dataset. K_NN considers the similarity between observations making it suitable for preserving patterns in a classification model. By using the closest neighbors to fill in gaps, K-NN maintains the relationship between features and the target class.

# Perform kNN imputation on the soybean_data
soybean_data_imputed <- kNN(soybean_data, k = 5, imp_var = FALSE)

The plot shows the distribution of imputed values across different variables in the soybean dataset; each red point represents a newly imputed value. The visualization allows you to identify unusual patterns or clusters which may indicate that the imputation process introduced bias or inconsistencies. Clusters that are high on the y-axis (e.g., values 3 or 4) may suggest that the imputed values are concentrated at higher levels, potentially reflecting patterns in the original data or suggesting outliers. The clusters around zero imply that a significant number of missing values were filled in with low values, potentially indicating that zero (or near-zero values) were common or considered the most likely replacement based on the k-NN algorithm.

# Create a difference dataset with imputed values
diff_data <- soybean_data_imputed
diff_data[!is.na(soybean_data)] <- NA

# Convert all columns to character
diff_data_char <- diff_data %>%
  mutate(across(everything(), as.character))

# Melt the data
diff_long <- diff_data_char %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  drop_na(Value)

# Plot the imputed values
ggplot(diff_long, aes(x = Variable, y = Value)) +
  geom_jitter(alpha = 0.7, color = "red", size = 0.5) +  # Reduce point size
  labs(title = "Imputed Values in the Soybean Dataset", x = "Variable", y = "Imputed Value") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.text.y = element_text(size = 8),
    axis.title = element_text(size = 10),
    plot.title = element_text(size = 12)
  )

Imputed vs Original Predictor

This plot compares the distributions of original and imputed data for selected variables in the soybean dataset. For most variables, the imputed values closely resemble the original distribution, but there are some differences (e.g., lodging and seed.tmt), where imputed data introduces different patterns or shifts in frequencies. Changes in distribution may affect the predictive power of the variable as imputed values could alter relationships between predictors and the target class, and introduce biases or inaccuracies into the model.