Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
# Load the Glass dataset
data("Glass")
glass_data <- as.data.frame(Glass)
# Plot histograms
glass_data %>%
gather(key = "Feature", value = "Value", -Type) %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "darkblue", alpha = 0.6) +
facet_wrap(~ Feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Features in Glass Dataset", x = "Value", y = "Count")
# Boxplots
glass_data %>%
gather(key = "Feature", value = "Value", -Type) %>%
ggplot(aes(x = as.factor(Type), y = Value)) +
geom_boxplot() +
facet_wrap(~ Feature, scales = "free") +
theme_minimal() +
labs(title = "Boxplot of Features by Glass Type", x = "Glass Type", y = "Value")
High correlations between predictors can lead to multicollinearity, making coefficient estimates in models unstable; we may need to remove highly correlated variables or use dimensionality reduction to eliminate colinerarity. Variables with low correlations to others, like Fe and Ba, may help the model predictive power. Strong correlations help identify which variables might be redundant or need special handling.
Moderate to High Positive Correlation:
Moderate to Strong Negative Correlations:
Low to No Correlation:
# Calculate the correlation matrix
cor_matrix <- cor(glass_data %>% select_if(is.numeric))
# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6,
title = "Correlation Heatmap of Glass Dataset Predictors",
addCoef.col = "white",
number.cex = 0.7, mar = c(0, 0, 1, 0),
col = colorRampPalette(c("blue", "white", "red"))(200))
Address Skewness with Log Transformations
The variables in the dataset that are highly skewed (e.g., Ba, Fe, K, Mg) may benefit from log transformation to make distributions more normal-like, improving model performance. The log transformation for these variables appear to have limited impact on skewness and may not improve the model.
# Apply log transformation
glass_data <- glass_data %>%
mutate(
Ba_log = log(Ba + 1),
Fe_log = log(Fe + 1),
K_log = log(K + 1)
)
# Combine before and after log transformation in one plot
glass_data %>%
gather(key = "Feature", value = "Value", Ba, Fe, K, Ba_log, Fe_log, K_log) %>%
mutate(
Transformation = ifelse(str_detect(Feature, "_log"), "After Log", "Before Log"),
BaseFeature = str_replace(Feature, "_log", "")
) %>%
ggplot(aes(x = Value, fill = Transformation)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
facet_grid(rows = vars(Transformation), cols = vars(BaseFeature), scales = "free") +
theme_minimal() +
labs(title = "Before and After Log Transformation", x = "Value", y = "Count") +
scale_fill_manual(values = c("darkblue", "orange"))
Normalize or Standardize
Variables like Na, Ca, and RI have different ranges and scales. Standardization (mean = 0, standard deviation = 1) or normalization (scaling to a fixed range) ensures that predictors contribute equally to the model. This transformation appears to shift the data significantly in some variable potentially improving the model.
# Select original variables for standardization
numeric_vars <- glass_data %>% select(Ba, Fe, K, Na, Mg, Al, Si, Ca, RI)
# Standardize
glass_data_scaled <- as.data.frame(scale(numeric_vars))
colnames(glass_data_scaled) <- paste0(colnames(glass_data_scaled), "_scaled")
# Bind original and scaled values
glass_data <- bind_cols(glass_data, glass_data_scaled)
# Plot combined values
glass_data %>%
gather(key = "Feature", value = "Value", Ba, Fe, K, Na, Mg, Al, Si, Ca, RI, Ba_scaled, Fe_scaled, K_scaled, Na_scaled, Mg_scaled, Al_scaled, Si_scaled, Ca_scaled, RI_scaled) %>%
mutate(Transformation = ifelse(str_detect(Feature, "_scaled"), "After Standardization", "Before Standardization"),
BaseFeature = str_remove(Feature, "_scaled")) %>%
ggplot(aes(x = Value, fill = Transformation)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
facet_wrap(~ BaseFeature, scales = "free") +
theme_minimal() +
labs(title = "Before and After Standardization", x = "Value", y = "Count") +
scale_fill_manual(values = c("darkblue", "red"))
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The Class plot shows the proportion of missing and available data for each class in the soybean dataset. Most classes have a very high percentage of available data (near or at 100%). The few exceptions are charcoal rot, diaporthe-stem-canker, and phyllosticta-leaf-spot, which have higher proportions of missing data (over 50%). This suggests that some classes contain more incomplete observations, which could potentially affect modeling performed on those specific classes.
# Load the Soybean dataset
data("Soybean")
soybean_data <- as.data.frame(Soybean)
# Calculate the missingness for each column by Class
missing_summary <- soybean_data %>%
group_by(Class) %>%
summarise(across(everything(), ~ mean(is.na(.)))) %>%
pivot_longer(-Class, names_to = "Variable", values_to = "Missing_Percentage")
# Calculate available data percentage
missing_summary <- missing_summary %>%
mutate(Available_Percentage = 1 - Missing_Percentage) %>%
pivot_longer(cols = c("Missing_Percentage", "Available_Percentage"),
names_to = "Data_Type", values_to = "Percentage")
# Labels for "Data_Type"
missing_summary$Data_Type <- factor(missing_summary$Data_Type,
levels = c("Available_Percentage", "Missing_Percentage"),
labels = c("Available Data", "Missing Data"))
# Plot 100% stacked bar chart
ggplot(missing_summary, aes(x = Class, y = Percentage, fill = Data_Type)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Percentage of Missing and Available Data by Class in Soybean Dataset",
x = "Class", y = "Percentage") +
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
axis.text.y = element_text(size = 8),
axis.title = element_text(size = 8),
plot.title = element_text(size = 12)
) +
scale_fill_manual(values = c("Available Data" = "darkblue", "Missing Data" = "lightgray"))
The distributions of the selected predictors in the soybean dataset show several variables with notable missing values (labeled “NA”). For variables like lodging, seed.discolor, and germ, the missing values are relatively high, which may influence the balance of the observed values and impact the analysis of these predictors. Other variables, such as fruit.spots and fruiting.bodies, have fewer missing values, which suggests a more reliable distribution of observed data; the proportion of missing values can affect the interpretation and model performance.
These 12 predictors have the highest percentage of missing values.
# Filter the soybean_data
variables_to_include <- c("hail", "sever", "seed.tmt", "lodging", "germ", "leaf.mild", "fruiting.bodies", "fruit.spots", "seed.discolor", "shriveling", "leaf.shread", "seed")
# Convert columns to numeric
soybean_data_long <- soybean_data %>%
mutate(across(all_of(variables_to_include), as.character)) %>%
pivot_longer(cols = all_of(variables_to_include), names_to = "Variable", values_to = "Value")
# Plot bar charts
ggplot(soybean_data_long, aes(x = Value)) +
geom_bar(fill = "orange") +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Frequency Distributions of Selected Predictors in the Soybean Dataset", x = "Value", y = "Count") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 8),
strip.text = element_text(size = 8),
plot.title = element_text(size = 10)
)
The bar plot shows the percentage of missing data across variables in the soybean dataset. The variables sever, seed.tmt, and lodging have the highest percentages of missing data, each exceeding roughly 17.5%. This could significantly impact analyses and modeling involving these variables. Variables like area.dam, leaves, and date have almost no missing values, suggesting that they are more complete and reliable for analysis. The presence of varying levels of missing data across predictors indicates that we need to impute or exclude data. We address the higher percentages of missingness to avoid biases in modeling.
# Bar plot of missing data per variable by percentage
gg_miss_var(soybean_data, show_pct = TRUE) +
labs(title = "Percentage of Missing Data per Variable")
The MCAR (Missing Completely At Random) test checks whether the missing data is randomly distributed without any systematic pattern. In this result, the high statistic (3712.484) and a p-value of 0 indicate that the missing data is not MCAR. There is a significant pattern to how the data is missing. This suggests that the missingness is likely related to other variables in the dataset. If the data were randomly distributed with no systematic pattern, we could potentially remove the missing records. In this case, the MCAR test points to imputation as an appropriate approach to handle the missing data.
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 3712. 130 0 9
To further check the potential impact of the missing data, we plot a missing data correlation matrix. A correlation matrix of missing data helps identify patterns in missingness. The strong blue patterns in this matrix indicate strong positive correlations between missing values across variables, suggesting that if one variable has missing data, it is likely that other correlated variables also have missing values. The large clusters of missingness may imply that the missing data is not random and is associated with certain variables.
# Create missing indicators for each variable
missing_indicators <- soybean_data %>%
mutate(across(everything(), ~ as.numeric(is.na(.)), .names = "miss_{col}"))
# Select only the indicator columns for correlation excluding miss_Class and miss_leaves
missing_indicators_numeric <- missing_indicators %>%
select(starts_with("miss_")) %>%
select(-miss_Class, -miss_leaves)
# Compute correlation matrix for missingness
cor_matrix <- cor(missing_indicators_numeric)
# Plot the correlation matrix
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.6,
cl.cex = 0.6,
number.cex = 0.6)
We use K-Nearest Neighbors (K-NN) imputation to fill in the missing values in the soybean dataset. K_NN considers the similarity between observations making it suitable for preserving patterns in a classification model. By using the closest neighbors to fill in gaps, K-NN maintains the relationship between features and the target class.
# Perform kNN imputation on the soybean_data
soybean_data_imputed <- kNN(soybean_data, k = 5, imp_var = FALSE)
The plot shows the distribution of imputed values across different variables in the soybean dataset; each red point represents a newly imputed value. The visualization allows you to identify unusual patterns or clusters which may indicate that the imputation process introduced bias or inconsistencies. Clusters that are high on the y-axis (e.g., values 3 or 4) may suggest that the imputed values are concentrated at higher levels, potentially reflecting patterns in the original data or suggesting outliers. The clusters around zero imply that a significant number of missing values were filled in with low values, potentially indicating that zero (or near-zero values) were common or considered the most likely replacement based on the k-NN algorithm.
# Create a difference dataset with imputed values
diff_data <- soybean_data_imputed
diff_data[!is.na(soybean_data)] <- NA
# Convert all columns to character
diff_data_char <- diff_data %>%
mutate(across(everything(), as.character))
# Melt the data
diff_long <- diff_data_char %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
drop_na(Value)
# Plot the imputed values
ggplot(diff_long, aes(x = Variable, y = Value)) +
geom_jitter(alpha = 0.7, color = "red", size = 0.5) + # Reduce point size
labs(title = "Imputed Values in the Soybean Dataset", x = "Variable", y = "Imputed Value") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
axis.title = element_text(size = 10),
plot.title = element_text(size = 12)
)
This plot compares the distributions of original and imputed data for selected variables in the soybean dataset. For most variables, the imputed values closely resemble the original distribution, but there are some differences (e.g., lodging and seed.tmt), where imputed data introduces different patterns or shifts in frequencies. Changes in distribution may affect the predictive power of the variable as imputed values could alter relationships between predictors and the target class, and introduce biases or inaccuracies into the model.
These 12 predictors have the highest percentage of missing values.
# Convert all columns to character
soybean_data_char <- soybean_data %>%
mutate(across(everything(), as.character))
soybean_data_imputed_char <- soybean_data_imputed %>%
mutate(across(everything(), as.character))
# Combine original and imputed datasets
original_long <- soybean_data_char %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
mutate(Source = "Original")
imputed_long <- soybean_data_imputed_char %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
mutate(Source = "Imputed")
combined_long <- bind_rows(original_long, imputed_long)
# Filter the combined data
variables_to_include <- c("hail", "sever", "seed.tmt", "lodging", "germ", "leaf.mild", "fruiting.bodies", "fruit.spots", "seed.discolor", "shriveling", "leaf.shread", "seed")
filtered_combined_long <- combined_long %>%
filter(Variable %in% variables_to_include)
# Plot bar charts to compare distributions
ggplot(filtered_combined_long, aes(x = Value, fill = Source)) +
geom_bar(position = "dodge", alpha = 0.7) +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Comparison of Original and Imputed Data Distributions for Selected Variables", x = "Value", y = "Count") +
theme_minimal() +
theme(
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
strip.text = element_text(size = 8)
) +
scale_fill_manual(values = c("Original" = "orange", "Imputed" = "red"))