The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. Excersize 1
library(mlbench)
library(ggplot2)
library(reshape2)
library(dplyr)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
par(mfrow = c(3, 3))
for(i in 1:9) {
hist(Glass[, i],
main = paste("Histogram of", names(Glass)[i]),
xlab = names(Glass)[i],
col = "lightblue")
}
The histograms reveal different patterns of concentration distributions for each element, with some elements showing symmetrical distributions and others exhibiting skewness.
# Boxplots for each variable
par(mfrow = c(3, 3)) # Reset plotting layout
for(i in 1:9) {
boxplot(Glass[, i],
main = paste("Boxplot of", names(Glass)[i]),
xlab = names(Glass)[i],
col = "lightgreen")
}
pairs(Glass[, 1:9],
main = "Scatterplot Matrix of Predictors",
col = as.factor(Glass$type))
This matrix illustrates the relationships among different variables, which are represented along the axes: Fe, Ba, Ca, K, Si, Al, Mg, Na, and RI. Each cell in the matrix features a scatterplot that demonstrates the correlation between two variables. The values along the axes correspond to the measurements of these variables. This matrix is utilized to visualize correlations among multiple variables simultaneously. For instance, the cell at the intersection of “Fe” and “Ba” displays a scatterplot of Fe values plotted against Ba values, allowing for a visual assessment of their correlation. The empty cells indicate that the relevant scatterplots are not included in this matrix. Overall, the matrix is a significant tool for exploratory
cor_mat <- cor(Glass[, 1:9])
melted_cor <- melt(cor_mat)
ggplot(melted_cor, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)) +
coord_fixed() +
ggtitle("Correlation Heatmap of Predictors")
#Enhanced pairwise plot using GGally
library(GGally)
ggpairs(Glass, columns = 1:9,
aes(color = type), data = Glass) +
ggtitle("GGpairs: Pairwise Plot of Predictors")
The pairwise plot visualizes the relationships between all pairs of predictors in the glass dataset, showing correlations and distributions to aid in understanding data structure and potential multicollinearity.
b. Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, several predictors are skewed. RI, Mg,
Al, Si, K, Ca,
Ba, and Fe show significant skewness. Na shows
mild skewness.
#Check skewness
Glass %>%
summarize(across(where(is.numeric),
list(skewness = ~ moments::skewness(.x, na.rm = TRUE))))
## RI_skewness Na_skewness Mg_skewness Al_skewness Si_skewness K_skewness
## 1 1.614015 0.4509917 -1.144465 0.9009179 -0.7253173 6.505636
## Ca_skewness Ba_skewness Fe_skewness
## 1 2.032677 3.392431 1.742007
c. Are there any relevant transformations of one or more predictors that might improve the classification model?
There are several transformations that can be applied to the
predictors to improve the classification model, especially when dealing
with skewed data. For positively skewed RI, K,
Ca, Ba, and Fe logarithmic
transformation log(X) or the square root
sqrt(X) can be helpful. For negatively skewed
Mg the inverse power transformation 1/X can
help.
par(mfrow = c(3, 2), mar = c(4, 3, 3, 1) + 0.1)
trans_vars <- c("RI", "K", "Ca", "Ba", "Fe")
for(var in trans_vars) {
Glass[[paste0("log_", var)]] <- log(Glass[[var]])
Glass[[paste0("sqrt_", var)]] <- sqrt(Glass[[var]])
}
# Visualize the transformed variables
for(var in trans_vars) {
hist(Glass[[paste0("log_", var)]],
main = paste("Histogram of log(", var, ")", sep = ""),
xlab = paste("log(", var, ")", sep = ""),
col = "lightgreen")
# Histogram for square root transformation
hist(Glass[[paste0("sqrt_", var)]],
main = paste("Histogram of sqrt(", var, ")", sep = ""),
xlab = paste("sqrt(", var, ")", sep = ""),
col = "lightcoral")
}
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
predictor_names <- setdiff(names(Soybean), "Class")
for(name in predictor_names) {
cat("\nFrequency for", name, ":\n")
print(table(Soybean[[name]], useNA = "ifany"))
}
##
## Frequency for date :
##
## 0 1 2 3 4 5 6 <NA>
## 26 75 93 118 131 149 90 1
##
## Frequency for plant.stand :
##
## 0 1 <NA>
## 354 293 36
##
## Frequency for precip :
##
## 0 1 2 <NA>
## 74 112 459 38
##
## Frequency for temp :
##
## 0 1 2 <NA>
## 80 374 199 30
##
## Frequency for hail :
##
## 0 1 <NA>
## 435 127 121
##
## Frequency for crop.hist :
##
## 0 1 2 3 <NA>
## 65 165 219 218 16
##
## Frequency for area.dam :
##
## 0 1 2 3 <NA>
## 123 227 145 187 1
##
## Frequency for sever :
##
## 0 1 2 <NA>
## 195 322 45 121
##
## Frequency for seed.tmt :
##
## 0 1 2 <NA>
## 305 222 35 121
##
## Frequency for germ :
##
## 0 1 2 <NA>
## 165 213 193 112
##
## Frequency for plant.growth :
##
## 0 1 <NA>
## 441 226 16
##
## Frequency for leaves :
##
## 0 1
## 77 606
##
## Frequency for leaf.halo :
##
## 0 1 2 <NA>
## 221 36 342 84
##
## Frequency for leaf.marg :
##
## 0 1 2 <NA>
## 357 21 221 84
##
## Frequency for leaf.size :
##
## 0 1 2 <NA>
## 51 327 221 84
##
## Frequency for leaf.shread :
##
## 0 1 <NA>
## 487 96 100
##
## Frequency for leaf.malf :
##
## 0 1 <NA>
## 554 45 84
##
## Frequency for leaf.mild :
##
## 0 1 2 <NA>
## 535 20 20 108
##
## Frequency for stem :
##
## 0 1 <NA>
## 296 371 16
##
## Frequency for lodging :
##
## 0 1 <NA>
## 520 42 121
##
## Frequency for stem.cankers :
##
## 0 1 2 3 <NA>
## 379 39 36 191 38
##
## Frequency for canker.lesion :
##
## 0 1 2 3 <NA>
## 320 83 177 65 38
##
## Frequency for fruiting.bodies :
##
## 0 1 <NA>
## 473 104 106
##
## Frequency for ext.decay :
##
## 0 1 2 <NA>
## 497 135 13 38
##
## Frequency for mycelium :
##
## 0 1 <NA>
## 639 6 38
##
## Frequency for int.discolor :
##
## 0 1 2 <NA>
## 581 44 20 38
##
## Frequency for sclerotia :
##
## 0 1 <NA>
## 625 20 38
##
## Frequency for fruit.pods :
##
## 0 1 2 3 <NA>
## 407 130 14 48 84
##
## Frequency for fruit.spots :
##
## 0 1 2 4 <NA>
## 345 75 57 100 106
##
## Frequency for seed :
##
## 0 1 <NA>
## 476 115 92
##
## Frequency for mold.growth :
##
## 0 1 <NA>
## 524 67 92
##
## Frequency for seed.discolor :
##
## 0 1 <NA>
## 513 64 106
##
## Frequency for seed.size :
##
## 0 1 <NA>
## 532 59 92
##
## Frequency for shriveling :
##
## 0 1 <NA>
## 539 38 106
##
## Frequency for roots :
##
## 0 1 2 <NA>
## 551 86 15 31
par(mfrow=c(2,2))
for(name in predictor_names) { # Plotting the first 4 predictors for demonstration
counts <- table(Soybean[[name]])
barplot(counts, main=paste("Distribution of", name), col="skyblue")
}
b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
#Check the missing data
sum(is.na(Soybean))
## [1] 2337
# Calculate missing data per variable
missing_counts <- sapply(Soybean, function(x) sum(is.na(x)))
missing_percentages <- missing_counts / nrow(Soybean)*100
print(missing_percentages)
## Class date plant.stand precip temp
## 0.0000000 0.1464129 5.2708638 5.5636896 4.3923865
## hail crop.hist area.dam sever seed.tmt
## 17.7159590 2.3426061 0.1464129 17.7159590 17.7159590
## germ plant.growth leaves leaf.halo leaf.marg
## 16.3982430 2.3426061 0.0000000 12.2986823 12.2986823
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 12.2986823 14.6412884 12.2986823 15.8125915 2.3426061
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 17.7159590 5.5636896 5.5636896 15.5197657 5.5636896
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 5.5636896 5.5636896 5.5636896 12.2986823 15.5197657
## seed mold.growth seed.discolor seed.size shriveling
## 13.4699854 13.4699854 15.5197657 13.4699854 15.5197657
## roots
## 4.5387994
The highest likelihood of missing data - with over 17% of the data
missing is in hail, sever,
seed.tmt and lodging.
#Investigate missing data by class for 4 predictors (as an example)
example_predictor <- predictor_names[4]
print(table(Soybean$Class, is.na(Soybean[[example_predictor]])))
##
## FALSE TRUE
## 2-4-d-injury 0 16
## alternarialeaf-spot 91 0
## anthracnose 44 0
## bacterial-blight 20 0
## bacterial-pustule 20 0
## brown-spot 92 0
## brown-stem-rot 44 0
## charcoal-rot 20 0
## cyst-nematode 0 14
## diaporthe-pod-&-stem-blight 15 0
## diaporthe-stem-canker 20 0
## downy-mildew 20 0
## frog-eye-leaf-spot 91 0
## herbicide-injury 8 0
## phyllosticta-leaf-spot 20 0
## phytophthora-rot 88 0
## powdery-mildew 20 0
## purple-seed-stain 20 0
## rhizoctonia-root-rot 20 0
c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
We can use simple approach of imputation using the mode
# Function to impute mode
impute_mode <- function(x) {
if(any(is.na(x))) {
mode_val <- names(sort(table(x), decreasing = TRUE))[1]
x[is.na(x)] <- mode_val
}
return(x)
}
Soybean_imputed <- Soybean
for(name in predictor_names) {
Soybean_imputed[[name]] <- impute_mode(Soybean_imputed[[name]])
}
#Check if missing data remains after imputation
sum(is.na(Soybean_imputed))
## [1] 0
par(mfrow=c(1,2))
barplot(table(Soybean[[example_predictor]], useNA = "ifany"), main = "Before Imputation", col="tomato")
barplot(table(Soybean_imputed[[example_predictor]], useNA = "ifany"), main = "After Imputation", col="seagreen")