The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.0
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Calculate the correlation matrix
correlation_matrix <- cor(Glass |> select(-Type))
# Melt the correlation matrix
melted_correlation <- melt(correlation_matrix)
ggplot(melted_correlation, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "green", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap of Glass Predictors")
# Load necessary libraries
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.0
## corrplot 0.92 loaded
library(dplyr)
# Load the Glass dataset
data("Glass")
# Calculate the correlation matrix
correlation_matrix <- cor(Glass |> select(-Type))
# Create the correlation plot with diagonal values on the right side
corrplot(correlation_matrix,
method = "circle", # Circle method for visualization
type = "full", # Show the full matrix
order = "hclust", # Hierarchical clustering order
tl.col = "black", # Text label color
tl.srt = 45, # Text label rotation
addCoef.col = "black", # Add correlation coefficients in black
diag = TRUE, # Show the diagonal
cl.pos = "r") # Position color legend to the right
# Gather data into a long format for easier plotting
glass_long <- Glass %>%
select(RI, Na, Mg, Al, Si, K, Ca, Ba, Fe) %>%
pivot_longer(cols = everything(), names_to = "Element", values_to = "Value")
# Create density plot using facet_wrap
ggplot(glass_long, aes(x = Value, fill = Element)) +
geom_density(color = "black", alpha = 0.5) +
facet_wrap(~ Element, scales = "free", nrow = 3) +
theme_minimal() + # Use a minimal theme for cleaner look
labs(title = "Density Plots of Glass Components", x = "Value", y = "Density") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
Conclusion:all of the predictor variables have skewness: RI, Na, Ai, K, Ca, Ba, and Fe all have positive skew and Mg, and Si have negative skew.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Load the Glass dataset
data("Glass")
# Define the list of numeric variables to plot
numeric_vars <- c("Ba", "Ca", "Fe", "K", "Na", "RI", "Si", "Mg", "Al")
# Create individual box plots for each variable with outliers highlighted in red
plots <- lapply(numeric_vars, function(var) {
ggplot(Glass, aes_string(y = var)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.colour = "red") +
ggtitle(paste("Box Plot of", var)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(y = var)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Arrange the plots in a grid layout
grid.arrange(grobs = plots, nrow = 3, ncol = 3)
In the above box plot we can see that all predictor axcept for Mg has outlier.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(MASS) # For Box-Cox transformation
## Warning: package 'MASS' was built under R version 4.3.0
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Part C I choose to utilize the Box-Cox method the normalize Al and Mg. Since the Box-Cox method automatically figures out the ideal transformation parameter, it gives more flexibility. Both variables showed a complex distribution and therefore this method appeared most appropriate.
For distributions like Ri and K which have more than 1 peak, I opted to use a sqrt transformation to better center the data without changing heavily changing the shape or slopes in the distribution.
For Ca, Ba, and Fe, since they are right skewed, I opted to use a more powerful transformation like log. However, it didn’t heavily change the distribution for Ba and Fe.
# Box-Cox transformation on Al
lamb.Al <- boxcox((Glass$Al) ~ 1)
# Find optimal lambda for Al
lambda_Al <- lamb.Al$x[which.max(lamb.Al$y)] #0.50
Glass <- Glass %>%
mutate(Al_transformed = (Al^lambda_Al - 1) / lambda_Al)
p4.2 <- ggplot(Glass, aes(x = Al_transformed)) +
geom_density(fill = "lightblue", alpha = 0.5) +
ggtitle("Box-Cox Transformed Al (Aluminum)")
#################
Glass <- Glass %>%
mutate(Mg = ifelse(Mg <= 0, Mg + abs(min(Mg)) + 1, Mg))
# Box-Cox transformation for Mg
lamb.Mg <- boxcox((Glass$Mg) ~ 1)
# Find lambda for Mg transformation
lamb.Mg <- lamb.Mg$x[which.max(lamb.Mg$y)] #
# Apply the Box-Cox transformation to Mg
Glass <- Glass %>%
mutate(Mg_transformed = (Mg^lamb.Mg - 1) / lamb.Mg)
p3.2 <- ggplot(Glass, aes(x = Mg_transformed)) +
geom_density(fill = "lightblue", alpha = 0.5) +
ggtitle("Box-Cox Transformed Mg (Magnesium)")
library(ggplot2)
library(gridExtra)
# Transformations
Glass$Ri_sqrt <- sqrt(Glass$RI)
Glass$K_sqrt <- sqrt(Glass$K)
Glass$Ca_log <- log(Glass$Ca + 1)
Glass$Ba_log <- log(Glass$Ba + 1)
Glass$Fe_log <- log(Glass$Fe + 1)
# List of variables and titles for plotting
variables <- c("Ri_sqrt", "K_sqrt", "Ca_log", "Ba_log", "Fe_log")
titles <- c("Ri_sqrt", "K_sqrt", "Ca_log", "Ba_log", "Fe_log")
# Create a list to store plots
plots <- list()
# Generate plots
for (i in seq_along(variables)) {
plots[[i]] <- ggplot(Glass, aes_string(x = variables[i])) +
geom_density(fill = "purple", alpha = 0.5) +
ggtitle(titles[i])
}
# Arrange plots in a grid
grid.arrange(grobs = plots, nrow = 3, ncol = 2)
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
a.Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
b.Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
c.Develop a strategy for handling missing data, either by eliminating predictors or imputation.
data("Soybean")
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Load the Soybean dataset
data("Soybean")
# Check the structure of the dataset to identify categorical variables
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Identify categorical variables (e.g., those with factor type)
categorical_vars <- sapply(Soybean, is.factor)
# Create frequency tables for categorical variables
frequency_tables <- lapply(names(Soybean)[categorical_vars], function(var) {
table(Soybean[[var]])
})
# Display frequency tables
names(frequency_tables) <- names(Soybean)[categorical_vars]
frequency_tables
## $Class
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
##
## 0 1
## 354 293
##
## $precip
##
## 0 1 2
## 74 112 459
##
## $temp
##
## 0 1 2
## 80 374 199
##
## $hail
##
## 0 1
## 435 127
##
## $crop.hist
##
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
##
## 0 1 2 3
## 123 227 145 187
##
## $sever
##
## 0 1 2
## 195 322 45
##
## $seed.tmt
##
## 0 1 2
## 305 222 35
##
## $germ
##
## 0 1 2
## 165 213 193
##
## $plant.growth
##
## 0 1
## 441 226
##
## $leaves
##
## 0 1
## 77 606
##
## $leaf.halo
##
## 0 1 2
## 221 36 342
##
## $leaf.marg
##
## 0 1 2
## 357 21 221
##
## $leaf.size
##
## 0 1 2
## 51 327 221
##
## $leaf.shread
##
## 0 1
## 487 96
##
## $leaf.malf
##
## 0 1
## 554 45
##
## $leaf.mild
##
## 0 1 2
## 535 20 20
##
## $stem
##
## 0 1
## 296 371
##
## $lodging
##
## 0 1
## 520 42
##
## $stem.cankers
##
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
##
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
##
## 0 1
## 473 104
##
## $ext.decay
##
## 0 1 2
## 497 135 13
##
## $mycelium
##
## 0 1
## 639 6
##
## $int.discolor
##
## 0 1 2
## 581 44 20
##
## $sclerotia
##
## 0 1
## 625 20
##
## $fruit.pods
##
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
##
## 0 1 2 4
## 345 75 57 100
##
## $seed
##
## 0 1
## 476 115
##
## $mold.growth
##
## 0 1
## 524 67
##
## $seed.discolor
##
## 0 1
## 513 64
##
## $seed.size
##
## 0 1
## 532 59
##
## $shriveling
##
## 0 1
## 539 38
##
## $roots
##
## 0 1 2
## 551 86 15
# Create bar plots for categorical variables
for (var in names(frequency_tables)) {
p <- ggplot(Soybean, aes_string(x = var)) +
geom_bar(fill = "blue", alpha = 0.7) +
ggtitle(paste("Frequency Distribution of", var)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = var, y = "Count")
# Print the plot
print(p)
}
# Check for missing values
missing_summary <- sapply(Soybean, function(x) sum(is.na(x)))
missing_summary <- data.frame(Predictor = names(missing_summary), Missing = missing_summary)
missing_summary <- missing_summary %>% filter(Missing > 0)
# Display predictors with missing values
print(missing_summary)
## Predictor Missing
## date date 1
## plant.stand plant.stand 36
## precip precip 38
## temp temp 30
## hail hail 121
## crop.hist crop.hist 16
## area.dam area.dam 1
## sever sever 121
## seed.tmt seed.tmt 121
## germ germ 112
## plant.growth plant.growth 16
## leaf.halo leaf.halo 84
## leaf.marg leaf.marg 84
## leaf.size leaf.size 84
## leaf.shread leaf.shread 100
## leaf.malf leaf.malf 84
## leaf.mild leaf.mild 108
## stem stem 16
## lodging lodging 121
## stem.cankers stem.cankers 38
## canker.lesion canker.lesion 38
## fruiting.bodies fruiting.bodies 106
## ext.decay ext.decay 38
## mycelium mycelium 38
## int.discolor int.discolor 38
## sclerotia sclerotia 38
## fruit.pods fruit.pods 84
## fruit.spots fruit.spots 106
## seed seed 92
## mold.growth mold.growth 92
## seed.discolor seed.discolor 106
## seed.size seed.size 92
## shriveling shriveling 106
## roots roots 31
c.Develop a strategy for handling missing data, either by eliminating predictors or imputation.
variable like hail is missing 121 values, but another like precip is missing 38. If we remove every single instance of null values, this may increase bias in our model towards variables that have more accurate data.
Instead, what I would do is to impute and use a technique like K-nearest model to fill in the null values.
Summary of Steps:
Identify predictors with excessive missing values. Remove those predictors if they are not crucial. Apply mean/median imputation for numerical predictors. Apply mode imputation for categorical predictors.
# Bar plot of missing values
ggplot(missing_summary, aes(x = reorder(Predictor, -Missing), y = Missing)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
ggtitle("Missing Values by Predictor") +
labs(x = "Predictor", y = "Number of Missing Values") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create a missingness indicator for each predictor
missing_indicators <- as.data.frame(sapply(Soybean, function(x) is.na(x)))
# Add the Class variable to the indicators
missing_indicators$Class <- Soybean$Class
# Calculate the proportion of missing values for each class
missing_by_class <- missing_indicators %>%
group_by(Class) %>%
summarize(across(everything(), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
# Melt the data for visualization
missing_by_class_long <- melt(missing_by_class, id.vars = "Class")
# Bar plot of missing data by class
ggplot(missing_by_class_long, aes(x = Class, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Proportion of Missing Data by Class") +
labs(x = "Class", y = "Proportion of Missing Values") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))