Excercise3.1.

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  • Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

  • Do there appear to be any outliers in the data? Are any predictors skewed?

  • Are there any relevant transformations of one or more predictors that might improve the classification model?

data(Glass) 
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.0
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# Calculate the correlation matrix
correlation_matrix <- cor(Glass |> select(-Type))

# Melt the correlation matrix
melted_correlation <- melt(correlation_matrix)
ggplot(melted_correlation, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "green", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap of Glass Predictors")

# Load necessary libraries
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.0
## corrplot 0.92 loaded
library(dplyr)

# Load the Glass dataset
data("Glass")

# Calculate the correlation matrix
correlation_matrix <- cor(Glass |> select(-Type))

# Create the correlation plot with diagonal values on the right side
corrplot(correlation_matrix, 
         method = "circle",      # Circle method for visualization
         type = "full",          # Show the full matrix
         order = "hclust",       # Hierarchical clustering order
         tl.col = "black",       # Text label color
         tl.srt = 45,            # Text label rotation
         addCoef.col = "black",  # Add correlation coefficients in black
         diag = TRUE,            # Show the diagonal
         cl.pos = "r")           # Position color legend to the right

  • Do there appear to be any outliers in the data? Are any predictors skewed?
# Gather data into a long format for easier plotting
glass_long <- Glass %>%
  select(RI, Na, Mg, Al, Si, K, Ca, Ba, Fe) %>%
  pivot_longer(cols = everything(), names_to = "Element", values_to = "Value")
# Create density plot using facet_wrap
ggplot(glass_long, aes(x = Value, fill = Element)) + 
  geom_density(color = "black", alpha = 0.5) + 
  facet_wrap(~ Element, scales = "free", nrow = 3) + 
  theme_minimal() +  # Use a minimal theme for cleaner look
  labs(title = "Density Plots of Glass Components", x = "Value", y = "Density") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5))

Conclusion:all of the predictor variables have skewness: RI, Na, Ai, K, Ca, Ba, and Fe all have positive skew and Mg, and Si have negative skew.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Load the Glass dataset
data("Glass")

# Define the list of numeric variables to plot
numeric_vars <- c("Ba", "Ca", "Fe", "K", "Na", "RI", "Si", "Mg", "Al")

# Create individual box plots for each variable with outliers highlighted in red
plots <- lapply(numeric_vars, function(var) {
  ggplot(Glass, aes_string(y = var)) + 
    geom_boxplot(fill = "lightblue", color = "black", outlier.colour = "red") + 
    ggtitle(paste("Box Plot of", var)) + 
    theme_minimal() + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    labs(y = var)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Arrange the plots in a grid layout
grid.arrange(grobs = plots, nrow = 3, ncol = 3)

In the above box plot we can see that all predictor axcept for Mg has outlier.

  • Are there any relevant transformations of one or more predictors that might improve the classification model?*
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(MASS)  # For Box-Cox transformation
## Warning: package 'MASS' was built under R version 4.3.0
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Part C I choose to utilize the Box-Cox method the normalize Al and Mg. Since the Box-Cox method automatically figures out the ideal transformation parameter, it gives more flexibility. Both variables showed a complex distribution and therefore this method appeared most appropriate.

For distributions like Ri and K which have more than 1 peak, I opted to use a sqrt transformation to better center the data without changing heavily changing the shape or slopes in the distribution.

For Ca, Ba, and Fe, since they are right skewed, I opted to use a more powerful transformation like log. However, it didn’t heavily change the distribution for Ba and Fe.

# Box-Cox transformation on Al
lamb.Al <- boxcox((Glass$Al) ~ 1)

# Find optimal lambda for Al
lambda_Al <- lamb.Al$x[which.max(lamb.Al$y)] #0.50

Glass <- Glass %>%
  mutate(Al_transformed = (Al^lambda_Al - 1) / lambda_Al)

p4.2 <- ggplot(Glass, aes(x = Al_transformed)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  ggtitle("Box-Cox Transformed Al (Aluminum)") 

#################
  
Glass <- Glass %>%
  mutate(Mg = ifelse(Mg <= 0, Mg + abs(min(Mg)) + 1, Mg))

# Box-Cox transformation for Mg
lamb.Mg <- boxcox((Glass$Mg) ~ 1)

# Find lambda for Mg transformation
lamb.Mg <- lamb.Mg$x[which.max(lamb.Mg$y)]  # 

# Apply the Box-Cox transformation to Mg
Glass <- Glass %>%
  mutate(Mg_transformed = (Mg^lamb.Mg - 1) / lamb.Mg)

p3.2 <- ggplot(Glass, aes(x = Mg_transformed)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  ggtitle("Box-Cox Transformed Mg (Magnesium)")
library(ggplot2)
library(gridExtra)

# Transformations
Glass$Ri_sqrt <- sqrt(Glass$RI)
Glass$K_sqrt <- sqrt(Glass$K)
Glass$Ca_log <- log(Glass$Ca + 1)
Glass$Ba_log <- log(Glass$Ba + 1)
Glass$Fe_log <- log(Glass$Fe + 1)

# List of variables and titles for plotting
variables <- c("Ri_sqrt", "K_sqrt", "Ca_log", "Ba_log", "Fe_log")
titles <- c("Ri_sqrt", "K_sqrt", "Ca_log", "Ba_log", "Fe_log")

# Create a list to store plots
plots <- list()

# Generate plots
for (i in seq_along(variables)) {
  plots[[i]] <- ggplot(Glass, aes_string(x = variables[i])) +
    geom_density(fill = "purple", alpha = 0.5) +
    ggtitle(titles[i])
}

# Arrange plots in a grid
grid.arrange(grobs = plots, nrow = 3, ncol = 2)

Excercise 3.2:

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  • a.Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

  • b.Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

  • c.Develop a strategy for handling missing data, either by eliminating predictors or imputation.

data("Soybean")

head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
# Load necessary libraries
library(ggplot2)
library(dplyr)

# Load the Soybean dataset
data("Soybean")

# Check the structure of the dataset to identify categorical variables
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Identify categorical variables (e.g., those with factor type)
categorical_vars <- sapply(Soybean, is.factor)

# Create frequency tables for categorical variables
frequency_tables <- lapply(names(Soybean)[categorical_vars], function(var) {
  table(Soybean[[var]])
})
# Display frequency tables
names(frequency_tables) <- names(Soybean)[categorical_vars]
frequency_tables
## $Class
## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
## $date
## 
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
## $plant.stand
## 
##   0   1 
## 354 293 
## 
## $precip
## 
##   0   1   2 
##  74 112 459 
## 
## $temp
## 
##   0   1   2 
##  80 374 199 
## 
## $hail
## 
##   0   1 
## 435 127 
## 
## $crop.hist
## 
##   0   1   2   3 
##  65 165 219 218 
## 
## $area.dam
## 
##   0   1   2   3 
## 123 227 145 187 
## 
## $sever
## 
##   0   1   2 
## 195 322  45 
## 
## $seed.tmt
## 
##   0   1   2 
## 305 222  35 
## 
## $germ
## 
##   0   1   2 
## 165 213 193 
## 
## $plant.growth
## 
##   0   1 
## 441 226 
## 
## $leaves
## 
##   0   1 
##  77 606 
## 
## $leaf.halo
## 
##   0   1   2 
## 221  36 342 
## 
## $leaf.marg
## 
##   0   1   2 
## 357  21 221 
## 
## $leaf.size
## 
##   0   1   2 
##  51 327 221 
## 
## $leaf.shread
## 
##   0   1 
## 487  96 
## 
## $leaf.malf
## 
##   0   1 
## 554  45 
## 
## $leaf.mild
## 
##   0   1   2 
## 535  20  20 
## 
## $stem
## 
##   0   1 
## 296 371 
## 
## $lodging
## 
##   0   1 
## 520  42 
## 
## $stem.cankers
## 
##   0   1   2   3 
## 379  39  36 191 
## 
## $canker.lesion
## 
##   0   1   2   3 
## 320  83 177  65 
## 
## $fruiting.bodies
## 
##   0   1 
## 473 104 
## 
## $ext.decay
## 
##   0   1   2 
## 497 135  13 
## 
## $mycelium
## 
##   0   1 
## 639   6 
## 
## $int.discolor
## 
##   0   1   2 
## 581  44  20 
## 
## $sclerotia
## 
##   0   1 
## 625  20 
## 
## $fruit.pods
## 
##   0   1   2   3 
## 407 130  14  48 
## 
## $fruit.spots
## 
##   0   1   2   4 
## 345  75  57 100 
## 
## $seed
## 
##   0   1 
## 476 115 
## 
## $mold.growth
## 
##   0   1 
## 524  67 
## 
## $seed.discolor
## 
##   0   1 
## 513  64 
## 
## $seed.size
## 
##   0   1 
## 532  59 
## 
## $shriveling
## 
##   0   1 
## 539  38 
## 
## $roots
## 
##   0   1   2 
## 551  86  15
# Create bar plots for categorical variables
for (var in names(frequency_tables)) {
  p <- ggplot(Soybean, aes_string(x = var)) +
    geom_bar(fill = "blue", alpha = 0.7) +
    ggtitle(paste("Frequency Distribution of", var)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = var, y = "Count")
  
  # Print the plot
  print(p)
}

# Check for missing values
missing_summary <- sapply(Soybean, function(x) sum(is.na(x)))
missing_summary <- data.frame(Predictor = names(missing_summary), Missing = missing_summary)
missing_summary <- missing_summary %>% filter(Missing > 0)

# Display predictors with missing values
print(missing_summary)
##                       Predictor Missing
## date                       date       1
## plant.stand         plant.stand      36
## precip                   precip      38
## temp                       temp      30
## hail                       hail     121
## crop.hist             crop.hist      16
## area.dam               area.dam       1
## sever                     sever     121
## seed.tmt               seed.tmt     121
## germ                       germ     112
## plant.growth       plant.growth      16
## leaf.halo             leaf.halo      84
## leaf.marg             leaf.marg      84
## leaf.size             leaf.size      84
## leaf.shread         leaf.shread     100
## leaf.malf             leaf.malf      84
## leaf.mild             leaf.mild     108
## stem                       stem      16
## lodging                 lodging     121
## stem.cankers       stem.cankers      38
## canker.lesion     canker.lesion      38
## fruiting.bodies fruiting.bodies     106
## ext.decay             ext.decay      38
## mycelium               mycelium      38
## int.discolor       int.discolor      38
## sclerotia             sclerotia      38
## fruit.pods           fruit.pods      84
## fruit.spots         fruit.spots     106
## seed                       seed      92
## mold.growth         mold.growth      92
## seed.discolor     seed.discolor     106
## seed.size             seed.size      92
## shriveling           shriveling     106
## roots                     roots      31

c.Develop a strategy for handling missing data, either by eliminating predictors or imputation.

variable like hail is missing 121 values, but another like precip is missing 38. If we remove every single instance of null values, this may increase bias in our model towards variables that have more accurate data.

Instead, what I would do is to impute and use a technique like K-nearest model to fill in the null values.

Summary of Steps:

Identify predictors with excessive missing values. Remove those predictors if they are not crucial. Apply mean/median imputation for numerical predictors. Apply mode imputation for categorical predictors.

# Bar plot of missing values
ggplot(missing_summary, aes(x = reorder(Predictor, -Missing), y = Missing)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
  ggtitle("Missing Values by Predictor") +
  labs(x = "Predictor", y = "Number of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Create a missingness indicator for each predictor
missing_indicators <- as.data.frame(sapply(Soybean, function(x) is.na(x)))

# Add the Class variable to the indicators
missing_indicators$Class <- Soybean$Class
# Calculate the proportion of missing values for each class
missing_by_class <- missing_indicators %>%
  group_by(Class) %>%
  summarize(across(everything(), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
# Melt the data for visualization
missing_by_class_long <- melt(missing_by_class, id.vars = "Class")
# Bar plot of missing data by class
ggplot(missing_by_class_long, aes(x = Class, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Proportion of Missing Data by Class") +
  labs(x = "Class", y = "Proportion of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))