Homework4

3.1

library(mlbench)
library(tidyverse)
library(GGally)
library(e1071)
library(caret)
library(nnet)
library(kernlab)

# Load the data
data(Glass)

# Check the structure of the dataset
glimpse(Glass)

## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

(a) Using visualizations, explore the predi tor variables to understand their distributions as well as the relationships between predictors.

# Histogram for each predictor
Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "blue4", alpha = 0.7) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distribution of Predictor Variables")

The histograms demonstrate that some of predictors display a right-skewed distribution, such as Si, Na, and Ca. Furthermore, some variables like Ba and Fe have a lot of zero values, indicating that they might not be found in some of the glass types.

# Scatterplot matrix to visualize relationships between predictors
ggpairs(Glass, aes(color = as.factor(Type), alpha = 0.5))

From the scatterplot matrix, it is evident that Ca and K appear to be inversely related, while Na and RI show a very weak positive correlation. Some categories overlap, suggesting that certain types of glass have a similar composition.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

# Boxplots to check for outliers
Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "orange", alpha = 0.7) +
  theme_minimal() +
  coord_flip() +
  ggtitle("Boxplots of Predictor Variables")

# Checking skewness
apply(Glass[, -10], 2, function(x) skewness(x, na.rm = TRUE))

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Both the boxplots and the skewness test indicate that variable K exhibits a right-skewed distribution, characterized by a large number of small values and a few large values. Additionally, Ba and Ca also display mild skewness. This suggests that there are a few predictor variables with skewed distributions.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Original classification model

set.seed(1048)  # Ensure reproducibility
trainIndex <- createDataPartition(Glass$Type, p = 0.8, list = FALSE)
trainData <- Glass[trainIndex, ]
testData <- Glass[-trainIndex, ]

# 3. Train an SVM model (without transformations)
svmModel_cv <- train(Type ~ Na + Mg + K + Ba, data = trainData, method = "svmLinear", trControl = trainControl(method = "cv", number = 10))

# Evaluate the model
print(svmModel_cv)

## Support Vector Machines with Linear Kernel 
## 
## 174 samples
##   4 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 156, 157, 156, 158, 155, 157, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5068047  0.2972592
## 
## Tuning parameter 'C' was held constant at a value of 1

There are many transformations of the predictors that can improve the classification model.

Using Log Transformation

# Ensure all values are positive before applying log transformation
min_Ba <- min(trainData$Ba)
min_K <- min(trainData$K)

# Shift the data to make all values positive
shift_value_Ba <- abs(min_Ba) + 1
shift_value_K <- abs(min_K) + 1

trainData$Ba <- log(trainData$Ba + shift_value_Ba)
trainData$K <- log(trainData$K + shift_value_K)
testData$Ba <- log(testData$Ba + shift_value_Ba)
testData$K <- log(testData$K + shift_value_K)

# Remove shift effect
trainData$Ba <- trainData$Ba - log(shift_value_Ba)
trainData$K <- trainData$K - log(shift_value_K)
testData$Ba <- testData$Ba - log(shift_value_Ba)
testData$K <- testData$K - log(shift_value_K)

# Train a classification model
svmModel_cv2 <- train(Type ~ Na + Mg + K + Ba, data = trainData, method = "svmLinear", trControl = trainControl(method = "cv", number = 10))

# Evaluate the model
print(svmModel_cv2)

## Support Vector Machines with Linear Kernel 
## 
## 174 samples
##   4 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 157, 157, 157, 154, 156, 158, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5596487  0.3750261
## 
## Tuning parameter 'C' was held constant at a value of 1

After applying a log transformation to the skewed predictors (Ba and K), we observed a 0.05 increase in the model’s accuracy. Although this improvement is noteworthy, it is not significant since Ba and K are not major predictors in the model. We can explore other models to determine if the effects of the log transformations are more pronounced. However, this result is already sufficient to demonstrate the improvement.

3.2

# Load the data
data(Soybean)

# Check the structure of the dataset
glimpse(Soybean)

## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Investigate the categorical predictors
categorical_columns <- sapply(Soybean, is.factor)
categorical_data <- Soybean[, categorical_columns]

# Check frequency distributions for categorical variables
summary(categorical_data)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

From the summary, we observe that there are 19 distinct classes of diseases in the dataset, and the distribution is very uneven. There are 92 observations of brown-spot, while other classes may have only a dozen.

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

colSums(is.na(Soybean))

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

# Check the proportion of missing data for each predictor
missing_data <- colSums(is.na(Soybean)) / nrow(Soybean)
missing_data

##           Class            date     plant.stand          precip            temp 
##     0.000000000     0.001464129     0.052708638     0.055636896     0.043923865 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##     0.177159590     0.023426061     0.001464129     0.177159590     0.177159590 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##     0.163982430     0.023426061     0.000000000     0.122986823     0.122986823 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##     0.122986823     0.146412884     0.122986823     0.158125915     0.023426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##     0.177159590     0.055636896     0.055636896     0.155197657     0.055636896 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##     0.055636896     0.055636896     0.055636896     0.122986823     0.155197657 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##     0.134699854     0.134699854     0.155197657     0.134699854     0.155197657 
##           roots 
##     0.045387994

The proportion of missing data for each predictor is shown above, and we can see that some predictors, such as hail, seed.tmt, and leaf.mild, have a very high proportion of missing data, with at least 15% of their data missing.

# Investigate if missingness is related to classes
table(Soybean$Class, useNA = "ifany")

## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20

# Count missing values by class
missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))

glimpse(missing_by_class)

## Rows: 19
## Columns: 36
## $ Class                   <fct> 2-4-d-injury, alternarialeaf-spot, anthracnose…
## $ missing_date            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ missing_plant.stand     <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 6, 0, 0, 0, 0, 0,…
## $ missing_precip          <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_temp            <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 0, 0,…
## $ missing_hail            <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 8, 0…
## $ missing_crop.hist       <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ missing_area.dam        <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ missing_sever           <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 8, 0…
## $ missing_seed.tmt        <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 8, 0…
## $ missing_germ            <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 6, 0, 0, 0, 8, 0,…
## $ missing_plant.growth    <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ missing_leaves          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ missing_leaf.halo       <int> 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 0, 0,…
## $ missing_leaf.marg       <int> 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 0, 0,…
## $ missing_leaf.size       <int> 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 0, 0,…
## $ missing_leaf.shread     <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 0, 0…
## $ missing_leaf.malf       <int> 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 0, 0,…
## $ missing_leaf.mild       <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 8, 0…
## $ missing_stem            <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ missing_lodging         <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 15, 0, 0, 0, 8, 0…
## $ missing_stem.cankers    <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_canker.lesion   <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_fruiting.bodies <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_ext.decay       <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_mycelium        <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_int.discolor    <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_sclerotia       <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_fruit.pods      <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ missing_fruit.spots     <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_seed            <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, …
## $ missing_mold.growth     <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, …
## $ missing_seed.discolor   <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_seed.size       <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, …
## $ missing_shriveling      <int> 16, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 8, 0,…
## $ missing_roots           <int> 16, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0,…

In our investigation into whether missing values are related to specific classes, we found that most rows containing NAs had multiple missing values across several columns. Furthermore, these missing observations often originated from the same classes. This suggests a pattern of missingness associated with the classes.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In this situation, there are several methods to handle missing data. Since the percentage of missing data is less than 30%, we can use mode imputation for each class, as the predictors are categorical variables. If the missingness of the predictors is significant to the dataset, we can create indicators for the missing values. These approaches preserve useful information while ensuring that the missing data does not negatively impact model performance.