3.1

library(mlbench)
library(tidyverse)
library(GGally)
library(e1071)
library(caret)
library(nnet)
library(kernlab)
# Load the data
data(Glass)

# Check the structure of the dataset
glimpse(Glass)
## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

(a) Using visualizations, explore the predi tor variables to understand their distributions as well as the relationships between predictors.

# Histogram for each predictor
Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "blue4", alpha = 0.7) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distribution of Predictor Variables")

The histograms demonstrate that some of predictors display a right-skewed distribution, such as Si, Na, and Ca. Furthermore, some variables like Ba and Fe have a lot of zero values, indicating that they might not be found in some of the glass types.

# Scatterplot matrix to visualize relationships between predictors
ggpairs(Glass, aes(color = as.factor(Type), alpha = 0.5))

From the scatterplot matrix, it is evident that Ca and K appear to be inversely related, while Na and RI show a very weak positive correlation. Some categories overlap, suggesting that certain types of glass have a similar composition.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

# Boxplots to check for outliers
Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "orange", alpha = 0.7) +
  theme_minimal() +
  coord_flip() +
  ggtitle("Boxplots of Predictor Variables")

# Checking skewness
apply(Glass[, -10], 2, function(x) skewness(x, na.rm = TRUE))
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Both the boxplots and the skewness test indicate that variable K exhibits a right-skewed distribution, characterized by a large number of small values and a few large values. Additionally, Ba and Ca also display mild skewness. This suggests that there are a few predictor variables with skewed distributions.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Original classification model

set.seed(1048)  # Ensure reproducibility
trainIndex <- createDataPartition(Glass$Type, p = 0.8, list = FALSE)
trainData <- Glass[trainIndex, ]
testData <- Glass[-trainIndex, ]

# 3. Train an SVM model (without transformations)
svmModel_cv <- train(Type ~ Na + Mg + K + Ba, data = trainData, method = "svmLinear", trControl = trainControl(method = "cv", number = 10))

# Evaluate the model
print(svmModel_cv)
## Support Vector Machines with Linear Kernel 
## 
## 174 samples
##   4 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 156, 157, 156, 158, 155, 157, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5068047  0.2972592
## 
## Tuning parameter 'C' was held constant at a value of 1

There are many transformations of the predictors that can improve the classification model.

Using Log Transformation

# Ensure all values are positive before applying log transformation
min_Ba <- min(trainData$Ba)
min_K <- min(trainData$K)

# Shift the data to make all values positive
shift_value_Ba <- abs(min_Ba) + 1
shift_value_K <- abs(min_K) + 1

trainData$Ba <- log(trainData$Ba + shift_value_Ba)
trainData$K <- log(trainData$K + shift_value_K)
testData$Ba <- log(testData$Ba + shift_value_Ba)
testData$K <- log(testData$K + shift_value_K)

# Remove shift effect
trainData$Ba <- trainData$Ba - log(shift_value_Ba)
trainData$K <- trainData$K - log(shift_value_K)
testData$Ba <- testData$Ba - log(shift_value_Ba)
testData$K <- testData$K - log(shift_value_K)
# Train a classification model
svmModel_cv2 <- train(Type ~ Na + Mg + K + Ba, data = trainData, method = "svmLinear", trControl = trainControl(method = "cv", number = 10))

# Evaluate the model
print(svmModel_cv2)
## Support Vector Machines with Linear Kernel 
## 
## 174 samples
##   4 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 157, 157, 157, 154, 156, 158, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5596487  0.3750261
## 
## Tuning parameter 'C' was held constant at a value of 1

After applying a log transformation to the skewed predictors (Ba and K), we observed a 0.05 increase in the model’s accuracy. Although this improvement is noteworthy, it is not significant since Ba and K are not major predictors in the model. We can explore other models to determine if the effects of the log transformations are more pronounced. However, this result is already sufficient to demonstrate the improvement.

3.2

# Load the data
data(Soybean)

# Check the structure of the dataset
glimpse(Soybean)
## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Investigate the categorical predictors
categorical_columns <- sapply(Soybean, is.factor)
categorical_data <- Soybean[, categorical_columns]

# Check frequency distributions for categorical variables
summary(categorical_data)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

From the summary, we observe that there are 19 distinct classes of diseases in the dataset, and the distribution is very uneven. There are 92 observations of brown-spot, while other classes may have only a dozen.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In this situation, there are several methods to handle missing data. Since the percentage of missing data is less than 30%, we can use mode imputation for each class, as the predictors are categorical variables. If the missingness of the predictors is significant to the dataset, we can create indicators for the missing values. These approaches preserve useful information while ensuring that the missing data does not negatively impact model performance.