(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

## corrplot 0.94 loaded
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corr_matrix <- cor(Glass[, 1:9])

# Visualize the correlation matrix as a heatmap
corrplot(corr_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

Glass %>%
  select_if(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free')

Glass %>% 
  select(!Type) %>% 
  gather() %>% 
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free')

By looking at the graphs we can see that RI (Refractive Index) and Ca (Calcium) show the highest positive correlation in our heatmap and RI and Mg (Magnesium) have the lowest correlation.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Most of the predictors exhibit some degree of right skewness, with the exception of Mg (Magnesium) and Si (Silicon). Among these, Si shows a more normal distribution, while the majority of the others appear to be right skewed. Outliers are very evident in the box plots for all predictors except for Mg, which does not appear to have any outliers.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

The Box-Cox Transformation is suitable for predictors that show right skewness, as it helps to stabilize variance and make the data more normally distributed. The Spatial Sign Transformation is useful for predictors with evident outliers. This transformation reduces the impact of extreme values by normalizing the data

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First of all degenerate distribution occurs when one or more categories have very few or no observations.

Soybean %>%  
  select(!Class)%>%  
  drop_na() %>%  
  gather() %>% 
  ggplot(aes(value)) +  
  geom_bar() +  
  facet_wrap(~ key) +
  labs(title="Soybean")

In our case ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate as observed on the plot above.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I’d use mode imputation for handling the missing data in this dataset. Since most of the predictors are categorical, we can fill in missing values with the most frequent category for each predictor. This approach is simple and effective, ensuring that we don’t lose too much data by removing rows or columns. Mode imputation helps preserve the structure of the dataset while maintaining the integrity of the categorical variables. After imputation, we can proceed with analysis without worrying about gaps in the data.