library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
library(gridExtra)
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)

R Markdown

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Predictors")

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot() 

Several elements, including Al, Ba, Ca, Fe, K, and the Refractive Index (RI), show right-skewed distributions. Fe and Ba have values concentrated near zero, indicating they are present in trace amounts in most samples. Mg is left-skewed and bimodal, Na is nearly normally distributed with a slight right tail, and Si is left-skewed.

Positive Correlations: A strong positive relationship exists between RI and Ca (as calcium increases, the refractive index tends to increase) and between Ba and Al. Negative Correlations: Notable negative correlations include RI and Si, Al and Mg, Ca and Mg, and Ba and Mg, suggesting these elements tend to vary inversely or independently.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

There appears to be outliers present in the data in almost all of the predictors, excluding Mg. Some of the predictors are skewed. See 3.1a

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Right-skewed variables (e.g., Na, Al, K, Ba, Fe) benefit from log, square root, or Box-Cox transformations to reduce skewness.

Left-skewed variables (e.g., Mg, Si) can be normalized using reverse log or square root transformations.

For slightly skewed variables like Ca and RI, a square root transformation should be used to address mild skewness.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

columns <- colnames(Soybean)
#
lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Degenerate distributions occur when a variable only takes on a single value or almost always takes one value. Mycelium and Sclerotia seem to be degenerate. leaf.mild and leaf.malf seem to also almost one-sided.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I would start by eliminating predictors with high amounts of missing data. You can then impute the variables that have missing values using KNN for more accurate results. You can also remove the classes missing the most values.