(a) Using visualizations, explore the predictor variables to
understand their distributions as well as the relationships between
predictors.
## corrplot 0.94 loaded
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corr_matrix <- cor(Glass[, 1:9])
# Visualize the correlation matrix as a heatmap
corrplot(corr_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

Glass %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free')

Glass %>%
select(!Type) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free')

By looking at the graphs we can see that RI (Refractive Index) and
Ca (Calcium) show the highest positive correlation in our heatmap and RI
and Mg (Magnesium) have the lowest correlation.
(b) Do there appear to be any outliers in the data? Are any
predictors skewed?
Most of the predictors exhibit some degree of right skewness, with
the exception of Mg (Magnesium) and Si (Silicon). Among these, Si shows
a more normal distribution, while the majority of the others appear to
be right skewed. Outliers are very evident in the box plots for all
predictors except for Mg, which does not appear to have any
outliers.
(a) Investigate the frequency distributions for the
categorical predictors. Are any of the distributions degenerate in the
ways discussed earlier in this chapter?
First of all degenerate distribution occurs when one or more
categories have very few or no observations.
Soybean %>%
select(!Class)%>%
drop_na() %>%
gather() %>%
ggplot(aes(value)) +
geom_bar() +
facet_wrap(~ key) +
labs(title="Soybean")

In our case ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be
degenerate as observed on the plot above.
(c) Develop a strategy for handling missing data, either by
eliminating predictors or imputation.
I’d use mode imputation for handling the missing data in this
dataset. Since most of the predictors are categorical, we can fill in
missing values with the most frequent category for each predictor. This
approach is simple and effective, ensuring that we don’t lose too much
data by removing rows or columns. Mode imputation helps preserve the
structure of the dataset while maintaining the integrity of the
categorical variables. After imputation, we can proceed with analysis
without worrying about gaps in the data.