library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
library(gridExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Predictors")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Predictors")
Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
Several elements, including Al, Ba, Ca, Fe, K, and the Refractive Index
(RI), show right-skewed distributions. Fe and Ba have values
concentrated near zero, indicating they are present in trace amounts in
most samples. Mg is left-skewed and bimodal, Na is nearly normally
distributed with a slight right tail, and Si is left-skewed.
Positive Correlations: A strong positive relationship exists between RI and Ca (as calcium increases, the refractive index tends to increase) and between Ba and Al. Negative Correlations: Notable negative correlations include RI and Si, Al and Mg, Ca and Mg, and Ba and Mg, suggesting these elements tend to vary inversely or independently.
There appears to be outliers present in the data in almost all of the predictors, excluding Mg. Some of the predictors are skewed. See 3.1a
Right-skewed variables (e.g., Na, Al, K, Ba, Fe) benefit from log, square root, or Box-Cox transformations to reduce skewness.
Left-skewed variables (e.g., Mg, Si) can be normalized using reverse log or square root transformations.
For slightly skewed variables like Ca and RI, a square root transformation should be used to address mild skewness.
The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
columns <- colnames(Soybean)
#
lapply(columns,
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
Degenerate distributions occur when a variable only takes on a single
value or almost always takes one value. Mycelium and Sclerotia seem to
be degenerate. leaf.mild and leaf.malf seem to also almost
one-sided.
I would start by eliminating predictors with high amounts of missing data. You can then impute the variables that have missing values using KNN for more accurate results. You can also remove the classes missing the most values.