3.1. The UC Irvine Machine Learning Repository to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(DataExplorer)
library(naniar)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(grid)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
par(mfrow=c(3,4))
par(mai=c(.3,.3,.3,.3))
factors <- c("Type")
variables <- names(Glass)
for (i in 1:(length(variables)-1)) {
if (! variables[i] %in% factors){
hist(Glass[[variables[i]]], main = variables[i], col = "lightblue")
}
}
The distributions vary across the different element numeric predictors:
ggplot(data=Glass,aes(Type)) +
geom_bar() +
labs(title='Type Frequencies')
The shape of a factor variable is a little less relevant in comparison to a numeric distribution assuming a non-degenerate distribution; however, there is a much heavier concentration in types 1 & 2 which may have significant relationships with the other predictors.
Let’s build a correlation plot to see the strength of the linear relationships among the predictors:
corrplot(cor(Glass |> dplyr::select(-Type)),
method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)
Most of the relationships are not unusually correlated with one another although several factors are moderately linearly related to RI and Mg. The one exception to that case is CA and RI that has a correlation strength of 0.81 which is by far the strongest positive or negative relationship.
Glass |>
select_if(is.numeric) %>%
pivot_longer(cols=everything()) |>
# filter(cols!='Si') |>
ggplot(aes(y = value,colour=name)) +
geom_boxplot()
The boxplot would seem to indicate a few outliers that exist in the data set in most of the predictors although they are most pronounced/frequent in CA and NA. There are a few outliers that exist for Si, Ba, Al, and K.
As mentioned above when reviewing the distributions there are many predictors that are skewed in Ri, Mg, Al, Si, K, Ca, Ba, Fe. Almost all of the elements have skewed values that are present in the dataset.
glass_bc_prep <- Glass |>
select_if(is.numeric) %>% mutate(rownum=row_number()) |>
pivot_longer(cols=c('Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba','RI','Fe')) |> mutate(value_adj=ifelse(value==0,0.001,value)) |>
pivot_wider(id_cols=rownum,names_from=name,values_from=value_adj) |>
dplyr::select(-rownum)
skewed <- c(colnames(glass_bc_prep))
for (i in 1:(length(colnames(glass_bc_prep)))){
if (i == 1){
lambdas <- c()
}
bc <- boxcox(lm(glass_bc_prep[[skewed[i]]] ~ 1),
lambda = seq(-2, 2, length.out = 81),
plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
knitr::kable(lambdas, format = "simple")
| skewed | lambdas |
|---|---|
| Na | -0.0999999999999999 |
| Mg | 0.55 |
| Al | 0.5 |
| Si | 2 |
| K | 0.35 |
| Ca | -1.1 |
| Ba | -0.85 |
| RI | -2 |
| Fe | -0.45 |
Based on the proposed best power transformation from the Box Cox it does not appear that most of the columns have a reasonable transformation although we will replot them all to see what the new distributions are after applying the tranformation.
glass_bc <- glass_bc_prep |> mutate(na_mod=Na^-0.1,mg_mod=Mg^0.75,
al_mod=Al^0.5,si_mod=Si^2,k_mod=K^0.33,ca_mod=Ca^-1,
ba_mod=Ba^-1.3,ri_mod=RI^-2,fe_mod=Fe^-0.85) |> dplyr::select(contains('mod'))
par(mfrow=c(2,4))
par(mai=c(.3,.3,.3,.3))
bc_vars <- names(glass_bc)
for (i in 1:(length(bc_vars)-1)) {
hist(glass_bc[[bc_vars[i]]], main = bc_vars[i], col = "lightblue")
}
After applying the maximized log likelihood transformations to the
Glass data set it does not seem that many are really beneficial.
Na, Al appear to have been normalized as
expected and perhaps Ca and Si could be useful
transformations, but the remainder do not appear to be effective.
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
There are a decent amount of null values across many of the predictor columns. We will explore these occurrences in future plots. Most of the predictors have a minimum of two factors although they are heavily concentrated in one of those values.
#guidance found here: https://stackoverflow.com/questions/67158295/bar-plot-for-each-column-in-a-data-frame-in-r
all_plots <- lapply(names(Soybean), function(col) {
ggplot(Soybean, aes(.data[[col]], after_stat(count),fill='black')) +
geom_bar(aes(fill = .data[[col]]), position = "dodge") +
theme(legend.position="none",axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
})
Need to Sort
This plot shows a large variety of classes that exist in the Soybean dataset.
grid.arrange(grobs=all_plots[1],ncol=1)
grid.arrange(grobs=all_plots[2:10], ncol= 3)
grid.arrange(grobs=all_plots[11:19], ncol= 3)
grid.arrange(grobs=all_plots[20:29], ncol= 3)
grid.arrange(grobs=all_plots[30:36], ncol= 3)
Looking at the frequency distributions of the different predictors
many of the features appear to be showing up at two or three values and
it would need to be researched further how common the pairwise
correlation of specific values is among the predictors. In general the
zero case is the most common one for a vast majority of the features
although there are certainly exceptions to that patterns
(e.g. stem)
There are technically not any columns with only one value that
corresponds to the degenerate distribution definition although
mycelium and sclerotia have very few
alternative cases outside of zero.
p1 <- plot_missing(Soybean, missing_only = TRUE,
ggtheme = theme_classic())
It’s likely that there are interrelationships across specific predictors that are very similar to one another given multiple groups of columns have the same number of records missing.
There are a few features that appear to have the highest rate of
missing values (germ, lodging,
seed.tmt, sever, and hall) at
17.72%. The different groups of columns mostly have a number of NA
values that might be appropriate to impute new values.
gg_miss_fct(x =Soybean , fct = Class) + labs(title='Missing Values by Class')
This missing frequency plot grouped by a factor is extremely useful to see the concentration of missing values by Class. It would appear that only a few classes make up all of the missing values. This warrants some special treatment or review to understand why this concentration is occurring. As it seems unlikely that this could be missing at random and more likely corresponds to the classes in some ways.
While imputation might make sense for
diaporthe-pod & stem-canker or
phytophthora-rot it would probably be practical to
separately consider the classes that have all the null values and
determine whether these are slightly different types of soybeans or
there is rationale that explains why there are so many missing values in
them. Is there truly some unusual measurement error that has corrupted
these values as it would be more challenging to reasonably estimate them
with imputation without understanding some of the differences and
similarities to other Class types. A model with the remainder of the
data (i.e. classes without nulls) might be a useful starting point.