The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
sum(rowSums(is.na(Glass)) > 0)
## [1] 0
There are no null values so we can just start plotting histograms for each column
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot_histograms <- function(df, bins = 30, ncol = 2) {
# Create a list of histogram plots for predictors columns (Response is a factor)
plots <- lapply(names(df), function(col) {
if (is.numeric(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(fill = "blue", color = "black", bins = bins) +
labs(title = paste("Histogram of", col), x = col, y = "Count") +
theme_minimal()
}
})
do.call(grid.arrange, c(plots, ncol = ncol))
}
plot_histograms(Glass, bins = 25, ncol = 3)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
Glass %>%
select(where(is.numeric)) %>%
apply(., 2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Glass)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Do there appear to be any outliers in the data? Are any predictors skewed?
Looking at the Histograms there does appear to be some outliers in the data. We can see a couple outliers in the Ri column out near the 1.53 mark, there is also one Na column that is much higher than the rest of the values. There is an obvious one in the K column all the way out near 6. Then, there are also some in the Ba and Fe columns so there appears to be a few outliers in the data set.
There are also a few columns that appear to be skewed. The Ri, Al, Ca, Ba, Fe, K, and Na columns all are right skewed which means there are more lower values than higher ones. While the Mg and Si columns appear to be left skewed which means there are more higher values than lower.
Looking at the scatter plots we can see that the Ri and Ca predictors are pretty positively correlated with a .810 correlation. Then the next highest correlation is between Si and Ri with a negative correlation of -.542. There are some other predictors that are slightly correlated in the .4 range but those 4 are the most correlated. So, we could possibly pick one of Ri and Ca to include in a model since they are correlated.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, there are some transformations that may improve the model accuracy. For all of the predictors we could use box-cox transformations and figure out the optimal lambda values. But just looking at the distributions I would apply log transformations to all the predictors that are significantly right skewed like the Ri, Ba, K, Fe, Ca. Then I would apply exponential transformation to the significantly left skewed data like the Mg columns and then I would not transform the Si and Na columns since they are already almost normal.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g.,temperature,precipitation)and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
library(mlbench)
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
Get info about the Soybean data set.
?Soybean
Function to plot frequency distributions for categorical variables
plot_cat_dists <- function(df, ncol = 6) {
plots <- lapply(names(df), function(col) {
if (is.factor(df[[col]]) && col != "Class") {
ggplot(df, aes(x = .data[[col]])) +
geom_density() +
geom_bar()
}
})
do.call(grid.arrange, c(plots, ncol = ncol))
}
Plot the frequency distributions for categorical variables.
plot_cat_dists(Soybean)
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
Looking at the output above we can clearly see that there are a few
variables that take on primarily one value making them degenerate
variables. The variables “mycelium”, “sclerotia” and “shriveling” are
all clearly degenerate and then the variables “leaf.mild”, “leaf.malf”,
“leaves”, “lodging”, “seed.size”, “seed.discolor”, and “mold.growth” are
somewhat degenerate with most of the samples belonging to one class.
Roughly 18% of the data are missing.Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Get the percentage of missing data for each column.
missing_data <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
missing_data_sorted <- sort(missing_data[missing_data > 0], decreasing = TRUE)
missing_data_sorted
## hail sever seed.tmt lodging germ
## 17.7159590 17.7159590 17.7159590 17.7159590 16.3982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 15.8125915 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 14.6412884 13.4699854 13.4699854 13.4699854 12.2986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 12.2986823 12.2986823 12.2986823 12.2986823 5.5636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## sclerotia plant.stand roots temp crop.hist
## 5.5636896 5.2708638 4.5387994 4.3923865 2.3426061
## plant.growth stem date area.dam
## 2.3426061 2.3426061 0.1464129 0.1464129
From the output above we can see that the top 4 missing variables have the same exact percentage of missing samples. This shows that there is a clear pattern of the missing data related to the classes. It is highly unlikely that 4 columns would have the same exact amount of missing data unless there was some underlying pattern as to why that those columns are missing so often and at the same exact rate.
My strategy for handling missing data would be to remove the top 4 missing predictors that all have the exact same missing percentage. For these columns there are just too much missing so trying to impute these columns would be too unreliable. Then for the rest of the predictors I would try to use KNN to impute the missing data for each predictor. I would also check to see if there were any relationships between predictors that could be useful in a simple linear model to impute any predictors if KNN does not work well for any predictors. I would also be very careful when imputing predictors with more than 15% missing as they could be unreliable due to so much data missing but it would be worth a try at imputing them using either KNN or a simple linear regression model to fill in those missing values accurately.