Libraries
library(mlbench)
library(corrplot)## corrplot 0.92 loaded
library(e1071)## Warning: package 'e1071' was built under R version 4.1.2
library(visdat)
library(naniar)## Warning: package 'naniar' was built under R version 4.1.2
library(dplyr)## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)## Warning: package 'tidyr' was built under R version 4.1.2
library(ggplot2)Exercise 3.1
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)Part A
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
glass_df <- Glass[,1:9]
glass_gather <- glass_df %>%
gather(key = 'variable', value = 'value')
ggplot(glass_gather) +
geom_histogram(aes(x=value, y = ..density..), bins=50) +
geom_density(aes(x=value), color='red') +
facet_wrap(. ~variable, scales='free', ncol=4)From the above histograms of the predictor variables, we learn a lot about the distribution of the features in our glass classification model. Al, Ca, Na, RI and Si are all relatively centered. All Exhibit a slight right skew except for Si, which exhibits a slight left skew. Ba, Fe, K are extremely right skewed. Finally Mg is very left skewed, but also exhibits some bimodal behavior.
glass_df %>%
cor() %>%
corrplot()Understanding the relationship between variables is essential in building an effective predictive model. From the above correlation plot we can see that most features do not have strong positive correlation with eachother. Only Ca and RI exhibit any strong prositive correlation. Conversely, we see that many features are negatively correlated, such as Sa-RI and Ba-Mg.
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Glass Types")Finally, looking at the target values we can see that this is a multi-categorical classification problem that is unblalanced.
Part B
Do there appear to be any outliers in the data? Are any predictors skewed?
glass_df %>%
apply(., 2, skewness)## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
Certainly there are some outliers. Namely we look at RI, Ba, and K.
Part C
Are there any relevant transformations of one or more predictors that might improve the classification model?
Most of the skewed features could benefit from a log transformation. In general I would attempt a Box Cox transformation and see what the resulting features looked like distributionally. We also saw strong multi-collinearity so potentially using some sort of PCA transformtion would allow us to avoid that.
Exercise 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Part A
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data(Soybean)
head(Soybean)data(Soybean)
columns <- colnames(Soybean)
lapply(columns,
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + ggtitle(col)})## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
Degenerate distributions are those where the majority of instances are of a single class. We can see this type of behavior for the following features: Mycelium, Leaves, and Shriveling.
Part B
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
vis_miss(Soybean)## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
gg_miss_upset(Soybean,
nsets = 20,
nintersects = NA)columns <- colnames(Soybean)
new_cols <- append(columns, "missing")
sb2 <- cbind(Soybean,rowSums(is.na(Soybean)))
colnames(sb2) <- new_colsggplot(sb2) +
geom_histogram(aes(x=missing, y = ..density..), bins=50) +
geom_density(aes(x=missing), color='red') +
facet_wrap(. ~Class, scales='free', ncol=4)Based on the above charts, we see that there does seem to be a pattern in missing values determined by the class. 3 classes appear: 2-4-d-injury, iaporth-pod&stem, and phytophthora-rot.
Part C
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
There are a few strategies for handling these missing values. As the missing values are seemingly tied to individual classes, we could potentially just remove these classes altogether. However, using a more advanced approach such as KNN would be my choice.