DATA 624 Homework 4
Exercise 3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
First, we create a subset of just our predictor variables.
predictors <- Glass |>
select(-Type)
head(predictors)
Now we can plot the distributions using histograms and boxplots.
# plot distributions
# histogram
par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))
for (predictor in names(predictors)) {
hist(predictors[[predictor]], main = predictor, col='lightblue')
}
# boxplot
par(mfrow=c(3,3))
par(mai=c(.25,.25,.25,.25))
for (predictor in names(predictors)) {
boxplot(predictors[[predictor]],
main = predictor,
col='lightblue',
horizontal=T)
}
Now we can visualize the relationships between the predictor variables
using a correlation plot and scatter plots.
# relationships between predictors
# correlation plot
corrplot(cor(predictors),
method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)
# pairplot
pairs(predictors)
b. Do there appear to be any outliers in the data? Are any predictors
skewed?
Na appears to be mostly normally distributed with a slight right skew. Al, RI, and Ca also appear to have a right skews. Fe, Ba, and K are all severely right skewed. Si has a left skew and Mg is bimodal and also left skewed.
From the boxplots, we see a number of outliers for all but Mg.
We could apply Box-Cox transformations to address the skewness of some of the variables. We could also use spacial sign transformations to minimize the outliers.
# Box-Cox transformation of Al
par(mfrow=c(1,2))
BoxCoxTrans(predictors$Al)
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 1.190 1.360 1.445 1.630 3.500
##
## Largest/Smallest: 12.1
## Sample Skewness: 0.895
##
## Estimated Lambda: 0.5
hist(predictors$Al, main='Original Distribution of Al')
hist(predictors$Al**.5, main='Transformed (Lambda = 0.5)')
# Spacial sign transformation of predictors
boxplot(predictors, main='Original Distributions')
boxplot(caret::spatialSign(scale(predictors)), main='Spacial Sign Transformed')
Exercise 3.2 The soybean data can also be found at the UC Irvine Machine
Learning Repository. Data were collected to predict disease in 683
soybeans. The 35 predictors are mostly categorical and include
information on the environmental conditions (e.g., temperature,
precipitation) and plant conditions (e.g., left spots, mold growth). The
outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
predictors <- Soybean |>
select(-Class)
for (predictor in names(predictors)) {
print(
ggplot(data = predictors, aes(x = predictors[[predictor]])) +
geom_bar() +
labs(title = paste("Bar plot of", predictor), x=predictor)
)
}
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.
Many of the predictors are missing values. A few of the predictors are
also very imbalanced, with almost all of the observations being
accounted for in a single variable, such as leaf.malf, leaf.mild,
lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor,
seed.size, and shriveling.
We can calculate the percentage of data missing from each variable.
data.frame('percent_missing' = sort(round(colMeans((is.na(predictors)) * 100), 2), decreasing = T))
hail, sever, seed.tmt, and lodging have the highest likelihood of missing data, with over 17% of the data in these columns missing.
missing_df <- Soybean |>
group_by(Class) |>
summarise_all(~sum(is.na(.)))
missing_classes <- missing_df |>
select(-Class) |>
rowSums()
missing_classes_df <- data.frame('Class' = missing_df$Class,
'missing' = missing_classes)
missing_classes_df |>
ggplot(aes(x = missing, y = reorder(Class, missing))) +
geom_bar(stat='identity', fill='red') +
labs(title = 'Missing Values per Class', y = 'Class', x = 'Missing Values')
phytophthora-rot, 2-4-d-injury, cyst-nematode,
diaporthe-pod-&-stem-blight, and herbicide-injury account for all
the missing values in the dataset.
We could use KNN imputation to try and fill in the missing data. We could also eliminate variables with too many missing values.