DATA 624 Homework 4

Exercise 3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First, we create a subset of just our predictor variables.

predictors <- Glass |>
  select(-Type)

head(predictors)

Now we can plot the distributions using histograms and boxplots.

# plot distributions
# histogram
par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))
for (predictor in names(predictors)) {
  hist(predictors[[predictor]], main = predictor, col='lightblue')
}

# boxplot
par(mfrow=c(3,3))
par(mai=c(.25,.25,.25,.25))
for (predictor in names(predictors)) {
  boxplot(predictors[[predictor]], 
          main = predictor, 
          col='lightblue',
          horizontal=T)
}

Now we can visualize the relationships between the predictor variables using a correlation plot and scatter plots.

# relationships between predictors
# correlation plot
corrplot(cor(predictors), 
         method="color",
         diag=FALSE,
         type="lower",
         addCoef.col = "black",
         number.cex=0.70)

# pairplot
pairs(predictors)

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Na appears to be mostly normally distributed with a slight right skew. Al, RI, and Ca also appear to have a right skews. Fe, Ba, and K are all severely right skewed. Si has a left skew and Mg is bimodal and also left skewed.

From the boxplots, we see a number of outliers for all but Mg.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

We could apply Box-Cox transformations to address the skewness of some of the variables. We could also use spacial sign transformations to minimize the outliers.

# Box-Cox transformation of Al
par(mfrow=c(1,2))
BoxCoxTrans(predictors$Al) 
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5
hist(predictors$Al, main='Original Distribution of Al')
hist(predictors$Al**.5, main='Transformed (Lambda = 0.5)')

# Spacial sign transformation of predictors
boxplot(predictors, main='Original Distributions')
boxplot(caret::spatialSign(scale(predictors)), main='Spacial Sign Transformed')

Exercise 3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
predictors <- Soybean |>
  select(-Class)

for (predictor in names(predictors)) {
  print(
  ggplot(data = predictors, aes(x = predictors[[predictor]])) +
    geom_bar() +
    labs(title = paste("Bar plot of", predictor), x=predictor)
  )
}
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

Many of the predictors are missing values. A few of the predictors are also very imbalanced, with almost all of the observations being accounted for in a single variable, such as leaf.malf, leaf.mild, lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, and shriveling.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can calculate the percentage of data missing from each variable.

data.frame('percent_missing' = sort(round(colMeans((is.na(predictors)) * 100), 2), decreasing = T))

hail, sever, seed.tmt, and lodging have the highest likelihood of missing data, with over 17% of the data in these columns missing.

missing_df <- Soybean |>
  group_by(Class) |>
  summarise_all(~sum(is.na(.)))

missing_classes <- missing_df |>
  select(-Class) |>
  rowSums()

missing_classes_df <- data.frame('Class' = missing_df$Class,
                                 'missing' = missing_classes)

missing_classes_df |>
  ggplot(aes(x = missing, y = reorder(Class, missing))) +
  geom_bar(stat='identity', fill='red') +
  labs(title = 'Missing Values per Class', y = 'Class', x = 'Missing Values')

phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury account for all the missing values in the dataset.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We could use KNN imputation to try and fill in the missing data. We could also eliminate variables with too many missing values.