Homework4

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.

Problem 3.1

3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Lets join the neccesary libraries to resolve the exercise.

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between the predictors.
library(ggplot2)

hist(Glass$RI, main="Histogram of Refractive Index", xlab="Refractive Index")

The histogram shows that the RI variable is approximately normally distributed, with a slight right skew. Most values are concentrated between 1.51 and 1.53, with a few outliers on the higher end.

Now, I will create histograms for all predictor variables to examine their distributions.

par(mfrow=c(3,3))
for(i in 1:9) {
  hist(Glass[,i], main=names(Glass)[i], xlab=names(Glass)[i])
}

par(mfrow=c(1,1))

In the histograms, I observe that RI, Na, and Si appear to be approximately normally distributed.Al, Ca, Ba, and Fe show right skewness, with several outliers present in these variables.

In order to answer the last part of question a, I will create a pairs plot to visualize the relationships between the predictors.

pairs(Glass[,1:9], main="Pairs Plot of Glass Predictors", pch=21, bg=Glass$Type)

The pairs plot reveals several relationships between the predictors. For example, there is a negative correlation between Si and CA, as well as between Ca and Ba. Some predictors, such as Mg and K, do not show strong relationships with others.

  1. Do there appear to be any outliers in the data? Are my predictors skewed?
boxplot(Glass[,1:9], main="Boxplots of Glass Predictors", las=2)

The boxplots indicate that several predictors have outliers. For instance Na, Al, K, Ca, Ba, and Fe all show points that fall outside the whiskers of the boxplots, indicating potential outliers. Some predictors exhibit skewness; for example, Mg is right skewed.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, because the goal of transforming a predictor is to make it’s distribution more normal, reduce skewness, or stabilize variance. Common transformations include log, square root, and Box-Cox transformations. For example, applying a log transformation to right skewed predictors like Mg, K, Ca, Ba, and Fe could help normalize their distributions and reduce the impact of outliers.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions(e.g. left spot, mold growth). The outcome labels consist of 19 distinct classes.

Lets start by loading the data and examining its structure.

library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

We can plot barplots for each categorical predictor to visualize their frequency distributions. Note: a degenerate distribution would have one category that dominates the distribution, while other categories have very few observations.

library(ggplot2)

par(mfrow=c(3,3))
for(i in 1:36) {
  barplot(table(Soybean[,i]), main=names(Soybean)[i], xlab=names(Soybean)[i])
}

par(mfrow=c(1,1))

From the barplots, we can see that some predictors have degenerate distributions. The predictor “leaf.mild”, “lodging”, “mycelium”, “fruiting.bodies” have one category that dominates the distribution, while other categories have very few observations. This could lead to issues in modeling, as the model may not learn effectively from categories with very few instances.

  1. Roughly 18% of the data re missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can calculate the percentage of missing values for each predictor and visualize the pattern of missing data.

missing_percent <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
#print(missing_percent)
barplot(missing_percent, main="Percentage of Missing Values by Predictor", las=2, ylab="Percentage Missing")

high_missing <- missing_percent[missing_percent > 15]
#print(high_missing)
barplot(high_missing, main="Percentage of High Missing Values by Predictor", las=2, ylab="Percentage Missing") +
  abline(h=15, col="red", lty=2)

## numeric(0)

From the bar plot, we can see that several predictors have a high percentage of missing values, with some exceeding 15%, this include hail, sever, seed.tmt, germ, leaf.mild, lodging , ting.bodies, fruit.spots, ed.discolor and shriveling. In order to visualize if there is any pattern to different classes, we can create a heatmap of missing values across different classes. This will help us see if certain classes have more missing data than others.

missing_by_class <- Soybean |>
  group_by(Class) |>
  summarise(across(everything(), ~ mean(is.na(.)))) |>
  pivot_longer(-Class, names_to = "predictor", values_to = "missing_rate")

ggplot(missing_by_class, aes(x = predictor, y = Class, fill = missing_rate)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Missing Data Rates by Predictor and Class",
       x = "Predictor", y = "Class", fill = "Missing %") +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

In the heatmap, we can observe that certain predictors have higher missing rates for specific classes. For example, the predictor “hail” has a high missing rate across multiple classes, while “seed.tmt” shows a high missing rate. This suggests that the pattern of missing data may be related to the classes.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
  • Remove predictors with a high percentage of missing values: Predictors with more than 15% missing values (hail, sever, seed.tmt, germ, left.mind , lodging ,ting.bodies, fruit.spots, ed.discolor and shriveling) could be removed from the dataset to avoid introducing bias or noise into the model.
Soybean_cleaned <- Soybean |>
  select(-c(hail, sever, seed.tmt, germ, leaf.mild, lodging,
            fruiting.bodies, fruit.spots, seed.discolor, shriveling))

str(Soybean_cleaned)
## 'data.frame':    683 obs. of  26 variables:
##  $ Class        : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date         : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand  : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip       : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp         : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ crop.hist    : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam     : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo    : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg    : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size    : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion: Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ ext.decay    : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods   : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots        : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  • Impute missing values for predictors with low to moderate missingness: For predictors with less than 15% missing values, we can use imputation methods such as mode imputation (for categorical variables) or k-nearest neighbors (KNN) imputation to fill in the missing values.

Using mode imputation for categorical variables:

forcats::fct_explicit_na() function from the forcats package to replace NA values with a new level “Missing”.

df_explicit <- Soybean_cleaned |>
  mutate(across(where(is.factor), ~ forcats::fct_explicit_na(.)))
  • Use models that can handle missing data: Some machine learning algorithms, such as decision trees and random forests, can handle missing data internally. Using these models can help mitigate the impact of missing values without the need for extensive preprocessing.