library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Chapter 3 Exercise 1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

# This code provided by the textbook
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggplot(data = Glass, aes(x = RI)) + geom_histogram(binwidth = 0.0005)

ggplot(data = Glass, aes(x = Na)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Mg)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Al)) + geom_histogram(binwidth = 0.05)

ggplot(data = Glass, aes(x = Si)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = K)) + geom_histogram(binwidth = 0.01)

ggplot(data = Glass, aes(x = Ca)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Ba)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Fe)) + geom_histogram(binwidth = 0.01)

The relationships between the predictors can be explored using scatterplots. To determine the most interesting relationships to investigate, I’ll start by looking for correlation coefficients close to 1 or -1 between pairs of variables.

cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)
##       RI    Na    Mg    Al    Si     K    Ca    Ba    Fe
## RI  1.00 -0.19 -0.12 -0.41 -0.54 -0.29  0.81  0.00  0.14
## Na -0.19  1.00 -0.27  0.16 -0.07 -0.27 -0.28  0.33 -0.24
## Mg -0.12 -0.27  1.00 -0.48 -0.17  0.01 -0.44 -0.49  0.08
## Al -0.41  0.16 -0.48  1.00 -0.01  0.33 -0.26  0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01  1.00 -0.19 -0.21 -0.10 -0.09
## K  -0.29 -0.27  0.01  0.33 -0.19  1.00 -0.32 -0.04 -0.01
## Ca  0.81 -0.28 -0.44 -0.26 -0.21 -0.32  1.00 -0.11  0.12
## Ba  0.00  0.33 -0.49  0.48 -0.10 -0.04 -0.11  1.00 -0.06
## Fe  0.14 -0.24  0.08 -0.07 -0.09 -0.01  0.12 -0.06  1.00

The only strong (|cor| > 0.7) correlation in the table is between RI and Ca (0.81). The scatterplot of these variables is shown below.

ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()

The strongest remaining correlations are between RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48). It’s interesting that Mg is correlated with Al and Ba, which are also correlated with one another.

The scatterplots of these relationships are shown below.

ggplot(Glass, aes(x = RI, y = Si)) + geom_point()

ggplot(Glass, aes(x = Al, y = Mg)) + geom_point()

ggplot(Glass, aes(x = Ba, y = Mg)) + geom_point()

ggplot(Glass, aes(x = Al, y = Ba)) + geom_point()

It’s clear from these scatterplots that the stronger correlations between Al, Mg, and Ba were a result of Mg and Ba having a lot of 0 values, rather than a strong relationship between the non-zero values. On the other hand, the scatterplot of RI and Si does appear to show a noticeable inverse relationship.

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Skew

RI, Na, Al, Ca, are right skewed.

Si is left skewed.

Mg and K are left-skewed, with an additional complication that each of these variables has quite a few 0 values.

Ba and Fe are mostly zeros with a small number of positive non-zero values.

Outliers

Outliers are visible on the box plots of the data.

ggplot(data = Glass, aes(x = RI)) + geom_boxplot()

ggplot(data = Glass, aes(x = Na)) + geom_boxplot()

ggplot(data = Glass, aes(x = Mg)) + geom_boxplot()

ggplot(data = Glass, aes(x = Al)) + geom_boxplot()

ggplot(data = Glass, aes(x = Si)) + geom_boxplot()

ggplot(data = Glass, aes(x = K)) + geom_boxplot()

ggplot(data = Glass, aes(x = Ca)) + geom_boxplot()

ggplot(data = Glass, aes(x = Ba)) + geom_boxplot()

ggplot(data = Glass, aes(x = Fe)) + geom_boxplot()

The box plots of every variable except Mg show at least some outliers. Some noteworthy takeaways:

  • RI, Na, Al, Si, and Ca have high and low outliers.
  • K, Ba, and Fa only have high outliers.
  • Any Si value that is not between 71 and 74 is labeled as an outlier.
  • Any K value greater than 1.25 is labeled as an outlier.
  • All non-zero Ba values are labeled as outliers.
  • All Fe values greater than 2.5 are labeled as outliers.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

It might be correct to remove either RI or Ca to account for collinearity. The remaining variables should be centered and scaled by subtracting the mean of each variable from each of the values, and then dividing the resulting values by the standard deviation of that variable. Then, I would consider using a spatial sign on the variables with outliers, especially those with high and low outliers (RI, Na, Al, Si, and Ca). Alternatively, for the variables that show the most skew, a Box-Cox transformation could be an appropriate final step.

Chapter 3 Exercise 2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

# This code provided by the textbook
library(mlbench)
data(Soybean)

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

From page 45 in the textbook, the criteria for degenerate predictors are:

  • The fraction of unique values over the sample size is low (say 10 %).
  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
# Get the number of observations from the dimension of the DataFrame
print(dim(Soybean)[1])
## [1] 683

There are 683 observations in the dataset. Therefore, for a variable to meet the first criteria for a degenerate predictor, there must be 68 or fewer unique values for the variable.

# Create a list of columns and an empty list for the number of unique values in the column
col_list <- colnames(Soybean)
col_list <- col_list[-1]
unique_values <- c()

# Find the number of unique values in each column and append it to the list
for (i in 2:ncol(Soybean)) {
  unique_values <- append(unique_values, length(unique(Soybean[,i])))
}

# Create a DataFrame of the results and print the results in descending order of number of unique values
degeneracy_df <- data.frame(col_list, unique_values)
print(head(degeneracy_df %>% arrange(desc(unique_values))))
##        col_list unique_values
## 1          date             8
## 2     crop.hist             5
## 3      area.dam             5
## 4  stem.cankers             5
## 5 canker.lesion             5
## 6    fruit.pods             5

No column has more than 8 unique values, so all columns have the potential to qualify as degenerate. To check the second criteria for degeneracy, it is necessary to determine the frequency of the two most prevalent values for each variable.

# Create empty lists for the frequency of the two most prevalent values for each variable
prev_1 <- c()
prev_2 <- c()

# Find the frequency of the two most prevalent values for each column and append them to the lists
for (i in 2:ncol(Soybean)) {
  sorted_prev <- sort(table(Soybean[,i]), decreasing = TRUE)
  prev_1 <- append(prev_1, sorted_prev[1])
  prev_2 <- append(prev_2, sorted_prev[2])
}

# Add the columns to the DataFrame and calculate the ratio of the frequencies of the two most prevalent values
degeneracy_df$prev_1 <- prev_1
degeneracy_df$prev_2 <- prev_2
degeneracy_df <- degeneracy_df %>% mutate(ratio = prev_1 / prev_2)

# Filter the data for ratios above 20 and print the results
print(degeneracy_df %>% filter(ratio >= 20) %>% arrange(desc(ratio)))
##    col_list unique_values prev_1 prev_2  ratio
## 1  mycelium             3    639      6 106.50
## 2 sclerotia             3    625     20  31.25
## 3 leaf.mild             4    535     20  26.75

The mycelium, sclerotia, and leaf.mild variables are degenerate according to the criteria provided.

(b)

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The three variables that were previously identified as degenerate (mycelium, sclerotia, and leaf.mild) can be eliminated. From there, I would use imputation to fill in the remaining missing values, but since missing data is associated with only 5 of the classes I would first introduce a new variable to represent the number of missing variables for each observation. It is possible that that variable would be as important as the pre-existing ones. I’d also consider removing the variables with more than 15% of the data missing (shown below) to see if that improved the model.

print(missing_df %>% filter(prob_missing >= 0.15) %>% arrange(desc(prob_missing)))
##           col_list num_missing_values prob_missing
## 1             hail                121    0.1771596
## 2            sever                121    0.1771596
## 3         seed.tmt                121    0.1771596
## 4          lodging                121    0.1771596
## 5             germ                112    0.1639824
## 6        leaf.mild                108    0.1581259
## 7  fruiting.bodies                106    0.1551977
## 8      fruit.spots                106    0.1551977
## 9    seed.discolor                106    0.1551977
## 10      shriveling                106    0.1551977