library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe
# This code provided by the textbook
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
ggplot(data = Glass, aes(x = RI)) + geom_histogram(binwidth = 0.0005)
ggplot(data = Glass, aes(x = Na)) + geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x = Mg)) + geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x = Al)) + geom_histogram(binwidth = 0.05)
ggplot(data = Glass, aes(x = Si)) + geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x = K)) + geom_histogram(binwidth = 0.01)
ggplot(data = Glass, aes(x = Ca)) + geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x = Ba)) + geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x = Fe)) + geom_histogram(binwidth = 0.01)
The relationships between the predictors can be explored using scatterplots. To determine the most interesting relationships to investigate, I’ll start by looking for correlation coefficients close to 1 or -1 between pairs of variables.
cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)
## RI Na Mg Al Si K Ca Ba Fe
## RI 1.00 -0.19 -0.12 -0.41 -0.54 -0.29 0.81 0.00 0.14
## Na -0.19 1.00 -0.27 0.16 -0.07 -0.27 -0.28 0.33 -0.24
## Mg -0.12 -0.27 1.00 -0.48 -0.17 0.01 -0.44 -0.49 0.08
## Al -0.41 0.16 -0.48 1.00 -0.01 0.33 -0.26 0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01 1.00 -0.19 -0.21 -0.10 -0.09
## K -0.29 -0.27 0.01 0.33 -0.19 1.00 -0.32 -0.04 -0.01
## Ca 0.81 -0.28 -0.44 -0.26 -0.21 -0.32 1.00 -0.11 0.12
## Ba 0.00 0.33 -0.49 0.48 -0.10 -0.04 -0.11 1.00 -0.06
## Fe 0.14 -0.24 0.08 -0.07 -0.09 -0.01 0.12 -0.06 1.00
The only strong (|cor| > 0.7) correlation in the table is between RI and Ca (0.81). The scatterplot of these variables is shown below.
ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()
The strongest remaining correlations are between RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48). It’s interesting that Mg is correlated with Al and Ba, which are also correlated with one another.
The scatterplots of these relationships are shown below.
ggplot(Glass, aes(x = RI, y = Si)) + geom_point()
ggplot(Glass, aes(x = Al, y = Mg)) + geom_point()
ggplot(Glass, aes(x = Ba, y = Mg)) + geom_point()
ggplot(Glass, aes(x = Al, y = Ba)) + geom_point()
It’s clear from these scatterplots that the stronger correlations between Al, Mg, and Ba were a result of Mg and Ba having a lot of 0 values, rather than a strong relationship between the non-zero values. On the other hand, the scatterplot of RI and Si does appear to show a noticeable inverse relationship.
RI, Na, Al, Ca, are right skewed.
Si is left skewed.
Mg and K are left-skewed, with an additional complication that each of these variables has quite a few 0 values.
Ba and Fe are mostly zeros with a small number of positive non-zero values.
Outliers are visible on the box plots of the data.
ggplot(data = Glass, aes(x = RI)) + geom_boxplot()
ggplot(data = Glass, aes(x = Na)) + geom_boxplot()
ggplot(data = Glass, aes(x = Mg)) + geom_boxplot()
ggplot(data = Glass, aes(x = Al)) + geom_boxplot()
ggplot(data = Glass, aes(x = Si)) + geom_boxplot()
ggplot(data = Glass, aes(x = K)) + geom_boxplot()
ggplot(data = Glass, aes(x = Ca)) + geom_boxplot()
ggplot(data = Glass, aes(x = Ba)) + geom_boxplot()
ggplot(data = Glass, aes(x = Fe)) + geom_boxplot()
The box plots of every variable except Mg show at least some outliers. Some noteworthy takeaways:
It might be correct to remove either RI or Ca to account for collinearity. The remaining variables should be centered and scaled by subtracting the mean of each variable from each of the values, and then dividing the resulting values by the standard deviation of that variable. Then, I would consider using a spatial sign on the variables with outliers, especially those with high and low outliers (RI, Na, Al, Si, and Ca). Alternatively, for the variables that show the most skew, a Box-Cox transformation could be an appropriate final step.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
# This code provided by the textbook
library(mlbench)
data(Soybean)
From page 45 in the textbook, the criteria for degenerate predictors are:
# Get the number of observations from the dimension of the DataFrame
print(dim(Soybean)[1])
## [1] 683
There are 683 observations in the dataset. Therefore, for a variable to meet the first criteria for a degenerate predictor, there must be 68 or fewer unique values for the variable.
# Create a list of columns and an empty list for the number of unique values in the column
col_list <- colnames(Soybean)
col_list <- col_list[-1]
unique_values <- c()
# Find the number of unique values in each column and append it to the list
for (i in 2:ncol(Soybean)) {
unique_values <- append(unique_values, length(unique(Soybean[,i])))
}
# Create a DataFrame of the results and print the results in descending order of number of unique values
degeneracy_df <- data.frame(col_list, unique_values)
print(head(degeneracy_df %>% arrange(desc(unique_values))))
## col_list unique_values
## 1 date 8
## 2 crop.hist 5
## 3 area.dam 5
## 4 stem.cankers 5
## 5 canker.lesion 5
## 6 fruit.pods 5
No column has more than 8 unique values, so all columns have the potential to qualify as degenerate. To check the second criteria for degeneracy, it is necessary to determine the frequency of the two most prevalent values for each variable.
# Create empty lists for the frequency of the two most prevalent values for each variable
prev_1 <- c()
prev_2 <- c()
# Find the frequency of the two most prevalent values for each column and append them to the lists
for (i in 2:ncol(Soybean)) {
sorted_prev <- sort(table(Soybean[,i]), decreasing = TRUE)
prev_1 <- append(prev_1, sorted_prev[1])
prev_2 <- append(prev_2, sorted_prev[2])
}
# Add the columns to the DataFrame and calculate the ratio of the frequencies of the two most prevalent values
degeneracy_df$prev_1 <- prev_1
degeneracy_df$prev_2 <- prev_2
degeneracy_df <- degeneracy_df %>% mutate(ratio = prev_1 / prev_2)
# Filter the data for ratios above 20 and print the results
print(degeneracy_df %>% filter(ratio >= 20) %>% arrange(desc(ratio)))
## col_list unique_values prev_1 prev_2 ratio
## 1 mycelium 3 639 6 106.50
## 2 sclerotia 3 625 20 31.25
## 3 leaf.mild 4 535 20 26.75
The mycelium
, sclerotia
, and
leaf.mild
variables are degenerate according to the
criteria provided.
The three variables that were previously identified as degenerate
(mycelium
, sclerotia
, and
leaf.mild
) can be eliminated. From there, I would use
imputation to fill in the remaining missing values, but since missing
data is associated with only 5 of the classes I would first introduce a
new variable to represent the number of missing variables for each
observation. It is possible that that variable would be as
important as the pre-existing ones. I’d also consider removing the
variables with more than 15% of the data missing (shown below) to see if
that improved the model.
print(missing_df %>% filter(prob_missing >= 0.15) %>% arrange(desc(prob_missing)))
## col_list num_missing_values prob_missing
## 1 hail 121 0.1771596
## 2 sever 121 0.1771596
## 3 seed.tmt 121 0.1771596
## 4 lodging 121 0.1771596
## 5 germ 112 0.1639824
## 6 leaf.mild 108 0.1581259
## 7 fruiting.bodies 106 0.1551977
## 8 fruit.spots 106 0.1551977
## 9 seed.discolor 106 0.1551977
## 10 shriveling 106 0.1551977