Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.
3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Lets join the neccesary libraries to resolve the exercise.
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(ggplot2)
hist(Glass$RI, main="Histogram of Refractive Index", xlab="Refractive Index")
The histogram shows that the RI variable is approximately normally distributed, with a slight right skew. Most values are concentrated between 1.51 and 1.53, with a few outliers on the higher end.
Now, I will create histograms for all predictor variables to examine their distributions.
par(mfrow=c(3,3))
for(i in 1:9) {
hist(Glass[,i], main=names(Glass)[i], xlab=names(Glass)[i])
}
par(mfrow=c(1,1))
In the histograms, I observe that RI, Na, and Si appear to be approximately normally distributed.Al, Ca, Ba, and Fe show right skewness, with several outliers present in these variables.
In order to answer the last part of question a, I will create a pairs plot to visualize the relationships between the predictors.
pairs(Glass[,1:9], main="Pairs Plot of Glass Predictors", pch=21, bg=Glass$Type)
The pairs plot reveals several relationships between the predictors. For example, there is a negative correlation between Si and CA, as well as between Ca and Ba. Some predictors, such as Mg and K, do not show strong relationships with others.
boxplot(Glass[,1:9], main="Boxplots of Glass Predictors", las=2)
The boxplots indicate that several predictors have outliers. For instance Na, Al, K, Ca, Ba, and Fe all show points that fall outside the whiskers of the boxplots, indicating potential outliers. Some predictors exhibit skewness; for example, Mg is right skewed.
Yes, because the goal of transforming a predictor is to make it’s distribution more normal, reduce skewness, or stabilize variance. Common transformations include log, square root, and Box-Cox transformations. For example, applying a log transformation to right skewed predictors like Mg, K, Ca, Ba, and Fe could help normalize their distributions and reduce the impact of outliers.
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions(e.g. left spot, mold growth). The outcome labels consist of 19 distinct classes.
Lets start by loading the data and examining its structure.
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
We can plot barplots for each categorical predictor to visualize their frequency distributions. Note: a degenerate distribution would have one category that dominates the distribution, while other categories have very few observations.
library(ggplot2)
par(mfrow=c(3,3))
for(i in 1:36) {
barplot(table(Soybean[,i]), main=names(Soybean)[i], xlab=names(Soybean)[i])
}
par(mfrow=c(1,1))
From the barplots, we can see that some predictors have degenerate distributions. The predictor “leaf.mild”, “lodging”, “mycelium”, “fruiting.bodies” have one category that dominates the distribution, while other categories have very few observations. This could lead to issues in modeling, as the model may not learn effectively from categories with very few instances.
We can calculate the percentage of missing values for each predictor and visualize the pattern of missing data.
missing_percent <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
#print(missing_percent)
barplot(missing_percent, main="Percentage of Missing Values by Predictor", las=2, ylab="Percentage Missing")
high_missing <- missing_percent[missing_percent > 15]
#print(high_missing)
barplot(high_missing, main="Percentage of High Missing Values by Predictor", las=2, ylab="Percentage Missing") +
abline(h=15, col="red", lty=2)
## numeric(0)
From the bar plot, we can see that several predictors have a high percentage of missing values, with some exceeding 15%, this include hail, sever, seed.tmt, germ, leaf.mild, lodging , ting.bodies, fruit.spots, ed.discolor and shriveling. In order to visualize if there is any pattern to different classes, we can create a heatmap of missing values across different classes. This will help us see if certain classes have more missing data than others.
missing_by_class <- Soybean |>
group_by(Class) |>
summarise(across(everything(), ~ mean(is.na(.)))) |>
pivot_longer(-Class, names_to = "predictor", values_to = "missing_rate")
ggplot(missing_by_class, aes(x = predictor, y = Class, fill = missing_rate)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Missing Data Rates by Predictor and Class",
x = "Predictor", y = "Class", fill = "Missing %") +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In the heatmap, we can observe that certain predictors have higher missing rates for specific classes. For example, the predictor “hail” has a high missing rate across multiple classes, while “seed.tmt” shows a high missing rate. This suggests that the pattern of missing data may be related to the classes.
Soybean_cleaned <- Soybean |>
select(-c(hail, sever, seed.tmt, germ, leaf.mild, lodging,
fruiting.bodies, fruit.spots, seed.discolor, shriveling))
str(Soybean_cleaned)
## 'data.frame': 683 obs. of 26 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion: Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Using mode imputation for categorical variables:
forcats::fct_explicit_na() function from the forcats package to replace NA values with a new level “Missing”.
df_explicit <- Soybean_cleaned |>
mutate(across(where(is.factor), ~ forcats::fct_explicit_na(.)))