pacman::p_load(
tidyverse,
janitor,
fpp3,
mlbench,
AppliedPredictiveModeling,
skimr,
e1071,
caret,
impute,
corrplot,
GGally,
recipes
)Homework 4
3.1
3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass |> head() RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Glass |> skim()| Name | Glass |
| Number of rows | 214 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Type | 0 | 1 | FALSE | 6 | 2: 76, 1: 70, 7: 29, 3: 17 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| RI | 0 | 1 | 1.52 | 0.00 | 1.51 | 1.52 | 1.52 | 1.52 | 1.53 | ▁▇▂▁▁ |
| Na | 0 | 1 | 13.41 | 0.82 | 10.73 | 12.91 | 13.30 | 13.83 | 17.38 | ▁▇▆▁▁ |
| Mg | 0 | 1 | 2.68 | 1.44 | 0.00 | 2.11 | 3.48 | 3.60 | 4.49 | ▃▁▁▇▅ |
| Al | 0 | 1 | 1.44 | 0.50 | 0.29 | 1.19 | 1.36 | 1.63 | 3.50 | ▂▇▃▁▁ |
| Si | 0 | 1 | 72.65 | 0.77 | 69.81 | 72.28 | 72.79 | 73.09 | 75.41 | ▁▂▇▂▁ |
| K | 0 | 1 | 0.50 | 0.65 | 0.00 | 0.12 | 0.56 | 0.61 | 6.21 | ▇▁▁▁▁ |
| Ca | 0 | 1 | 8.96 | 1.42 | 5.43 | 8.24 | 8.60 | 9.17 | 16.19 | ▁▇▁▁▁ |
| Ba | 0 | 1 | 0.18 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 3.15 | ▇▁▁▁▁ |
| Fe | 0 | 1 | 0.06 | 0.10 | 0.00 | 0.00 | 0.00 | 0.10 | 0.51 | ▇▁▁▁▁ |
- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
First, I visualize and get correlations for all combinations of variables with ggpairs().
# GGpairs for all variables
ggpairs(Glass)This is helpful to return many useful visualizations to get a general view of relationships between all variables all at once.
To zoom in and get a better sense of the distributions of just the predictors, I create histograms for each.
# Histograms of predictors
predictors <- Glass |> select(-Type)
predictors |>
pivot_longer(everything(), names_to = "predictor") |>
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~predictor, scales = "free") +
labs(
subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
x = "Percentages of elements or refractive index (RI)",
y = "Frequency"
) +
theme_classic()Al,Ca,Na,Ri, andSilook normally distributed.Al,Ca,Na, andRiare skewed right.Ba,Fe,K, andMghave many 0 values skewing the data right.Mgis bimodal.
Next, I visualize the relationships between predictors with a correlation matrix.
# Find the highest absolute correlations
correlations <- cor(predictors)
high_corr <- correlations |>
as.data.frame() |>
rownames_to_column(var = "Var1") |>
pivot_longer(-Var1, names_to = "Var2", values_to = "corr") |>
rowwise() |>
mutate(
pair = paste(sort(c(Var1, Var2)), collapse = " and "),
abs_value = abs(corr),
level = case_when(
abs_value > 0.75 ~ "highly",
abs_value > 0.5 ~ "moderately",
abs_value > 0.25 ~ "slightly",
TRUE ~ "none"
)
) |>
# Filter out self-correlations and duplicates
filter(Var1 != Var2) |>
distinct(pair, .keep_all = TRUE) |>
arrange(desc(abs_value)) |>
select(pair, corr, level)
high_corr |> head()# A tibble: 6 × 3
# Rowwise:
pair corr level
<chr> <dbl> <chr>
1 Ca and RI 0.810 highly
2 RI and Si -0.542 moderately
3 Ba and Mg -0.492 slightly
4 Al and Mg -0.482 slightly
5 Al and Ba 0.479 slightly
6 Ca and Mg -0.444 slightly
# Correlation matrix to visualize relationships between predictors.
corrplot(correlations, order = "hclust")From the correlation matrix,
Ca and RI are highly correlated with a correlation of 0.81.
RI and Si are moderately correlated with a correlation of -0.54.
Ba and Mg are slightly correlated with a correlation of -0.49.
Al and Mg are slightly correlated with a correlation of -0.48.
Do there appear to be any outliers in the data? Are any predictors skewed?
# Calculate the sample skewness statistic for each predictor
skewValues <- apply(predictors, 2, skewness) |>
as.data.frame() |>
clean_names() |>
mutate(
abs_value = abs(apply_predictors_2_skewness),
level = case_when(
apply_predictors_2_skewness > 1 ~ "highly right skewed",
apply_predictors_2_skewness > 0.5 ~ "moderately right skewed",
apply_predictors_2_skewness < -1 ~ "highly left skewed",
apply_predictors_2_skewness < -0.5 ~ "moderately left skewed",
TRUE ~ "not skewed"
)
) |>
arrange(desc(abs_value)) |>
rownames_to_column(var = "predictor") |>
select(predictor, apply_predictors_2_skewness, level)
head(skewValues) predictor apply_predictors_2_skewness level
1 K 6.460089 highly right skewed
2 Ba 3.368680 highly right skewed
3 Ca 2.018446 highly right skewed
4 Fe 1.729811 highly right skewed
5 RI 1.602715 highly right skewed
6 Mg -1.136452 highly left skewed
# Get highly skewed predictors
high_skew <- skewValues |>
filter(level %in% c("highly right skewed", "highly left skewed")) |>
pull(predictor)K is highly right skewed with a sample skewness statistic of 6.46.
Ba is highly right skewed with a sample skewness statistic of 3.37.
Ca is highly right skewed with a sample skewness statistic of 2.02.
Fe is highly right skewed with a sample skewness statistic of 1.73.
RI is highly right skewed with a sample skewness statistic of 1.6.
Mg is highly left skewed with a sample skewness statistic of -1.14.
Al is moderately right skewed with a sample skewness statistic of 0.89.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, I would center, scale, and use a box-cox transformation to empirically calculate the appropriate transformation for predictors.
# Center and scale the data, and apply Box-Cox transformation to predictors
transformations <- preProcess(
Glass,
method = c("BoxCox", "center", "scale")
)
# Display and get transformation details
transformationsCreated from 214 samples and 10 variables
Pre-processing:
- Box-Cox transformation (5)
- centered (9)
- ignored (1)
- scaled (9)
Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1
box_cox_transformed_predictors <- transformations$method$BoxCox |>
paste(collapse = ", ")
centered_scaled_predictors <- transformations$method$center |>
paste(collapse = ", ")
# Apply the transformations to the highly skewed predictors
transformed_glass <- predict(transformations, Glass)
# Now generate histograms of the transformed predictors
transformed_glass |>
select(-Type) |>
pivot_longer(everything(), names_to = "predictor") |>
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~predictor, scales = "free") +
labs(
subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
x = "Transformed percentages of elements or refractive index (RI)",
y = "Frequency"
) +
theme_classic()After applying the Box-Cox transformation (applied to RI, Na, Al, Si, Ca), the histograms show that the data is now more normally distributed. All variables are now scaled and centered around 0.
3.2
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
skim(Soybean)| Name | Soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
From skimming the data, I see that all 35 predictors are factor variables and there is some missingness.
- Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
First, I plot all predictors to visualize their frequency distributions.
Soybean |>
# Get predictors
select(-Class) |>
mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
pivot_longer(everything(), names_to = "predictor") |>
ggplot(aes(value)) +
geom_bar() +
facet_wrap(~predictor, scales = "free") +
labs(
subtitle = "Categorical Variables",
x = "",
y = "Frequency"
) +
theme_classic()I notice that there are a few variables that appear to have very low variance. For example, most mycelium observations are the same.
Next, I check for degenerate predictors using the caret::nearZeroVar() function. Variables with a single unique value or with near-zero variance are degenerate. Specifically, these predictors are degenerate if:
The fraction of unique values over the sample size is low (say 10%). I store the percentage of unique data points out of the total number of data points in the
percentUniquecolumn below.The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). I store the ratio of frequencies for the most common value over the second most common value as
freqRatiobelow.
# Get near zero variance information
degenerate_predictors <- Soybean |>
select(-Class) |>
# Set cutoffs for degenerate predictors
nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
as.data.frame() |>
rownames_to_column("predictor") |>
# Filter for predictors with near zero variance
filter(nzv == TRUE)
degenerate_predictors predictor freqRatio percentUnique zeroVar nzv
1 leaf.mild 26.75 0.4392387 FALSE TRUE
2 mycelium 106.50 0.2928258 FALSE TRUE
3 sclerotia 31.25 0.2928258 FALSE TRUE
The predictors leaf.mild, mycelium, sclerotia are degenerate.
- Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Calculate missing values by predictor
missing_by_predictor <- Soybean |>
select(-Class) |>
mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
pivot_longer(
cols = everything(),
names_to = "predictor",
values_to = "value"
) |>
group_by(predictor) |>
summarize(
n = n(),
n_missing = sum(is.na(value)),
percent_missing = n_missing / n * 100,
ten_perc_miss = percent_missing > 10,
) |>
arrange(desc(percent_missing)) |>
ungroup()
ten_perc_miss <- missing_by_predictor |>
filter(ten_perc_miss == TRUE) |>
pull(predictor) |>
paste(collapse = ", ")
# Plot missing values by predictor
missing_by_predictor |>
ggplot(aes(x = reorder(predictor, percent_missing), y = percent_missing)) +
geom_col() +
coord_flip() +
labs(
title = "Sever, seed.tmt, lodging, and hail have the most missingness.",
x = "Predictor",
y = "Percent Missing"
) +
theme_classic()sever, seed.tmt, lodging, and hail have the highest percentage of missing values, at nearly 18% missing. hail, lodging, seed.tmt, sever, germ, leaf.mild, fruit.spots, fruiting.bodies, seed.discolor, shriveling, leaf.shread, mold.growth, seed, seed.size, fruit.pods, leaf.halo, leaf.malf, leaf.marg, leaf.size had over 10% of observations missing.
Next, I check if the pattern of missing data is related to the class.
missing_by_class <- Soybean |>
group_by(Class) |>
mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
pivot_longer(-Class, names_to = "predictor") |>
group_by(Class, predictor) |>
summarize(
n_missing = sum(is.na(value)),
perc_missing = n_missing / n() * 100,
) |>
arrange(desc(n_missing))
# Plot missing values by predictor and class
missing_by_class |>
ggplot(aes(x = reorder(Class, perc_missing), y = perc_missing, fill = predictor)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "A few select classes have high rates of missingness.",
x = "Class",
y = "Percent Missing"
) +
theme_classic()missing_classes <- missing_by_class |>
group_by(Class) |>
summarize(
n_missing = sum(n_missing)
) |>
arrange(desc(n_missing)) |>
filter(n_missing > 0) |>
pull(Class) |>
paste(collapse = ", ")Yes, it appears that there is a pattern of missingness related to class. phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury have missing predictor data, and other classes don’t.
- Develop a strategy for handling missing data, either by eliminating predictors or imputation.
I will remove degenerate variables. Then, I will impute missings using the recipes package. I use step_impute_mode() to substitute missing values of categorical variables by the mode of those variables.
# Remove degenerate vars
Soybean_without_degenerate <- Soybean |>
select(-c(degenerate_predictors$predictor))
# Create the recipe, imputing missing categorical predictor values with their mode
soybean_recipe <- recipe(Class ~ ., data = Soybean_without_degenerate) |>
step_impute_mode(all_nominal_predictors())
# Prepare and apply the recipe
prepped_recipe <- prep(soybean_recipe)
imputed_soybean <- bake(prepped_recipe, new_data = Soybean)
# Show number of missings for all variables
imputed_soybean |> skim()| Name | imputed_soybean |
| Number of rows | 683 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| factor | 33 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| date | 0 | 1 | FALSE | 7 | 5: 150, 4: 131, 3: 118, 2: 93 |
| plant.stand | 0 | 1 | TRUE | 2 | 0: 390, 1: 293 |
| precip | 0 | 1 | TRUE | 3 | 2: 497, 1: 112, 0: 74 |
| temp | 0 | 1 | TRUE | 3 | 1: 404, 2: 199, 0: 80 |
| hail | 0 | 1 | FALSE | 2 | 0: 556, 1: 127 |
| crop.hist | 0 | 1 | FALSE | 4 | 2: 235, 3: 218, 1: 165, 0: 65 |
| area.dam | 0 | 1 | FALSE | 4 | 1: 228, 3: 187, 2: 145, 0: 123 |
| sever | 0 | 1 | FALSE | 3 | 1: 443, 0: 195, 2: 45 |
| seed.tmt | 0 | 1 | FALSE | 3 | 0: 426, 1: 222, 2: 35 |
| germ | 0 | 1 | TRUE | 3 | 1: 325, 2: 193, 0: 165 |
| plant.growth | 0 | 1 | FALSE | 2 | 0: 457, 1: 226 |
| leaves | 0 | 1 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 0 | 1 | FALSE | 3 | 2: 426, 0: 221, 1: 36 |
| leaf.marg | 0 | 1 | FALSE | 3 | 0: 441, 2: 221, 1: 21 |
| leaf.size | 0 | 1 | TRUE | 3 | 1: 411, 2: 221, 0: 51 |
| leaf.shread | 0 | 1 | FALSE | 2 | 0: 587, 1: 96 |
| leaf.malf | 0 | 1 | FALSE | 2 | 0: 638, 1: 45 |
| stem | 0 | 1 | FALSE | 2 | 1: 387, 0: 296 |
| lodging | 0 | 1 | FALSE | 2 | 0: 641, 1: 42 |
| stem.cankers | 0 | 1 | FALSE | 4 | 0: 417, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 0 | 1 | FALSE | 4 | 0: 358, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 0 | 1 | FALSE | 2 | 0: 579, 1: 104 |
| ext.decay | 0 | 1 | FALSE | 3 | 0: 535, 1: 135, 2: 13 |
| int.discolor | 0 | 1 | FALSE | 3 | 0: 619, 1: 44, 2: 20 |
| fruit.pods | 0 | 1 | FALSE | 4 | 0: 491, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 0 | 1 | FALSE | 4 | 0: 451, 4: 100, 1: 75, 2: 57 |
| seed | 0 | 1 | FALSE | 2 | 0: 568, 1: 115 |
| mold.growth | 0 | 1 | FALSE | 2 | 0: 616, 1: 67 |
| seed.discolor | 0 | 1 | FALSE | 2 | 0: 619, 1: 64 |
| seed.size | 0 | 1 | FALSE | 2 | 0: 624, 1: 59 |
| shriveling | 0 | 1 | FALSE | 2 | 0: 645, 1: 38 |
| roots | 0 | 1 | FALSE | 3 | 0: 582, 1: 86, 2: 15 |
| Class | 0 | 1 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
# Plot all (imputed) predictors
imputed_soybean |>
# Get predictors
select(-Class) |>
mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
pivot_longer(everything(), names_to = "predictor") |>
ggplot(aes(value)) +
geom_bar() +
facet_wrap(~predictor, scales = "free") +
labs(
subtitle = "Categorical Variables",
x = "",
y = "Frequency"
) +
theme_classic()# Check if there are any more degenerate predictors after imputation
imputed_soybean |>
select(-Class) |>
nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
as.data.frame() |>
rownames_to_column("predictor") |>
# Filter for predictors with near zero variance
filter(nzv == TRUE)[1] predictor freqRatio percentUnique zeroVar nzv
<0 rows> (or 0-length row.names)
I imputed the missing values with the mode of each predictor, so that there are no longer any missing predictor data. After imputation, I confirmed that there are still no degenerate predictors.