Objective:

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via: > library(mlbench) > data(Glass) > str(Glass) ‘data.frame’: 214 obs. of 10 variables: $ RI : num 1.52 1.52 1.52 1.52 1.52 … $ Na : num 13.6 13.9 13.5 13.2 13.3 … $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 … $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 … $ Si : num 71.8 72.7 73 72.6 73.1 … $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 … $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 … $ Ba : num 0 0 0 0 0 0 0 0 0 0 … $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 … $ Type: Factor w/ 6 levels “1”,“2”,“3”,“5”,..: 1 1 1 1 1 1 1 1 1 1 …

(a)Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.4.3
data(Glass)

# Transforming Glass to long
glass_long <- Glass |>
  pivot_longer(-Type, names_to = "var", values_to = "value")

# Histogram for glass predictors
ggplot(glass_long, aes(value)) +
  geom_histogram(bins = 30, color = "white") +
  facet_wrap(~ var, scales = "free") +
  labs(title = "Glass predictors")

# Distribution by class
ggplot(glass_long, aes(value, Type)) +
  geom_density_ridges(scale = .9, rel_min_height = .01) +
  facet_wrap(~ var, scales = "free_x") +
  labs(title = "Distribution by Class Densities")
## Picking joint bandwidth of 0.161
## Picking joint bandwidth of 0.234
## Picking joint bandwidth of 0.298
## Picking joint bandwidth of 0.131
## Picking joint bandwidth of 0.186
## Picking joint bandwidth of 0.33
## Picking joint bandwidth of 0.191
## Picking joint bandwidth of 0.000739
## Picking joint bandwidth of 0.293

GGally::ggpairs(Glass, columns = 1:9, aes(color = Type), progress = FALSE)
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

Glass |> # Heatmap
  select(-Type) |>
  cor() |>
  as.data.frame() |>
  rownames_to_column("x") |>
  pivot_longer(-x, names_to = "y", values_to = "r") |>
  ggplot(aes(x, y, fill = r)) +
  geom_tile() + scale_fill_gradient2(limits = c(-1,1)) +
  coord_equal() + theme(axis.text.x = element_text(angle=45, hjust=1))

(b) Do there appear to be any outliers in the data? Are any predictors skewed? According to the results of question a, especially within the histograms and the pairs/scatter matrix, you can see clear outliers and predictors skewed such as Ba, Fe, and K being right-skewed with Ba and Fe having big whole number values with mostly zeros, and K having very few values. While Ca, Mg, and Na are slightly skewed towards the right with Rl being slightly left skewed.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes there are relevant transformations of one or more predictors based on the initial long transformation for the histograms and class conditional densities, and all thanks to these improved classification models we can tell that several predictors are mostly right skewed or zero inflated.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)
library(dplyr)
library(tidyr)

cat_df <- Soybean %>%
  select(-Class) %>%
  select(where(is.factor)) %>%
  mutate(across(everything(), as.character))
# Long transformation
freq <- cat_df %>%
  pivot_longer(everything(), names_to = "var", values_to = "level") %>%
  count(var, level, name = "n") %>%
  group_by(var) %>%
  mutate(prop = n/sum(n))
# Use profile for each predictor
profile <- freq %>%
  summarise(
    n_levels   = n_distinct(level),
    top_level  = level[which.max(prop)],
    top_prop   = max(prop),
    min_n      = min(n),
    n_rare     = sum(n < 5),
    single_lvl = n_levels == 1,
    near_zero  = top_prop >= 0.95
  ) %>%
  ungroup() %>%
  arrange(desc(single_lvl), desc(near_zero), desc(n_rare))
degenerate <- profile %>% filter(single_lvl | near_zero | n_rare > 0)


profile
degenerate
profile %>% arrange(desc(top_prop)) %>% slice_head(n = 10)
ggplot(freq, aes(level, prop)) +
  geom_col() +
  facet_wrap(~ var, scales = "free_x") +
  labs(title = "Frequency distributions of categorical predictors",
       x = "Level", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

miss_by_var <- Soybean |>
  select(-Class) |>
  summarise(across(everything(), ~mean(is.na(.x)))) |>
  pivot_longer(everything(), names_to="variable", values_to="prop_missing") |>
  arrange(desc(prop_missing))
miss_by_var %>% print(n=Inf)
## # A tibble: 35 × 2
##    variable        prop_missing
##    <chr>                  <dbl>
##  1 hail                 0.177  
##  2 sever                0.177  
##  3 seed.tmt             0.177  
##  4 lodging              0.177  
##  5 germ                 0.164  
##  6 leaf.mild            0.158  
##  7 fruiting.bodies      0.155  
##  8 fruit.spots          0.155  
##  9 seed.discolor        0.155  
## 10 shriveling           0.155  
## 11 leaf.shread          0.146  
## 12 seed                 0.135  
## 13 mold.growth          0.135  
## 14 seed.size            0.135  
## 15 leaf.halo            0.123  
## 16 leaf.marg            0.123  
## 17 leaf.size            0.123  
## 18 leaf.malf            0.123  
## 19 fruit.pods           0.123  
## 20 precip               0.0556 
## 21 stem.cankers         0.0556 
## 22 canker.lesion        0.0556 
## 23 ext.decay            0.0556 
## 24 mycelium             0.0556 
## 25 int.discolor         0.0556 
## 26 sclerotia            0.0556 
## 27 plant.stand          0.0527 
## 28 roots                0.0454 
## 29 temp                 0.0439 
## 30 crop.hist            0.0234 
## 31 plant.growth         0.0234 
## 32 stem                 0.0234 
## 33 date                 0.00146
## 34 area.dam             0.00146
## 35 leaves               0
# Missingness: Barplot
ggplot(miss_by_var, aes(reorder(variable, prop_missing), prop_missing)) +
  geom_col() + coord_flip() +
  labs(x = "", y = "Proportion missing", title = "Missingness by predictor")

# Need to pre-coerce to character
sb_char <- Soybean %>%
  mutate(across(-Class, ~ na_if(as.character(.x), "?")))
# Missingness: Heatmap
miss_by_class <- sb_char %>%
  pivot_longer(-Class, names_to="variable", values_to="value") %>%
  mutate(miss = is.na(value)) %>%
  group_by(Class, variable) %>%
  summarise(prop_missing = mean(miss), .groups="drop")

ggplot(miss_by_class, aes(variable, Class, fill = prop_missing)) +
  geom_tile() +
  scale_fill_gradient(limits = c(0, 1), name = "Missing %") +
  labs(x = "", y = "Class", title = "Missingness per class & predictor") +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))

chi_results <- sb_char %>%
  pivot_longer(-Class, names_to="variable", values_to="value") %>%
  mutate(miss = factor(is.na(value), levels=c(FALSE, TRUE))) %>%
  group_by(variable) %>%
  summarise(
    p_value = {
      tab <- table(miss, Class)
      suppressWarnings(chisq.test(tab, simulate.p.value = TRUE, B = 5000)$p.value)
    },
    .groups="drop"
  ) %>%
  mutate(p_adj = p.adjust(p_value, method="BH")) %>%
  arrange(p_adj)

chi_results %>% print(n=Inf)
## # A tibble: 35 × 3
##    variable           p_value      p_adj
##    <chr>                <dbl>      <dbl>
##  1 canker.lesion     0.000200   0.000212
##  2 crop.hist         0.000200   0.000212
##  3 ext.decay         0.000200   0.000212
##  4 fruit.pods        0.000200   0.000212
##  5 fruit.spots       0.000200   0.000212
##  6 fruiting.bodies   0.000200   0.000212
##  7 germ              0.000200   0.000212
##  8 hail              0.000200   0.000212
##  9 int.discolor      0.000200   0.000212
## 10 leaf.halo         0.000200   0.000212
## 11 leaf.malf         0.000200   0.000212
## 12 leaf.marg         0.000200   0.000212
## 13 leaf.mild         0.000200   0.000212
## 14 leaf.shread       0.000200   0.000212
## 15 leaf.size         0.000200   0.000212
## 16 lodging           0.000200   0.000212
## 17 mold.growth       0.000200   0.000212
## 18 mycelium          0.000200   0.000212
## 19 plant.growth      0.000200   0.000212
## 20 plant.stand       0.000200   0.000212
## 21 precip            0.000200   0.000212
## 22 roots             0.000200   0.000212
## 23 sclerotia         0.000200   0.000212
## 24 seed              0.000200   0.000212
## 25 seed.discolor     0.000200   0.000212
## 26 seed.size         0.000200   0.000212
## 27 seed.tmt          0.000200   0.000212
## 28 sever             0.000200   0.000212
## 29 shriveling        0.000200   0.000212
## 30 stem              0.000200   0.000212
## 31 stem.cankers      0.000200   0.000212
## 32 temp              0.000200   0.000212
## 33 date              0.0784     0.0808  
## 34 area.dam          0.0814     0.0814  
## 35 leaves          NaN        NaN
chi_results %>% filter(p_adj < 0.05)

There are particular predictors that are more likely to be missing such as sever, seed.tmt, lodging, hail, and germ according to the EDA compared to most. The pattern as the heatmap and chi-squared tests indicate missingness varies by class (many p_adj < 0.05), so it’s class related rather than random.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation. By using recipes library, my plan standardizes missing values, drops predictors with too many NAs, and then imputes the rest using statistics used for training only (mode for categorical, median for numeric). This produces a clean usable dataset as a result.

library(recipes)
## Warning: package 'recipes' was built under R version 4.4.3
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stringr':
## 
##     fixed
## The following object is masked from 'package:stats':
## 
##     step
rec <- recipe(Class ~ ., data = Soybean) %>%
  step_mutate(across(-Class, ~ na_if(as.character(.x), "?"))) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_impute_median(all_numeric_predictors())