Homework 4

Author

Naomi Buell

pacman::p_load(
    tidyverse,
    janitor,
    fpp3,
    mlbench,
    AppliedPredictiveModeling,
    skimr,
    e1071,
    caret,
    impute,
    corrplot,
    GGally,
    recipes
)

3.1

3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)
'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass |> head()
       RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
Glass |> skim()
Data summary
Name Glass
Number of rows 214
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Type 0 1 FALSE 6 2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RI 0 1 1.52 0.00 1.51 1.52 1.52 1.52 1.53 ▁▇▂▁▁
Na 0 1 13.41 0.82 10.73 12.91 13.30 13.83 17.38 ▁▇▆▁▁
Mg 0 1 2.68 1.44 0.00 2.11 3.48 3.60 4.49 ▃▁▁▇▅
Al 0 1 1.44 0.50 0.29 1.19 1.36 1.63 3.50 ▂▇▃▁▁
Si 0 1 72.65 0.77 69.81 72.28 72.79 73.09 75.41 ▁▂▇▂▁
K 0 1 0.50 0.65 0.00 0.12 0.56 0.61 6.21 ▇▁▁▁▁
Ca 0 1 8.96 1.42 5.43 8.24 8.60 9.17 16.19 ▁▇▁▁▁
Ba 0 1 0.18 0.50 0.00 0.00 0.00 0.00 3.15 ▇▁▁▁▁
Fe 0 1 0.06 0.10 0.00 0.00 0.00 0.10 0.51 ▇▁▁▁▁
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First, I visualize and get correlations for all combinations of variables with ggpairs().

# GGpairs for all variables
ggpairs(Glass)

This is helpful to return many useful visualizations to get a general view of relationships between all variables all at once.

To zoom in and get a better sense of the distributions of just the predictors, I create histograms for each.

# Histograms of predictors
predictors <- Glass |> select(-Type)

predictors |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_histogram() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
        x = "Percentages of elements or refractive index (RI)",
        y = "Frequency"
    ) +
    theme_classic()

  • Al, Ca, Na, Ri, and Si look normally distributed. Al, Ca, Na, and Ri are skewed right.

  • Ba, Fe, K, and Mg have many 0 values skewing the data right. Mg is bimodal.

Next, I visualize the relationships between predictors with a correlation matrix.

# Find the highest absolute correlations
correlations <- cor(predictors)

high_corr <- correlations |>
    as.data.frame() |>
    rownames_to_column(var = "Var1") |>
    pivot_longer(-Var1, names_to = "Var2", values_to = "corr") |>
    rowwise() |>
    mutate(
        pair = paste(sort(c(Var1, Var2)), collapse = " and "),
        abs_value = abs(corr),
        level = case_when(
            abs_value > 0.75 ~ "highly",
            abs_value > 0.5 ~ "moderately",
            abs_value > 0.25 ~ "slightly",
            TRUE ~ "none"
        )
    ) |>
    # Filter out self-correlations and duplicates
    filter(Var1 != Var2) |>
    distinct(pair, .keep_all = TRUE) |>
    arrange(desc(abs_value)) |>
    select(pair, corr, level)

high_corr |> head()
# A tibble: 6 × 3
# Rowwise: 
  pair        corr level     
  <chr>      <dbl> <chr>     
1 Ca and RI  0.810 highly    
2 RI and Si -0.542 moderately
3 Ba and Mg -0.492 slightly  
4 Al and Mg -0.482 slightly  
5 Al and Ba  0.479 slightly  
6 Ca and Mg -0.444 slightly  
# Correlation matrix to visualize relationships between predictors.
corrplot(correlations, order = "hclust")

From the correlation matrix,

  • Ca and RI are highly correlated with a correlation of 0.81.

  • RI and Si are moderately correlated with a correlation of -0.54.

  • Ba and Mg are slightly correlated with a correlation of -0.49.

  • Al and Mg are slightly correlated with a correlation of -0.48.

Do there appear to be any outliers in the data? Are any predictors skewed?

# Calculate the sample skewness statistic for each predictor
skewValues <- apply(predictors, 2, skewness) |>
    as.data.frame() |>
    clean_names() |>
    mutate(
        abs_value = abs(apply_predictors_2_skewness),
        level = case_when(
            apply_predictors_2_skewness > 1 ~ "highly right skewed",
            apply_predictors_2_skewness > 0.5 ~ "moderately right skewed",
            apply_predictors_2_skewness < -1 ~ "highly left skewed",
            apply_predictors_2_skewness < -0.5 ~ "moderately left skewed",
            TRUE ~ "not skewed"
        )
    ) |>
    arrange(desc(abs_value)) |>
    rownames_to_column(var = "predictor") |>
    select(predictor, apply_predictors_2_skewness, level)

head(skewValues)
  predictor apply_predictors_2_skewness               level
1         K                    6.460089 highly right skewed
2        Ba                    3.368680 highly right skewed
3        Ca                    2.018446 highly right skewed
4        Fe                    1.729811 highly right skewed
5        RI                    1.602715 highly right skewed
6        Mg                   -1.136452  highly left skewed
# Get highly skewed predictors
high_skew <- skewValues |>
    filter(level %in% c("highly right skewed", "highly left skewed")) |>
    pull(predictor)
  • K is highly right skewed with a sample skewness statistic of 6.46.

  • Ba is highly right skewed with a sample skewness statistic of 3.37.

  • Ca is highly right skewed with a sample skewness statistic of 2.02.

  • Fe is highly right skewed with a sample skewness statistic of 1.73.

  • RI is highly right skewed with a sample skewness statistic of 1.6.

  • Mg is highly left skewed with a sample skewness statistic of -1.14.

  • Al is moderately right skewed with a sample skewness statistic of 0.89.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, I would center, scale, and use a box-cox transformation to empirically calculate the appropriate transformation for predictors.

# Center and scale the data, and apply Box-Cox transformation to predictors
transformations <- preProcess(
    Glass,
    method = c("BoxCox", "center", "scale")
)

# Display and get transformation details
transformations
Created from 214 samples and 10 variables

Pre-processing:
  - Box-Cox transformation (5)
  - centered (9)
  - ignored (1)
  - scaled (9)

Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1
box_cox_transformed_predictors <- transformations$method$BoxCox |>
    paste(collapse = ", ")
centered_scaled_predictors <- transformations$method$center |>
    paste(collapse = ", ")

# Apply the transformations to the highly skewed predictors
transformed_glass <- predict(transformations, Glass)

# Now generate histograms of the transformed predictors
transformed_glass |>
    select(-Type) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_histogram() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
        x = "Transformed percentages of elements or refractive index (RI)",
        y = "Frequency"
    ) +
    theme_classic()

After applying the Box-Cox transformation (applied to RI, Na, Al, Si, Ca), the histograms show that the data is now more normally distributed. All variables are now scaled and centered around 0.

3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
skim(Soybean)
Data summary
Name Soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15

From skimming the data, I see that all 35 predictors are factor variables and there is some missingness.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First, I plot all predictors to visualize their frequency distributions.

Soybean |>
    # Get predictors
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Categorical Variables",
        x = "",
        y = "Frequency"
    ) +
    theme_classic()

I notice that there are a few variables that appear to have very low variance. For example, most mycelium observations are the same.

Next, I check for degenerate predictors using the caret::nearZeroVar() function. Variables with a single unique value or with near-zero variance are degenerate. Specifically, these predictors are degenerate if:

  • The fraction of unique values over the sample size is low (say 10%). I store the percentage of unique data points out of the total number of data points in the percentUnique column below.

  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). I store the ratio of frequencies for the most common value over the second most common value as freqRatio below.

# Get near zero variance information
degenerate_predictors <- Soybean |>
    select(-Class) |>
    # Set cutoffs for degenerate predictors
    nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
    as.data.frame() |>
    rownames_to_column("predictor") |>
    # Filter for predictors with near zero variance
    filter(nzv == TRUE)

degenerate_predictors
  predictor freqRatio percentUnique zeroVar  nzv
1 leaf.mild     26.75     0.4392387   FALSE TRUE
2  mycelium    106.50     0.2928258   FALSE TRUE
3 sclerotia     31.25     0.2928258   FALSE TRUE

The predictors leaf.mild, mycelium, sclerotia are degenerate.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Calculate missing values by predictor
missing_by_predictor <- Soybean |>
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(
        cols = everything(),
        names_to = "predictor",
        values_to = "value"
    ) |>
    group_by(predictor) |>
    summarize(
        n = n(),
        n_missing = sum(is.na(value)),
        percent_missing = n_missing / n * 100,
        ten_perc_miss = percent_missing > 10,
    ) |>
    arrange(desc(percent_missing)) |>
    ungroup()

ten_perc_miss <- missing_by_predictor |>
    filter(ten_perc_miss == TRUE) |>
    pull(predictor) |>
    paste(collapse = ", ")

# Plot missing values by predictor
missing_by_predictor |>
    ggplot(aes(x = reorder(predictor, percent_missing), y = percent_missing)) +
    geom_col() +
    coord_flip() +
    labs(
        title = "Sever, seed.tmt, lodging, and hail have the most missingness.",
        x = "Predictor",
        y = "Percent Missing"
    ) +
    theme_classic()

sever, seed.tmt, lodging, and hail have the highest percentage of missing values, at nearly 18% missing. hail, lodging, seed.tmt, sever, germ, leaf.mild, fruit.spots, fruiting.bodies, seed.discolor, shriveling, leaf.shread, mold.growth, seed, seed.size, fruit.pods, leaf.halo, leaf.malf, leaf.marg, leaf.size had over 10% of observations missing.

Next, I check if the pattern of missing data is related to the class.

missing_by_class <- Soybean |>
    group_by(Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(-Class, names_to = "predictor") |>
    group_by(Class, predictor) |>
    summarize(
        n_missing = sum(is.na(value)),
        perc_missing = n_missing / n() * 100,
    ) |>
    arrange(desc(n_missing))

# Plot missing values by predictor and class
missing_by_class |>
    ggplot(aes(x = reorder(Class, perc_missing), y = perc_missing, fill = predictor)) +
    geom_col(position = "dodge") +
    coord_flip() +
    labs(
        title = "A few select classes have high rates of missingness.",
        x = "Class",
        y = "Percent Missing"
    ) +
    theme_classic()

missing_classes <- missing_by_class |>
    group_by(Class) |>
    summarize(
        n_missing = sum(n_missing)
    ) |>
    arrange(desc(n_missing)) |>
    filter(n_missing > 0) |>
    pull(Class) |>
    paste(collapse = ", ")

Yes, it appears that there is a pattern of missingness related to class. phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury have missing predictor data, and other classes don’t.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will remove degenerate variables. Then, I will impute missings using the recipes package. I use step_impute_mode() to substitute missing values of categorical variables by the mode of those variables.

# Remove degenerate vars
Soybean_without_degenerate <- Soybean |>
    select(-c(degenerate_predictors$predictor))

# Create the recipe, imputing missing categorical predictor values with their mode
soybean_recipe <- recipe(Class ~ ., data = Soybean_without_degenerate) |> 
    step_impute_mode(all_nominal_predictors())

# Prepare and apply the recipe
prepped_recipe <- prep(soybean_recipe)
imputed_soybean <- bake(prepped_recipe, new_data = Soybean)

# Show number of missings for all variables
imputed_soybean |> skim()
Data summary
Name imputed_soybean
Number of rows 683
Number of columns 33
_______________________
Column type frequency:
factor 33
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
date 0 1 FALSE 7 5: 150, 4: 131, 3: 118, 2: 93
plant.stand 0 1 TRUE 2 0: 390, 1: 293
precip 0 1 TRUE 3 2: 497, 1: 112, 0: 74
temp 0 1 TRUE 3 1: 404, 2: 199, 0: 80
hail 0 1 FALSE 2 0: 556, 1: 127
crop.hist 0 1 FALSE 4 2: 235, 3: 218, 1: 165, 0: 65
area.dam 0 1 FALSE 4 1: 228, 3: 187, 2: 145, 0: 123
sever 0 1 FALSE 3 1: 443, 0: 195, 2: 45
seed.tmt 0 1 FALSE 3 0: 426, 1: 222, 2: 35
germ 0 1 TRUE 3 1: 325, 2: 193, 0: 165
plant.growth 0 1 FALSE 2 0: 457, 1: 226
leaves 0 1 FALSE 2 1: 606, 0: 77
leaf.halo 0 1 FALSE 3 2: 426, 0: 221, 1: 36
leaf.marg 0 1 FALSE 3 0: 441, 2: 221, 1: 21
leaf.size 0 1 TRUE 3 1: 411, 2: 221, 0: 51
leaf.shread 0 1 FALSE 2 0: 587, 1: 96
leaf.malf 0 1 FALSE 2 0: 638, 1: 45
stem 0 1 FALSE 2 1: 387, 0: 296
lodging 0 1 FALSE 2 0: 641, 1: 42
stem.cankers 0 1 FALSE 4 0: 417, 3: 191, 1: 39, 2: 36
canker.lesion 0 1 FALSE 4 0: 358, 2: 177, 1: 83, 3: 65
fruiting.bodies 0 1 FALSE 2 0: 579, 1: 104
ext.decay 0 1 FALSE 3 0: 535, 1: 135, 2: 13
int.discolor 0 1 FALSE 3 0: 619, 1: 44, 2: 20
fruit.pods 0 1 FALSE 4 0: 491, 1: 130, 3: 48, 2: 14
fruit.spots 0 1 FALSE 4 0: 451, 4: 100, 1: 75, 2: 57
seed 0 1 FALSE 2 0: 568, 1: 115
mold.growth 0 1 FALSE 2 0: 616, 1: 67
seed.discolor 0 1 FALSE 2 0: 619, 1: 64
seed.size 0 1 FALSE 2 0: 624, 1: 59
shriveling 0 1 FALSE 2 0: 645, 1: 38
roots 0 1 FALSE 3 0: 582, 1: 86, 2: 15
Class 0 1 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
# Plot all (imputed) predictors
imputed_soybean |>
    # Get predictors
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Categorical Variables",
        x = "",
        y = "Frequency"
    ) +
    theme_classic()

# Check if there are any more degenerate predictors after imputation
imputed_soybean |>
    select(-Class) |>
    nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
    as.data.frame() |>
    rownames_to_column("predictor") |>
    # Filter for predictors with near zero variance
    filter(nzv == TRUE)
[1] predictor     freqRatio     percentUnique zeroVar       nzv          
<0 rows> (or 0-length row.names)

I imputed the missing values with the mode of each predictor, so that there are no longer any missing predictor data. After imputation, I confirmed that there are still no degenerate predictors.