Homework 4

Author

Naomi Buell

pacman::p_load(
    tidyverse,
    janitor,
    fpp3,
    mlbench,
    AppliedPredictiveModeling,
    skimr,
    e1071,
    caret,
    impute,
    corrplot,
    GGally,
    recipes
)

3.1

3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Glass |> head()

       RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Glass |> skim()

Data summary
Name	Glass
Number of rows	214
Number of columns	10
_______________________
Column type frequency:
factor	1
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Type	0	1	FALSE	6	2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RI	1	1.52	0.00	1.51	1.52	1.52	1.52	1.53	▁▇▂▁▁
Na	1	13.41	0.82	10.73	12.91	13.30	13.83	17.38	▁▇▆▁▁
Mg	1	2.68	1.44	0.00	2.11	3.48	3.60	4.49	▃▁▁▇▅
Al	1	1.44	0.50	0.29	1.19	1.36	1.63	3.50	▂▇▃▁▁
Si	1	72.65	0.77	69.81	72.28	72.79	73.09	75.41	▁▂▇▂▁
K	1	0.50	0.65	0.00	0.12	0.56	0.61	6.21	▇▁▁▁▁
Ca	1	8.96	1.42	5.43	8.24	8.60	9.17	16.19	▁▇▁▁▁
Ba	1	0.18	0.50	0.00	0.00	0.00	0.00	3.15	▇▁▁▁▁
Fe	1	0.06	0.10	0.00	0.00	0.00	0.10	0.51	▇▁▁▁▁

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First, I visualize and get correlations for all combinations of variables with ggpairs().

# GGpairs for all variables
ggpairs(Glass)

This is helpful to return many useful visualizations to get a general view of relationships between all variables all at once.

To zoom in and get a better sense of the distributions of just the predictors, I create histograms for each.

# Histograms of predictors
predictors <- Glass |> select(-Type)

predictors |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_histogram() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
        x = "Percentages of elements or refractive index (RI)",
        y = "Frequency"
    ) +
    theme_classic()

Al, Ca, Na, Ri, and Si look normally distributed. Al, Ca, Na, and Ri are skewed right.
Ba, Fe, K, and Mg have many 0 values skewing the data right. Mg is bimodal.

Next, I visualize the relationships between predictors with a correlation matrix.

# Find the highest absolute correlations
correlations <- cor(predictors)

high_corr <- correlations |>
    as.data.frame() |>
    rownames_to_column(var = "Var1") |>
    pivot_longer(-Var1, names_to = "Var2", values_to = "corr") |>
    rowwise() |>
    mutate(
        pair = paste(sort(c(Var1, Var2)), collapse = " and "),
        abs_value = abs(corr),
        level = case_when(
            abs_value > 0.75 ~ "highly",
            abs_value > 0.5 ~ "moderately",
            abs_value > 0.25 ~ "slightly",
            TRUE ~ "none"
        )
    ) |>
    # Filter out self-correlations and duplicates
    filter(Var1 != Var2) |>
    distinct(pair, .keep_all = TRUE) |>
    arrange(desc(abs_value)) |>
    select(pair, corr, level)

high_corr |> head()

# A tibble: 6 × 3
# Rowwise: 
  pair        corr level     
  <chr>      <dbl> <chr>     
1 Ca and RI  0.810 highly    
2 RI and Si -0.542 moderately
3 Ba and Mg -0.492 slightly  
4 Al and Mg -0.482 slightly  
5 Al and Ba  0.479 slightly  
6 Ca and Mg -0.444 slightly

# Correlation matrix to visualize relationships between predictors.
corrplot(correlations, order = "hclust")

From the correlation matrix,

Ca and RI are highly correlated with a correlation of 0.81.
RI and Si are moderately correlated with a correlation of -0.54.
Ba and Mg are slightly correlated with a correlation of -0.49.
Al and Mg are slightly correlated with a correlation of -0.48.

Do there appear to be any outliers in the data? Are any predictors skewed?

# Calculate the sample skewness statistic for each predictor
skewValues <- apply(predictors, 2, skewness) |>
    as.data.frame() |>
    clean_names() |>
    mutate(
        abs_value = abs(apply_predictors_2_skewness),
        level = case_when(
            apply_predictors_2_skewness > 1 ~ "highly right skewed",
            apply_predictors_2_skewness > 0.5 ~ "moderately right skewed",
            apply_predictors_2_skewness < -1 ~ "highly left skewed",
            apply_predictors_2_skewness < -0.5 ~ "moderately left skewed",
            TRUE ~ "not skewed"
        )
    ) |>
    arrange(desc(abs_value)) |>
    rownames_to_column(var = "predictor") |>
    select(predictor, apply_predictors_2_skewness, level)

head(skewValues)

  predictor apply_predictors_2_skewness               level
1         K                    6.460089 highly right skewed
2        Ba                    3.368680 highly right skewed
3        Ca                    2.018446 highly right skewed
4        Fe                    1.729811 highly right skewed
5        RI                    1.602715 highly right skewed
6        Mg                   -1.136452  highly left skewed

# Get highly skewed predictors
high_skew <- skewValues |>
    filter(level %in% c("highly right skewed", "highly left skewed")) |>
    pull(predictor)

K is highly right skewed with a sample skewness statistic of 6.46.
Ba is highly right skewed with a sample skewness statistic of 3.37.
Ca is highly right skewed with a sample skewness statistic of 2.02.
Fe is highly right skewed with a sample skewness statistic of 1.73.
RI is highly right skewed with a sample skewness statistic of 1.6.
Mg is highly left skewed with a sample skewness statistic of -1.14.
Al is moderately right skewed with a sample skewness statistic of 0.89.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, I would center, scale, and use a box-cox transformation to empirically calculate the appropriate transformation for predictors.

# Center and scale the data, and apply Box-Cox transformation to predictors
transformations <- preProcess(
    Glass,
    method = c("BoxCox", "center", "scale")
)

# Display and get transformation details
transformations

Created from 214 samples and 10 variables

Pre-processing:
  - Box-Cox transformation (5)
  - centered (9)
  - ignored (1)
  - scaled (9)

Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1

box_cox_transformed_predictors <- transformations$method$BoxCox |>
    paste(collapse = ", ")
centered_scaled_predictors <- transformations$method$center |>
    paste(collapse = ", ")

# Apply the transformations to the highly skewed predictors
transformed_glass <- predict(transformations, Glass)

# Now generate histograms of the transformed predictors
transformed_glass |>
    select(-Type) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_histogram() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Histograms show some normal distributions and high occurrences of 0s skewing data.",
        x = "Transformed percentages of elements or refractive index (RI)",
        y = "Frequency"
    ) +
    theme_classic()

After applying the Box-Cox transformation (applied to RI, Na, Al, Si, Ca), the histograms show that the data is now more normally distributed. All variables are now scaled and centered around 0.

3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
skim(Soybean)

Data summary
Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

From skimming the data, I see that all 35 predictors are factor variables and there is some missingness.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First, I plot all predictors to visualize their frequency distributions.

Soybean |>
    # Get predictors
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Categorical Variables",
        x = "",
        y = "Frequency"
    ) +
    theme_classic()

I notice that there are a few variables that appear to have very low variance. For example, most mycelium observations are the same.

Next, I check for degenerate predictors using the caret::nearZeroVar() function. Variables with a single unique value or with near-zero variance are degenerate. Specifically, these predictors are degenerate if:

The fraction of unique values over the sample size is low (say 10%). I store the percentage of unique data points out of the total number of data points in the percentUnique column below.
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). I store the ratio of frequencies for the most common value over the second most common value as freqRatio below.

# Get near zero variance information
degenerate_predictors <- Soybean |>
    select(-Class) |>
    # Set cutoffs for degenerate predictors
    nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
    as.data.frame() |>
    rownames_to_column("predictor") |>
    # Filter for predictors with near zero variance
    filter(nzv == TRUE)

degenerate_predictors

  predictor freqRatio percentUnique zeroVar  nzv
1 leaf.mild     26.75     0.4392387   FALSE TRUE
2  mycelium    106.50     0.2928258   FALSE TRUE
3 sclerotia     31.25     0.2928258   FALSE TRUE

The predictors leaf.mild, mycelium, sclerotia are degenerate.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Calculate missing values by predictor
missing_by_predictor <- Soybean |>
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(
        cols = everything(),
        names_to = "predictor",
        values_to = "value"
    ) |>
    group_by(predictor) |>
    summarize(
        n = n(),
        n_missing = sum(is.na(value)),
        percent_missing = n_missing / n * 100,
        ten_perc_miss = percent_missing > 10,
    ) |>
    arrange(desc(percent_missing)) |>
    ungroup()

ten_perc_miss <- missing_by_predictor |>
    filter(ten_perc_miss == TRUE) |>
    pull(predictor) |>
    paste(collapse = ", ")

# Plot missing values by predictor
missing_by_predictor |>
    ggplot(aes(x = reorder(predictor, percent_missing), y = percent_missing)) +
    geom_col() +
    coord_flip() +
    labs(
        title = "Sever, seed.tmt, lodging, and hail have the most missingness.",
        x = "Predictor",
        y = "Percent Missing"
    ) +
    theme_classic()

sever, seed.tmt, lodging, and hail have the highest percentage of missing values, at nearly 18% missing. hail, lodging, seed.tmt, sever, germ, leaf.mild, fruit.spots, fruiting.bodies, seed.discolor, shriveling, leaf.shread, mold.growth, seed, seed.size, fruit.pods, leaf.halo, leaf.malf, leaf.marg, leaf.size had over 10% of observations missing.

Next, I check if the pattern of missing data is related to the class.

missing_by_class <- Soybean |>
    group_by(Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(-Class, names_to = "predictor") |>
    group_by(Class, predictor) |>
    summarize(
        n_missing = sum(is.na(value)),
        perc_missing = n_missing / n() * 100,
    ) |>
    arrange(desc(n_missing))

# Plot missing values by predictor and class
missing_by_class |>
    ggplot(aes(x = reorder(Class, perc_missing), y = perc_missing, fill = predictor)) +
    geom_col(position = "dodge") +
    coord_flip() +
    labs(
        title = "A few select classes have high rates of missingness.",
        x = "Class",
        y = "Percent Missing"
    ) +
    theme_classic()

missing_classes <- missing_by_class |>
    group_by(Class) |>
    summarize(
        n_missing = sum(n_missing)
    ) |>
    arrange(desc(n_missing)) |>
    filter(n_missing > 0) |>
    pull(Class) |>
    paste(collapse = ", ")

Yes, it appears that there is a pattern of missingness related to class. phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury have missing predictor data, and other classes don’t.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will remove degenerate variables. Then, I will impute missings using the recipes package. I use step_impute_mode() to substitute missing values of categorical variables by the mode of those variables.

# Remove degenerate vars
Soybean_without_degenerate <- Soybean |>
    select(-c(degenerate_predictors$predictor))

# Create the recipe, imputing missing categorical predictor values with their mode
soybean_recipe <- recipe(Class ~ ., data = Soybean_without_degenerate) |> 
    step_impute_mode(all_nominal_predictors())

# Prepare and apply the recipe
prepped_recipe <- prep(soybean_recipe)
imputed_soybean <- bake(prepped_recipe, new_data = Soybean)

# Show number of missings for all variables
imputed_soybean |> skim()

Data summary
Name	imputed_soybean
Number of rows	683
Number of columns	33
_______________________
Column type frequency:
factor	33
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
date	1	FALSE	7	5: 150, 4: 131, 3: 118, 2: 93
plant.stand	1	TRUE	2	0: 390, 1: 293
precip	1	TRUE	3	2: 497, 1: 112, 0: 74
temp	1	TRUE	3	1: 404, 2: 199, 0: 80
hail	1	FALSE	2	0: 556, 1: 127
crop.hist	1	FALSE	4	2: 235, 3: 218, 1: 165, 0: 65
area.dam	1	FALSE	4	1: 228, 3: 187, 2: 145, 0: 123
sever	1	FALSE	3	1: 443, 0: 195, 2: 45
seed.tmt	1	FALSE	3	0: 426, 1: 222, 2: 35
germ	1	TRUE	3	1: 325, 2: 193, 0: 165
plant.growth	1	FALSE	2	0: 457, 1: 226
leaves	1	FALSE	2	1: 606, 0: 77
leaf.halo	1	FALSE	3	2: 426, 0: 221, 1: 36
leaf.marg	1	FALSE	3	0: 441, 2: 221, 1: 21
leaf.size	1	TRUE	3	1: 411, 2: 221, 0: 51
leaf.shread	1	FALSE	2	0: 587, 1: 96
leaf.malf	1	FALSE	2	0: 638, 1: 45
stem	1	FALSE	2	1: 387, 0: 296
lodging	1	FALSE	2	0: 641, 1: 42
stem.cankers	1	FALSE	4	0: 417, 3: 191, 1: 39, 2: 36
canker.lesion	1	FALSE	4	0: 358, 2: 177, 1: 83, 3: 65
fruiting.bodies	1	FALSE	2	0: 579, 1: 104
ext.decay	1	FALSE	3	0: 535, 1: 135, 2: 13
int.discolor	1	FALSE	3	0: 619, 1: 44, 2: 20
fruit.pods	1	FALSE	4	0: 491, 1: 130, 3: 48, 2: 14
fruit.spots	1	FALSE	4	0: 451, 4: 100, 1: 75, 2: 57
seed	1	FALSE	2	0: 568, 1: 115
mold.growth	1	FALSE	2	0: 616, 1: 67
seed.discolor	1	FALSE	2	0: 619, 1: 64
seed.size	1	FALSE	2	0: 624, 1: 59
shriveling	1	FALSE	2	0: 645, 1: 38
roots	1	FALSE	3	0: 582, 1: 86, 2: 15
Class	1	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88

# Plot all (imputed) predictors
imputed_soybean |>
    # Get predictors
    select(-Class) |>
    mutate(across(everything(), ~ as.character(.) |> as.factor())) |>
    pivot_longer(everything(), names_to = "predictor") |>
    ggplot(aes(value)) +
    geom_bar() +
    facet_wrap(~predictor, scales = "free") +
    labs(
        subtitle = "Categorical Variables",
        x = "",
        y = "Frequency"
    ) +
    theme_classic()

# Check if there are any more degenerate predictors after imputation
imputed_soybean |>
    select(-Class) |>
    nearZeroVar(freqCut = 20, uniqueCut = 10, saveMetrics = TRUE) |>
    as.data.frame() |>
    rownames_to_column("predictor") |>
    # Filter for predictors with near zero variance
    filter(nzv == TRUE)

[1] predictor     freqRatio     percentUnique zeroVar       nzv          
<0 rows> (or 0-length row.names)

I imputed the missing values with the mode of each predictor, so that there are no longer any missing predictor data. After imputation, I confirmed that there are still no degenerate predictors.