Data 624 Homework 4

library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(mlbench)
library(corrplot)
library(AppliedPredictiveModeling)
library(GGally)
library(patchwork)
library(caret)
library(e1071)
theme_set(theme_bw())

3.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

corr_glass = cor(Glass[0:9])
corrplot(corr_glass, order = "hclust")

Glass |>
  select(-Type) |>
  ggpairs()

Looking at the GGPairs plot generated above, the predictor variables have a variety of distributions. These are discussed in more detail in part (b) below.

With regards to any relationships between predictors, there appears to be a strong positive relationship between RI and Ca. There also appears to be a positive relationship between Ba and Al, and a negative relationship between the following pairs of variables: RI and Al, RI and Si, Al and Mg, Ca and Mg, Ba and Mg. These identified relationships are based on the correlation values being greater than 0.4 (or less than -0.4 for the negative correlation).

Do there appear to be any outliers in the data? Are any predictors skewed?

ri_box = Glass |>
  ggplot(aes(RI)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

na_box = Glass |>
  ggplot(aes(Na)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

mg_box = Glass |>
  ggplot(aes(Mg)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

al_box = Glass |>
  ggplot(aes(Al)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

si_box = Glass |>
  ggplot(aes(Si)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

k_box = Glass |>
  ggplot(aes(K)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

ca_box = Glass |>
  ggplot(aes(Ca)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

ba_box = Glass |>
  ggplot(aes(Ba)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

fe_box = Glass |>
  ggplot(aes(Fe)) +
  geom_boxplot() +
  scale_y_continuous(labels = NULL) +
  theme(axis.ticks.x = element_blank()) +
  coord_flip()

(ri_box + na_box + mg_box) / (al_box + si_box + k_box) / (ca_box + ba_box + fe_box)

Based on the histograms above, all but the Mg predictor variable appear to have outliers. The predictor variables with the most significant outliers appear to be K, Ba, and Fe.

Looking at the GGPairs plot generated above, the following predictor variables have nearly normal distributions: Na, Al, Si, and Ca. The following predictor variables appear to have a right skewed distribution:RI,K,Ba, andFe. The remaining predictor variable,Mg, appears to have a left skewed distribution.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Glass |>
  select(-Type) |>
  summarize(
    across(RI:Fe, \(x) BoxCoxTrans(x)$lambda)
  )

Based on the list above, which lists the approximate lambda values that would be used for a Box-Cox Transformation, I would do the following:

Take the negative square inverse of RI
Take the log of Na
Take the square root of Al
Take the square of Si
Take the inverse of Ca

The remaining variables would not need to be transformed.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
## See ?Soybean for details

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Before looking into the frequency distributions, I wanted to ensure that all of the factors listed when using ?Soybean were accounted for in each of the variables. Below is a summary of the Soybean dataframe, which includes the number of factors for each variable.

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

When comparing this to the number of factors listed when using ?Soybean, all of the counts matched except for the predictor variable fruit.spots. The summary above shows it only having 4 factors, whereas the description of the Soybean database shows 5 factors listed for fruit.spots. To ensure this is accounted for, the Soybean dataframe was modified below to include this factor level.

Soybean = Soybean |>
  mutate(fruit.spots = fct_expand(fruit.spots, "3", after = 3))

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 5 levels "0","1","2","3",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

for (v in colnames(Soybean |> select(-Class, -date))){
  print(Soybean |>
          ggplot(aes(x = Soybean[,v])) + geom_bar() +
          scale_x_discrete(drop = FALSE) + labs(x = v))
}

Based on the histograms above, there are no predictor variables that have degenerate distributions. That is, there are no predictor variables that have only 1 value throughout the entire series, which would result in a zero variance. There are some variables that do have a large difference in counts, however. The below table shows the predictor variables with the greatest frequency ratios, defined as the fraction of the value with the highest frequency to the value of the lowest frequency.

soybean_df = data.frame(variable = character(), factor_ratio = numeric(), frequency_ratio = numeric())

for (v in colnames(Soybean |> select(-Class, -date))){
  soybean_count = Soybean |>
    count(Soybean[,v], .drop = FALSE)
  
  colnames(soybean_count)[1] = v
  
  soybean_fact_ratio = n_distinct(soybean_count[,v]) / sum(soybean_count$n)
  soybean_max = max(soybean_count[!is.na(soybean_count[1]),]$n)
  soybean_min = min(soybean_count[!is.na(soybean_count[1]),]$n)
  soybean_ratio = soybean_max / soybean_min
  
  soybean_df = soybean_df |>
    add_row(variable = v, factor_ratio = soybean_fact_ratio, frequency_ratio = soybean_ratio)
}

soybean_df |> arrange(desc(frequency_ratio))

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

na_count = Soybean |>
  summarize(across(everything(), \(x) sum(is.na(x)))) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "count") |>
  arrange(desc(count))

na_count

Based on the table above, the predictors hail, sever, seed.tmt, and lodging have the highest number of data points missing at 121 each, followed by germ at 112.

Soybean |>
  group_by(Class, hail) |>
  count() |>
  filter(is.na(hail)) |>
  arrange(desc(n))

Soybean |>
  group_by(Class, sever) |>
  count() |>
  filter(is.na(sever)) |>
  arrange(desc(n))

Soybean |>
  group_by(Class, seed.tmt) |>
  count() |>
  filter(is.na(seed.tmt)) |>
  arrange(desc(n))

Soybean |>
  group_by(Class, lodging) |>
  count() |>
  filter(is.na(lodging)) |>
  arrange(desc(n))

Soybean |>
  group_by(Class, germ) |>
  count() |>
  filter(is.na(germ)) |>
  arrange(desc(n))

Looking at the counts above, the same 5 classes appear for the missing data: phytophthora-rot, 2-4-d-injury, cyst-nematode, herbicide-injury, and diaporthe-pod-&-stem-blight, suggesting there’s a pattern between missing data and these 5 classes.

Develop a strategy for handling missing data, either by eliminating predictors or imputation

The first thing I would do with regards to handling any missing data is looking at any predictors that have a large discrepancy with frequencies of values. The table below has a filtered set of frequency ratios that’s at least 20. This, combined with the fraction of unique values to the total observations being less than 0.1% for all variables, suggests that these predictor variables should be removed from the model.

soybean_df |> 
  filter(frequency_ratio >= 20) |>
  arrange(desc(frequency_ratio))

After removing the above predictor variables, the correlations between predictor variables was analyzed to see if there are any relationships that could be used for imputation. The below code calculates the correlation across all predictor variables after the removal of the above predictors.

high_freq_ratio = soybean_df |> 
  filter(frequency_ratio >= 20) |>
  arrange(desc(frequency_ratio)) |>
  select(variable) |>
  as.list()

na_count |>
  filter(!(variable %in% high_freq_ratio$variable))

soybean_cor = Soybean |>
  select(-high_freq_ratio$variable) |>
  select(-Class, -date) |>
  mutate(across(everything(), as.numeric)) |>
  cor(use = "na.or.complete") |>
  round(2)

soybean_cor[abs(soybean_cor) <= 0.5] = NA
soybean_cor

##                 plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## plant.stand               1     NA   NA   NA        NA       NA    NA       NA
## precip                   NA      1   NA   NA        NA       NA    NA       NA
## temp                     NA     NA    1   NA        NA       NA    NA       NA
## hail                     NA     NA   NA    1        NA       NA    NA       NA
## crop.hist                NA     NA   NA   NA         1       NA    NA       NA
## area.dam                 NA     NA   NA   NA        NA        1    NA       NA
## sever                    NA     NA   NA   NA        NA       NA     1       NA
## seed.tmt                 NA     NA   NA   NA        NA       NA    NA        1
## germ                     NA     NA   NA   NA        NA       NA    NA       NA
## plant.growth             NA     NA   NA   NA        NA       NA    NA       NA
## leaves                   NA     NA   NA   NA        NA       NA    NA       NA
## leaf.halo                NA     NA   NA   NA        NA       NA    NA       NA
## leaf.marg                NA     NA   NA   NA        NA       NA    NA       NA
## leaf.size                NA     NA   NA   NA        NA       NA    NA       NA
## leaf.shread              NA     NA   NA   NA        NA       NA    NA       NA
## leaf.malf                NA     NA   NA   NA        NA       NA    NA       NA
## stem                     NA     NA   NA   NA        NA       NA    NA       NA
## lodging                  NA     NA   NA   NA        NA       NA    NA       NA
## stem.cankers             NA     NA   NA   NA        NA       NA    NA       NA
## canker.lesion            NA     NA   NA   NA        NA       NA    NA       NA
## fruiting.bodies          NA     NA   NA   NA        NA       NA    NA       NA
## seed                     NA     NA   NA   NA        NA       NA    NA       NA
## mold.growth              NA     NA   NA   NA        NA       NA    NA       NA
## seed.discolor            NA     NA   NA   NA        NA       NA    NA       NA
## seed.size                NA     NA   NA   NA        NA       NA    NA       NA
## shriveling               NA     NA   NA   NA        NA       NA    NA       NA
##                 germ plant.growth leaves leaf.halo leaf.marg leaf.size
## plant.stand       NA           NA     NA        NA        NA        NA
## precip            NA           NA     NA        NA        NA        NA
## temp              NA           NA     NA        NA        NA        NA
## hail              NA           NA     NA        NA        NA        NA
## crop.hist         NA           NA     NA        NA        NA        NA
## area.dam          NA           NA     NA        NA        NA        NA
## sever             NA           NA     NA        NA        NA        NA
## seed.tmt          NA           NA     NA        NA        NA        NA
## germ               1           NA     NA        NA        NA        NA
## plant.growth      NA            1     NA        NA        NA        NA
## leaves            NA           NA      1        NA        NA        NA
## leaf.halo         NA           NA     NA      1.00     -0.98     -0.78
## leaf.marg         NA           NA     NA     -0.98      1.00      0.82
## leaf.size         NA           NA     NA     -0.78      0.82      1.00
## leaf.shread       NA           NA     NA        NA        NA        NA
## leaf.malf         NA           NA     NA        NA        NA        NA
## stem              NA           NA     NA        NA        NA      0.52
## lodging           NA           NA     NA        NA        NA        NA
## stem.cankers      NA           NA     NA        NA        NA        NA
## canker.lesion     NA           NA     NA        NA        NA        NA
## fruiting.bodies   NA           NA     NA        NA        NA        NA
## seed              NA           NA     NA        NA        NA        NA
## mold.growth       NA           NA     NA        NA        NA        NA
## seed.discolor     NA           NA     NA        NA        NA        NA
## seed.size         NA           NA     NA        NA        NA        NA
## shriveling        NA           NA     NA        NA        NA        NA
##                 leaf.shread leaf.malf stem lodging stem.cankers canker.lesion
## plant.stand              NA        NA   NA      NA           NA            NA
## precip                   NA        NA   NA      NA           NA            NA
## temp                     NA        NA   NA      NA           NA            NA
## hail                     NA        NA   NA      NA           NA            NA
## crop.hist                NA        NA   NA      NA           NA            NA
## area.dam                 NA        NA   NA      NA           NA            NA
## sever                    NA        NA   NA      NA           NA            NA
## seed.tmt                 NA        NA   NA      NA           NA            NA
## germ                     NA        NA   NA      NA           NA            NA
## plant.growth             NA        NA   NA      NA           NA            NA
## leaves                   NA        NA   NA      NA           NA            NA
## leaf.halo                NA        NA   NA      NA           NA            NA
## leaf.marg                NA        NA   NA      NA           NA            NA
## leaf.size                NA        NA 0.52      NA           NA            NA
## leaf.shread               1        NA   NA      NA           NA            NA
## leaf.malf                NA         1   NA      NA           NA            NA
## stem                     NA        NA 1.00      NA         0.70           0.7
## lodging                  NA        NA   NA       1           NA            NA
## stem.cankers             NA        NA 0.70      NA         1.00            NA
## canker.lesion            NA        NA 0.70      NA           NA           1.0
## fruiting.bodies          NA        NA   NA      NA         0.61            NA
## seed                     NA        NA   NA      NA           NA            NA
## mold.growth              NA        NA   NA      NA           NA            NA
## seed.discolor            NA        NA   NA      NA           NA            NA
## seed.size                NA        NA   NA      NA           NA            NA
## shriveling               NA        NA   NA      NA           NA            NA
##                 fruiting.bodies seed mold.growth seed.discolor seed.size
## plant.stand                  NA   NA          NA            NA        NA
## precip                       NA   NA          NA            NA        NA
## temp                         NA   NA          NA            NA        NA
## hail                         NA   NA          NA            NA        NA
## crop.hist                    NA   NA          NA            NA        NA
## area.dam                     NA   NA          NA            NA        NA
## sever                        NA   NA          NA            NA        NA
## seed.tmt                     NA   NA          NA            NA        NA
## germ                         NA   NA          NA            NA        NA
## plant.growth                 NA   NA          NA            NA        NA
## leaves                       NA   NA          NA            NA        NA
## leaf.halo                    NA   NA          NA            NA        NA
## leaf.marg                    NA   NA          NA            NA        NA
## leaf.size                    NA   NA          NA            NA        NA
## leaf.shread                  NA   NA          NA            NA        NA
## leaf.malf                    NA   NA          NA            NA        NA
## stem                         NA   NA          NA            NA        NA
## lodging                      NA   NA          NA            NA        NA
## stem.cankers               0.61   NA          NA            NA        NA
## canker.lesion                NA   NA          NA            NA        NA
## fruiting.bodies            1.00   NA          NA            NA        NA
## seed                         NA 1.00        0.74           0.7        NA
## mold.growth                  NA 0.74        1.00            NA      0.55
## seed.discolor                NA 0.70          NA           1.0        NA
## seed.size                    NA   NA        0.55            NA      1.00
## shriveling                   NA   NA          NA            NA      0.79
##                 shriveling
## plant.stand             NA
## precip                  NA
## temp                    NA
## hail                    NA
## crop.hist               NA
## area.dam                NA
## sever                   NA
## seed.tmt                NA
## germ                    NA
## plant.growth            NA
## leaves                  NA
## leaf.halo               NA
## leaf.marg               NA
## leaf.size               NA
## leaf.shread             NA
## leaf.malf               NA
## stem                    NA
## lodging                 NA
## stem.cankers            NA
## canker.lesion           NA
## fruiting.bodies         NA
## seed                    NA
## mold.growth             NA
## seed.discolor           NA
## seed.size             0.79
## shriveling            1.00

Looking at possible relationships between predictors, there are several that have strong relationships tht could be used to impute missing values. The correlation matrix above had a filter applied where the absolute value of the correlation was at least 0.5. These relationships can be used to impute any missing values from the other predictor that doesn’t have missing values. Based on the matrix above, the following relationships could be used for imputations:

leaf.halo & leaf.marg
leaf.halo & leaf.size
leaf.size & leaf.marg
leaf.size & stem
stem & stem.cankers
stem & canker.lesion
stem.cankers & fruiting.bodies
seed & mold.growth
seed & seed.discolor
seed.size & mold.growth
seed.size & shriveling

For the remainder of the missing values, if computational power is a concern, then replacing any missing values with the mode of that predictor variable is a solution for imputation. If computational power is not a concern, then a more advanced method such as Multiple Imputation by Chained Equations (MICE) could be used for better accuracy within the model.

Data 624 Homework 4

Michael Drake

2026-02-25