DATA 624: Assignment 04

Author

Curtis Elsasser

Published

March 2, 2025

Setup

library(GGally)
library(mlbench)
library(tidyverse)

skewness <- function(x) {
  n <- length(x)
  m <- mean(x)
  s <- sd(x)
  sum((x - m)^3) / ((n - 1) * s^3)
}

Assignment

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

Problem 3.1

The UC Irvine Machine Learning Repository 6 contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).

data(Glass)
glimpse(Glass)

Rows: 214
Columns: 10
$ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
$ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
$ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
$ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
$ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
$ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
$ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
$ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
$ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

summary(Glass)

       RI              Na              Mg              Al       
 Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
 1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
 Median :1.518   Median :13.30   Median :3.480   Median :1.360  
 Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
 3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
 Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
       Si              K                Ca               Ba       
 Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
 1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
 Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
 Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
 3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
 Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
       Fe          Type  
 Min.   :0.00000   1:70  
 1st Qu.:0.00000   2:76  
 Median :0.00000   3:17  
 Mean   :0.05701   5:13  
 3rd Qu.:0.10000   6: 9  
 Max.   :0.51000   7:29

Element Unit?

It looks like the unit of each element predictor is a percentage of the composition of the glass. I’m not sure about RI (refractive index). Let’s try confirm this suspicion. Each row’s elements should be sum to 100.

Glass |>
  mutate(Total = Na + Mg + Al + Si + K + Ca + Ba + Fe) |>
  summarize(
    Min = min(Total),
    Max = max(Total)
  )

    Min   Max
1 99.02 100.1

Alright, that’s enough confirmation for me that the predictors are all percentages of the total composition of each observation. I think it’s safe to assume that the refractive index is a measure of how much light is bent when it passes through the glass. What are its units? According to Wikipedia, “…the refractive index (or refraction index) of an optical medium is the ratio of the apparent speed of light in the air or vacuum to the speed in the medium.”

3.1.1

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Visualizations

First, let’s get a feeling for the distribution of the predictor variables. We shall create histograms and boxplots for each of the predictors.

Glass |>
  pivot_longer(cols = Na:Fe, names_to = "Predictor", values_to = "Value") |>
  group_by(Type, Predictor) |>
  summarize(Value = mean(Value)) |>
  # get this guy to sort. I remember this solution from 607. I still think its weird.
  mutate(
    Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
  ) |>
  ggplot(mapping = aes(x = Predictor, y = Value)) +
  geom_col() +
  facet_wrap(~Type, scales = "free") +
  labs(
    title = "Composition of Glass by Type",
    x = "Element",
    y = "Percentage"
  )

`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.

It’s looking very exponential. Let’s see it without Si which dominates

Glass |>
  pivot_longer(cols = c(Na, Mg, Al, K, Ca, Ba, Fe), names_to = "Predictor", values_to = "Value") |>
  group_by(Type, Predictor) |>
  summarize(Value = mean(Value)) |>
  mutate(
    Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
  ) |>
  ggplot(mapping = aes(x = Predictor, y = Value)) +
  geom_col() +
  facet_wrap(~Type, scales = "free") +
  labs(
    title = "Composition of Glass (minus Si) by Type",
    x = "Element",
    y = "Total Percentage"
  )

`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.

Glass |>
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") |>
  ggplot(mapping = aes(x = Value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Distribution of Predictors in Glass Dataset",
    x = "Value",
    y = "Count"
  )

Glass |>
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") |>
  ggplot(mapping = aes(x = Value)) +
  geom_boxplot() +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Boxplots of Predictors in Glass Dataset",
    x = "Value",
    y = "Count"
  )

Glass |>
  select(RI:Fe) |>
  ggpairs(
    title = "Pairwise Scatterplots of Predictors in Glass Dataset"
  )

3.1.2

Do there appear to be any outliers in the data? Are any predictors skewed?

Answer

Yes, as per the \(\pm 1.5\times IQR\) rule, there are many outliers in the data. The boxplots show that there are outliers in the predictors: Al, Ba, Ca, Fe, K, Na, RI and Si; the only predictor missing from this list is Mg:

Al: outliers above and below the whiskers.
Ba: outliers above the whiskers.
Ca: outliers above and below the whiskers.
Fe: outliers above the whiskers.
K: outliers above the whiskers.
Mg: no outliers.
Na: outliers above and below the whiskers.
Si: outliers above and below the whiskers.
RI: most outliers are above the whiskers.

The histograms and boxplots show that the predictors are to varying degrees not normally distributed. The predictors Al, Ba, Ca, Fe, K, Na, and RI are right-skewed, while the predictors Mg and Si are left-skewed. We will calculate their skewness to confirm this. Ba, Fe, K and Mg are a little strange. They all have significant numbers of observations with 0%. I presume they are elements that induce special qualities in glass. We’ll follow up on that if we have time.

skewness(Glass$Al)

[1] 0.8988105

skewness(Glass$Ba)

[1] 3.384495

skewness(Glass$Ca)

[1] 2.027923

skewness(Glass$Fe)

[1] 1.737932

skewness(Glass$K)

[1] 6.490418

skewness(Glass$Mg)

[1] -1.141788

skewness(Glass$Na)

[1] 0.4499368

skewness(Glass$Si)

[1] -0.7236206

skewness(Glass$RI)

[1] 1.61024

3.1.3

Are there any relevant transformations of one or more predictors that might improve the classification model?

I would be inclined to apply a log transformation to all of the predictors for two reasons:

All but two of the the predictors are right-skewed. A log transformation would help to normalize the distributions of these predictors.
The predictors are exponentially distributed. A log transformation would help to linearize the relationships between the predictors and the outcome variable.

I might consider an exponential transformation for the predictors Mg and Si, but I am hesitant, because Si is already so dominant. It accounts for ~70% of the total composition of each type of glass. I would be concerned that an exponential transformation would make it even more dominant.

I am curious about trimming as well. As noted before, Ba, Fe, K and Mg are all have significant numbers of observations with 0%. They will be resistant to a straightforward log transformation.

Let’s see what a uniform \(log(x+1)\) transformation does to the distribution of the predictors.

Glass |>
  mutate(across(Na:Fe, log1p)) |>
  pivot_longer(cols = Na:Fe, names_to = "Predictor", values_to = "Value") |>
  group_by(Type, Predictor) |>
  summarize(Value = mean(Value)) |>
  # get this guy to sort. I remember this solution from 607. I still think its weird.
  mutate(
    Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
  ) |>
  ggplot(mapping = aes(x = Predictor, y = Value)) +
  geom_col() +
  facet_wrap(~Type, scales = "free") +
  labs(
    title = "Composition of Glass by Type",
    x = "Element",
    y = "log(%+1)"
  )

`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.

Glass |>
  select(RI:Fe) |>
  mutate(across(everything(), log1p)) |>
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
  ggplot(mapping = aes(x = Value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Distribution of Log-Transformed Predictors in Glass Dataset",
    x = "Value",
    y = "log(%+1)"
  )

Let’s see what happens if we get rid of the 0% observations. If for no other reason than to help us to see the distributions of the predictors more clearly.

Glass |>
  select(RI:Fe) |>
  mutate(across(everything(), log1p)) |>
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
  filter(Value > 0) |>
  ggplot(mapping = aes(x = Value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Distribution of Log-Transformed Predictors in Glass Dataset",
    subtitle = "0% observations removed",
    x = "Value",
    y = "log(%+1)"
  )

Ha, I don’t know if that made things better or worse! We can now see Ba’s and Fe’s distribution, neither of which normal at all.

Problem 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.

Load the data and have a peek:

data(Soybean) 
glimpse(Soybean)

Rows: 683
Columns: 36
$ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
$ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
$ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
$ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
$ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
$ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
$ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
$ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
$ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
$ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
$ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
$ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
$ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
$ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
$ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

summary(Soybean)

                 Class          date     plant.stand  precip      temp    
 brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
 alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
 frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
 phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
 anthracnose        : 44   6      : 90                                    
 brown-stem-rot     : 44   (Other):101                                    
 (Other)            :233   NA's   :  1                                    
   hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
 0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
 1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
 NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
            3   :218   3   :187   NA's:121   NA's:121   NA's:112               
            NA's: 16   NA's:  1                                                
                                                                               
                                                                               
 leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
 0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
 1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
         2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
         NA's: 84   NA's: 84   NA's: 84                          NA's:108  
                                                                           
                                                                           
                                                                           
   stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
 0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
 1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
 NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
                       3   :191     3   : 65                      NA's: 38  
                       NA's: 38     NA's: 38                                
                                                                            
                                                                            
 mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
 0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
 1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
 NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
            NA's: 38                3   : 48   4   :100              
                                    NA's: 84   NA's:106              
                                                                     
                                                                     
 mold.growth seed.discolor seed.size  shriveling  roots    
 0   :524    0   :513      0   :532   0   :539   0   :551  
 1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
 NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
                                                 NA's: 31

levels(Soybean$Class)

 [1] "2-4-d-injury"                "alternarialeaf-spot"        
 [3] "anthracnose"                 "bacterial-blight"           
 [5] "bacterial-pustule"           "brown-spot"                 
 [7] "brown-stem-rot"              "charcoal-rot"               
 [9] "cyst-nematode"               "diaporthe-pod-&-stem-blight"
[11] "diaporthe-stem-canker"       "downy-mildew"               
[13] "frog-eye-leaf-spot"          "herbicide-injury"           
[15] "phyllosticta-leaf-spot"      "phytophthora-rot"           
[17] "powdery-mildew"              "purple-seed-stain"          
[19] "rhizoctonia-root-rot"

There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

# The unjustified
levels(Soybean$Class)[16:19]

[1] "phytophthora-rot"     "powdery-mildew"       "purple-seed-stain"   
[4] "rhizoctonia-root-rot"

Category Table

I have changed the factor levels to reflect our modification. See the callout in section 3.2.1 for more information.

Predictor	Categories
date	NA(0), apr(1), may(2), june(3), july(4), aug(5), sept(6), oct(7)
plant.stand	NA(0), normal(1), lt-normal(2)
precip	NA(0), lt-norm(1), norm(2), gt-norm(3)
temp	NA(0), lt-norm(1), norm(2), gt-norm(3)
hail	NA(0), yes(1), no(2)
crop.hist	NA(0), dif-lst-yr(1), s-l-y(2), s-l-3-y(3), s-l-7-y(4)
area.dam	NA(0), scatter(1), low-area(2), upper-ar(3), whole-field(4)
sever	NA(0), minor(1), pot-severe(2), severe(3)
seed.tmt	NA(0), none(1), fungicide(2), other(3)
germ	NA(0), 91-211%(1), 81-89%(2), lt-81%(3)
plant.growth	NA(0), norm(1), abnorm(2)
leaves	NA(0), norm(1), abnorm(2)
leaf.halo	NA(0), absent(1), yellow-halos(2), no-yellow-halos(3)
leaf.marg	NA(0), w-s-marg(1), no-w-s-marg(2), dna(3)
leaf.size	NA(0), lt-2/8(1), gt-2/8(2), dna(3)
leaf.shread	NA(0), absent(1), present(2)
leaf.malf	NA(0), absent(1), present(2)
leaf.mild	NA(0), absent(1), upper-surf(2), lower-surf(3)
stem	NA(0), norm(1), abnorm(2)
lodging	NA(0), yes(1), no(2)
stem.cankers	NA(0), absent(1), below-soil(2), above-s(3), ab-sec-nde(4)
canker.lesion	NA(0), dna(1), brown(2), dk-brown-blk(3), tan(4)
fruiting.bodies	NA(0), absent(1), present(2)
ext.decay	NA(0), absent(1), firm-and-dry(2), watery(3)
mycelium	NA(0), absent(1), present(2)
int.discolor	NA(0), none(1), brown(2), black(3)
sclerotia	NA(0), absent(1), present(2)
fruit.pods	NA(0), norm(1), diseased(2), few-present(3), dna(4)
fruit.spots	NA(0), absent(1), col(2), br-w/blk-speck(3), distort(4), dna(5)
seed	NA(0), norm(1), abnorm(2)
mold.growth	NA(0), absent(1), present(2)
seed.discolor	NA(0), absent(1), present(2)
seed.size	NA(0), norm(1), lt-norm(2)
shriveling	NA(0), absent(1), present(2)
roots	NA(0), norm(1), rotted(2), galls-cysts(3)

3.2.1

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer

Why is as.numeric adding 1 to each factor level?

The as.numeric function is adding 1 to each factor level because the first level of a factor is 1, not 0. This is a little confusing, but the first level of a factor is always 1. And, when you use as.numeric on a factor in R, it converts the factor levels to their underlying integer codes, not the actual values of the factor. This is because factors are stored internally as integers with corresponding levels. And the values of the factor levels are stored as a character vector. So, converting the value is not a safe alternative.

Ah, I see. The as.numeric function is adding 1 to each factor level because the first level of a factor is 1, not 0. Well, we will take advantage of this. We will convert the NA values to 0 so that we can get a grasp of how many there are per predictor.

Soybean |>
  select(-Class) |>
  mutate(across(everything(), ~if_else(is.na(.), 0, as.numeric(.)))) |>
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
  group_by(Predictor, Value) |>
  summarize(Count = n()) |>
  ggplot(mapping = aes(x = Value, y = Count)) +
  geom_col() +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Distribution of Categorical Predictors in Soybean Dataset",
    x = "Value",
    y = "Count"
  )

`summarise()` has grouped output by 'Predictor'. You can override using the
`.groups` argument.

Degenerate?

To be honest, I was not totally clear on what a degenerate distribution is. I searched the chapter and found one reference to “degenerate”. I researched it a little and found that for categorical predictors, a degenerate distribution is one in which one of the categories has a very small number of observations or one category dominates the others. That we have. Looking at our distribution plot up above we can see that the following all have degenerate distributions:

int.discolor: most belong 0 which is our encoding of NA.
leaf.malf: most belong to 1, which is actually 0, which is “absent”.
leaf.mild: there are a fair number of NAs, but 0 (“absent”) dominates.
leaves: 1 (“abnormal”) dominates.
mycellium: 0 (“absent”) dominates.
sclerotia: 0 (“absent”) dominates.

There are some others with significant differences. It’s really a matter of what threshold (percentage) one adopts as the cutoff as being “degenerate”.

3.2.2

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer

Those above the average threshold (18%) are as follows:

Soybean |>
  select(-Class) |>
  mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
  group_by(Predictor) |>
  summarize(
    NAFraction = sum(Value == 0) / n()
  ) |>
  arrange(desc(NAFraction))

# A tibble: 35 × 2
   Predictor       NAFraction
   <chr>                <dbl>
 1 hail                 0.177
 2 lodging              0.177
 3 seed.tmt             0.177
 4 sever                0.177
 5 germ                 0.164
 6 leaf.mild            0.158
 7 fruit.spots          0.155
 8 fruiting.bodies      0.155
 9 seed.discolor        0.155
10 shriveling           0.155
# ℹ 25 more rows

Let’s see if there is any correlation to the classes. We will do this by creating a data.frame of the counts of each class for each predictor.

Soybean |>
  mutate(across(-Class, ~ if_else(is.na(.), 0, as.numeric(.)))) |>
  pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Value") |>
  group_by(Class) |>
  summarize(
    CountNA = sum(Value == 0),
    CountAll = n(),
    NAFraction = CountNA / CountAll
  ) |>
  filter(NAFraction > 0) |>
  arrange(desc(NAFraction))

# A tibble: 5 × 4
  Class                       CountNA CountAll NAFraction
  <fct>                         <int>    <int>      <dbl>
1 2-4-d-injury                    450      560      0.804
2 cyst-nematode                   336      490      0.686
3 herbicide-injury                160      280      0.571
4 phytophthora-rot               1214     3080      0.394
5 diaporthe-pod-&-stem-blight     177      525      0.337

3.2.3

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer

Let’s first look for correlation between the predictors. We will use the GGally package to create a correlation plot.

Soybean |>
  select(-Class) |>
  mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
  ggpairs(
    title = "Pairwise Scatterplots of Predictors in Soybean Dataset"
  )

Hahaha, that was ridiculous. 35 predictors by 35 predictors yielded ~612 plots that looked comical and completely illegible on an 8” wide page. We really don’t need every every pairwise plot. The missing classes are isolated to 5 different classes. We might be able to impute the missing data based on the non NA composition of the other classes.

Wait, let’s not give up on correlation. Let try a correlation matrix instead. Uh-oh, it’s a matrix. I have never worked with one in R before. I searched for and found a way to convert it to a data.frame. Let’s see if we can find any high correlations.

cor_matrix <- Soybean |>
  select(-Class) |>
  mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
  cor()

class(cor_matrix)

[1] "matrix" "array"

as.data.frame(as.table(cor_matrix)) |> 
  filter(Var1 != Var2 & abs(Freq) > 0.80) |>
  arrange(desc(abs(Freq))) |>
  print()

              Var1            Var2      Freq
1      mold.growth            seed 0.8686516
2             seed     mold.growth 0.8686516
3        leaf.size       leaf.marg 0.8573120
4        leaf.marg       leaf.size 0.8573120
5        seed.size     mold.growth 0.8380320
6      mold.growth       seed.size 0.8380320
7        seed.size            seed 0.8323945
8             seed       seed.size 0.8323945
9       shriveling     mold.growth 0.8317743
10     mold.growth      shriveling 0.8317743
11       sclerotia    int.discolor 0.8285093
12    int.discolor       sclerotia 0.8285093
13      shriveling fruiting.bodies 0.8117533
14 fruiting.bodies      shriveling 0.8117533
15      shriveling   seed.discolor 0.8102875
16   seed.discolor      shriveling 0.8102875

I somewhat arbitrarily chose 0.80 as the threshold for high correlation. There are 8 pairs of predictors that are highly correlated. My strategy would be to remove one of the predictors from each unique pair to reduce collinearity.