hw4DATA624

Author

Semyon Toybis

Assignment

We are required to complete questions 3.1 and 3.2 from chapter 3 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.

3.1 - UC Irvine Machine Learning - Glass Data

A - exploring the data

As described in the book, the data consists of 214 glass samples labeled as one of seven categories. There are nine predictors in the data set - Na, Mg, Al, Si, K, Ca, Ba, Fe. We are tasked with visualizing the data to explore the predictor variables, understand their distributions, and relationships between variables.

First, I use the ggpairs function to visualize the distribution of the variables and correlation between variables

data("Glass")
Glass |> select(-Type) |> ggpairs(progress = FALSE)

Additionally, I use the describe function from the Psych Package to generate a table that has summary statistics about each predictor variable and its distribution

psych::describe(Glass[1:9])

   vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
RI    1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
Na    2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
Mg    3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
Al    4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
Si    5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
K     6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
Ca    7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
Ba    8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
Fe    9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
     se
RI 0.00
Na 0.06
Mg 0.10
Al 0.03
Si 0.05
K  0.04
Ca 0.10
Ba 0.03
Fe 0.01

B - Outliers and skewness

Based on the above plot and summary table, all of the predictor variables have skewness: RI, Na, Ai, K, Ca, Ba, and Fe all have positive skew and Mg, and Si have negative skew. All of the predictors except for Mg have positive Kurtosis as well, some with extreme amounts. This suggests that there are outliers in all of the predictor variables.

Additionally, the box plots for each predictor are visible below, with red points signifying outliers based on IQR. As evident below, all predictors besides Mg have outliers.

Glass_longer <- Glass |> select(-Type) |> pivot_longer(cols = everything(), names_to = 'Predictor', values_to = 'Value')

Glass_longer |> ggplot((aes(x = Predictor, y = Value))) + geom_boxplot(outlier.color = 'red') + facet_wrap(~Predictor, scales = 'free_y')

C - Transformations of predictors

First, I try taking log values of the predictors besides K, Ba, and Fe - K, Ba, and Fe have observations that include zero, which would result in a -Inf log value.

For K, Ba and Fe, I will try taking the square root.

Glass_Trasnformed <- Glass

Glass_Trasnformed[1:5] <- log(Glass_Trasnformed[1:5])
Glass_Trasnformed[7] <- log(Glass_Trasnformed[7])
Glass_Trasnformed[6] <- sqrt(Glass_Trasnformed[6])
Glass_Trasnformed[8:9] <- sqrt(Glass_Trasnformed[8:9])

Glass_Trasnformed |> select(-Type) |> ggpairs(progress = FALSE)

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).

Warning: Removed 42 rows containing non-finite outside the scale range
(`stat_density()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).

This approach results in some missing values and does not meaningfully improve the skewness of the data.

Next, I try the Box-Cox transformation for each predictor

library(caret)

Loading required package: lattice


Attaching package: 'caret'

The following objects are masked from 'package:fabletools':

    MAE, RMSE

The following object is masked from 'package:purrr':

    lift

Glass_box_cox <- Glass

RI_predictor <- BoxCoxTrans(Glass_box_cox$RI)
RI_transformed <- predict(RI_predictor, Glass_box_cox$RI)

Na_predictor <- BoxCoxTrans(Glass_box_cox$Na)
Na_transformed <- predict(Na_predictor, Glass_box_cox$Na)

Mg_predictor <- BoxCoxTrans(Glass_box_cox$Mg)
Mg_transformed <- predict(Mg_predictor, Glass_box_cox$Mg)

Al_predictor <- BoxCoxTrans(Glass_box_cox$Al)
Al_transformed <- predict(Al_predictor, Glass_box_cox$Al)

Si_predictor <- BoxCoxTrans(Glass_box_cox$Si)
Si_transformed <- predict(Si_predictor, Glass_box_cox$Si)

K_predictor <- BoxCoxTrans(Glass_box_cox$K)
K_transformed <- predict(K_predictor, Glass_box_cox$K)

Ca_predictor <- BoxCoxTrans(Glass_box_cox$Ca)
Ca_transformed <- predict(Ca_predictor, Glass_box_cox$Ca)

Ba_predictor <- BoxCoxTrans(Glass_box_cox$Ba)
Ba_transformed <- predict(Ba_predictor, Glass_box_cox$Ba)

Fe_predictor <- BoxCoxTrans(Glass_box_cox$Fe)
Fe_transformed <- predict(Fe_predictor, Glass_box_cox$Fe)

transformed_predcitors <- data.frame(RI_transformed, Na_transformed, Mg_transformed,
                                     Al_transformed, Si_transformed, K_transformed,
                                     Ca_transformed, Ba_transformed, Fe_transformed)

transformed_predcitors |> ggpairs(progress = FALSE)

transformed_predcitors_longer <- transformed_predcitors |> pivot_longer(cols = everything(), names_to = 'Predictor', values_to = 'Value')

transformed_predcitors_longer |> ggplot((aes(x = Predictor, y = Value))) + geom_boxplot(outlier.color = 'red') + facet_wrap(~Predictor, scales = 'free_y')

The Box-Cox transformations of the variables does not make a meaningful difference to the skewness and presence of outliers of the predictors.

The predictors in this data set are difficult to transform: neither the log, square root, nor Box-Cox transformations made a meaningful difference in the skewness and presence of outliers. Further transformations would need to be considered before modeling or alternatively considering a model that is less sensitive to skewness and outliers in the predictors, such as a tree based model or support vector machines.

3.2 - UC Irvine Machine Learning - Soybean Data

data("Soybean")

head(Soybean)

                  Class date plant.stand precip temp hail crop.hist area.dam
1 diaporthe-stem-canker    6           0      2    1    0         1        1
2 diaporthe-stem-canker    4           0      2    1    0         2        0
3 diaporthe-stem-canker    3           0      2    1    0         1        0
4 diaporthe-stem-canker    3           0      2    1    0         1        0
5 diaporthe-stem-canker    6           0      2    1    0         2        0
6 diaporthe-stem-canker    5           0      2    1    0         3        0
  sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
1     1        0    0            1      1         0         2         2
2     2        1    1            1      1         0         2         2
3     2        1    2            1      1         0         2         2
4     2        0    1            1      1         0         2         2
5     1        0    2            1      1         0         2         2
6     1        0    1            1      1         0         2         2
  leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
1           0         0         0    1       1            3             1
2           0         0         0    1       0            3             1
3           0         0         0    1       0            3             0
4           0         0         0    1       0            3             0
5           0         0         0    1       0            3             1
6           0         0         0    1       0            3             0
  fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
1               1         1        0            0         0          0
2               1         1        0            0         0          0
3               1         1        0            0         0          0
4               1         1        0            0         0          0
5               1         1        0            0         0          0
6               1         1        0            0         0          0
  fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
1           4    0           0             0         0          0     0
2           4    0           0             0         0          0     0
3           4    0           0             0         0          0     0
4           4    0           0             0         0          0     0
5           4    0           0             0         0          0     0
6           4    0           0             0         0          0     0

A - categorical variables

Below is a summary of the variables in the Soybean data set

summary(Soybean)

                 Class          date     plant.stand  precip      temp    
 brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
 alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
 frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
 phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
 anthracnose        : 44   6      : 90                                    
 brown-stem-rot     : 44   (Other):101                                    
 (Other)            :233   NA's   :  1                                    
   hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
 0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
 1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
 NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
            3   :218   3   :187   NA's:121   NA's:121   NA's:112               
            NA's: 16   NA's:  1                                                
                                                                               
                                                                               
 leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
 0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
 1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
         2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
         NA's: 84   NA's: 84   NA's: 84                          NA's:108  
                                                                           
                                                                           
                                                                           
   stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
 0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
 1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
 NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
                       3   :191     3   : 65                      NA's: 38  
                       NA's: 38     NA's: 38                                
                                                                            
                                                                            
 mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
 0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
 1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
 NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
            NA's: 38                3   : 48   4   :100              
                                    NA's: 84   NA's:106              
                                                                     
                                                                     
 mold.growth seed.discolor seed.size  shriveling  roots    
 0   :524    0   :513      0   :532   0   :539   0   :551  
 1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
 NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
                                                 NA's: 31

The chapter discussed degenerate distributions where the variables has zero variances (e.g. a constant variable) or a variable where the vast majority of observations are concentrated in one value with a smaller amount of other unique values (i.e. near-zero variance).

A rule of thumb for detecting near-zero variance predictors is:

the fraction of unique values over the sample size is low (such as 10%)
the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (around 20)

Since a number of these variables are dummy variables, we need to test the ratio of the frequency of the most prevalent value to the second most prevalent value.

To do this, I have to create a function that counts the frequency of each unique value and takes the ratio of the frequency of the first most common value to the second.

frequency_ratio <- function(columnToTest) {
  
  length_of_summary <- length(table(columnToTest))
  
  ratio <- sort(table(columnToTest))[length_of_summary]/sort(table(columnToTest))[(length_of_summary-1)]
  
  return(ratio)
  
}

Next, I have to apply this function to each column of the Soybean data set that has predictor variables. I filter for values that are >= 20

predictor_Frequency_Ratio <- apply(Soybean, 2, frequency_ratio)

as.data.frame(predictor_Frequency_Ratio) |> filter(predictor_Frequency_Ratio >=20)

          predictor_Frequency_Ratio
leaf.mild                     26.75
mycelium                     106.50
sclerotia                     31.25

The distributions for leaf.mild, mycelium, and sclerotia are degenerate as these have near zero variance based on the frequency of the most prevalent value to the frequency of the second most prevalent value.

B - Missing data

We are asked to check if any predictors are more likely to have missing data.

Below I create a function that returns the percentage of total observations for each variable that is NA. I then apply the function to the Soybean data set

na_percentage <- function(columnToTest) {
  
 return(sum(is.na(columnToTest))/length(columnToTest))
  
}

soybean_predictor_na <- apply(Soybean, 2, na_percentage)

sort(soybean_predictor_na, decreasing = TRUE)

           hail           sever        seed.tmt         lodging            germ 
    0.177159590     0.177159590     0.177159590     0.177159590     0.163982430 
      leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
    0.158125915     0.155197657     0.155197657     0.155197657     0.155197657 
    leaf.shread            seed     mold.growth       seed.size       leaf.halo 
    0.146412884     0.134699854     0.134699854     0.134699854     0.122986823 
      leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
    0.122986823     0.122986823     0.122986823     0.122986823     0.055636896 
   stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
    0.055636896     0.055636896     0.055636896     0.055636896     0.055636896 
      sclerotia     plant.stand           roots            temp       crop.hist 
    0.055636896     0.052708638     0.045387994     0.043923865     0.023426061 
   plant.growth            stem            date        area.dam           Class 
    0.023426061     0.023426061     0.001464129     0.001464129     0.000000000 
         leaves 
    0.000000000

Based on the above vector, the hail variable has the highest percentage of NA values at 17.7%. Furthermore, in the above vector the variables from hail through fruit.pods have double digit NA percentages, with fruit.pods at 12.3%. After that, the NA percentage falls to 5.5%. Based on this, it seems like the variables from hail through fruit.pods in the above vector are the most likely to have NA values.

We are then asked if certain classes tend to have more NAs than others. Below, I count the number of NAs by class and variable (e.g.: for the class 2-4-d-injury, how many times does the date column have an NA value). Then I sum up the counts of NAs across the rows. Below, we see that the five classes listed have the most NAs, whereas the other classes do not have any NAs for any of the variables.

class_Na_summary <- Soybean |> group_by(Class) |> summarise_all(funs(sum(is.na(.))))

Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

class_Na_summary$Na_count <-  rowSums(class_Na_summary[,2:36])

class_Na_summary |> select(Class, Na_count) |> filter(Na_count >0)

# A tibble: 5 × 2
  Class                       Na_count
  <fct>                          <dbl>
1 2-4-d-injury                     450
2 cyst-nematode                    336
3 diaporthe-pod-&-stem-blight      177
4 herbicide-injury                 160
5 phytophthora-rot                1214

C - handling missing data

Based on the above, I would eliminate the leaf.mild, mycelium, and sclerotia variables as they are degenerate.

soybean_updated <- Soybean |> select(-c("leaf.mild", "mycelium", "sclerotia"))

summary(soybean_updated)

                 Class          date     plant.stand  precip      temp    
 brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
 alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
 frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
 phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
 anthracnose        : 44   6      : 90                                    
 brown-stem-rot     : 44   (Other):101                                    
 (Other)            :233   NA's   :  1                                    
   hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
 0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
 1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
 NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
            3   :218   3   :187   NA's:121   NA's:121   NA's:112               
            NA's: 16   NA's:  1                                                
                                                                               
                                                                               
 leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf    stem    
 0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :296  
 1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   :371  
         2   :342   2   :221   2   :221   NA's:100    NA's: 84   NA's: 16  
         NA's: 84   NA's: 84   NA's: 84                                    
                                                                           
                                                                           
                                                                           
 lodging    stem.cankers canker.lesion fruiting.bodies ext.decay  int.discolor
 0   :520   0   :379     0   :320      0   :473        0   :497   0   :581    
 1   : 42   1   : 39     1   : 83      1   :104        1   :135   1   : 44    
 NA's:121   2   : 36     2   :177      NA's:106        2   : 13   2   : 20    
            3   :191     3   : 65                      NA's: 38   NA's: 38    
            NA's: 38     NA's: 38                                             
                                                                              
                                                                              
 fruit.pods fruit.spots   seed     mold.growth seed.discolor seed.size 
 0   :407   0   :345    0   :476   0   :524    0   :513      0   :532  
 1   :130   1   : 75    1   :115   1   : 67    1   : 64      1   : 59  
 2   : 14   2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92  
 3   : 48   4   :100                                                   
 NA's: 84   NA's:106                                                   
                                                                       
                                                                       
 shriveling  roots    
 0   :539   0   :551  
 1   : 38   1   : 86  
 NA's:106   2   : 15  
            NA's: 31

From the remaining variables, we can consider which ones have the highest percentage of NAs:

soybean_predictor_na_updated <- apply(soybean_updated, 2, na_percentage)

soybean_predictor_na_updated |> as.data.frame() |> filter(soybean_predictor_na_updated > 0.1)

                soybean_predictor_na_updated
hail                               0.1771596
sever                              0.1771596
seed.tmt                           0.1771596
germ                               0.1639824
leaf.halo                          0.1229868
leaf.marg                          0.1229868
leaf.size                          0.1229868
leaf.shread                        0.1464129
leaf.malf                          0.1229868
lodging                            0.1771596
fruiting.bodies                    0.1551977
fruit.pods                         0.1229868
fruit.spots                        0.1551977
seed                               0.1346999
mold.growth                        0.1346999
seed.discolor                      0.1551977
seed.size                          0.1346999
shriveling                         0.1551977

These variables have a high percentage of NA values (there are other variables that still have NAs, however at a lower percentage). To determine whether to eliminate these variables of to try and impute values requires more understanding of the data. While the percentage of NAs for these variables seem high, it is unclear if the NA values are informative or not as the meaning of what these categorical variables describe is not provided. Given the large presence of NAs, I would suggest using a model that can account for missing data, such as a tree based model. The alternative would be to either drop observations with missing values or try to impute these values based on an approach such as K-nearest neighbors.