Data624 - Homework 4

Author

Anthony Josue Roman

Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
data(Glass)
str(Glass)
'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1 part A

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Distribution Plots

library(tidyverse)
library(GGally)

glass_x <- Glass %>% select(-Type)

glass_long <- glass_x %>%
  pivot_longer(cols = everything(),
               names_to = "Predictor",
               values_to = "Value")

ggplot(glass_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Boxplots

glass_long2 <- Glass %>%
  pivot_longer(cols = -Type,
               names_to = "Predictor",
               values_to = "Value")

ggplot(glass_long2, aes(x = Type, y = Value)) +
  geom_boxplot(fill = "tomato") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Correlation Matrix

cor_matrix <- cor(glass_x)

library(corrplot)
corrplot(cor_matrix, method = "color", tl.cex = 0.8)

ggpairs(glass_x)

From the histograms, it can be seen that the distributions of most predictor variables such as RI, Na, Si, and Ca are more or less symmetric and have a moderate spread. However, Ba, Fe, and K have a high degree of right skewness, and most values are concentrated at zero, while a few values are very high.

Boxplots of the predictor variables according to glass type show that several predictor variables can differentiate the classes. For example, Ba can differentiate Type 7 glasses from the other types. Similarly, Mg has a high degree of separation among the classes, especially Types 5, 6, and 7, in which the values are close to zero. Several predictor variables show the presence of outliers, such as Ba, Fe, K, and Ca.

The correlation matrix shows that the predictor variables have a moderate relationship. A strong positive correlation exists between RI and Ca, which is close to 0.81. On the other hand, a strong negative correlation exists between RI and Si, which is close to -0.54.

3.1 Part B

Do there appear to be any outliers in the data? Are any predictors skewed?

library(e1071)

skew_vals <- apply(glass_x, 2, skewness)
round(skew_vals, 2)
   RI    Na    Mg    Al    Si     K    Ca    Ba    Fe 
 1.60  0.45 -1.14  0.89 -0.72  6.46  2.02  3.37  1.73 

The skewness values confirm what the histograms and boxplots suggested. Several of the predictors appear to be somewhat skewed in nature. Of particular interest is the fact that K has a skewness of 6.46, and Ba has a skewness of 3.37, which is quite steep and therefore suggests the presence of quite strong right-tailed skewness and possibly some strong outliers in these two predictors. Ca has a skewness of 2.02, Fe has a skewness of 1.73, and RI has a skewness of 1.60, which is also quite strong and suggests the presence of strong right-tailed skewness in these predictors.

Mg has a skewness of -1.14, which is quite strong and suggests the presence of strong left-tailed skewness in this predictor. Na has a skewness of 0.45 and is roughly symmetric in nature, while Al has a skewness of

3.1 Part C

Are there any relevant transformations of one or more predictors that might improve the classification model?

library(caret)

bc_K <- BoxCoxTrans(Glass$K)
bc_K
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 

Lambda could not be estimated; no transformation is applied
K_trans <- predict(bc_K, Glass$K)

bc_K <- BoxCoxTrans(Glass$K)
bc_K$lambda
[1] NA
par(mfrow = c(1,2))
hist(Glass$K, main = "Original K")
hist(log(Glass$K + 1), main = "Log(K + 1)")

The predictor K is highly right-skewed and has zeros. Box-Cox requires strictly positive values and therefore could not be used in this case. Hence, the log(K+1) transformation was used, which actually tamed the right skew of this feature and reduced the skew further. Other features like Ba and Ca, which are highly skewed, may also benefit from similar transformations. Additionally, centering and scaling the predictors may help in improving the performance of distance-based classification methods.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)

str(Soybean)
'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

3.2 Part A

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(dplyr)

freq_list <- lapply(Soybean[ , -ncol(Soybean)], table)

freq_list[[1]]

               2-4-d-injury         alternarialeaf-spot 
                         16                          91 
                anthracnose            bacterial-blight 
                         44                          20 
          bacterial-pustule                  brown-spot 
                         20                          92 
             brown-stem-rot                charcoal-rot 
                         44                          20 
              cyst-nematode diaporthe-pod-&-stem-blight 
                         14                          15 
      diaporthe-stem-canker                downy-mildew 
                         20                          20 
         frog-eye-leaf-spot            herbicide-injury 
                         91                           8 
     phyllosticta-leaf-spot            phytophthora-rot 
                         20                          88 
             powdery-mildew           purple-seed-stain 
                         20                          20 
       rhizoctonia-root-rot 
                         20 
library(caret)

nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv
                 freqRatio percentUnique zeroVar   nzv
Class             1.010989     2.7818448   FALSE FALSE
date              1.137405     1.0248902   FALSE FALSE
plant.stand       1.208191     0.2928258   FALSE FALSE
precip            4.098214     0.4392387   FALSE FALSE
temp              1.879397     0.4392387   FALSE FALSE
hail              3.425197     0.2928258   FALSE FALSE
crop.hist         1.004587     0.5856515   FALSE FALSE
area.dam          1.213904     0.5856515   FALSE FALSE
sever             1.651282     0.4392387   FALSE FALSE
seed.tmt          1.373874     0.4392387   FALSE FALSE
germ              1.103627     0.4392387   FALSE FALSE
plant.growth      1.951327     0.2928258   FALSE FALSE
leaves            7.870130     0.2928258   FALSE FALSE
leaf.halo         1.547511     0.4392387   FALSE FALSE
leaf.marg         1.615385     0.4392387   FALSE FALSE
leaf.size         1.479638     0.4392387   FALSE FALSE
leaf.shread       5.072917     0.2928258   FALSE FALSE
leaf.malf        12.311111     0.2928258   FALSE FALSE
leaf.mild        26.750000     0.4392387   FALSE  TRUE
stem              1.253378     0.2928258   FALSE FALSE
lodging          12.380952     0.2928258   FALSE FALSE
stem.cankers      1.984293     0.5856515   FALSE FALSE
canker.lesion     1.807910     0.5856515   FALSE FALSE
fruiting.bodies   4.548077     0.2928258   FALSE FALSE
ext.decay         3.681481     0.4392387   FALSE FALSE
mycelium        106.500000     0.2928258   FALSE  TRUE
int.discolor     13.204545     0.4392387   FALSE FALSE
sclerotia        31.250000     0.2928258   FALSE  TRUE
fruit.pods        3.130769     0.5856515   FALSE FALSE
fruit.spots       3.450000     0.5856515   FALSE FALSE
seed              4.139130     0.2928258   FALSE FALSE
mold.growth       7.820896     0.2928258   FALSE FALSE
seed.discolor     8.015625     0.2928258   FALSE FALSE
seed.size         9.016949     0.2928258   FALSE FALSE
shriveling       14.184211     0.2928258   FALSE FALSE
roots             6.406977     0.4392387   FALSE FALSE

On inspection of the categorical predictors, we note that some of the predictor variables exhibit highly unbalanced level distributions. Using nearZeroVar(), we note that leaf.mild, mycelium, and sclerotia are near zero variance predictors, meaning that they exhibit highly skewed distributions with one level being far more dominant than the others, as evidenced by the high frequency ratios for the levels of the predictor variables.

3.2 Part B

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

colSums(is.na(Soybean))
          Class            date     plant.stand          precip            temp 
              0               1              36              38              30 
           hail       crop.hist        area.dam           sever        seed.tmt 
            121              16               1             121             121 
           germ    plant.growth          leaves       leaf.halo       leaf.marg 
            112              16               0              84              84 
      leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
             84             100              84             108              16 
        lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
            121              38              38             106              38 
       mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
             38              38              38              84             106 
           seed     mold.growth   seed.discolor       seed.size      shriveling 
             92              92             106              92             106 
          roots 
             31 
missing_pct <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
round(missing_pct, 2)
          Class            date     plant.stand          precip            temp 
           0.00            0.15            5.27            5.56            4.39 
           hail       crop.hist        area.dam           sever        seed.tmt 
          17.72            2.34            0.15           17.72           17.72 
           germ    plant.growth          leaves       leaf.halo       leaf.marg 
          16.40            2.34            0.00           12.30           12.30 
      leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
          12.30           14.64           12.30           15.81            2.34 
        lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
          17.72            5.56            5.56           15.52            5.56 
       mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
           5.56            5.56            5.56           12.30           15.52 
           seed     mold.growth   seed.discolor       seed.size      shriveling 
          13.47           13.47           15.52           13.47           15.52 
          roots 
           4.54 
sort(missing_pct, decreasing = TRUE)
           hail           sever        seed.tmt         lodging            germ 
     17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
      leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
     15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
    leaf.shread            seed     mold.growth       seed.size       leaf.halo 
     14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
      leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
     12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
   stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
      5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
      sclerotia     plant.stand           roots            temp       crop.hist 
      5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
   plant.growth            stem            date        area.dam           Class 
      2.3426061       2.3426061       0.1464129       0.1464129       0.0000000 
         leaves 
      0.0000000 
library(dplyr)

Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(),
                   ~ mean(is.na(.)) * 100))
# A tibble: 19 × 36
   Class   date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
   <fct>  <dbl>       <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>    <dbl>
 1 2-4-d…  6.25         100    100   100 100         100     6.25 100      100  
 2 alter…  0              0      0     0   0           0     0      0        0  
 3 anthr…  0              0      0     0   0           0     0      0        0  
 4 bacte…  0              0      0     0   0           0     0      0        0  
 5 bacte…  0              0      0     0   0           0     0      0        0  
 6 brown…  0              0      0     0   0           0     0      0        0  
 7 brown…  0              0      0     0   0           0     0      0        0  
 8 charc…  0              0      0     0   0           0     0      0        0  
 9 cyst-…  0            100    100   100 100           0     0    100      100  
10 diapo…  0             40      0     0 100           0     0    100      100  
11 diapo…  0              0      0     0   0           0     0      0        0  
12 downy…  0              0      0     0   0           0     0      0        0  
13 frog-…  0              0      0     0   0           0     0      0        0  
14 herbi…  0              0    100     0 100           0     0    100      100  
15 phyll…  0              0      0     0   0           0     0      0        0  
16 phyto…  0              0      0     0  77.3         0     0     77.3     77.3
17 powde…  0              0      0     0   0           0     0      0        0  
18 purpl…  0              0      0     0   0           0     0      0        0  
19 rhizo…  0              0      0     0   0           0     0      0        0  
# ℹ 26 more variables: germ <dbl>, plant.growth <dbl>, leaves <dbl>,
#   leaf.halo <dbl>, leaf.marg <dbl>, leaf.size <dbl>, leaf.shread <dbl>,
#   leaf.malf <dbl>, leaf.mild <dbl>, stem <dbl>, lodging <dbl>,
#   stem.cankers <dbl>, canker.lesion <dbl>, fruiting.bodies <dbl>,
#   ext.decay <dbl>, mycelium <dbl>, int.discolor <dbl>, sclerotia <dbl>,
#   fruit.pods <dbl>, fruit.spots <dbl>, seed <dbl>, mold.growth <dbl>,
#   seed.discolor <dbl>, seed.size <dbl>, shriveling <dbl>, roots <dbl>

Approximately 18% of the data is missing, and the distribution of missing values is not uniform across the features, as some features, such as hail, sever, seed.tmt, and lodging, exhibit 18% missing values, while others exhibit little or no missing values.

When we analyze the missing values for the data by class, we see that some disease types exhibit entirely missing values for some features. For example, some features exhibit 100% missing values for some classes, indicating that the missing values are not entirely random, suggesting that the missing data might be dependent on the type of disease.

3.2 Part C

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Soybean_missing_level <- Soybean

for (j in 1:(ncol(Soybean_missing_level) - 1)) {
  x <- Soybean_missing_level[[j]]
  
  levels(x) <- c(levels(x), "Missing")
  
  x[is.na(x)] <- "Missing"
  
  Soybean_missing_level[[j]] <- x
}

colSums(is.na(Soybean_missing_level[, -ncol(Soybean_missing_level)]))
          Class            date     plant.stand          precip            temp 
              0               0               0               0               0 
           hail       crop.hist        area.dam           sever        seed.tmt 
              0               0               0               0               0 
           germ    plant.growth          leaves       leaf.halo       leaf.marg 
              0               0               0               0               0 
      leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
              0               0               0               0               0 
        lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
              0               0               0               0               0 
       mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
              0               0               0               0               0 
           seed     mold.growth   seed.discolor       seed.size      shriveling 
              0               0               0               0               0 

As much of the data is missing and the amount of missing information varies across classes, deleting the rows (listwise deletion) may result in the loss of important information and may introduce bias in the data set. Hence, we handled the missing values in the dataset by adding another category named “Missing” to the features. This is to retain the information that may exist in the pattern of the missing values. Also, features with zero variance in part (a) may be removed to reduce noise in the features.