Exercises 3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench) data(Glass) str(Glass)

‘data.frame’: 214 obs. of 10 variables: $ RI : num 1.52 1.52 1.52 1.52 1.52 … $ Na : num 13.6 13.9 13.5 13.2 13.3 … $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 … $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 … $ Si : num 71.8 72.7 73 72.6 73.1 … $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 … $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 … $ Ba : num 0 0 0 0 0 0 0 0 0 0 … $ Fe : num 0 0 0 0 0.26 0 0 0 0.11 … $ Type: Factor w/ 6 levels “1”, “2”, “3”, “5”, …

library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
library(corrplot)
## corrplot 0.95 loaded
data(Glass)

glimpse(Glass)
## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

    glass_pred <- Glass %>% select(-Type)
    
    glass_long <- glass_pred %>%
      pivot_longer(cols = everything(),
                   names_to = "Variable",
                   values_to = "Value")
    
    ggplot(glass_long, aes(Value)) +
      geom_histogram(bins = 20, fill = "skyblue", color = "black") +
      facet_wrap(~Variable, scales = "free") +
      theme_minimal()

    ggplot(glass_long, aes(x = Variable, y = Value)) +
      geom_boxplot(fill = "tomato") +
      coord_flip() +
      theme_minimal()

    The predictor variables show a variety of distributional shapes. RI, Na, Si, and Al appear approximately symmetric, while Ba, Fe, K, and Ca are strongly right-skewed with many small values and a few large observations. The boxplots reveal several extreme observations, particularly for Ba and Fe, which likely correspond to rare but valid glass compositions.

    Scatterplots and the correlation matrix indicate relationships among predictors. In particular, RI is positively related to Ca and negatively related to Mg, and several mineral components exhibit moderate correlations. This suggests the predictors are not independent and that the chemical composition variables interact in determining the structure of the glass.

    Overall, the visualizations show non-normality, skewness, outliers, and correlations among predictors, all of which should be considered before building a classification model.

  2. Do there appear to be any outliers in the data? Are any predictors skewed?

    The visualizations show that the dataset contains both outliers and skewed predictors. In particular, Ba, Fe, and K are highly right-skewed and contain extreme values, while RI, Na, and Si are roughly symmetric. These characteristics suggest that some form of transformation (such as a log transform) and scaling would likely improve the performance of subsequent classification models.

  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

    Yes. Because several predictors (particularly Ba, Fe, and K) are strongly right-skewed and contain extreme values, a log transformation is appropriate to reduce skewness and the influence of outliers. Additionally, the predictors are measured on very different scales, so standardizing the variables is necessary to prevent large-scale variables from dominating distance calculations. Applying a log transformation followed by standardization should improve the performance and stability of classification models.

3.2 Soybean Data Exercise The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., leaf spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

library(mlbench)
data(Soybean)

dim(Soybean)
## [1] 683  36
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Several predictors are degenerate or near-zero variance. Variables such as mycelium, sclerotia, leaf.mild, lodging, and shriveling have one level occurring in over 90–99% of observations. These predictors contain minimal variability and provide little discriminatory information for predicting disease class.

for(i in names(Soybean)){
  cat("\n", i, "\n")
  print(prop.table(table(Soybean[[i]])))
}
## 
##  Class 
## 
##                2-4-d-injury         alternarialeaf-spot 
##                  0.02342606                  0.13323572 
##                 anthracnose            bacterial-blight 
##                  0.06442167                  0.02928258 
##           bacterial-pustule                  brown-spot 
##                  0.02928258                  0.13469985 
##              brown-stem-rot                charcoal-rot 
##                  0.06442167                  0.02928258 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                  0.02049780                  0.02196193 
##       diaporthe-stem-canker                downy-mildew 
##                  0.02928258                  0.02928258 
##          frog-eye-leaf-spot            herbicide-injury 
##                  0.13323572                  0.01171303 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                  0.02928258                  0.12884334 
##              powdery-mildew           purple-seed-stain 
##                  0.02928258                  0.02928258 
##        rhizoctonia-root-rot 
##                  0.02928258 
## 
##  date 
## 
##          0          1          2          3          4          5          6 
## 0.03812317 0.10997067 0.13636364 0.17302053 0.19208211 0.21847507 0.13196481 
## 
##  plant.stand 
## 
##         0         1 
## 0.5471406 0.4528594 
## 
##  precip 
## 
##         0         1         2 
## 0.1147287 0.1736434 0.7116279 
## 
##  temp 
## 
##         0         1         2 
## 0.1225115 0.5727412 0.3047473 
## 
##  hail 
## 
##         0         1 
## 0.7740214 0.2259786 
## 
##  crop.hist 
## 
##          0          1          2          3 
## 0.09745127 0.24737631 0.32833583 0.32683658 
## 
##  area.dam 
## 
##         0         1         2         3 
## 0.1803519 0.3328446 0.2126100 0.2741935 
## 
##  sever 
## 
##          0          1          2 
## 0.34697509 0.57295374 0.08007117 
## 
##  seed.tmt 
## 
##          0          1          2 
## 0.54270463 0.39501779 0.06227758 
## 
##  germ 
## 
##         0         1         2 
## 0.2889667 0.3730298 0.3380035 
## 
##  plant.growth 
## 
##         0         1 
## 0.6611694 0.3388306 
## 
##  leaves 
## 
##         0         1 
## 0.1127379 0.8872621 
## 
##  leaf.halo 
## 
##          0          1          2 
## 0.36894825 0.06010017 0.57095159 
## 
##  leaf.marg 
## 
##          0          1          2 
## 0.59599332 0.03505843 0.36894825 
## 
##  leaf.size 
## 
##         0         1         2 
## 0.0851419 0.5459098 0.3689482 
## 
##  leaf.shread 
## 
##         0         1 
## 0.8353345 0.1646655 
## 
##  leaf.malf 
## 
##          0          1 
## 0.92487479 0.07512521 
## 
##  leaf.mild 
## 
##          0          1          2 
## 0.93043478 0.03478261 0.03478261 
## 
##  stem 
## 
##         0         1 
## 0.4437781 0.5562219 
## 
##  lodging 
## 
##         0         1 
## 0.9252669 0.0747331 
## 
##  stem.cankers 
## 
##          0          1          2          3 
## 0.58759690 0.06046512 0.05581395 0.29612403 
## 
##  canker.lesion 
## 
##         0         1         2         3 
## 0.4961240 0.1286822 0.2744186 0.1007752 
## 
##  fruiting.bodies 
## 
##         0         1 
## 0.8197574 0.1802426 
## 
##  ext.decay 
## 
##          0          1          2 
## 0.77054264 0.20930233 0.02015504 
## 
##  mycelium 
## 
##           0           1 
## 0.990697674 0.009302326 
## 
##  int.discolor 
## 
##          0          1          2 
## 0.90077519 0.06821705 0.03100775 
## 
##  sclerotia 
## 
##          0          1 
## 0.96899225 0.03100775 
## 
##  fruit.pods 
## 
##          0          1          2          3 
## 0.67946578 0.21702838 0.02337229 0.08013356 
## 
##  fruit.spots 
## 
##          0          1          2          4 
## 0.59792028 0.12998267 0.09878683 0.17331023 
## 
##  seed 
## 
##         0         1 
## 0.8054146 0.1945854 
## 
##  mold.growth 
## 
##         0         1 
## 0.8866328 0.1133672 
## 
##  seed.discolor 
## 
##         0         1 
## 0.8890815 0.1109185 
## 
##  seed.size 
## 
##         0         1 
## 0.9001692 0.0998308 
## 
##  shriveling 
## 
##          0          1 
## 0.93414211 0.06585789 
## 
##  roots 
## 
##          0          1          2 
## 0.84509202 0.13190184 0.02300613
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

    Approximately 18% of observations are missing. Missingness is not evenly distributed across predictors, with some variables containing substantially more missing values than others. The pattern of missing data appears related to the disease class, indicating the data are not missing at random. This likely occurs because some diseases can be diagnosed visually, so certain environmental measurements were not recorded.

  2. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

    Deleting observations is inappropriate due to the small dataset size. Predictors with excessive missingness may be removed. For the remaining predictors, missing values should be imputed using class-conditional mode imputation, replacing missing values with the most frequent category within each disease class. This approach preserves relationships between predictors and disease outcomes.

3.3. QSAR Blood-Brain Barrier Exercise

Chapter 5 introduces Quantitative Structure-Activity Relationship (QSAR) modeling where the characteristics of a chemical compound are used to predict other chemical properties. The caret package contains a QSAR data set from Mente and Lombardo (2005). Here, the ability of a chemical to permeate the blood-brain barrier was experimentally determined for 208 compounds. 134 descriptors were measured for each compound.

  1. Start R and use these commands to load the data: > library(caret) > data(BloodBrain) > # use ?BloodBrain to see more details The numeric outcome is contained in the vector logBBB while the predictors are in the data frame bbbDescr.

    library(caret)
    ## Loading required package: lattice
    ## 
    ## Attaching package: 'caret'
    ## The following object is masked from 'package:purrr':
    ## 
    ##     lift
    data(BloodBrain)
    
    # Check for zero / near-zero variance predictors
    nzv <- nearZeroVar(bbbDescr)
    length(nzv)
    ## [1] 7
    nzv
    ## [1]  3 16 17 22 25 50 60
  2. Do any of the individual predictors have degenerate distributions?

    Yes. Using near-zero varianece, seven predictors were identified as near-zero variance. These descriptors have almost no variation across the compounds and therefore provide little predictive information. This means they should be removed before modeling.

  3. Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling?

    data(BloodBrain)
    
    # 1. Find near-zero variance predictors
    nzv <- nearZeroVar(bbbDescr)
    
    # 2. Remove them
    bbbDescr_clean <- bbbDescr[, -nzv]
    
    # 3. Compute correlation matrix
    cor_matrix <- cor(bbbDescr_clean)
    
    # 4. Identify highly correlated predictors
    highCorr <- findCorrelation(cor_matrix, cutoff = 0.90)
    
    # How many?
    length(highCorr)
    ## [1] 35
    # 5. Remove correlated predictors
    bbbDescr_reduced <- bbbDescr_clean[, -highCorr]
    
    # Check dimensions
    dim(bbbDescr)
    ## [1] 208 134
    dim(bbbDescr_clean)
    ## [1] 208 127
    dim(bbbDescr_reduced)
    ## [1] 208  92

Do any of the individual predictors have degenerate distributions?
Yes. Using near-zero varience, 7 predictors were identified as near-zero variance. These variables had almost no variability across the 208 compounds and therefore contained little useful predictive information. They were removed before modeling.

Are there strong relationships between the predictor data?
Yes. A correlation analysis showed substantial multicollinearity among the descriptors. Using a cutoff of 0.90, 35 predictors were found to be highly correlated with others. This occurs because many QSAR descriptors measure similar chemical properties such as molecular size, polarity, and surface area.

How could correlations be reduced?
Correlations were reduced by removing highly correlated predictors using findCorrelation. Other possible methods include principal component analysis (PCA), partial least squares (PLS), or regularized regression methods.

Does this have a dramatic effect on the number of predictors available for modeling?
Yes. The dataset initially contained 134 predictors. After removing 7 near-zero variance predictors and 35 highly correlated predictors, only 92 predictors remained. This represents a reduction of about one-third of the variables. However, the reduction does not remove important chemical information because many of the removed predictors were redundant. Instead, it improves model stability and predictive performance.