624 Homework 4

Introduction

In this homework assignment I will be submitting exercises 3.1 and 3.2 from the Kuhn and Johnson Applied Predictive Modeling book.

Question 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

A. Using visualizations, explore the preditor variables to understand their distributions as well as the relationship between predictors

library(GGally)
library(tidyverse)

ggpairs(Glass, upper = list(continuous = wrap("cor", size = 2)))

Using the ggpairs function we can see the distributions of the predictors along the diagonal and the correlations between predictors (the correlation coefficients in the top right triangle, and each pair’s relationships in the bottom left triangle)

B. Do there appear to be any outliers in the data? Are any predictors skewed?

Some of the distribution plots show that certain elements have greater tenancy to register as a 0 value, Ba, Fe, and potentially K. These may be considered outliers. Rl, Na, Al, Si, and Ca seem to have relatively normal distribution plots. K, Ba, and Fe have very strong right side skews.

C. Are there any relevant transformations of one or more predictors that might improve the classification model?

A box-cox transformation for the distributions of certain elements may work to be more normal such as Rl, Mg, Na, Ca. Another transformation we can consider is using a principal component analysis to reduce the number of collinear predictors such as Ca and Rl.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repositiory. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (eg., temperateure, precipitation) and plant conditions (eg., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data("Soybean")
#help(Soybean)

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributons degenerate in the ways discussed earlier in this chapter?

par(mar = c(2, 2, 2, 2))  # Adjust margin size (bottom, left, top, right)
for (col in 2:ncol(Soybean)) {
    hist( as.numeric(Soybean[,col]),main =   colnames(Soybean)[col], xlab = colnames(Soybean)[col])
}

## Using nearZeroVar function from chapter to return column names where predictors might be degenerate
library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

nearZeroVar(Soybean, names = TRUE)

## [1] "leaf.mild" "mycelium"  "sclerotia"

Degenerate distributions in the way discussed in the chapter are distributions of predictors that are zero variance - a predictor variable with a single unique value, or near zero variance predictors. There are a few near zero variance predictors that can be concidered degenerate (leaf.mild, mycelium, and sclerotia)

B. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

## Using a group_by summarise function to sum the NAs 
Soybean %>%
  group_by(Class) %>%
  summarise(Total_NA = sum(is.na(across(everything()))))

## # A tibble: 19 × 2
##    Class                       Total_NA
##    <fct>                          <int>
##  1 2-4-d-injury                     450
##  2 alternarialeaf-spot                0
##  3 anthracnose                        0
##  4 bacterial-blight                   0
##  5 bacterial-pustule                  0
##  6 brown-spot                         0
##  7 brown-stem-rot                     0
##  8 charcoal-rot                       0
##  9 cyst-nematode                    336
## 10 diaporthe-pod-&-stem-blight      177
## 11 diaporthe-stem-canker              0
## 12 downy-mildew                       0
## 13 frog-eye-leaf-spot                 0
## 14 herbicide-injury                 160
## 15 phyllosticta-leaf-spot             0
## 16 phytophthora-rot                1214
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

na_counts <- sort(colSums(is.na(Soybean)))
print(na_counts)

##           Class          leaves            date        area.dam       crop.hist 
##               0               0               1               1              16 
##    plant.growth            stem            temp           roots     plant.stand 
##              16              16              30              31              36 
##          precip    stem.cankers   canker.lesion       ext.decay        mycelium 
##              38              38              38              38              38 
##    int.discolor       sclerotia       leaf.halo       leaf.marg       leaf.size 
##              38              38              84              84              84 
##       leaf.malf      fruit.pods            seed     mold.growth       seed.size 
##              84              84              92              92              92 
##     leaf.shread fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             100             106             106             106             106 
##       leaf.mild            germ            hail           sever        seed.tmt 
##             108             112             121             121             121 
##         lodging 
##             121

lodging, seed.tmt, sever, and hail predictors have the most NA values and are more likely to be missing. By class phytophthora-rot has the highest NA count.

C. Develop a strategy for handling missing data either by eliminating predictors or imputation.

I would not eliminate predictors are there are NA values across almost all of the predictors, and there doesn’t seem to be a large outlier in terms of NA values in predictors.

For handling this dataset’s missing data we have to first remember that the predictor variables are categorical and not continuous. I would try using the median value of similar groups; for example if we are missing the fruit.pods value, we could calculate the median value for fruit.pods from other rows with the same class, date, etc., values.