All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.

Setup

The following packages are required to rerun this .rmd file:

seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(mlbench)
library(e1071)
library(caret)
library(fpp3)

Exercise 3.1

Description

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labelled as one of seven class categories. There are nine predictors, including the refractive index and percentage of the eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Part (a)

Description

Using visualizations, explore the predictor variables to understand these distributions as well as the relationships between predictors.

Solution

First, the cell below separates the data in Glass into its predictors (X) and target (y):

X <- select(Glass, -Type)
y <- Glass$Type

The cell below plots histograms of all the predictor variables in order to view their distributions:

plt_data <- X %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

ggplot(plt_data, aes(x = value)) +
  geom_histogram(fill = 'steel blue', bins=30) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of All Predictors", x = "Value", y = "Frequency") +
  theme_minimal()

Next, the above plot is recreated, but this time includes separate distributions for each category of the target variable:

plt_data <- Glass %>%
  pivot_longer(cols = -Type, names_to = "variable", values_to = "value")

ggplot(plt_data, aes(x = value, fill = Type)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of All Predictors (Separated by Class)",
       x = "Value",
       y = "Frequency") +
  theme_minimal()

To evaluate the relationships between the predictors themselves, the cell below creates a correlation plot:

cor_matrix <- cor(X)

corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black")

Part (b)

Description

Do there appear to be any outliers in the data? Are any predictors skewed?

Solution

The plots produced in part (a) reveal a lot about the structure of the Glass dataset and how the predictor variables relate to themselves and the target variable, Type. The histogram of the predictors reveal that fields Al, Ca, Mg, are the ones that most closely follow a normal distribution while the rest all have some level of skew. This skewness is not equal however amongst the remaining predictors, with fields Fe and Ba being the worst offenders. We can check statistically which of the predictors can be considered normal using a Shapiro-Wilk test:

col_tests <- function(X){
  # This function takes in a dataframe X as an input and returns the following
  # for each of its columns:
  #     - Determines whether or not the data is normal
  #     - Determines the skewness of the data
  
  # initialize empty dataframe for values
  col_tests <- data.frame(
    Variable = character(),
    SW_P_Value = numeric(),
    Normal = character(),
    Skewness = numeric(),
    High_Skew = character()
  )

  # loop through dataframe columns
  for (column in names(X)){
  
    # perform shapiro wilk test and calculate skewness
    vals <- as.numeric(na.omit(X[[column]]))
    p_val <- shapiro.test(vals)$p.value
    skew <- skewness(vals)
  
    # determine if data is normal
    normal <- FALSE
    if (p_val > 0.05){
      normal <- TRUE
    }
  
    # determine if data is highly skewed
    high_skew <- FALSE
    if (skew > 2 | skew < -2){
      high_skew <- TRUE
    }
  
    # append data to col_tests dataframe
    row_to_add <- data.frame(
      Variable = column,
      P_Value = p_val,
      Normal = normal,
      Skewness = skew,
      High_Skew = high_skew
    )
    col_tests <- rbind(col_tests, row_to_add)
  }
  return(col_tests)
}

col_tests(X)

##   Variable      P_Value Normal   Skewness High_Skew
## 1       RI 1.076671e-12  FALSE  1.6027151     FALSE
## 2       Na 3.465543e-07  FALSE  0.4478343     FALSE
## 3       Mg 2.390921e-19  FALSE -1.1364523     FALSE
## 4       Al 2.083156e-07  FALSE  0.8946104     FALSE
## 5       Si 2.175032e-09  FALSE -0.7202392     FALSE
## 6        K 2.172188e-25  FALSE  6.4600889      TRUE
## 7       Ca 4.286584e-16  FALSE  2.0184463      TRUE
## 8       Ba 5.383302e-26  FALSE  3.3686800      TRUE
## 9       Fe 1.156668e-20  FALSE  1.7298107     FALSE

Using the Shapiro-Wilk test, it appears as though none of the variables can be considered normal. Despite this, only three predictors were categorized as having high skew.

The histograms of each predictor also reveal that a number of them likely contain outliers. In particular, the Ba, Fe, and K fields all appear to have values that are far outside the area of the main distribution. We can check for this quantitatively by determining if any of the fields have values that are outside the third or first quartile by one and half times their inter-quartile range (IQF):

outlier_test <- function(X){
  # This function takes in a dataframe X as an input and returns the number of
  # outliers in each column.
  
  # initialize empty dataframe for values
  outlier_test <- data.frame(
    Variable = character(),
    Outliers = numeric(),
    Outlier_Percentage = numeric()
  )

# loop through dataframe columns
  for (column in names(X)){
  
    # determine the 1st and 3rd quartiles and the IQR
    quar_1 <- quantile(X[[column]])[2]
    quar_3 <- quantile(X[[column]])[4] 
    iqr_range <- quar_3 - quar_1
  
    # check for outliers 
    outliers <- 0
    for (val in X[[column]]){
      min_limit <- quar_1 - (1.5 * iqr_range)
      max_limit <- quar_3 + (1.5 * iqr_range)
      if( val < min_limit | val > max_limit){
        outliers <- outliers + 1
      }
    }
    
    # determine the fraction of values in the column that are outliers
    outlier_frac <- round((outliers / nrow(X)) * 100, 2)
    
    # append data to outlier_test dataframe                             
    row_to_add <- data.frame(
      Variable = column,
      Outliers = outliers,
      Outlier_Percentage = outlier_frac
    )
    outlier_test <- rbind(outlier_test, row_to_add)
  }
  return(outlier_test)
}

outlier_test(X)

##   Variable Outliers Outlier_Percentage
## 1       RI       17               7.94
## 2       Na        7               3.27
## 3       Mg        0               0.00
## 4       Al       18               8.41
## 5       Si       12               5.61
## 6        K        7               3.27
## 7       Ca       26              12.15
## 8       Ba       38              17.76
## 9       Fe       12               5.61

The results above show that only one of the predictors does not contain outliers, with the percentage of outliers in a column reaching as high as 17.75%.

Looking at the histograms of each predictor separated by class provides insight into which of the predictors might actually be useful when making predictions. For example, the Al, Na, and Si fields exhibit a range of distributions that appear shifted from one another by class. This means that the information they contain could be useful in determining the category that new observations are a part of.

Lastly, the correlation plot produced above reveals the the predictors do not seem to suffer from a high degree of multicollinearity. Typically, predictor pairs with correlations greater than 0.7 are considered too high, and the only pair of variables that breaks that threshold is RI and Ca.

Part (c)

Descriptions

Are there any relevant transformations of one or more predictors that might improve the classification model?

Solution

Part (b) emphasized two possible issues present with the predictor variables: the presence of outliers and high skewness.

Box-Cox transformations can be used to help quell the possible negative effects of skewness by stabilizing the variance and keeping it closer to constant across the entire range of values in a field. Box-Cox transformations can only be performed if all values are positive, which means this transformation needs to come before any centering and scaling that would transform the mean to 0.

# k ca ba
BC_K <- BoxCoxTrans(X$K + 0.0001) # need to add a small bit to avoid 0 values
BC_Ca <- BoxCoxTrans(X$Ca) 
BC_Ba <- BoxCoxTrans(X$Ba)

X$K <- predict(BC_K, X$K)
X$Ca <- predict(BC_Ca, X$Ca)
X$Ba <- predict(BC_Ba, X$Ba)

The output below shows that the Box-Cox transformations resulted in two of the three problematic predictors no longer having high skew.

col_tests(X)

##   Variable      P_Value Normal    Skewness High_Skew
## 1       RI 1.076671e-12  FALSE  1.60271508     FALSE
## 2       Na 3.465543e-07  FALSE  0.44783426     FALSE
## 3       Mg 2.390921e-19  FALSE -1.13645228     FALSE
## 4       Al 2.083156e-07  FALSE  0.89461042     FALSE
## 5       Si 2.175032e-09  FALSE -0.72023921     FALSE
## 6        K 4.893872e-15  FALSE -0.06559656     FALSE
## 7       Ca 1.133295e-11  FALSE -0.19395573     FALSE
## 8       Ba 5.383302e-26  FALSE  3.36867997      TRUE
## 9       Fe 1.156668e-20  FALSE  1.72981071     FALSE

To help inhibit the influence of outliers, a spatial sign transformation can be applied. Spatial sign transformations transform each data point to a unit vector, which scales their magnitudes down to 1 effectively neutralizing their extreme values. Because all but 1 predictor contained outliers, we can apply this transformation to the entire dataset. However, the data also needs to be centered and scaled (transformed so that the mean \(\mu\) and standard deviation \(\sigma\) of each column are 0 and 1, respectively) before doing so. To perform the scaling and spatial sign transformation all in one pipeline process, we can invoke the preProcess function (from the caret library):

pipeline <- preProcess(X, method = c("center", "scale", "spatialSign"))
X <- predict(pipeline, X)

The output below reveals that this transformation significantly reduced the number of outliers in the dataset:

outlier_test(X)

##   Variable Outliers Outlier_Percentage
## 1       RI        1               0.47
## 2       Na        0               0.00
## 3       Mg        0               0.00
## 4       Al        0               0.00
## 5       Si        0               0.00
## 6        K        0               0.00
## 7       Ca        0               0.00
## 8       Ba       28              13.08
## 9       Fe        0               0.00

The combination of these transformations should result in a dataset with a higher degree of predictive power.

Exercise 3.2

Description

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)

Part (a)

Description

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Solution

The cell below uses the col_tests function created earlier in order to test the predictor columns of the Soybean dataset:

X <- select(Soybean, -Class)
y <- Soybean$Class

col_tests(X) %>% arrange(by_group = desc(Skewness))

##           Variable      P_Value Normal    Skewness High_Skew
## 1         mycelium 3.793219e-48  FALSE 10.19921824      TRUE
## 2        sclerotia 1.631058e-46  FALSE  5.39870500      TRUE
## 3        leaf.mild 1.703066e-42  FALSE  3.95290557      TRUE
## 4       shriveling 1.181246e-42  FALSE  3.49157641      TRUE
## 5     int.discolor 5.735058e-43  FALSE  3.33861193      TRUE
## 6          lodging 8.570681e-42  FALSE  3.22582942      TRUE
## 7        leaf.malf 7.907694e-43  FALSE  3.21564565      TRUE
## 8        seed.size 1.678775e-41  FALSE  2.66303034      TRUE
## 9    seed.discolor 1.177280e-40  FALSE  2.47154019      TRUE
## 10           roots 1.149566e-40  FALSE  2.45781443      TRUE
## 11     mold.growth 6.053615e-41  FALSE  2.43281985      TRUE
## 12      fruit.pods 1.001294e-34  FALSE  1.83817833     FALSE
## 13     leaf.shread 7.396918e-39  FALSE  1.80367508     FALSE
## 14       ext.decay 8.899983e-38  FALSE  1.69543723     FALSE
## 15 fruiting.bodies 3.411697e-38  FALSE  1.65939253     FALSE
## 16            seed 4.003077e-38  FALSE  1.53904600     FALSE
## 17            hail 1.807481e-36  FALSE  1.30690508     FALSE
## 18     fruit.spots 5.695136e-31  FALSE  0.94650965     FALSE
## 19        seed.tmt 1.398438e-29  FALSE  0.73966698     FALSE
## 20    plant.growth 1.310802e-36  FALSE  0.67949699     FALSE
## 21    stem.cankers 1.676351e-33  FALSE  0.60983130     FALSE
## 22   canker.lesion 9.881052e-29  FALSE  0.51457211     FALSE
## 23       leaf.marg 1.407673e-33  FALSE  0.46484621     FALSE
## 24     plant.stand 6.291546e-35  FALSE  0.18896734     FALSE
## 25           sever 4.309345e-28  FALSE  0.17391297     FALSE
## 26        area.dam 2.684910e-24  FALSE  0.01799005     FALSE
## 27            germ 6.842615e-26  FALSE -0.08680952     FALSE
## 28            temp 6.455716e-29  FALSE -0.15829545     FALSE
## 29            stem 1.999130e-35  FALSE -0.22581409     FALSE
## 30       leaf.size 1.340863e-28  FALSE -0.24946067     FALSE
## 31            date 4.350968e-17  FALSE -0.30397011     FALSE
## 32       crop.hist 2.525308e-24  FALSE -0.39757148     FALSE
## 33       leaf.halo 6.950384e-33  FALSE -0.41080342     FALSE
## 34          precip 1.015203e-35  FALSE -1.41630633     FALSE
## 35          leaves 2.331813e-43  FALSE -2.44354029      TRUE

We see that 11 of the 35 rows exhibit a high degree of skewness, and none of the predictor distributions are considered normal according to a Shapiro-Wilk test.

Part (b)

Description

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Solution

The following outputs the percentage of values in each field that are missing:

missing_data_cols <- data.frame(
  Column = names(X),
  MissingValues = 
    round(sort(colSums(is.na(X)), decreasing = TRUE) / nrow(X), 4) * 100
)

rownames(missing_data_cols) <- NULL

ggplot(missing_data_cols, 
       aes(x = reorder(Column, MissingValues), y = MissingValues)) +
  geom_bar(stat = "identity", fill = "steel blue") +
  labs(
    title = "% of Values Missing Per Column", 
    x = "Predictor Name", 
    y = "% Missing Values") +
  coord_flip() +
  theme_minimal()

Based on the above plot, there are clearly some predictors with a much higher degree of missing data compared to others. It also looks as though predictors that might have come from the same source (i.e. weather based predictors, leaf measurements, mycology based predictors) all have constant levels of missing information.

Next, the cell below evaluates missing values by class:

missing_data_class <- Soybean %>%
  group_by(Class) %>%
  summarise(MissingValues = sum(is.na(across(everything()))),
            TotalValues = n())

missing_data_class$TotalValues <- missing_data_class$TotalValues * ncol(X)
missing_data_class$FracMissing <- missing_data_class$MissingValues / missing_data_class$TotalValues


ggplot(missing_data_class, 
       aes(x = reorder(Class, FracMissing), y = FracMissing)) +
  geom_bar(stat = "identity", fill='steel blue') +
  labs(title = "Missing Values by Class", x = "Class", y = "Number of Missing Values") +
  coord_flip()

The plot above is much more enlightening as to the cause of the missing data. It appears as though all missing data comes from observations of five different class, each seeming to represent some kind of injury or disease (i.e. cyst, rot, blight, etc.). As such, it might have been impossible to take measurements of these predictors while the soybean plants are affected with any of these conditions.

Part (c)

Description

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Solution

Since all of the predictors have some level of missing data, it would likely be incorrect to remove any of them. As such, a data imputation strategy is necessary. Because each of the predictor fields are categorical, a mode imputation strategy will likely be effective. More specifically, because the missing values seem to be class driven, we can implement mode imputation by class in which missing values in each class are imputed with the most frequent value within that class:

# develop function to calc mode that can be used by mutate later on
Mode <- function(x) {
  ux <- unique(na.omit(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# mode imputation by class
imputed <- Soybean %>%
  group_by(Class) %>%
  mutate(across(everything(), ~ ifelse(is.na(.), Mode(.), .)))

The cell above shows how this imputation strategy can be implemented in R.

Data 624 - HW 4 - Data Preprocessing

William Jasmine

2024-09-29

Setup

Exercise 3.1

Description

Part (a)

Description

Solution

Part (b)

Description

Solution

Part (c)

Descriptions

Solution

Exercise 3.2

Description

Part (a)

Description

Solution

Part (b)

Description

Solution

Part (c)

Description

Solution