library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
library(lattice)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

3.1

The UC Irvine Machine Learning Repository contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
#View(Glass)

glass_data <- Glass %>%
  select(-Type) %>%
  pivot_longer(everything(), names_to = "predictors", values_to = "values")
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
#histograms for each predictor
glass_data %>%
  ggplot(aes(x = values)) + 
  geom_histogram()+
  facet_wrap(~predictors, scales = "free") +
  theme_minimal() + 
  labs(title = "Histograms of Glass Dataset Predictors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#correlation plot
Glass %>%
  select(1:9) %>%
  ggpairs()

Based on the histograms, the data for each predictor is not normally distributed. Most of the predictors are right-skewed. “Mg” predictor looks bimodal.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
glass_data |> 
  ggplot(aes(x = values)) +
  geom_boxplot() +
  facet_wrap(~ predictors, scales = "free") +
  labs(title = "Distributions of Glass Predictors")

#need to remove the "type" column from the dataset to calculate skewness or pivot back to wide format
glass <- Glass %>%
  select(-Type)

skewValues <- apply(glass, 2, skewness)
head(skewValues) # not sure why it is not calculating skewness for Ca, Ba, and Fe
##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889

Each predictor has a lot of outliers, except Mg.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, I would perform a few different transformations. Boxcox transformation, log transformation, scaling.

glass_trans <- preProcess(glass, method = c("BoxCox", "center", "scale"))
head(glass_trans)
## $dim
## [1] 214   9
## 
## $bc
## $bc$RI
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2 
## 
## 
## $bc$Na
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations
## 
## 
## $bc$Al
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5 
## 
## 
## $bc$Si
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2 
## 
## 
## $bc$Ca
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1 
## 
## 
## 
## $yj
## NULL
## 
## $et
## NULL
## 
## $invHyperbolicSine
## NULL
## 
## $mean
##           RI           Na           Mg           Al           Si            K 
## 2.831185e-01 2.594009e+00 2.684533e+00 3.684509e-01 2.638878e+03 4.970561e-01 
##           Ca           Ba           Fe 
## 8.256036e-01 1.750467e-01 5.700935e-02
glass_transformed <- predict(glass_trans, newdata = glass)


glass_transformed %>%
  pivot_longer(everything(), names_to = "predictors", values_to = "values") %>%
  ggplot(aes(x = values)) + 
  geom_histogram()+
  facet_wrap(~predictors, scales = "free") +
  theme_minimal() + 
  labs(title = "Histograms of Glass Dataset Transformed Predictors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After the transformations, the variables appear to be better distributed.

3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g.,temperature,precipitation)and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
## See ?Soybean for details
?Soybean
## Help on topic 'Soybean' was found in the following packages:
## 
##   Package               Library
##   mlbench               /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   nlme                  /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
#A degenerate distribution happens when a variable has very little variability
Soybean |>
  select(-Class) |>
  mutate(across(where(is.factor), as.character)) |>
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
  group_by(Predictor, Value) |>
  summarize(Count = n()) |>
  ggplot(mapping = aes(x = Value, y = Count)) +
  geom_col() +
  facet_wrap(~Predictor, scales = "free") +
  labs(
    title = "Distribution of Predictors"
  )
## `summarise()` has grouped output by 'Predictor'. You can override using the
## `.groups` argument.

Several predictors show degenerate distribution,meaning that one category occurring far more frequently than others. For some of the predictors there is a high proportion of the NA values.

  1. Roughly18% of the data are missing.Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
mean(is.na(Soybean)) * 100
## [1] 9.504636
missing_perc <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
sort(missing_perc, decreasing = TRUE)
##            hail           sever        seed.tmt         lodging            germ 
##      17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##      14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##      12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##       5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
##    plant.growth            stem            date        area.dam           Class 
##       2.3426061       2.3426061       0.1464129       0.1464129       0.0000000 
##          leaves 
##       0.0000000
missing_data <- Soybean %>%
  select(-Class) %>%
  mutate_all(~ is.na(.))

missing_data$Class <- Soybean$Class

missing_by_class_all <- missing_data %>%
  gather(key = "predictor", value = "missing", -Class) %>%
  group_by(Class, predictor) %>%
  summarise(missing_percentage = mean(missing) * 100) %>%
  ungroup()
## `summarise()` has grouped output by 'Class'. You can override using the
## `.groups` argument.
# View the results
head(missing_by_class_all)
## # A tibble: 6 × 3
##   Class        predictor     missing_percentage
##   <fct>        <chr>                      <dbl>
## 1 2-4-d-injury area.dam                    6.25
## 2 2-4-d-injury canker.lesion             100   
## 3 2-4-d-injury crop.hist                 100   
## 4 2-4-d-injury date                        6.25
## 5 2-4-d-injury ext.decay                 100   
## 6 2-4-d-injury fruit.pods                100
ggplot(missing_by_class_all, aes(x = Class, y = missing_percentage, fill = predictor)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip()+
  labs(title = "Missingness of Predictors by Class",
       x = "Class",
       y = "Percentage of Missing Data") +
  theme_minimal()

It does appear that the missng data is releated to class, and particualrly to “phytophthora-rot”, “herbicide-injury”, “diaporther-pod-stem-blight”, “cyst-nematode”, “2-4-d-injury”

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We can eliminate degenerate variables since they don’t have any variations across categories. For the predictors with low missigness, we can impute the missing values using the mode. And I think we would need to separately explore the missing data in those specific classes and use multiple imputations.