In this assignment, the problems 3.1 and 3.2 have been solved from the Kuhn and Johnson book-Applied Predictive Modeling.

Question-3.1:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

Libraries

library(mlbench)
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.4
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
library (MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(e1071) 
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:fabletools':
## 
##     interpolate
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:caret':
## 
##     lift
library(dplyr)
library(corrplot)
## corrplot 0.94 loaded

Get Data

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
#?Glass

There are 10 variables in the Glass data set. Of them, the “Type” variable is the response variable and the rest nine (9) are the predictor variables.All the predictors are numerical variables. To visualize their distributions, a histogram for each of the predictor variable will be plotted. Additionally, a correlation plot will also plotted to understand their relations.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Data visualizations for the predictor variable:

# Plot histograms
Glass |>
  keep(is.numeric)|>
  gather()|>
  ggplot(aes(value)) + 
  geom_histogram(bin=5)+ 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms for the Numerical Predictors")
## Warning in geom_histogram(bin = 5): Ignoring unknown parameters: `bin`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Plot box plots
Glass |>
  keep(is.numeric)|>
  gather() |>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots for the Numerical Predictors")

Correlations of predictors:

Glass|>
  keep(is.numeric) |>
  cor() |>
  corrplot() 

Glass|>keep(is.numeric) |>
cor()|>
round(2)
##       RI    Na    Mg    Al    Si     K    Ca    Ba    Fe
## RI  1.00 -0.19 -0.12 -0.41 -0.54 -0.29  0.81  0.00  0.14
## Na -0.19  1.00 -0.27  0.16 -0.07 -0.27 -0.28  0.33 -0.24
## Mg -0.12 -0.27  1.00 -0.48 -0.17  0.01 -0.44 -0.49  0.08
## Al -0.41  0.16 -0.48  1.00 -0.01  0.33 -0.26  0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01  1.00 -0.19 -0.21 -0.10 -0.09
## K  -0.29 -0.27  0.01  0.33 -0.19  1.00 -0.32 -0.04 -0.01
## Ca  0.81 -0.28 -0.44 -0.26 -0.21 -0.32  1.00 -0.11  0.12
## Ba  0.00  0.33 -0.49  0.48 -0.10 -0.04 -0.11  1.00 -0.06
## Fe  0.14 -0.24  0.08 -0.07 -0.09 -0.01  0.12 -0.06  1.00

The correlation matrix shows that the refractive index (RI) has a strong positive correlation with calcium (Ca). In contrast, RI has a strong negative correlation with silicon (Si). Other notable relationships include a moderate negative correlation between aluminum (Al) and magnesium (Mg), and a moderate positive correlation between sodium (Na) and barium (Ba). Most other correlations are weak, suggesting minimal relationships among those elements.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

Primarily it seems that the predictor sets: Refractive Index, Sodium, Potassium, Calcium, and Iron have presence of potential outliers.However, the reside of the possible outliers is stronger in the Potassium and Iron data sets.

Histograms above reveal that the predictor variables: Sodium is slightly right skewed, Al is moderately right skewed, and refractive index, Potassium, Calcium, Barium and Iron are strongly right skewed. On the other hand, the distribution of Magnesium and Silicon data show left skewness, where the distribution for Magnesium shows higher left skewness compared to the Silicon’s distribution.

Following is the data showing skewness static for each of the predictor:

Glass1<-Glass|>keep(is.numeric)
(skewValues <- apply(Glass1, 2, skewness))
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

As all the data have skewness to some extent so replacing the data with the log, square root, or inverse may help to remove the skew. Since, Refractive Index (RI), Potassium (K), Calcium (Ca), Barium (Ba) and Iron (Fe) have strong right skewness, the log transformation may be worked well for them. Also, Magnesium (Mg) has strong left skewness and again log transformation may also be worked good for this data set. Additionally, the Box-Cox method can be applied here to find out the appropriate transformation technique for individual predictor.

Glass|>keep(is.numeric)|>
  mutate_all(funs(BoxCoxTrans(.)$lambda))|>
  head(1)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##   RI   Na Mg  Al Si  K   Ca Ba Fe
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA

From the optimal lambda values above, it is seen that the log transformation will be reasonable for Sodium (Na), Magnesium(Mg), Potassium (K),Barium (Ba) and Iron (Fe). Moreover, Aluminium can be square rooted, Silicon can be squared, and Refractive Index (RI) and Calcium can be inverse squared.

Question-3.2:

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# ?Soybean
head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# Exclude response variable 'Class' from the data set
Soybean1 <- Soybean[, !(names(Soybean) %in% "Class")]

# Bar plots to see frequency distribution of 35 categorical predictors
columns<-colnames(Soybean1)
lapply(columns,
  function(col) {
    ggplot(Soybean1, 
           aes_string(col)) + geom_bar() + ggtitle(col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

Based on the frequency distribution, mycelium and sclerotia appear to be degenerate. Leaf.mild and leaf.malf also seem heavily skewed, especially when excluding the missing values.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Create bar plots too see percentage of missing values in predictor variables:

# Calculate missing percentages
miss_percent <- colSums(is.na(Soybean1)) / nrow(Soybean1) * 100

# Create data frame with missing percentages
miss_df <- data.frame(Predictor = names(miss_percent),
                          Miss_Percent = miss_percent)

# Bar plot of missing percentages for predictors
ggplot(miss_df, aes(x = reorder(Predictor, -Miss_Percent), 
                        y = Miss_Percent)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "Percentage of Missing Values in Predictors",
       x = "Predictors",
       y = " % Missing") +
  theme_minimal()

Determine missing counts by class and see the predictor’s calsses that have missing counts:

# Count missing values by Class
missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
  pivot_longer(cols = starts_with("missing_"),
               names_to = "Predictor", 
               values_to = "Missing_Count") %>%
  mutate(Predictor = sub("missing_", "", Predictor))

# See missing counts by class
head(missing_by_class,20)
## # A tibble: 20 × 3
##    Class        Predictor    Missing_Count
##    <fct>        <chr>                <int>
##  1 2-4-d-injury date                     1
##  2 2-4-d-injury plant.stand             16
##  3 2-4-d-injury precip                  16
##  4 2-4-d-injury temp                    16
##  5 2-4-d-injury hail                    16
##  6 2-4-d-injury crop.hist               16
##  7 2-4-d-injury area.dam                 1
##  8 2-4-d-injury sever                   16
##  9 2-4-d-injury seed.tmt                16
## 10 2-4-d-injury germ                    16
## 11 2-4-d-injury plant.growth            16
## 12 2-4-d-injury leaves                   0
## 13 2-4-d-injury leaf.halo                0
## 14 2-4-d-injury leaf.marg                0
## 15 2-4-d-injury leaf.size                0
## 16 2-4-d-injury leaf.shread             16
## 17 2-4-d-injury leaf.malf                0
## 18 2-4-d-injury leaf.mild               16
## 19 2-4-d-injury stem                    16
## 20 2-4-d-injury lodging                 16
# Classes with missing counts
classes_with_missing <- missing_by_class %>%
  filter(Missing_Count > 0) %>%
  distinct(Class)  

# See distinct classes
print(classes_with_missing)
## # A tibble: 5 × 1
##   Class                      
##   <fct>                      
## 1 2-4-d-injury               
## 2 cyst-nematode              
## 3 diaporthe-pod-&-stem-blight
## 4 herbicide-injury           
## 5 phytophthora-rot

Bar plot to see missing percentage after removing classes with missing counts:

# Define classes to remove
classes_to_remove <- c("2-4-d-injury", "cyst-nematode", "diaporthe-pod-&-stem-blight", "herbicide-injury", "phytophthora-rot")

# Filter out defined classes from Soybean data set
Soybean2 <- Soybean %>% 
  filter(!Class %in% classes_to_remove)

# Calculate missing percentages
miss_percent <- colSums(is.na(Soybean2)) / nrow(Soybean2) * 100

# Create data frame for missing percentages
miss_data <- data.frame(Predictor = names(miss_percent),
                      Miss_Percent = miss_percent)

# Bar plot of missing percentages for predictors
ggplot(miss_data, aes(x = reorder(Predictor, -Miss_Percent), 
                     y = Miss_Percent)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "Percentage of Missing Values in Predictors",
       x = "Predictors",
       y = "% Missing") +
  theme_minimal()

From the 1st bar plot above, it is seen that except predictor, “leaves”, the rest 34 predictors have missing counts to some percentages in the Soybean data set. It is also observed that the five classes:2-4-d-injury, herbicide-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and phytophthora-rot have the missing counts associated with 34 predictors. It seems like there is a pattern of missing data related to the this five classes. After removing these classes there was no missing data observed in the data set as shown by the 2nd bar plot above.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

To handle the missing data in the Soybean data set, I will adapt a strategy based on elimination and imputation. At first, I will assess the volume of the missing data for each of the predictor. The predictor with 20% or more missing values will be removed from the analysis. Here, I have chosen 20% as a threshold for this analysis based on the sample size of the data. Next, for the remaining predictors with missing values, I will use imputation techniques to fill in the gaps for missing values in the data set. Here, I can apply simple mean, median, or mode methods to replace the missing values. Alternatively, I can use more advanced techniques like K-Nearest Neighbors (KNN) to better estimate the missing values. I think, by combining these approaches, I will be able to ensure a more reliable data set for this analysis while minimizing the impact of missing data.