3.1

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

    Answer: From observing the distributions using corrplot and histogram, there are plenty of value of zeros, some are skewed, and most are normally distributed.

  2. Do there appear to be any outliers in the data? Are any predictors skewed?

    Answer: The box plots shows that majority has outliers in the data except Mg. From the histogram, some are right skewed, meanwhile others are normally distributed.

  3. Are there any relevant transformations of one or more predictors that might improve the classification mode

    Answer: Box transformation seems good transformer to improve, since it shows the right skewness of the histogram, help make variance more stable and normally distributed, although the ones with many zero didn’t change much from the transformation.

library(mlbench)
library(corrplot)
## corrplot 0.95 loaded
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
matrix <- cor(Glass[, 1:9])

corrplot(matrix, method = "color", addCoef.col = "blue", tl.cex = 0.8)

Glass |>
  select_if(is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_histogram(bins = 50) + 
  facet_wrap(~key, scales = 'free')

Glass |>
  select_if(is.numeric) |>  
  gather() |>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free')

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

    Answer: From the histograms, most are not normally distributed, they seem to be skewed mostly, so they are mostly degenerate.

  2. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

    Phytophthora-rot has the highest amount of missing data, at 68, much more than others, and rest are 8 for herbicide-injury and around 15 for the rest, so they are not uniformly distributed. From this, the pattern of missing data might be related to the classes.

  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation

    Answer: I can use something like mean or median imputation as a strategy for handling missing data for missing values around maybe 5-10%, so we are able to fill the missing values for each predictor. However, for something as high as around 60%, I would drop the predictor.

library(mlbench)
data(Soybean)

Soybean |>
  select(!Class) |>  
  gather() |>
  ggplot(aes(value)) + 
  geom_bar() +  
  facet_wrap(~key, scales = "free")
## Warning: attributes are not identical across measure variables; they will be
## dropped

perc <- Soybean |>
  summarise_all(~mean(is.na(.)) * 100) |>
  gather(key = "predictor", value = "perc") 

ggplot(perc, aes(x = reorder(predictor, perc), y = perc)) +
  geom_bar(stat = "identity") +
  labs(x = "Predictor",
       y = "Missing Percentage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Soybean |>
  filter_all(any_vars(is.na(.))) |>
  select(Class) |>
  group_by(Class) |>
  summarise(count = n()) 
## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68