Load libraries

library(tidyverse)
library(fpp3)
library(corrplot)

Exercises 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
glass2 <- Glass%>%
  select(-Type) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "name", values_to = "value")

ggplot(glass2, aes(x= value)) +
  geom_histogram(bins =30) +
  facet_wrap(~name, scales = "free") +
  labs(title = "Glass Distribution")

glass <- Glass[3:9]

corrplot(cor(glass),
  method = "number",
  type = "upper")

ggplot(glass2, aes(name, value)) +
  geom_boxplot() +
  labs(title = "Boxplot for Glass") 

ggplot(Glass, aes(Type)) +
  geom_bar() +
  labs(title = "Count for Type for Glass")

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

The outlier in the data seems to be SI as most of the data is below the value of 20 while SI is much greater. Reviewing the histogram charts you can see the distribution for BA, FE, K are extremely skewed to the right. AI,CA and RI are skewed to the right but less than the BA, FE and K.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Using the log transformation or Box-Cox Transformation will help with the right skewness in BA, FE and K.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data("Soybean")
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The distributions are degenerate in a way since a lot of the data has missing data for some of the data such as hail, sclerotia, seed, shriveling, stem, leaves, seed.size, seed.discolor, mycellium, leaves, lodging, plant.growth and roots.

soybean_data <- Soybean %>%
  select(-Class, - date) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "name", values_to = "value")

ggplot(soybean_data, aes(x= value)) +
  geom_histogram(stat = "count") +
  facet_wrap(~name, scales = "free") +
  labs(title = "Glass Distribution")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 2336 rows containing non-finite outside the scale range
## (`stat_count()`).

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The particular predictors with the most missing data are hail, server, seed.tmt. lodging with 121 missing data. The data shows 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury and
phytophthora-rot are the major contributor with missing data.

colSums(is.na(Soybean))
##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31
Soybean %>%
  group_by(Class) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  summarise(across(where(is.numeric), sum, na.rm = FALSE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), sum, na.rm = FALSE)`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 19 × 36
##    Class   date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##    <fct>  <dbl>       <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>    <dbl>
##  1 2-4-d…    NA          NA     NA    NA    NA        NA       NA    NA       NA
##  2 alter…   552         124    264   233   101       261      236   129      146
##  3 anthr…   241          67    132    99    55       122      122    79       71
##  4 bacte…    90          25     50    43    30        56       50    30       30
##  5 bacte…    75          29     48    39    30        53       50    28       26
##  6 brown…   293         127    266   194   103       291      296   179      135
##  7 brown…   234          55     53    81    53       140      122    95       66
##  8 charc…   115          20     20    55    31        55       70    40       30
##  9 cyst-…    58          NA     NA    NA    NA        45       34    NA       NA
## 10 diapo…    88          NA     43    45    NA        44       54    NA       NA
## 11 diapo…   110          20     60    40    21        61       23    46       29
## 12 downy…    90          31     60    35    29        56       50    34       30
## 13 frog-…   501         119    263   219   101       266      227   134      143
## 14 herbi…    15          16     NA     8    NA        12       20    NA       NA
## 15 phyll…    68          31     31    50    29        50       52    26       32
## 16 phyto…   266         176    234   195    NA       262      178    NA       NA
## 17 powde…    95          31     31    30    29        50       50    30       34
## 18 purpl…   113          20     60    39    29        50       50    20       28
## 19 rhizo…    45          38     60    20    22        50       40    51       24
## # ℹ 26 more variables: germ <dbl>, plant.growth <dbl>, leaves <dbl>,
## #   leaf.halo <dbl>, leaf.marg <dbl>, leaf.size <dbl>, leaf.shread <dbl>,
## #   leaf.malf <dbl>, leaf.mild <dbl>, stem <dbl>, lodging <dbl>,
## #   stem.cankers <dbl>, canker.lesion <dbl>, fruiting.bodies <dbl>,
## #   ext.decay <dbl>, mycelium <dbl>, int.discolor <dbl>, sclerotia <dbl>,
## #   fruit.pods <dbl>, fruit.spots <dbl>, seed <dbl>, mold.growth <dbl>,
## #   seed.discolor <dbl>, seed.size <dbl>, shriveling <dbl>, roots <dbl>
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

A strategy for handling missing data is imputation which is using the K-nearest neighbor in getting the surrounding data points to fill in the missing data. Each time there is missing data we will have to looking at the surround points and fill in the missing information.