library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.2
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(mice)
## Warning: package 'mice' was built under R version 4.3.2
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(tidyr)
data(Glass)
data(Soybean)

Question 3.1:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
# Plotting histogram
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

# Plotting Boxplot chart
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

#Finding correlation between pairs
Glass %>%
  keep(is.numeric) %>%
  
  cor() %>%
  corrplot() 

The histogram shows that Al, Ca, and Si have fairly symmetrical distributions, while Fe, K, Mg, and Ba have right-skewed distributions, indicating a concentration of lower values with fewer high values. From the boxplot, we can see that, several variables have outliers such as Al, Na, K and indicated by the points beyond the whiskers. The correlation between pairs of variables is moderate. The correlation between RI and Ca seems to be strongly negative, whereas there are strong positive correlations between Al and Si, and between Na and Si.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

Answer:

Yes, both histogram and boxplot have have indications of outliers and skewed predictors in the data. In Boxplot, several predictors (e.g. Al, Ba, Fe, K, Na, and RI) show points that lie outside the whiskers, indicating potential outliers. In histogram, Ba, Fe, K, and Mg have right-skewed distributions.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Ba, Fe, and K could benefit from a log transformation because they have a high right skewness and concentrations of points with low values. The ideal lambdas are displayed in the table below. Si cannot be squared, while RI can be inversely squared. One can square root Al. The model contains relationships with other variables, therefore it would be interesting to look into how it performs without Ca.

Glass %>%
  keep(is.numeric) %>%
  mutate_all(funs(BoxCoxTrans(.)$lambda)) %>%
  head(1)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##   RI   Na Mg  Al Si  K   Ca Ba Fe
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA

Question 3.2:

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean %>% 
  select(-Class) %>%
  gather() %>%
  
  ggplot(aes(value)) + 
    geom_bar() + 
    facet_wrap(~ key)
## Warning: attributes are not identical across measure variables; they will be
## dropped

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Soybean %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("skyblue","blue"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n(),
         Proportion =  Missing / class_Total) %>% 
  ungroup()%>%
  select(Class, Proportion) %>%
  distinct() 
## # A tibble: 5 × 2
##   Class                       Proportion
##   <fct>                            <dbl>
## 1 phytophthora-rot                 0.773
## 2 diaporthe-pod-&-stem-blight      1    
## 3 cyst-nematode                    1    
## 4 2-4-d-injury                     1    
## 5 herbicide-injury                 1

‘phytophthora-rot class’, ‘diaporthe-pod-&-stem-blight’, ‘cyst-nematode’, ‘2-4-d-injury’, and ‘herbicide-injury’ classes have missing values. If we filter these classes, there will be no missing data.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer:

We can remove thoe 5 classes completely from the data for handling missing data.

Soybean %>%
  filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
                       "2-4-d-injury", "herbicide-injury")) %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values with Missing Classes Removed",
       x = "Proportion") +
  scale_fill_manual(values=c("skyblue","blue"))

As we can see there is no missing data available.