###3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(mlbench) 
library(purrr)
library(corrplot)
## corrplot 0.94 loaded
data(Glass) 
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)Using visualizations,explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Glass Types")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms" )

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots")

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot() 

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

(c)Are there any relevant transformations of one or more predictors that might improve the classification model?

We can use lambda /Box-Cox transformation, log for positively skewed variables, maybe standardization since the variables are on different scales.

###3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)  ## See ?Soybean for details
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# facet wrapping geombar makes it hard to see; trying another approach 
library(tidyr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
index <- nearZeroVar(Soybean)
colnames(Soybean)[index]
## [1] "leaf.mild" "mycelium"  "sclerotia"
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_proportions <- sort(apply(Soybean, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)

missing_df <- data.frame(variable = names(missing_proportions), missing_proportion = missing_proportions)

ggplot(missing_df, aes(x = reorder(variable, missing_proportion), y = missing_proportion)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Proportion of Missing Values",
       x = "Variable",
       y = "Proportion") +
  theme_minimal()

classes_w_missing <- Soybean %>%
   filter_all(any_vars(is.na(.))) %>%
   select(Class) %>%
   group_by(Class) %>%
   summarise(count = n())
classes_w_missing
## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
for (class in unique(classes_w_missing$Class)) {
  t <- Soybean %>%
    filter(Class == class) 

  missing_proportions <- sort(apply(t, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)
  
  missing_100 <- names(missing_proportions[missing_proportions == 1])
  
  print(paste("Class:", class))
  if (length(missing_100) > 0) {
    print(paste("Variables with 100% missing:", paste(missing_100, collapse = ", ")))
  } else {
    print("No variables with 100% missing values.")
    }
  }
## [1] "Class: 2-4-d-injury"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, crop.hist, sever, seed.tmt, germ, plant.growth, leaf.shread, leaf.mild, stem, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.pods, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling, roots"
## [1] "Class: cyst-nematode"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, sever, seed.tmt, germ, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed.discolor, shriveling"
## [1] "Class: diaporthe-pod-&-stem-blight"
## [1] "Variables with 100% missing: hail, sever, seed.tmt, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, roots"
## [1] "Class: herbicide-injury"
## [1] "Variables with 100% missing: precip, hail, sever, seed.tmt, germ, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling"
## [1] "Class: phytophthora-rot"
## [1] "No variables with 100% missing values."