library(mlbench)
library(AppliedPredictiveModeling)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)
library(grid)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)

Question 3.1

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

A.)

pred_columns <- colnames(Glass)
pred_columns  <- pred_columns[-10]
glass_predictors <- Glass[,-10]
## Plot all histograms to see the distribution of predictor variables
par(mfrow = c(5,2))
par(mar = c(2,2,2,2))
for (i in col(Glass[1,])) {
  title <- paste("Histogram of",colnames(Glass)[i], "values", sep = " ") 
  hist(as.numeric(Glass[,i]), xlab= colnames(Glass)[i], main =title) 
}

## For easier visualization, lets pivot data to be long rather than wide.
long_glass <- glass_predictors %>%
  pivot_longer(cols = RI:Fe, names_to = "name")

## Plot boxplots of predictor distribution
long_glass %>%
  ggplot(aes(y= value)) +
  geom_boxplot() + 
  facet_wrap(~ name, scales = "free_y")

Many of the plots do not appear to have a normal distribution.

## create a correlation plot of all the predictors
pred_corr <- cor(glass_predictors)
ggcorrplot::ggcorrplot(pred_corr, hc.order = TRUE, lab = TRUE) + 
  labs(title = "Correlation Between Predictors for Type of Glass")

Using a corr plot we can visualize the relationships between the predictors. Most predictors have negative or no correlation with other predictors. We can see that calcium has a strong positive correlation with the refractive index. Silicon has a strong negative corelation with the refractive index. There is a strong negative correlation between Magnesium and aluminium and magnesium and barium.

B.)

There are a variety of outliers in the data. We can see that potassium generally falls within 1 unit, but there are several points far out of the range.Iron has a range of 0 to 0.1 with outliers from 0.25 to 0.5. Most predictors do not follow a normal distribution and are skewed. The skewness function can be used to determine the magnitude of the skewness.

## Calculate skewness values for all predictors

glass_skewness <- apply(glass_predictors, 2,skewness)
glass_skewness
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

By calculating skewness values, we can see that Na, Al and Si are the least skewed, with values that fall between 1 and -1.All other predictors are classified as being highly skewed as they have values over 1. A box cox transformation to help with skewness would be recommmended.

C.)

A box-cox transformation is the first one that I would suggest.

glass_normalized <- preProcess(glass_predictors, method = "BoxCox")
glass_normalized_predict <- predict(glass_normalized, glass_predictors)
apply(glass_normalized_predict, 2, skewness)
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.13645228  0.09105899 -0.65090568  6.46008890 
##          Ca          Ba          Fe 
## -0.19395573  3.36867997  1.72981071

By applying only a boxcox transformation to the data, many columns remain almost unchanged. The most significant impovement in skewness is to the Ca and Al distribution which became far less skewed. According to the textbook, other transormations such as centering, scaling, and applying pca can improve skewness numbers. Below I have applied these transformations.

trans_glass <-preProcess(glass_predictors, method = c("BoxCox","center", "scale", "pca"))
trans_glass
## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
transformed <-predict(trans_glass,glass_predictors)
## Plot all histograms to see the distribution of predictor variables
par(mfrow = c(5,2))
par(mar = c(2,2,2,2))
for (i in col(transformed[1,])) {
  title <- paste("Histogram of",colnames(transformed)[i], "values", sep = " ") 
  hist(as.numeric(transformed[,i]), xlab= colnames(transformed)[i], main =title) 
}

We can see that by applying the standard transformations, outlined by the textbook (boxcox, centering, scaling, and pca) we have created more normal distributions, and removed complexity of model through PCA.

apply(transformed, 2,skewness)
##         PC1         PC2         PC3         PC4         PC5         PC6 
##  0.07032755  1.29194177 -2.44424443  0.04517354  0.24376825 -0.15801104 
##         PC7 
## -1.74132708

Although some PCs are still skewed, the majority of them fall between -1 and 1, which is only moderate skewness. Previously, the largest amount of skewness to the predictors was 6.4, while after transformations it is about -2.4

Question 3.2

library(mlbench)
data(Soybean)
## Determine the categorical predictors:

A.)

## Pivot data to longer format so that it can be visualized easier with ggplot
longer_soybean <- Soybean[2:36] %>% mutate_if(is.factor,as.numeric) %>% pivot_longer(cols = everything())
longer_soybean
## # A tibble: 23,905 × 2
##    name        value
##    <chr>       <dbl>
##  1 date            7
##  2 plant.stand     1
##  3 precip          3
##  4 temp            2
##  5 hail            1
##  6 crop.hist       2
##  7 area.dam        2
##  8 sever           2
##  9 seed.tmt        1
## 10 germ            1
## # ℹ 23,895 more rows
## create a list of the column names so that they can be selected for visualizations
col_names <- as.list(longer_soybean["name"] %>% unique())
longer_soybean %>% 
  filter(name %in% col_names$name[1:18]) %>%
    ggplot(aes(x= value)) +
      geom_histogram(stat = "count") + 
      facet_wrap(vars(name),)
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 1157 rows containing non-finite values (`stat_count()`).

longer_soybean %>% 
  filter(name %in% col_names$name[18:38]) %>%
    ggplot(aes(x= value)) +
      geom_histogram(stat = "count") + 
      facet_wrap(vars(name),)
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 1288 rows containing non-finite values (`stat_count()`).

While it doesn’t seem that the dataset has any zero variance predictors, there are some that may be near-zero variance predictors. Predictors such as lodging, sclerotia, mycelium, and shriveling look to be near-zero variance predictors.

B.)

sum(is.na(Soybean))
## [1] 2337

The total misssing values in the soybean dataset is 2337

total_missing <- as.data.frame(colSums(is.na(Soybean)))
total_missing %>% arrange(desc(`colSums(is.na(Soybean))`))
##                 colSums(is.na(Soybean))
## hail                                121
## sever                               121
## seed.tmt                            121
## lodging                             121
## germ                                112
## leaf.mild                           108
## fruiting.bodies                     106
## fruit.spots                         106
## seed.discolor                       106
## shriveling                          106
## leaf.shread                         100
## seed                                 92
## mold.growth                          92
## seed.size                            92
## leaf.halo                            84
## leaf.marg                            84
## leaf.size                            84
## leaf.malf                            84
## fruit.pods                           84
## precip                               38
## stem.cankers                         38
## canker.lesion                        38
## ext.decay                            38
## mycelium                             38
## int.discolor                         38
## sclerotia                            38
## plant.stand                          36
## roots                                31
## temp                                 30
## crop.hist                            16
## plant.growth                         16
## stem                                 16
## date                                  1
## area.dam                              1
## Class                                 0
## leaves                                0

We can see that several groups have over 100 value missing. The hail, sever, seed.tmt, lodging and germ contain the most missing values.

## Look at the missing values that are in the Soybean dataset

missing_soybean <- Soybean %>% group_by(Class) %>% summarise(across(date:roots, ~sum(is.na(.))))

missing_soybean
## # A tibble: 19 × 36
##    Class   date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##    <fct>  <int>       <int>  <int> <int> <int>     <int>    <int> <int>    <int>
##  1 2-4-d…     1          16     16    16    16        16        1    16       16
##  2 alter…     0           0      0     0     0         0        0     0        0
##  3 anthr…     0           0      0     0     0         0        0     0        0
##  4 bacte…     0           0      0     0     0         0        0     0        0
##  5 bacte…     0           0      0     0     0         0        0     0        0
##  6 brown…     0           0      0     0     0         0        0     0        0
##  7 brown…     0           0      0     0     0         0        0     0        0
##  8 charc…     0           0      0     0     0         0        0     0        0
##  9 cyst-…     0          14     14    14    14         0        0    14       14
## 10 diapo…     0           6      0     0    15         0        0    15       15
## 11 diapo…     0           0      0     0     0         0        0     0        0
## 12 downy…     0           0      0     0     0         0        0     0        0
## 13 frog-…     0           0      0     0     0         0        0     0        0
## 14 herbi…     0           0      8     0     8         0        0     8        8
## 15 phyll…     0           0      0     0     0         0        0     0        0
## 16 phyto…     0           0      0     0    68         0        0    68       68
## 17 powde…     0           0      0     0     0         0        0     0        0
## 18 purpl…     0           0      0     0     0         0        0     0        0
## 19 rhizo…     0           0      0     0     0         0        0     0        0
## # ℹ 26 more variables: germ <int>, plant.growth <int>, leaves <int>,
## #   leaf.halo <int>, leaf.marg <int>, leaf.size <int>, leaf.shread <int>,
## #   leaf.malf <int>, leaf.mild <int>, stem <int>, lodging <int>,
## #   stem.cankers <int>, canker.lesion <int>, fruiting.bodies <int>,
## #   ext.decay <int>, mycelium <int>, int.discolor <int>, sclerotia <int>,
## #   fruit.pods <int>, fruit.spots <int>, seed <int>, mold.growth <int>,
## #   seed.discolor <int>, seed.size <int>, shriveling <int>, roots <int>

The phytophthora-rot class accounts for 602 missing values

(602 / 2337) * 100
## [1] 25.75952

This accounts for approximately 26 % of the total missing values.

C.)

By following the guidelines laid out by the textbook, I would first look at the relationships between the predictors. If the predictor is identified as a “near zero” variance predictor, then it would make sense to simply remove it, as it may improve model performance anyway. Following that I would look at correlation between predictors. If there is a strong enough corellation between two predictors, I would remove the one with the problematic missing values. If these options were not available, I would impute the missing values, using a common method, such as the K-nearest model.