library(mlbench)
library(AppliedPredictiveModeling)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)
library(grid)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
Question 3.1
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
A.)
pred_columns <- colnames(Glass)
pred_columns <- pred_columns[-10]
glass_predictors <- Glass[,-10]
## Plot all histograms to see the distribution of predictor variables
par(mfrow = c(5,2))
par(mar = c(2,2,2,2))
for (i in col(Glass[1,])) {
title <- paste("Histogram of",colnames(Glass)[i], "values", sep = " ")
hist(as.numeric(Glass[,i]), xlab= colnames(Glass)[i], main =title)
}

## For easier visualization, lets pivot data to be long rather than wide.
long_glass <- glass_predictors %>%
pivot_longer(cols = RI:Fe, names_to = "name")
## Plot boxplots of predictor distribution
long_glass %>%
ggplot(aes(y= value)) +
geom_boxplot() +
facet_wrap(~ name, scales = "free_y")

Many of the plots do not appear to have a normal distribution.
## create a correlation plot of all the predictors
pred_corr <- cor(glass_predictors)
ggcorrplot::ggcorrplot(pred_corr, hc.order = TRUE, lab = TRUE) +
labs(title = "Correlation Between Predictors for Type of Glass")

Using a corr plot we can visualize the relationships between the
predictors. Most predictors have negative or no correlation with other
predictors. We can see that calcium has a strong positive correlation
with the refractive index. Silicon has a strong negative corelation with
the refractive index. There is a strong negative correlation between
Magnesium and aluminium and magnesium and barium.
B.)
There are a variety of outliers in the data. We can see that
potassium generally falls within 1 unit, but there are several points
far out of the range.Iron has a range of 0 to 0.1 with outliers from
0.25 to 0.5. Most predictors do not follow a normal distribution and are
skewed. The skewness function can be used to determine the magnitude of
the skewness.
## Calculate skewness values for all predictors
glass_skewness <- apply(glass_predictors, 2,skewness)
glass_skewness
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
By calculating skewness values, we can see that Na, Al and Si are
the least skewed, with values that fall between 1 and -1.All other
predictors are classified as being highly skewed as they have values
over 1. A box cox transformation to help with skewness would be
recommmended.
C.)
By applying only a boxcox transformation to the data, many columns
remain almost unchanged. The most significant impovement in skewness is
to the Ca and Al distribution which became far less skewed. According to
the textbook, other transormations such as centering, scaling, and
applying pca can improve skewness numbers. Below I have applied these
transformations.
trans_glass <-preProcess(glass_predictors, method = c("BoxCox","center", "scale", "pca"))
trans_glass
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (0)
## - principal component signal extraction (9)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
transformed <-predict(trans_glass,glass_predictors)
## Plot all histograms to see the distribution of predictor variables
par(mfrow = c(5,2))
par(mar = c(2,2,2,2))
for (i in col(transformed[1,])) {
title <- paste("Histogram of",colnames(transformed)[i], "values", sep = " ")
hist(as.numeric(transformed[,i]), xlab= colnames(transformed)[i], main =title)
}

We can see that by applying the standard transformations, outlined
by the textbook (boxcox, centering, scaling, and pca) we have created
more normal distributions, and removed complexity of model through
PCA.
apply(transformed, 2,skewness)
## PC1 PC2 PC3 PC4 PC5 PC6
## 0.07032755 1.29194177 -2.44424443 0.04517354 0.24376825 -0.15801104
## PC7
## -1.74132708
Although some PCs are still skewed, the majority of them fall
between -1 and 1, which is only moderate skewness. Previously, the
largest amount of skewness to the predictors was 6.4, while after
transformations it is about -2.4
Question 3.2
library(mlbench)
data(Soybean)
## Determine the categorical predictors:
A.)
## Pivot data to longer format so that it can be visualized easier with ggplot
longer_soybean <- Soybean[2:36] %>% mutate_if(is.factor,as.numeric) %>% pivot_longer(cols = everything())
longer_soybean
## # A tibble: 23,905 × 2
## name value
## <chr> <dbl>
## 1 date 7
## 2 plant.stand 1
## 3 precip 3
## 4 temp 2
## 5 hail 1
## 6 crop.hist 2
## 7 area.dam 2
## 8 sever 2
## 9 seed.tmt 1
## 10 germ 1
## # ℹ 23,895 more rows
## create a list of the column names so that they can be selected for visualizations
col_names <- as.list(longer_soybean["name"] %>% unique())
longer_soybean %>%
filter(name %in% col_names$name[1:18]) %>%
ggplot(aes(x= value)) +
geom_histogram(stat = "count") +
facet_wrap(vars(name),)
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 1157 rows containing non-finite values (`stat_count()`).

longer_soybean %>%
filter(name %in% col_names$name[18:38]) %>%
ggplot(aes(x= value)) +
geom_histogram(stat = "count") +
facet_wrap(vars(name),)
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 1288 rows containing non-finite values (`stat_count()`).

While it doesn’t seem that the dataset has any zero variance
predictors, there are some that may be near-zero variance predictors.
Predictors such as lodging, sclerotia, mycelium, and shriveling look to
be near-zero variance predictors.
B.)
sum(is.na(Soybean))
## [1] 2337
The total misssing values in the soybean dataset is 2337
total_missing <- as.data.frame(colSums(is.na(Soybean)))
total_missing %>% arrange(desc(`colSums(is.na(Soybean))`))
## colSums(is.na(Soybean))
## hail 121
## sever 121
## seed.tmt 121
## lodging 121
## germ 112
## leaf.mild 108
## fruiting.bodies 106
## fruit.spots 106
## seed.discolor 106
## shriveling 106
## leaf.shread 100
## seed 92
## mold.growth 92
## seed.size 92
## leaf.halo 84
## leaf.marg 84
## leaf.size 84
## leaf.malf 84
## fruit.pods 84
## precip 38
## stem.cankers 38
## canker.lesion 38
## ext.decay 38
## mycelium 38
## int.discolor 38
## sclerotia 38
## plant.stand 36
## roots 31
## temp 30
## crop.hist 16
## plant.growth 16
## stem 16
## date 1
## area.dam 1
## Class 0
## leaves 0
We can see that several groups have over 100 value missing. The
hail, sever, seed.tmt, lodging and germ contain the most missing
values.
## Look at the missing values that are in the Soybean dataset
missing_soybean <- Soybean %>% group_by(Class) %>% summarise(across(date:roots, ~sum(is.na(.))))
missing_soybean
## # A tibble: 19 × 36
## Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2-4-d… 1 16 16 16 16 16 1 16 16
## 2 alter… 0 0 0 0 0 0 0 0 0
## 3 anthr… 0 0 0 0 0 0 0 0 0
## 4 bacte… 0 0 0 0 0 0 0 0 0
## 5 bacte… 0 0 0 0 0 0 0 0 0
## 6 brown… 0 0 0 0 0 0 0 0 0
## 7 brown… 0 0 0 0 0 0 0 0 0
## 8 charc… 0 0 0 0 0 0 0 0 0
## 9 cyst-… 0 14 14 14 14 0 0 14 14
## 10 diapo… 0 6 0 0 15 0 0 15 15
## 11 diapo… 0 0 0 0 0 0 0 0 0
## 12 downy… 0 0 0 0 0 0 0 0 0
## 13 frog-… 0 0 0 0 0 0 0 0 0
## 14 herbi… 0 0 8 0 8 0 0 8 8
## 15 phyll… 0 0 0 0 0 0 0 0 0
## 16 phyto… 0 0 0 0 68 0 0 68 68
## 17 powde… 0 0 0 0 0 0 0 0 0
## 18 purpl… 0 0 0 0 0 0 0 0 0
## 19 rhizo… 0 0 0 0 0 0 0 0 0
## # ℹ 26 more variables: germ <int>, plant.growth <int>, leaves <int>,
## # leaf.halo <int>, leaf.marg <int>, leaf.size <int>, leaf.shread <int>,
## # leaf.malf <int>, leaf.mild <int>, stem <int>, lodging <int>,
## # stem.cankers <int>, canker.lesion <int>, fruiting.bodies <int>,
## # ext.decay <int>, mycelium <int>, int.discolor <int>, sclerotia <int>,
## # fruit.pods <int>, fruit.spots <int>, seed <int>, mold.growth <int>,
## # seed.discolor <int>, seed.size <int>, shriveling <int>, roots <int>
The phytophthora-rot class accounts for 602 missing values
(602 / 2337) * 100
## [1] 25.75952
This accounts for approximately 26 % of the total missing
values.
C.)
By following the guidelines laid out by the textbook, I would first
look at the relationships between the predictors. If the predictor is
identified as a “near zero” variance predictor, then it would make sense
to simply remove it, as it may improve model performance anyway.
Following that I would look at correlation between predictors. If there
is a strong enough corellation between two predictors, I would remove
the one with the problematic missing values. If these options were not
available, I would impute the missing values, using a common method,
such as the K-nearest model.