library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
library(lattice)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
The UC Irvine Machine Learning Repository contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
#View(Glass)
glass_data <- Glass %>%
select(-Type) %>%
pivot_longer(everything(), names_to = "predictors", values_to = "values")
#histograms for each predictor
glass_data %>%
ggplot(aes(x = values)) +
geom_histogram()+
facet_wrap(~predictors, scales = "free") +
theme_minimal() +
labs(title = "Histograms of Glass Dataset Predictors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#correlation plot
Glass %>%
select(1:9) %>%
ggpairs()
Based on the histograms, the data for each predictor is not
normally distributed. Most of the predictors are right-skewed. “Mg”
predictor looks bimodal.
glass_data |>
ggplot(aes(x = values)) +
geom_boxplot() +
facet_wrap(~ predictors, scales = "free") +
labs(title = "Distributions of Glass Predictors")
#need to remove the "type" column from the dataset to calculate skewness or pivot back to wide format
glass <- Glass %>%
select(-Type)
skewValues <- apply(glass, 2, skewness)
head(skewValues) # not sure why it is not calculating skewness for Ca, Ba, and Fe
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
Each predictor has a lot of outliers, except Mg.
Yes, I would perform a few different transformations. Boxcox transformation, log transformation, scaling.
glass_trans <- preProcess(glass, method = c("BoxCox", "center", "scale"))
head(glass_trans)
## $dim
## [1] 214 9
##
## $bc
## $bc$RI
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.511 1.517 1.518 1.518 1.519 1.534
##
## Largest/Smallest: 1.02
## Sample Skewness: 1.6
##
## Estimated Lambda: -2
##
##
## $bc$Na
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.73 12.91 13.30 13.41 13.82 17.38
##
## Largest/Smallest: 1.62
## Sample Skewness: 0.448
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
##
##
## $bc$Al
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 1.190 1.360 1.445 1.630 3.500
##
## Largest/Smallest: 12.1
## Sample Skewness: 0.895
##
## Estimated Lambda: 0.5
##
##
## $bc$Si
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.81 72.28 72.79 72.65 73.09 75.41
##
## Largest/Smallest: 1.08
## Sample Skewness: -0.72
##
## Estimated Lambda: 2
##
##
## $bc$Ca
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.430 8.240 8.600 8.957 9.172 16.190
##
## Largest/Smallest: 2.98
## Sample Skewness: 2.02
##
## Estimated Lambda: -1.1
##
##
##
## $yj
## NULL
##
## $et
## NULL
##
## $invHyperbolicSine
## NULL
##
## $mean
## RI Na Mg Al Si K
## 2.831185e-01 2.594009e+00 2.684533e+00 3.684509e-01 2.638878e+03 4.970561e-01
## Ca Ba Fe
## 8.256036e-01 1.750467e-01 5.700935e-02
glass_transformed <- predict(glass_trans, newdata = glass)
glass_transformed %>%
pivot_longer(everything(), names_to = "predictors", values_to = "values") %>%
ggplot(aes(x = values)) +
geom_histogram()+
facet_wrap(~predictors, scales = "free") +
theme_minimal() +
labs(title = "Histograms of Glass Dataset Transformed Predictors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
After the transformations, the variables appear to be better
distributed.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g.,temperature,precipitation)and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
## See ?Soybean for details
?Soybean
## Help on topic 'Soybean' was found in the following packages:
##
## Package Library
## mlbench /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## nlme /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##
##
## Using the first match ...
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#A degenerate distribution happens when a variable has very little variability
Soybean |>
select(-Class) |>
mutate(across(where(is.factor), as.character)) |>
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
group_by(Predictor, Value) |>
summarize(Count = n()) |>
ggplot(mapping = aes(x = Value, y = Count)) +
geom_col() +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Distribution of Predictors"
)
## `summarise()` has grouped output by 'Predictor'. You can override using the
## `.groups` argument.
Several predictors show degenerate distribution,meaning that one
category occurring far more frequently than others. For some of the
predictors there is a high proportion of the NA values.
mean(is.na(Soybean)) * 100
## [1] 9.504636
missing_perc <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
sort(missing_perc, decreasing = TRUE)
## hail sever seed.tmt lodging germ
## 17.7159590 17.7159590 17.7159590 17.7159590 16.3982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 15.8125915 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 14.6412884 13.4699854 13.4699854 13.4699854 12.2986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 12.2986823 12.2986823 12.2986823 12.2986823 5.5636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## sclerotia plant.stand roots temp crop.hist
## 5.5636896 5.2708638 4.5387994 4.3923865 2.3426061
## plant.growth stem date area.dam Class
## 2.3426061 2.3426061 0.1464129 0.1464129 0.0000000
## leaves
## 0.0000000
missing_data <- Soybean %>%
select(-Class) %>%
mutate_all(~ is.na(.))
missing_data$Class <- Soybean$Class
missing_by_class_all <- missing_data %>%
gather(key = "predictor", value = "missing", -Class) %>%
group_by(Class, predictor) %>%
summarise(missing_percentage = mean(missing) * 100) %>%
ungroup()
## `summarise()` has grouped output by 'Class'. You can override using the
## `.groups` argument.
# View the results
head(missing_by_class_all)
## # A tibble: 6 × 3
## Class predictor missing_percentage
## <fct> <chr> <dbl>
## 1 2-4-d-injury area.dam 6.25
## 2 2-4-d-injury canker.lesion 100
## 3 2-4-d-injury crop.hist 100
## 4 2-4-d-injury date 6.25
## 5 2-4-d-injury ext.decay 100
## 6 2-4-d-injury fruit.pods 100
ggplot(missing_by_class_all, aes(x = Class, y = missing_percentage, fill = predictor)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip()+
labs(title = "Missingness of Predictors by Class",
x = "Class",
y = "Percentage of Missing Data") +
theme_minimal()
It does appear that the missng data is releated to class, and
particualrly to “phytophthora-rot”, “herbicide-injury”,
“diaporther-pod-stem-blight”, “cyst-nematode”,
“2-4-d-injury”
We can eliminate degenerate variables since they don’t have any variations across categories. For the predictors with low missigness, we can impute the missing values using the mode. And I think we would need to separately explore the missing data in those specific classes and use multiple imputations.