Data 624 Homework 4

library(tidyverse)
library(mlbench)
library(kableExtra)
library(reactable)
library(GGally)
library(caret)
library(e1071)
library(univOutl)
library(moments)
library(outliers)
library(cowplot)
library(mice)
library(VIM)

3.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

set.seed(1234)

data("Glass")

str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their

distributions as well as the relationships between predictors.

reactable(Glass)

ggpairs(Glass, columns=1:9, title = "Predictor Variables") +
  theme(plot.title = element_text(hjust=0.4))

predictor_variables <- Glass[1:9]

predictor_variables |>
  cor() |>
  corrplot::corrplot(method = "number")

We can see high multicollinearity between RI and Ca.

predictor_variables |>
  cor() |>
  corrplot::corrplot(method = "square")

predictor_variables |>
  gather() |> 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()+
  ggtitle("Histogram of Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

predictor_variables %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_density() +
  facet_wrap(~key, scales = 'free')+
  ggtitle("Density Plot of Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

predictor_variables %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~key, scales = 'free')+
  ggtitle("Box Plot of Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

skew_var <- predictor_variables |>
  skewness()
kbl(skew_var) %>%
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))

	x
RI	1.6140150
Na	0.4509917
Mg	-1.1444648
Al	0.9009179
Si	-0.7253173
K	6.5056358
Ca	2.0326774
Ba	3.3924309
Fe	1.7420068

When analyzing the plots of the data, all of the predictors appear to have at least some skewness. Element K has the highest positive skewness, followed by element Ba. Elements Al, Na, and Si are the only three predictors that are closest to zero, showing symmetry and normal distribution. In terms of outliers, the box plots show that all of the predictors with the exception of element Mg have outliers. Element K appears to have the most distinct outlier among all the predictors.

(c) Are there any relevant transformations of one or more predictors that

might improve the classification model?

I will use a BoxCox Transformation of the predictor variables and see if it improves the classification model:

trans <- preProcess(predictor_variables, method = c("BoxCox", "center", "scale"))

transformed <- predict(trans, predictor_variables)

transformed |>
  gather() |> 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()+
  ggtitle("Histogram of Transformed Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

transformed |>
  gather() |> 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_density()+
  ggtitle("Density Plot of Transformed Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

transformed %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~key, scales = 'free')+
  ggtitle("Box Plot of Transformed Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

trans_var <- transformed |>
  skewness()
kbl(trans_var) %>%
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))

	x
RI	1.5766991
Na	0.0340851
Mg	-1.1444648
Al	0.0917010
Si	-0.6554949
K	6.5056358
Ca	-0.1953232
Ba	3.3924309
Fe	1.7420068

skew_var <- predictor_variables |>
  skewness()
kbl(skew_var) %>%
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))

	x
RI	1.6140150
Na	0.4509917
Mg	-1.1444648
Al	0.9009179
Si	-0.7253173
K	6.5056358
Ca	2.0326774
Ba	3.3924309
Fe	1.7420068

Using the BoxCox Transformation, the predictor variables that appear to show the most improvement are elements Al, Na, and Ca, with values that are closest to zero. This improvement can be seen in the histogram plot of the predictor variables, with each of the mentioned predictor variables showing more of a normal distribution. While RI and Si improved slightly in reducing their skewness values, there doesn’t appear to be a distinct improvement from the original model in terms of normal distribution.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

reactable(Soybean)

Soybean %>%
  select(-Class)%>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_bar()+
  facet_wrap(~ key) +
  ggtitle("Bar Plot of Soybean Categorical Predictor Variables") +
  theme(plot.title = element_text(hjust=0.5))

nearZeroVar(Soybean, names=TRUE)

## [1] "leaf.mild" "mycelium"  "sclerotia"

According to Kuhn and Johnson, distributions that are degenerate are said to have a single value for a vast majority of samples, or a handful of unique values that occur with very low frequencies. Based on this definition, the histogram of the categorical predictors identifies some predictor variables that fall into this category. roots and shriveling appear to stand out as variables that have a single unique value. When using the nearZeroVar function from the caret package, leaf.mild, mycelium and sclerotia are identified as variables that can be considered degenerate. When analyzing their respective histogram distributions, they appear to validate this conclusion.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

na_soybean <- sapply(Soybean, function(y) sum(length(which(is.na(y))))) %>% as.data.frame() 
colnames(na_soybean) <- "NA_Count"
na_soybean$Variable <- rownames(na_soybean)
na_soybean %>% 
  select(Variable, NA_Count) %>% 
  arrange(desc(NA_Count)) %>%
  mutate("Missing_Data_Percentage" = round(NA_Count/nrow(Soybean),2)*100) %>% 
  kable(caption="<center>Missing Data Count and Percentage", align = "c") %>% 
  kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))

Missing Data Count and Percentage
	Variable	NA_Count	Missing_Data_Percentage
hail	hail	121	18
sever	sever	121	18
seed.tmt	seed.tmt	121	18
lodging	lodging	121	18
germ	germ	112	16
leaf.mild	leaf.mild	108	16
fruiting.bodies	fruiting.bodies	106	16
fruit.spots	fruit.spots	106	16
seed.discolor	seed.discolor	106	16
shriveling	shriveling	106	16
leaf.shread	leaf.shread	100	15
seed	seed	92	13
mold.growth	mold.growth	92	13
seed.size	seed.size	92	13
leaf.halo	leaf.halo	84	12
leaf.marg	leaf.marg	84	12
leaf.size	leaf.size	84	12
leaf.malf	leaf.malf	84	12
fruit.pods	fruit.pods	84	12
precip	precip	38	6
stem.cankers	stem.cankers	38	6
canker.lesion	canker.lesion	38	6
ext.decay	ext.decay	38	6
mycelium	mycelium	38	6
int.discolor	int.discolor	38	6
sclerotia	sclerotia	38	6
plant.stand	plant.stand	36	5
roots	roots	31	5
temp	temp	30	4
crop.hist	crop.hist	16	2
plant.growth	plant.growth	16	2
stem	stem	16	2
date	date	1	0
area.dam	area.dam	1	0
Class	Class	0	0
leaves	leaves	0	0

The predictors that have the highest number of missing values are hail, sever, seed.tmt, and lodging.

class_na <- Soybean %>%
  mutate(Total_Missing = n()) %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing_Values = n()) %>%
  select(Class, Missing_Values) %>%
  #mutate(Missing_Percentage = round(Missing_Values/sum(Missing_Values)*100, 2)) %>%
  unique()

kbl(class_na) %>%
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))

Class	Missing_Values
phytophthora-rot	68
diaporthe-pod-&-stem-blight	15
cyst-nematode	14
2-4-d-injury	16
herbicide-injury	8

Among classes with the most missing values, phytophthora-rot has the most.

aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Using the nearZeroVar function from the caret package can help identify predictor variables that should be eliminated from the model, which can improve model performance. If using imputation, the mice package can help fill in missing values for the predictor variables used in predictive models. One method within mice that can be used is “pmm”, which is predictive mean matching. PMM is an imputation method that predicts values and subsequently selects observed values to be used to replace the missing values. It is the default imputation method within mice. More information of this method can be found here: https://bookdown.org/mwheymans/bookmi/multiple-imputation.html

Data 624 Homework 4

Mohamed Hassan-El Serafi

2024-02-25

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.