Homework 4 - Data Preprocessing/Overfitting

Imports

library(gridExtra)
library(ggplot2)
library(cowplot)
options(scipen=10000)
library(mlbench)
library(tidyverse)
library(corrplot)
library(AppliedPredictiveModeling)
library(caret)
library(DataExplorer)
library(kableExtra)
library(mice)

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identiﬁcation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Data Overview

data(Glass)
data(Soybean)
head(Glass) |>  kable() |> kable_styling() |>  kable_classic(full_width = T)

RI	Na	Mg	Al	Si	K	Ca	Fe	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	0.00	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	0.00	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	0.00	1
1.51766	13.21	3.69	1.29	72.61	0.57	8.22	0.00	1
1.51742	13.27	3.62	1.24	73.08	0.55	8.07	0.00	1
1.51596	12.79	3.61	1.62	72.97	0.64	8.07	0.26	1

GlassCont <- Glass |> dplyr::select(-Type)

# plotGraphs(GlassCont,FALSE)

Glass |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free', ncol = 2) +
  ggtitle("Histograms of Numerical Predictors")

Glass |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) + 
  facet_wrap(~key, scales = 'free',  ncol = 2) +
  ggtitle("Boxplots of Numerical Predictors")

Glass |>
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

## Correlation Plot

GlassCont <- Glass |> dplyr::select(-Type)

correlations <- cor(GlassCont)

corrplot(correlations, order = "hclust")

highCorr <- findCorrelation(correlations, cutoff = .75)

GlassUnco <- Glass[,-highCorr]

The correlation plot shows that in general, variables are not highligh correlated with eachother. The expecion is Ca, which is Highly correlated with RI.

correlations |>  kable() |> kable_styling() |>  kable_classic(full_width = T)

	RI	Na	Mg	Al	Si	K	Ca	Ba	Fe
RI	1.0000000	-0.1918854	-0.1222740	-0.4073260	-0.5420522	-0.2898327	0.8104027	-0.0003860	0.1430096
Na	-0.1918854	1.0000000	-0.2737320	0.1567937	-0.0698088	-0.2660865	-0.2754425	0.3266029	-0.2413464
Mg	-0.1222740	-0.2737320	1.0000000	-0.4817985	-0.1659267	0.0053957	-0.4437500	-0.4922621	0.0830595
Al	-0.4073260	0.1567937	-0.4817985	1.0000000	-0.0055237	0.3259584	-0.2595920	0.4794039	-0.0744022
Si	-0.5420522	-0.0698088	-0.1659267	-0.0055237	1.0000000	-0.1933309	-0.2087322	-0.1021513	-0.0942007
K	-0.2898327	-0.2660865	0.0053957	0.3259584	-0.1933309	1.0000000	-0.3178362	-0.0426181	-0.0077190
Ca	0.8104027	-0.2754425	-0.4437500	-0.2595920	-0.2087322	-0.3178362	1.0000000	-0.1128410	0.1249682
Ba	-0.0003860	0.3266029	-0.4922621	0.4794039	-0.1021513	-0.0426181	-0.1128410	1.0000000	-0.0586918
Fe	0.1430096	-0.2413464	0.0830595	-0.0744022	-0.0942007	-0.0077190	0.1249682	-0.0586918	1.0000000

Do there appear to be any outliers in the data? Are any predictors skewed?

The Ri variable is right-skewed, showing outliers at both ends, though the right tail is more pronounced with more extreme values. Similarly, the AI variable is right-skewed, with a more prominent right tail and outliers on both sides, but the right side contains more significant outliers.

The NA variable is also right-skewed, with four outliers on the right and two on the left. The left-side outliers are more extreme, significantly influencing the shape of the distribution and creating an imbalance in how the distribution is perceived.

In contrast, the Mg variable is left-skewed without any outliers, resulting in a smoother distribution. The SI variable shows left-skewness with outliers at both ends, but the left tail is more prominent, drawing attention to this side of the distribution.

Another version of the SI variable is right-skewed, with notable outliers in the right tail, where one specific observation is particularly extreme and stands out.

The CA variable is right-skewed with outliers on both sides, although the right tail is more prominent, a pattern observed consistently across multiple observations.

Lastly, the Ba variable has a highly sparse distribution, with most values at zero. The few non-zero values are considered outliers, making it difficult to assess the distribution meaningfully as it is dominated by zero values.

Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

Let’s transform the data with boxcox transformation to see how the model could be improved.

lambdas <- Glass |>
  keep(is.numeric) |>
  mutate_all(funs(BoxCoxTrans(.)$lambda)) |> head(1)

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

lambdas

##   RI   Na Mg  Al Si  K   Ca Ba Fe
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA

dataset <- GlassCont |> dplyr::select(-c(Mg,K,Ba,Fe))

for (i in 1:ncol(dataset)) {
  transformed <- BoxCoxTrans( dataset[,i])
  transformed_data <- predict(transformed,dataset[,i] )
  
  plot_h <- ggplot(dataset) + geom_histogram(aes(x= transformed_data))+xlab(paste(names(dataset)[i], " with Lambda =",transformed$lambda))
  box_p <- ggplot(dataset) +  geom_boxplot(aes(x= transformed_data),outlier.colour="red", outlier.shape=8,outlier.size=4)+xlab(paste(names(dataset)[i], " with Lambda =",transformed$lambda))
  
  grid.arrange(plot_h, box_p ,ncol=2)

}

The table summarizes the variables and their respective Box-Cox transformations. However, some variables, such as Fe, K, Mg, and Ba, contain zero values, which is problematic when applying the Box-Cox transformation, as it results in undefined values (NaN). To address this issue, we will add a constant to the dataset equals to the absolute value of the minimum value in de data plus 2, allowing us to apply the Box-Cox transformation and observe its effects.

dataset <- Glass |> dplyr::select(c(Mg,K,Ba,Fe))

for (i in 1:ncol(dataset)) {
  shift_value <- abs(min( dataset[,i])) + 2
  newVal <- dataset[,i] +  shift_value
  transformed <- BoxCoxTrans(newVal)
  transformed_data <- predict(transformed,newVal )
  
  plot_h <- ggplot(dataset) + geom_histogram(aes(x= transformed_data))+xlab(paste(names(dataset)[i], " with constant 1 and Lambda =",transformed$lambda))
  box_p <- ggplot(dataset) +  geom_boxplot(aes(x= transformed_data),outlier.colour="red", outlier.shape=8,outlier.size=4)+xlab(paste(names(dataset)[i], " with constant 1 and Lambda =",transformed$lambda))
  
  grid.arrange(plot_h, box_p ,ncol=2)

}

Outliers can limit the effectiveness of transformations like Box-Cox by dominating the process, preventing significant changes in the distribution. For both the data increased by a constant and the original data, the minimal impact of the transformation may be due to the strong influence of the outliers, which skew the results and reduce the transformation’s ability to correct the data. Handling the outliers separately could improve the transformation’s effectiveness and help normalize the data.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Data Overview

head(Soybean) |>  kable() |> kable_styling() |>  kable_classic(full_width = F) |>  scroll_box(width = "100%")

Class	date	precip	temp	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.marg	leaf.size	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	fruit.spots
diaporthe-stem-canker	6	2	1	1	1	1	0	0	1	1	2	2	1	1	3	1	1	1	4
diaporthe-stem-canker	4	2	1	2	0	2	1	1	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	1	2	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	0	1	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	6	2	1	2	0	1	0	2	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	5	2	1	3	0	1	0	1	1	1	2	2	1	0	3	0	1	1	4

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

par(mfrow=c(2, 2))

for (i in 1:ncol(Soybean)) {
  var <- names(Soybean)[i]
  print(ggplot(Soybean,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))

}

Based on the charts above, leaf.mild, mycelium, and sclerotia appear to have all their values centered around a single point. We can confirm this by checking for variables with near-zero variance using the nearZeroVar function.

near_zero <- nearZeroVar(Soybean)

soybean_zero <- Soybean[,near_zero]

for (i in 1:ncol(soybean_zero)) {
  var <- names(soybean_zero)[i]
  print(ggplot(soybean_zero,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))

}

As observed, most of the observations have a value of 0, with a few exceptions that account for no more than two records (ignoring NA values).

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Missing Data per Variable

missing_percentages <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
missing_percentages   |>  kable(col.names = c("Variables","Percentage")) |> kable_styling() |>  kable_classic(full_width = F)

Variables	Percentage
Class	0.0000000
date	0.1464129
plant.stand	5.2708638
precip	5.5636896
temp	4.3923865
hail	17.7159590
crop.hist	2.3426061
area.dam	0.1464129
sever	17.7159590
seed.tmt	17.7159590
germ	16.3982430
plant.growth	2.3426061
leaves	0.0000000
leaf.halo	12.2986823
leaf.marg	12.2986823
leaf.size	12.2986823
leaf.shread	14.6412884
leaf.malf	12.2986823
leaf.mild	15.8125915
stem	2.3426061
lodging	17.7159590
stem.cankers	5.5636896
canker.lesion	5.5636896
fruiting.bodies	15.5197657
ext.decay	5.5636896
mycelium	5.5636896
int.discolor	5.5636896
sclerotia	5.5636896
fruit.pods	12.2986823
fruit.spots	15.5197657
seed	13.4699854
mold.growth	13.4699854
seed.discolor	15.5197657
seed.size	13.4699854
shriveling	15.5197657
roots	4.5387994

plot_intro(Soybean, title = 'Missing Information on Soybean Dataset',
           ggtheme = theme_minimal())

# Plot missing volume in Column 
plot_missing(Soybean[,1:(ncol(Soybean)/2)],title = 'Information about Missing Value in Soybean dataset Part 1',ggtheme = theme_minimal())

plot_missing(Soybean[,(ncol(Soybean)/2):ncol(Soybean)],title = 'Information about Missing Value in Soybean dataset Part 2',ggtheme = theme_minimal())

Missing Data by Class

sb_miss_count <- Soybean |> mutate(nul=rowSums(is.na(Soybean)))|>
                      group_by(Class)|> summarize(miss=sum(nul)) |> filter(miss!=0)
sb_miss_count |>  kable() |> kable_styling() |>  kable_classic(full_width = T)

Class	miss
2-4-d-injury	450
cyst-nematode	336
diaporthe-pod-&-stem-blight	177
herbicide-injury	160
phytophthora-rot	1214

There is a significant imbalance in the number of missing values across these categories. For instance, phytophthora-rot has 1214 missing values, which is much higher compared to herbicide-injury (160) or diaporthe-pod-&-stem-blight (177). This could suggest that certain conditions or circumstances lead to a higher likelihood of missing values for specific categories.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For low missingness (<5%), including variables like Class, plant.stand, and precip, we apply simple imputation using the mean or median for numerical data and the mode for categorical data. These methods are sufficient given the minimal missing data and the low likelihood of introducing bias.

For moderate to high missingness (5%-17%), such as date, stem.cankers, germ, hail, lodging, and leaf.mild, we use advanced imputation techniques like KNN or MICE. These methods are more robust, preserving relationships between variables and minimizing bias, even when a substantial portion of data is missing. This ensures that critical variables can still be included in the analysis without distorting the underlying patterns in the data.

In this case, I will use MICE, which allows for flexible imputation based on the type of variable (e.g., numerical or categorical) and the relationships between variables. By default, MICE uses Predictive Mean Matching (PMM) for continuous variables, ensuring that imputed values are realistic by matching them to observed data. This method is particularly well-suited for moderate levels of missingness, as it maintains the integrity of the data while avoiding extreme or unrealistic imputed values.

imputed_data <- mice(Soybean, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)

soybean_complete_data <- complete(imputed_data, 1)

Dataset with Imputed Data

plot_intro(soybean_complete_data, title = 'Missing Information on Soybean Dataset',
           ggtheme = theme_minimal())

for (i in 1:ncol(soybean_complete_data)) {
  var <- names(soybean_complete_data)[i]
  print(ggplot(soybean_complete_data,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))

}

Homework 4 - Data Preprocessing/Overfitting

Lewris Mota Sanchez

2024-09-28

Imports

3.1

3.2

Missing Data per Variable

Missing Data by Class

Dataset with Imputed Data