Preprocessing/Overfitting Homework

3.1.* The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:*

It looks like Na and Ba as well as Al and Ba might be highly correlated. ### Pairs Plot EDA

Fe Ba and Ca and K all look skewed, so we’ll flag that for later transformation

There also seems to be some correlation between variables that might show itself once some box-cox transformations occur.

There also seem to be some near-zero variance candidates such as K as well we should look out for.

We should also note that our response RI and the predictor Ca seem to be highly positively correlated.

Missing Values Visualization

lets visualize the missing values in the dataset

⊕the str function in base will give us this same info, but I wanted to inlcude the vis_miss function from the naniar package here. This can be very helpful to determine if the data is structurally missing or there is informative missingness

library(naniar)

## Warning: package 'naniar' was built under R version 4.1.2

## 
## Attaching package: 'naniar'

## The following object is masked from 'package:tsibble':
## 
##     pedestrian

vis_miss(Glass)

It appears no data is missing

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Being an “outlier” can fall under many definitions. Let’s start with some boxplots to quickly visualize the data

library(tidyverse)
piv<-pivot_longer(Glass,cols = -c(RI,Type), names_to = 'predictors', values_to='vals')
ggplot(piv%>%select(-c(RI,Type)), aes(y=vals, group=predictors))+geom_boxplot()+facet_wrap(~predictors, scale = 'free')+ggtitle('Boxplots of Predictors in Glass Dataset')

There are significant outliers in these data. I suspect from previous investigation this could be fixed by correcting the skew.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Depending on which models we might choose to use, I would reccomend centering, scaling and applying box-cox transformations to most of these predictors.

Skewness

skewValues<-apply(Glass%>%select(-Type),2,skewness)
head(skewValues)

##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889

These variables are skewed, performing a box-cox transformation will likely correct the skew to make the distribution more normal.

While this might make certain types of models sensitive to outliers perform better, we must take care. Outliers could indicate that there is data belonging to a sub population that might be correlated with our response.

Here is an example of how a transformation can eliminate outliers.

a<-ggplot(Glass, aes(y=Ba))+geom_boxplot()

b<-ggplot(Glass, aes(y=log(Ba)))+geom_boxplot()

ggpubr::ggarrange(a,b)

## Warning: Removed 176 rows containing non-finite values (stat_boxplot).

Filtering high correlation Data

Building off our EDA, Lets remove any predictors that are too highly correlated with eachother

#get correlation table
predictors<-Glass%>%select(-c(RI,Type))
correlations <- cor(predictors)
highCorr<-findCorrelation(correlations, cutoff = .75)
length(highCorr)

## [1] 0

head(highCorr)

## integer(0)

#filteredSegData<-segData[,-highCorr]

It doesnt look like any of our data is too highly correlated to need to be removed.

Binning

Binning variables can be advantageous depending on the purpose of our model.

When you bin predictors, you often sacrifice accuracy for inteligibility and understanding.

A predictor to bin might be K

ggplot()

If we binned K into above 1 and below 1 we might decide that this variable was a near-zero variance predictor and remove it, or make a dummy variable that allowed us to model on the distribution below 1 if the flag for below 1 was raised.

ggplot(Glass%>%filter(K<1), aes(x=K,y=RI))+geom_point()+ggtitle('RI vs K with K Below 1')

Preprocessing for PCA

We could also complete many of these transformations at once using the preProcess function, especially if we wanted to use PCA

trans<-preProcess(Glass, 
                  method = c('BoxCox', 'center','scale','pca'))
trans

## Created from 214 samples and 10 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (1)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

Applying the transformations

transformed<-predict(trans,Glass%>%select(-Type))
head(transformed[,1:7])

##          PC1        PC2        PC3        PC4        PC5        PC6         PC7
## 1 -1.2126444 -0.3942139 -0.1730756 -1.7193852  0.1913387 -0.3686905 -0.47956412
## 2  0.6179073 -0.7020476 -0.5507034 -0.8575350  0.1566312  0.0618975 -0.08525582
## 3  0.9907027 -0.8876886 -0.6452946 -0.3027716  0.1363025 -0.1739302 -0.39412058
## 4  0.1510212 -0.9042336 -0.1622361 -0.4521567  0.4291846 -0.2951884 -0.10178442
## 5  0.3582849 -1.0160965 -0.5763959 -0.1667831  0.3634192 -0.3289072  0.13934291
## 6  0.3408017 -1.3565637  0.7451275  1.0568333 -1.7762845  0.1433598 -0.23899752

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Load Data

library(mlbench)
data(Soybean)

The Soybean database provides 19 classes of disease that a Soybean can get .

There are 35 categorical variables and 1 date variable.

A notable attribute of this dataset is the 4 final predictors with lots of missing values.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Near-zero variance predictors

The first thing I want to check is near-zero variance predictors.

These can cause problems when

-fraction of unique vals over sample size is low (ex: 10%) -ratio of frequency of MOST prevalent val to the frequency of SECOND most prevalent is large (ex: 20)

Lets look at a random variable to check for near-zero variance visually

ggplot(Soybean, aes(x=germ))+geom_bar()+
  ggtitle('Count of Samples by Germ encoding')

Now we will use the nearZero function to pull variables that meet this requirement

zero_cols<-nearZeroVar(Soybean)

Soybean%>%select(zero_cols)%>%colnames()

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(zero_cols)` instead of `zero_cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## [1] "leaf.mild" "mycelium"  "sclerotia"

Lets look at one of these near-zero variance columns

ggplot(Soybean, aes(x=leaf.mild))+geom_bar()+
  ggtitle('Count of Samples by leaf.mild encoding')

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Lets look at where the data is most missing

vis_miss(Soybean)

Notably, I am getting a different percent missing than the book suggests.

Based on our visualizations, we can conclude that the missing data is not randomly distributed.

It seems that the missing data is grouped into specific samples.

Informative Missingness

I want to examine if the missingness has to do with being a member of one of the Classes

Soybean%>%
  filter(!complete.cases(.))%>%
  group_by(Class)%>%
  summarize(n())

## # A tibble: 5 x 2
##   Class                       `n()`
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

As we can see, the missingness in the data is contained in just 5 of the 19 classes.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First method for handling the missind data is to do nothing.If we select a tree based model, no imputation or deletion of data is required.

Our second choice is to omit the missing data. If we deem the dataset large enough, and the missingness is concentrated in a small enough amount of the data, omission is a fine choice.

Our third choice is to impute the data. We can build models to predict the values of the missing data before we build our classification model. A commonly used technique is k-nearest neighbor which locates the samples with missing data in the “beanspace” and imputes the missing value with the most common value of its closest neighbors.

Imputation can bring along uncertainty, as well as increased computational time. Every time a new missing value needs to be imputed, the entire dataset is needed.