3.1, 3.2
3.1.* The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:*
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
first lets look at the pairs plot to get a feel for the data
GGally::ggpairs(Glass)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s also check the corrplot for the predictors
GGally::ggcorr(cor(Glass%>%select(-c(RI,Type))))
It looks like Na and Ba as well as Al and Ba might be highly correlated. ### Pairs Plot EDA
Fe Ba and Ca and K all look skewed, so we’ll flag that for later transformation
There also seems to be some correlation between variables that might show itself once some box-cox transformations occur.
There also seem to be some near-zero variance candidates such as K as well we should look out for.
We should also note that our response RI and the predictor Ca seem to be highly positively correlated.
lets visualize the missing values in the dataset
the str function in base will give us this same info, but I wanted to inlcude the vis_miss function from the naniar package here. This can be very helpful to determine if the data is structurally missing or there is informative missingness
library(naniar)
## Warning: package 'naniar' was built under R version 4.1.2
##
## Attaching package: 'naniar'
## The following object is masked from 'package:tsibble':
##
## pedestrian
vis_miss(Glass)
It appears no data is missing
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Being an “outlier” can fall under many definitions. Let’s start with some boxplots to quickly visualize the data
library(tidyverse)
piv<-pivot_longer(Glass,cols = -c(RI,Type), names_to = 'predictors', values_to='vals')
ggplot(piv%>%select(-c(RI,Type)), aes(y=vals, group=predictors))+geom_boxplot()+facet_wrap(~predictors, scale = 'free')+ggtitle('Boxplots of Predictors in Glass Dataset')
There are significant outliers in these data. I suspect from previous investigation this could be fixed by correcting the skew.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Depending on which models we might choose to use, I would reccomend centering, scaling and applying box-cox transformations to most of these predictors.
skewValues<-apply(Glass%>%select(-Type),2,skewness)
head(skewValues)
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
These variables are skewed, performing a box-cox transformation will likely correct the skew to make the distribution more normal.
While this might make certain types of models sensitive to outliers perform better, we must take care. Outliers could indicate that there is data belonging to a sub population that might be correlated with our response.
Here is an example of how a transformation can eliminate outliers.
a<-ggplot(Glass, aes(y=Ba))+geom_boxplot()
b<-ggplot(Glass, aes(y=log(Ba)))+geom_boxplot()
ggpubr::ggarrange(a,b)
## Warning: Removed 176 rows containing non-finite values (stat_boxplot).
Building off our EDA, Lets remove any predictors that are too highly correlated with eachother
#get correlation table
predictors<-Glass%>%select(-c(RI,Type))
correlations <- cor(predictors)
highCorr<-findCorrelation(correlations, cutoff = .75)
length(highCorr)
## [1] 0
head(highCorr)
## integer(0)
#filteredSegData<-segData[,-highCorr]
It doesnt look like any of our data is too highly correlated to need to be removed.
Binning variables can be advantageous depending on the purpose of our model.
When you bin predictors, you often sacrifice accuracy for inteligibility and understanding.
A predictor to bin might be K
ggplot()
If we binned K into
above 1 and below 1 we might decide that this variable was a near-zero variance predictor and remove it, or make a dummy variable that allowed us to model on the distribution below 1 if the flag for below 1 was raised.
ggplot(Glass%>%filter(K<1), aes(x=K,y=RI))+geom_point()+ggtitle('RI vs K with K Below 1')
We could also complete many of these transformations at once using the preProcess function, especially if we wanted to use PCA
trans<-preProcess(Glass,
method = c('BoxCox', 'center','scale','pca'))
trans
## Created from 214 samples and 10 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (1)
## - principal component signal extraction (9)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
Applying the transformations
transformed<-predict(trans,Glass%>%select(-Type))
head(transformed[,1:7])
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 1 -1.2126444 -0.3942139 -0.1730756 -1.7193852 0.1913387 -0.3686905 -0.47956412
## 2 0.6179073 -0.7020476 -0.5507034 -0.8575350 0.1566312 0.0618975 -0.08525582
## 3 0.9907027 -0.8876886 -0.6452946 -0.3027716 0.1363025 -0.1739302 -0.39412058
## 4 0.1510212 -0.9042336 -0.1622361 -0.4521567 0.4291846 -0.2951884 -0.10178442
## 5 0.3582849 -1.0160965 -0.5763959 -0.1667831 0.3634192 -0.3289072 0.13934291
## 6 0.3408017 -1.3565637 0.7451275 1.0568333 -1.7762845 0.1433598 -0.23899752
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data(Soybean)
The Soybean database provides 19 classes of disease that a Soybean can get .
There are 35 categorical variables and 1 date variable.
A notable attribute of this dataset is the 4 final predictors with lots of missing values.
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
The first thing I want to check is near-zero variance predictors.
These can cause problems when
-fraction of unique vals over sample size is low (ex: 10%) -ratio of frequency of MOST prevalent val to the frequency of SECOND most prevalent is large (ex: 20)
Lets look at a random variable to check for near-zero variance visually
ggplot(Soybean, aes(x=germ))+geom_bar()+
ggtitle('Count of Samples by Germ encoding')
Now we will use the nearZero function to pull variables that meet this requirement
zero_cols<-nearZeroVar(Soybean)
Soybean%>%select(zero_cols)%>%colnames()
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(zero_cols)` instead of `zero_cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## [1] "leaf.mild" "mycelium" "sclerotia"
Lets look at one of these near-zero variance columns
ggplot(Soybean, aes(x=leaf.mild))+geom_bar()+
ggtitle('Count of Samples by leaf.mild encoding')
(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Lets look at where the data is most missing
vis_miss(Soybean)
Notably, I am getting a different percent missing than the book suggests.
Based on our visualizations, we can conclude that the missing data is not randomly distributed.
It seems that the missing data is grouped into specific samples.
I want to examine if the missingness has to do with being a member of one of the Classes
Soybean%>%
filter(!complete.cases(.))%>%
group_by(Class)%>%
summarize(n())
## # A tibble: 5 x 2
## Class `n()`
## <fct> <int>
## 1 2-4-d-injury 16
## 2 cyst-nematode 14
## 3 diaporthe-pod-&-stem-blight 15
## 4 herbicide-injury 8
## 5 phytophthora-rot 68
As we can see, the missingness in the data is contained in just 5 of the 19 classes.
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
First method for handling the missind data is to do nothing.If we select a tree based model, no imputation or deletion of data is required.
Our second choice is to omit the missing data. If we deem the dataset large enough, and the missingness is concentrated in a small enough amount of the data, omission is a fine choice.
Our third choice is to impute the data. We can build models to predict the values of the missing data before we build our classification model. A commonly used technique is k-nearest neighbor which locates the samples with missing data in the “beanspace” and imputes the missing value with the most common value of its closest neighbors.
Imputation can bring along uncertainty, as well as increased computational time. Every time a new missing value needs to be imputed, the entire dataset is needed.