Data Preprocessing/Overfitting

The purpose of this assignment is to explore the Data Pre-processing exercises from Applied Predictive Modeling book.

#Libraries required
library(mlbench)
library(naniar)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggpubr)
library(caret)
library(purrr)
library(e1071)
library(inspectdf)
library(DataExplorer)

Exercise 3.1

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Missing Values

#Check the sum of NA values in the data.
sum(is.na(Glass))

## [1] 0

#Another way to visualize missing values.
vis_miss(Glass)

There are not missing values in the data.

Distribution

plot_histogram(Glass)

plot_density(Glass)

From the above histograms and distribution graphs, we can notice the following:

Some distributions follow normal distribution such as Na and Ai.
Some distributions are either left or right skewed Ca, Si, RI.
Some distribution is unimodal such as Ba and Fe
Bi-modal distribution in Mg

#Explore our non-numeric factor variable
ggplot(Glass, aes(Type)) +
  geom_bar()

the Type factor variable show that type 1,2, and 6 have the highest frequencies. Therefore, related variables may be of the same type.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

We detected some oultiers present in the data in the distribution plots above.

We can check that in more details using skewness function.

RI <- Glass$RI
skewness(RI)

## [1] 1.602715

The skewness of RI is 1.614015. It indicates that RI distribution is skewed towards the left.

Create a DataFrame for each variable, skewneSs value, and the interpretation.

#Create a dataframe for each variable and its skewnees value with the interpretation

Variable <- c("RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba","Fe")
Skew_Value <- c( 1.614015, 0.4509917, -1.144465, 0.9009179, -0.7253173, 6.505636, 2.032677, 3.392431, 1.742007)
Interpretation <- c("Right-skewed", "Right-skewed", "Left-skewed", "Right-skewed", "Left-skewed","Right-skewed", "Right-skewed", "Right-skewed", "Right-skewed" )

df <- data.frame(Variable, Skew_Value, Interpretation)
df

##   Variable Skew_Value Interpretation
## 1       RI  1.6140150   Right-skewed
## 2       Na  0.4509917   Right-skewed
## 3       Mg -1.1444650    Left-skewed
## 4       Al  0.9009179   Right-skewed
## 5       Si -0.7253173    Left-skewed
## 6        K  6.5056360   Right-skewed
## 7       Ca  2.0326770   Right-skewed
## 8       Ba  3.3924310   Right-skewed
## 9       Fe  1.7420070   Right-skewed

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

From the visualizations provide above with insight regarding our variables, their distributions, relevant transformations will improve the classification model.

preProcess function has the ability to transform, center, scale, or impute values.

Transformation <- preProcess(Glass, method = c("BoxCox", "center", "scale", "pca"))
Transformation

## Created from 214 samples and 10 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (1)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

Now, let’s see what transformation was performed on each Variable

Transformation[["method"]][["BoxCox"]]

## [1] "RI" "Na" "Al" "Si" "Ca"

Transformation[["method"]][["center"]]

## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"

Transformation[["method"]][["scale"]]

## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"

Transformation[["method"]][["pca"]]

## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
?Soybean

## starting httpd help server ... done

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

#Since all variables are factors, we will check the distributions for categorical predictors

inspectdf::inspect_cat(Soybean) %>%
    show_plot()

The size of the bar representing level frequency.

The grey part represent the missing values.

There are some distributions that are degenerate such as mycelium and sclerotia.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can investigate and visualize missing values using vis-miss.

#visualize missing values.
vis_miss(Soybean)

There is 9.5% of the data are missing shown in the graph above.

hail, sever, seed.tmt, lodging are missing 17.72% of the data. There are some vaiables with 15% missing values as well.
There appears to be a pattern to missing values that is related to Class
Class does not have any missing values.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

A strategy of imputation over removal will be be better approach, multiple imputation for categorical variables.

Data Preprocessing/Overfitting

Gehad Gad

3/5/2022

Exercise 3.1

Exercise 3.2