Data 624 Homework 4

Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Make a copy of the Glass dataset and remove the categorical variable - "Type".
glassCopy <- subset(Glass, select = -Type) 

# Plot the predictor variables distribution.
glassCopy %>%
  gather() %>% 
  ggplot(aes(value, color = 'red', fill = 'brown')) +
  facet_wrap(~ key, scales = 'free') +
  geom_histogram(bins = 16) +
  theme_light() +
  theme(legend.position = 'none') +
  ggtitle('Distribution of Predictor Variables')

# Create a correlation matrix of the predictor variables.
corrplot(cor(glassCopy))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Looking at the "Distribution of Predictor Variables" plot above, we can see that some of the variables are close to normally distributed (AI, Ca, Na, RI, and Si), whilst the remaining variables are skewed (Ba, Fe, K, and Mg). Ba, Fe, and K are skewed to the right. K has an outlier at 3 and 6, and there are a lot of outliers in Al, Ba, Ca, Mg, Fe, and Ri.

The correlation matrix tells us that most of the variables are not strongly related. Some exceptions to this are the relationships between Si and RI, Ca and RI, Ba and Mg.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes - applying a Box-Cox or Log transformation to the skewed variables - Ba, Fe, K, and Mg, might improve the classification model.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

nearZeroVar(Soybean, saveMetrics = TRUE) %>%
  kable(caption = 'Variables Near Zero Variance Status Report') %>%
  kable_styling()

Variables Near Zero Variance Status Report
	freqRatio	percentUnique	zeroVar	nzv
Class	1.010989	2.7818448	FALSE	FALSE
date	1.137405	1.0248902	FALSE	FALSE
plant.stand	1.208191	0.2928258	FALSE	FALSE
precip	4.098214	0.4392387	FALSE	FALSE
temp	1.879397	0.4392387	FALSE	FALSE
hail	3.425197	0.2928258	FALSE	FALSE
crop.hist	1.004587	0.5856515	FALSE	FALSE
area.dam	1.213904	0.5856515	FALSE	FALSE
sever	1.651282	0.4392387	FALSE	FALSE
seed.tmt	1.373874	0.4392387	FALSE	FALSE
germ	1.103627	0.4392387	FALSE	FALSE
plant.growth	1.951327	0.2928258	FALSE	FALSE
leaves	7.870130	0.2928258	FALSE	FALSE
leaf.halo	1.547511	0.4392387	FALSE	FALSE
leaf.marg	1.615385	0.4392387	FALSE	FALSE
leaf.size	1.479638	0.4392387	FALSE	FALSE
leaf.shread	5.072917	0.2928258	FALSE	FALSE
leaf.malf	12.311111	0.2928258	FALSE	FALSE
leaf.mild	26.750000	0.4392387	FALSE	TRUE
stem	1.253378	0.2928258	FALSE	FALSE
lodging	12.380952	0.2928258	FALSE	FALSE
stem.cankers	1.984293	0.5856515	FALSE	FALSE
canker.lesion	1.807910	0.5856515	FALSE	FALSE
fruiting.bodies	4.548077	0.2928258	FALSE	FALSE
ext.decay	3.681481	0.4392387	FALSE	FALSE
mycelium	106.500000	0.2928258	FALSE	TRUE
int.discolor	13.204546	0.4392387	FALSE	FALSE
sclerotia	31.250000	0.2928258	FALSE	TRUE
fruit.pods	3.130769	0.5856515	FALSE	FALSE
fruit.spots	3.450000	0.5856515	FALSE	FALSE
seed	4.139130	0.2928258	FALSE	FALSE
mold.growth	7.820895	0.2928258	FALSE	FALSE
seed.discolor	8.015625	0.2928258	FALSE	FALSE
seed.size	9.016949	0.2928258	FALSE	FALSE
shriveling	14.184211	0.2928258	FALSE	FALSE
roots	6.406977	0.4392387	FALSE	FALSE

# Search for degenerate distributions in the Soybean dataset.
degenerateDistributions <- nearZeroVar(Soybean)
colnames(Soybean)[degenerateDistributions]

## [1] "leaf.mild" "mycelium"  "sclerotia"

As per the above "Variables Near Zero Variance Status Report" table and NearZeroVar() search results, There are 3 variables in the Soybean dataset with degenerate distributions - leaf.mild, mycelium, and sclerotia.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Print out a table of missing values by column (sorted in descending order).
missingValuesOrdered <- order(-colSums(is.na(Soybean)))

kable(colSums(is.na(Soybean))[missingValuesOrdered], caption = 'Missing Values By Column') %>%
    kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>% 
    scroll_box(width = '100%', height = '600px')

Missing Values By Column
	x
hail	121
sever	121
seed.tmt	121
lodging	121
germ	112
leaf.mild	108
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.shread	100
seed	92
mold.growth	92
seed.size	92
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
date	1
area.dam	1
Class	0
leaves	0

# Print a table containing a count of missing values by class.
classesMissingValues <- Soybean %>%
  mutate(nul = rowSums(is.na(Soybean))) %>%
  group_by(Class) %>%
  summarize(missing = sum(nul)) %>%
  filter(missing != 0)

kable(classesMissingValues, caption = 'Missing Values By Class') %>%
      kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>% 
      scroll_box(width = '100%')

Missing Values By Class
Class	missing
2-4-d-injury	450
cyst-nematode	336
diaporthe-pod-&-stem-blight	177
herbicide-injury	160
phytophthora-rot	1214

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For this question, I decided to impute missing values using the MICE (Multivariate Imputation by Chained Equations) package's mice() imputation function. As per the below before and after imputation missing values count tables, we can see that the imputation has removed all missing values from the dataset.

#' mice_imputation - Mice Imputation.
#'
#' Given a dataset, runs the MICE algorithm on the dataset
#' to impute both numerical and categorical missing values.
#'
#' @param dataframe A dataframe on which to run the MICE algorithm.
#'
#' @return The passed dataset with missing values imputed to complete values.
#'
mice_imputation <- function(dataframe) {
  imputation <- mice(dataframe, m = 1, method = 'cart', printFlag = FALSE)
  imputed <- mice::complete(imputation)
}

# Check for empty values prior to imputing the data.
sapply(Soybean, function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count Before Imputation') %>% kable_styling()

Missing Values Count Before Imputation
	x
hail	121
sever	121
seed.tmt	121
lodging	121
germ	112
leaf.mild	108
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.shread	100
seed	92
mold.growth	92
seed.size	92
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
date	1
area.dam	1
Class	0
leaves	0

# Check for empty values once again after running the MICE imputation on the data.
sapply(mice_imputation(Soybean), function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count After Imputation') %>% kable_styling()

Missing Values Count After Imputation
	x
Class	0
date	0
plant.stand	0
precip	0
temp	0
hail	0
crop.hist	0
area.dam	0
sever	0
seed.tmt	0
germ	0
plant.growth	0
leaves	0
leaf.halo	0
leaf.marg	0
leaf.size	0
leaf.shread	0
leaf.malf	0
leaf.mild	0
stem	0
lodging	0
stem.cankers	0
canker.lesion	0
fruiting.bodies	0
ext.decay	0
mycelium	0
int.discolor	0
sclerotia	0
fruit.pods	0
fruit.spots	0
seed	0
mold.growth	0
seed.discolor	0
seed.size	0
shriveling	0
roots	0

Data 624 Homework 4

Week 5 Data Preprocessing/Overfitting

Stephen Haslett

9/26/2021

Exercise 3.1

Exercise 3.2