R Markdown

The data can be accessed via:

library(AppliedPredictiveModeling)
library(mlbench)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.94 loaded
library(purrr)
library(tidyr)

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
#Create a heat mat with corrplot to see if there's any correlation between class categories

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot() 

#Basic barchart to see glass distribution

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Glass Distributions")

#Histogram of each class category

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Predictors")

#Box and Whisker of each class category

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Predictors")

It appears that some predictors are more normally distributed than others (ex. AL, CA) while the remainder are skewed. When looking at the correlation plot, we can see that Ca and RI has the strongest correlation among all the class categories, while Ri-Si, Al-Mg, Ca-Mg, and Ba-Mg have a large negative correlation. From the bar plot, types 2,1, and 7 have the largest counts, meaning the data is heavily centered around them.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, there appears to be outliers, namely in the K, Fe, Na, Ba and Ri predictors. Most of the predictors are skewed as well, such as Al, Ca, Fe, K and Ba (right skewness), Mg, Si (left skewness).

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

The Box-Cox Transformation may be useful since a good amount of the predictors have right skewness. It will help stabilize the variance and remove the skew to help the data be more normally distributed. Because some of the predictors also contain outliers, Spatial SIgn Transformation will be helpful to minimize the impact of the outliers on the data.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

Soybean %>%
  select(!Class) %>%
  drop_na() %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_bar() +
  facet_wrap(~ key)
## Warning: attributes are not identical across measure variables; they will be
## dropped

Degenerate distributions are when the variable of the distribution take primarily one value and the others occur at a very low rate. We can see here that mycelium, scleroita, and roots appear to be degenerate.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For a strategy, I would use mode imputation in order to take care of missing data in the set. Because the majority of predictors are categorical, we can fill in missing values using the most frequent category. This approach will allow us to not lose too much data from removing rows/columns. Mode imputation also will let us preserve the dataset’s structure as well as the integrity of the categorical variables, allowing for the elimination of gaps in the data.