Data 624 HW4

Question 3.1

The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
library(dplyr)
library(corrplot)
library(car)
library(mice)
data(Glass)

Question 3.1.a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

corr <- Glass %>% subset(select=-c(Type)) %>% cor(use='pairwise.complete.obs')
corrplot.mixed(corr, upper='square', lower.col = "black")

X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  hist(X[ ,i], xlab = names(X[i]), main = paste(names(X[i]), "Histogram"), col="steelblue")  
}

pairs(X, main="Scatterplot Matrix")

Inference: From the histograms of the predictor variables it appears that RI, Na, Al, and Si have relatively normal (symmetric) distributions. The remainder of the predictor variables appear to have non-normal (asymmetric) distributions. The exceptions are RI, Na, Al, and Si which appear to be relatively normal. Most of the variable pairs in the scatterplot matrix do not appear to show strong correlations. The exceptions are the RI, Ca pair which show a clearly positive relationship and possible the RI, Si pair which show a slight negative relationship.

Question 3.1.b

Do there appear to be any outliers in the data? Are any predictors skewed?

X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  boxplot(X[ ,i], ylab = names(X[i]), horizontal=T,
          main = paste(names(X[i]), "Boxplot"), col="steelblue")
}

for (i in 1:ncol(X)) {
  d <- density(X[,i], na.rm = TRUE)
  plot(d, main = paste(names(X[i]), "Density"))
  polygon(d, col="steelblue")
}

Inference: The Boxplots show outliers in every variable except for Mg. The most extreme outliers appear in the K and Ba variables. The Boxplots show skewing in several variables, but does not serve variables with extreme outliers well. Using both the Boxplots and Density Plots it can be seen that Mg is left-skewed, while K, Ba, and Fe are right skewed. Ca also seems to be slightly right skewed.

Question 3.1.c

Are there any relevant transformations of one or more predictors that might improve the classification model?

summary(powerTransform(Glass[,1:9], family="yjPower"))$result[,1:2]

##      Est Power Rounded Pwr
## RI -25.5304405      -25.53
## Na   1.3571336        1.00
## Mg   1.7422811        1.74
## Al   0.9935500        1.00
## Si  10.9318996       10.93
## K   -0.1354867        0.00
## Ca   0.6753708        0.50
## Ba  -6.7971854       -6.80
## Fe -14.8719453      -14.87

Inferece: . THe powertransformation wtih family = yjpower was used to allow non negative numbers . Some of the numbers were zero even though nothing was negative. The Box COx Rounded lambda calculated by the power transform shows that the transformation might improve the classficication model.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

if (!require("mlbench")) install.packages("mlbench")
if (!require("caret")) install.packages("caret")
if (!require("VIM")) install.packages("mlbench")
if (!require("dplyr")) install.packages("dplyr")
if (!require("mice")) install.packages("mice")
library(mlbench)
data(Soybean)

Question 3.2.a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

X <- Soybean[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(X)) {
  smoothScatter(X[ ,i], ylab = names(X[i]))
}

Inference: The Smoothed Density Scatterplots show that all variables have few unique values. There are a few that could possibly be degenerate due to the low frequencies in some of the values. The two most glaring examples are mycelium and sclerotia.

Exercise 3.2.b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

library(VIM)
aggr(Soybean, prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

Inference: The visualizations produced by the aggr function in the VIM package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows several predictors variables have over 15% of their values missing. The grid shows the combination of all with 82% of data not missing in accordance with the problem description (18% missing)

Exercise 3.2.c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

MICE <- mice(Soybean, method="pmm", printFlag=F, seed=624)

## Warning: Number of logged events: 1674

aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.marg     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##        leaf.mild     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##         mycelium     0
##     int.discolor     0
##        sclerotia     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

Inference: Multivariate Imputation by Chained Equations (MICE) assumes values are missing at random and is implement by imputing missing data for all variables with a simple method, removing the imputations for one variable, imputing the removed data using regression, repeating the remove-regress imputation for every other imputed variable, then continuing the remove-regress imputation in a loop over the whole dataset n times