The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
if (!require("mlbench")) install.packages("mlbench")
if (!require("car")) install.packages("car")
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
The Histogram bins the data by frequency of values and plots the frequency bins against the ordered values. The height of each bin represents the amount of the frequency. The Scatterplot Matrix is a grid with a scatterplot for each pair of predictors. The diagonal of the matrix indicates which variable spans across the row and column.
X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
hist(X[ ,i], xlab = names(X[i]), main = paste(names(X[i]), "Histogram"), col="steelblue")
}
pairs(X, main="Scatterplot Matrix") # equivalent to plot(X)
From the histograms of the predictor variables it appears that RI, Na, Al, and Si have relatively normal (symmetric) distributions. The remainder of the predictor variables appear to have non-normal (asymmetric) distributions. The exceptions are RI, Na, Al, and Si which appear to be relatively normal. Most of the variable pairs in the scatterplot matrix do not appear to show strong correlations. The exceptions are the RI, Ca pair which show a clearly positive relationship and possible the RI, Si pair which show a slight negative relationship.
Do there appear to be any outliers in the data? Are any predictors skewed?
The Boxplot displays the data in quartiles. Inside the box lie the data filing within the 25th and 75th percentile, also called the 1st and 3rd quartile. The line inside the box represents the median, also known as the 50th percentile, also called the 2nd quartile. Extending from the box are whiskers. Any values outside the whiskers are considered outliers. Symmetric distributions have a box with equally sized whiskers. Skewed distributions have the boxes with one long and one short whisker. The Density Plot is very much like a histogram in that it is a visual representation of the frequency of the data, but it does so with a continuous curve rather than discrete bins. Density Plots are slightly better for visualizing the skew in data.
X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
boxplot(X[ ,i], ylab = names(X[i]), horizontal=T,
main = paste(names(X[i]), "Boxplot"), col="steelblue")
}
for (i in 1:ncol(X)) {
d <- density(X[,i], na.rm = TRUE)
plot(d, main = paste(names(X[i]), "Density"))
polygon(d, col="steelblue")
}
The Boxplots show outliers in every variable except for Mg. The most extreme outliers appear in the K and Ba variables. The Boxplots show skewing in several variables, but does not serve variables with extreme outliers well. Using both the Boxplots and Density Plots it can be seen that Mg is left-skewed, while K, Ba, and Fe are right skewed. Ca also seems to be slightly right skewed.
Are there any relevant transformations of one or more predictors that might improve the classification model?
The Box-Cox transformation is used to normalize data and improve forecasting. Given the vector \(\mathbf{x}\), the Box and Cox power transformation is calculated and returned as the value \(\lambda\) such that raising the vector of data to the power of \(\lambda\) yields the transformed data \(\mathbf{y}\). Simply stated, \(\mathbf{x}^\lambda=\mathbf{y}\). The function powerTransform()
in the car
package calculates the Box-Cox transformation using the maximum likelihood approach and returns information on the estimated values along with convenient rounded values that are within 1.96 standard deviations of the maximum likelihood estimate.
library(car)
summary(powerTransform(Glass[,1:9], family="yjPower"))$result[,1:2]
## Est Power Rounded Pwr
## RI -25.5304405 -25.53
## Na 1.3571336 1.00
## Mg 1.7422811 1.74
## Al 0.9935500 1.00
## Si 10.9318996 10.93
## K -0.1354867 0.00
## Ca 0.6753708 0.50
## Ba -6.7971854 -6.80
## Fe -14.8719453 -14.87
The powerTransform()
function with a family="bcPower"
parameter works for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, some include zero. Therefore the family="yjPower"
parameter which allows a few negative values is used. Variables were therefore shifted slightly right; \(1\mathrm{e}{-10}=1\cdot10^{-64}\) to be exact; to conform with the strictly positive requirement. The Box-Cox rounded \(\lambda\)’s calculated by the powerTransform()
function show that transformations might improve the classification model. The suggested transformations are:
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
if (!require("mlbench")) install.packages("mlbench")
if (!require("caret")) install.packages("caret")
if (!require("VIM")) install.packages("mlbench")
if (!require("dplyr")) install.packages("dplyr")
if (!require("mice")) install.packages("mice")
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Degenerate distributions are those where the predictor variable has a single unique value (zero variance) or a handful of unique values that occur with very low frequencies (near-zero variance). The smoothScatter()
function produces a smoothed color density representation of a scatterplot. This is very useful for examining categorical data. The nearZeroVar()
function from the caret
library examines the uniqueness of data and returns a table indicating whether each variable has zero or near-zero variance.
X <- Soybean[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(X)) {
smoothScatter(X[ ,i], ylab = names(X[i]))
}
library(caret)
nearZeroVar(X, names = TRUE, saveMetrics=T)
## freqRatio percentUnique zeroVar nzv
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
The Smoothed Density Scatterplots show that all variables have few unique values. There are a few that could possibly be degenerate due to the low frequencies in some of the values. The two most glaring examples are mycelium
and sclerotia
. The Smoothed Density Scatterplot for those variables shows one solid color across the whole chart. The variables leaf.mild
and int.discolor
appear to show near-zero variance as well, but running the nearZeroVar()
function on the data shows that only mycelium
, sclerotia
, and leaf.mild
have near-zero variance (“nzv”). No variable has zero variance (“zeroVar”).
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
The aggr
function in the VIM
package plots and calculates the amount of missing values in each variable. The dply
function is useful for wrangling data into aggregate summaries and is used to find the pattern of missing data related to the classes.
library(VIM)
aggr(Soybean, prop = c(T, T), bars=T, numbers=T, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
library(dplyr)
Soybean %>%
mutate(Total = n()) %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(), Proportion=Missing/Total) %>%
select(Class, Missing, Proportion) %>%
unique()
## Source: local data frame [5 x 3]
## Groups: Class [5]
##
## Class Missing Proportion
## <fctr> <int> <dbl>
## 1 phytophthora-rot 68 0.09956076
## 2 diaporthe-pod-&-stem-blight 15 0.02196193
## 3 cyst-nematode 14 0.02049780
## 4 2-4-d-injury 16 0.02342606
## 5 herbicide-injury 8 0.01171303
The visualizations produced by the aggr
function in the VIM
package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows several predictors variables have over 15% of their values missing. The grid shows the combination of all with 82% of data not missing in accordance with the problem description (18% missing). The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis. The non-graphical output of the function shows the exact proportion of missing values per variable. Checking if a pattern of missing data related to the classes exists is done by checking if some classes hold most of the incomplete cases. This is accomplished by filtering, grouping, and mutating the data with dplyr
. The majority of the missing values are in the phytophthora-rot
class which has nearly 10% incomplete cases. The are only four more, out of the eighteen other, variables with incomplete cases. The pattern of missing data is related to the classes. Mostly the phytophthora-rot
class however since the other four variables only have between 1% and 2% incomplete cases.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The mice()
function in the mice
package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has over imputation 20 methods that can be applied to the data. The one used with these data is the predictive mean matching method which is currently the most popular in online forums. After the imputations are made, a complete dataset is created using the complete()
function. The aggr
function from the VIM
package used in the previous example (plots and calculates the amount of missing values in each variable) is then reran for comparison.
library(mice)
MICE <- mice(Soybean, method="pmm", printFlag=F, seed=624)
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## Class 0
## date 0
## plant.stand 0
## precip 0
## temp 0
## hail 0
## crop.hist 0
## area.dam 0
## sever 0
## seed.tmt 0
## germ 0
## plant.growth 0
## leaves 0
## leaf.halo 0
## leaf.marg 0
## leaf.size 0
## leaf.shread 0
## leaf.malf 0
## leaf.mild 0
## stem 0
## lodging 0
## stem.cankers 0
## canker.lesion 0
## fruiting.bodies 0
## ext.decay 0
## mycelium 0
## int.discolor 0
## sclerotia 0
## fruit.pods 0
## fruit.spots 0
## seed 0
## mold.growth 0
## seed.discolor 0
## seed.size 0
## shriveling 0
## roots 0
Multivariate Imputation by Chained Equations (MICE) assumes values are missing at random and is implement by imputing missing data for all variables with a simple method, removing the imputations for one variable, imputing the removed data using regression, repeating the remove-regress imputation for every other imputed variable, then continuing the remove-regress imputation in a loop over the whole dataset \(n\) times. The simple imputation method used in here is Predictive Mean Matching (PMM) which “imputes missing values by means of the nearest-neighbor donor with distance based on the expected values of the missing variables conditional on the observed covariates.” Given that this is a categorial dataset, the regression method using is logistic regression. After applying this imputation using the mice()
function from the mice
, there are no missing data left in the data. The aggr
function from the VIM
package used in the previous example now shows no missing data in any variable.
https://rpubs.com/josezuniga/253955
https://rpubs.com/josezuniga/269297
http://appliedpredictivemodeling.com/
https://www.rdocumentation.org/packages
http://archive.ics.uci.edu/ml/index.html
https://robjhyndman.com/uwafiles/fpp-notes.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
https://archive.ics.uci.edu/ml/datasets/Glass+Identification
https://cran.r-project.org/web/packages/forecast/forecast.pdf
http://appliedpredictivemodeling.com/blog/2014/11/12/solutions-on-github
http://www.stefvanbuuren.nl/publications/2014%20Semicontinuous%20-%20Stat%20Neerl.pdf