The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
library(dplyr)
library(psych)
library(corrplot)
library(e1071)
library(car)
library(caret)
library(tidyr)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Nine chemery predictors are numerical data, and column “type” is factors with six levels: 1,2,3,5,6,and 7. There are no missing data in the table.
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Si has the highest percentage usage among all elements (69.81% to 75.41%). Fe has the lowest percentage usage among all elements (0% to 0.51%).
Type “1” and “2” glasses have 70 and 76 samples which are 65% of total sample size. Type “6” glass has 9 sample size which is the smallest among six types of glasses.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
The relationship of predictors can be shown by several pairwise plots.
pairs.panels(Glass[,-10],show.points=FALSE,gap=FALSE)
“Na” and “Si” seems close to normally distributed
“RI”, “Al” are lightly right-skewed
“Fe”, “K”, “Ba”, “Ca” are strongly right-skewed
“Mg” does not seem to be normally distribyted.
Corrplot and pairwise plot allowed to detect elements which correlate significantly between each other.
corrplot(cor(Glass[,-10]))
“RI” - “CA” (0.81)
“RI” - “SI” (-0.54)
Potentially it can cause a collinearity problem during model building process.
cor(Glass[,-10], as.numeric(Glass[,10]))
## [,1]
## RI -0.168739357
## Na 0.506424080
## Mg -0.728159518
## Al 0.591197598
## Si 0.149690687
## K -0.025834560
## Ca -0.008997841
## Ba 0.577676375
## Fe -0.183206747
Correlation between each elements and a glass type indicates that “Na”, “Mg”, “Al”, “Ba” strongly correlate with glass type (correlation coefficient more than 0.5) potentially making them a good predictors of glass type.
Do there appear to be any outliers in the data? Are any predictors skewed?
The Boxplot displays ouliers and any values outside the whiskers are considered outliers.
data <-Glass[,-10]
par(mfrow = c(3, 3))
for (i in 1:ncol(data)) {
boxplot(data[ ,i], ylab = names(data[i]), horizontal=T)
}
apply(Glass[,-10],2,skewness)
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
## Ca Ba Fe
## 2.0184463 3.3686800 1.7298107
Box Plots show that all elements except Mg have outliers. Outliers identification and considering if they are influential points or not are the important part of any modeling process.
Computed skewness of the elements allowed us to confirm the findings discussed abouve: higly skewd elements are Fe, Ba, K, Ca
Are there any relevant transformations of one or more predictors that might improve the classification model?
Yeo Johnson transformation was selected as a normalizing transformation. The Yeo-Johnson transformation can be thought of as an extension of the Box-Cox transformation. It handles both positive and negative values, whereas the Box-Cox transformation only handles positive values. Both can be used to transform the data so as to improve normality.
# YeoJohnson tranformation
glass_trans=preProcess(Glass[,-10], method=c("YeoJohnson"))
pred=predict(glass_trans,Glass[-10])
# checking skewness after YeoJohnson transformation
apply(pred,2,skewness)
## RI Na Mg Al Si
## 1.6027150827 -0.0088476749 -0.8770969306 0.0002128304 -0.7202392108
## K Ca Ba Fe
## -0.0708227694 -0.2063893005 3.3686799688 1.7298107096
Yeo Johnson method allowed to normalize the variables, but not all of them. The following elements were transformed to approximately normal distribution: Na, Al, K, Ca.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
As we see all variable are factors.
X <- Soybean[,1:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(X)) {
barplot(table(Soybean[,i]),xlab = names(X[i]))
}
A degenerate distribution is a probability distribution in a space (discrete or continuous) with support only on a space of lower dimension. As it is said in the book: “Some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables… such an uninformative variable may have little effect on the calculations.”
The plots indicate that we have the following variables such as “sclerotia”, “leaf.mild” and “mycelium” that are close to zero variance predictors (a predictor variable that has a single unique value).
Using nearZeroVar() we can confirm which variables are close to zero-variance predictors.
nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
https://www.rdocumentation.org/packages/caret/versions/6.0-84/topics/nearZeroVar
# Near zero variance predictors
library(caret)
nearZeroVar(Soybean)
## [1] 19 26 28
nearZeroVar(Soybean, names = TRUE)
## [1] "leaf.mild" "mycelium" "sclerotia"
nearZeroVar() has confirmed that the following variables are zero variance predictors: “sclerotia”, “leaf.mild” and “mycelium”. These variables can be removed before model building process.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
% of missing values per each variables were calculated below:
sapply(Soybean, function(y) sum(length(which(is.na(y)))))/nrow(Soybean)*100
## Class date plant.stand precip
## 0.0000000 0.1464129 5.2708638 5.5636896
## temp hail crop.hist area.dam
## 4.3923865 17.7159590 2.3426061 0.1464129
## sever seed.tmt germ plant.growth
## 17.7159590 17.7159590 16.3982430 2.3426061
## leaves leaf.halo leaf.marg leaf.size
## 0.0000000 12.2986823 12.2986823 12.2986823
## leaf.shread leaf.malf leaf.mild stem
## 14.6412884 12.2986823 15.8125915 2.3426061
## lodging stem.cankers canker.lesion fruiting.bodies
## 17.7159590 5.5636896 5.5636896 15.5197657
## ext.decay mycelium int.discolor sclerotia
## 5.5636896 5.5636896 5.5636896 5.5636896
## fruit.pods fruit.spots seed mold.growth
## 12.2986823 15.5197657 13.4699854 13.4699854
## seed.discolor seed.size shriveling roots
## 15.5197657 13.4699854 15.5197657 4.5387994
The following variables have the lagest % of missing values: “sever”, “lodging”, “hail”, “seed.tmt”. Hence we can assume that these variables are more likely to be missing.
Apart from checking NA’s in each predictors, we also checked if some classes have missing data.
Soybean %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
summarise(na = n()) %>%
select(Class, na) %>%
arrange(desc(na))
## # A tibble: 5 x 2
## Class na
## <fct> <int>
## 1 phytophthora-rot 68
## 2 2-4-d-injury 16
## 3 diaporthe-pod-&-stem-blight 15
## 4 cyst-nematode 14
## 5 herbicide-injury 8
“phytophthora-rot” has the most NA’s. “2-4-d-injury”, “diaporthe-pod-&-stem-blight” and “cyst-nematode” also have missing values, but significantly less. We can conclude that the pattern of missing data is related to the classes.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
If missingness is not informative we can potentially remove predictors, but we have quite a lot of missing data. Let’s check missing values distribution further with aggr() from VIF package. This function allows us to plot the amount of missing/imputed values in each variable and the amount of missing/imputed values in certain combinations of variables.
library(VIM)
aggr(Soybean, col=c('grey','pink'), sortVars=T,numbers=T, cex.axis=0.5)
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
There are lot of missing data for some predictors. It may be possible to remove the predictors with the largest number of missing values(for example “hail”) if “hail” is not informative. We can check that by applying chi-square test. At the same time even after removal we are still left with a lots of missing values. There are several techniques that hepl to deal with missing values. We are going to apply one of them - kNN based method. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.
kNN() performs imputation of missing data in a data frame using the k-Nearest Neighbour algorithm.
Soybean_imp<- kNN(Soybean, variable = c("hail"), k =5)
summary(Soybean_imp)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0:551 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1:132 1 :165 1 :227 1 :322 1 :222 1 :213
## 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots hail_imp
## 0 :551 Mode :logical
## 1 : 86 FALSE:562
## 2 : 15 TRUE :121
## NA's: 31
##
##
##
As we see there is no missing values in “hail” variable now. We can apply the same approach to other variables.