The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.5.3
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
}
Based on the histogram plots above, it appears that RI
, NA
, Al
, and Si
are approximately normal in their distribution. The rest of the other predictors do not appear to be approximately normal in their distributions.
library(corrplot)
## corrplot 0.84 loaded
y <- cor(Glass[1:9])
corrplot(y, method="number")
Predictors Ri
and Ca
are strongly correlated with each other, which means that they represent the same information. As they represent the same information, it’s recommended to only use one of these variables. The rest of the other variables are weakly to moderately correlated.
par(mfrow = c(3, 3))
for (i in 1:ncol(Glass[1:9])){
boxplot(Glass[,i], xlab=colnames(Glass[1:9])[i], horizontal=T)
}
Data that fall within the boxplot fall within the 25th and 75th percentile. The middle line is the median or 50th percentile. Values outside the whikers are considered outliers. As you can see, every predictor variable has outliers except Mg
. Predictors K
and Ba
show the most extreme outliers.
We saw that RI
and Ca
are highly correlated. We can remove one of the highly correlated variable. We also saw that some of the predictors were skewed. Applying the box-cox transformation to predictors k
, Ba
, ’Mg
, and Fe
would result in a more symmetric distibution. A data transformation that can help minimize the problem of outliers is the spatial sign transformation. The effect of this transformation makes all the samples equidistant from the center of the sphere.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
A degenerate distribution happens when a predictor variable has a single unique value (zero variance) or only has a handful of unique values (near-zero variance) that occur with very low frequencies. Below, the nearZeroVar()
functionfrom the caret library is used to examine uniqueness of data. The table below shows whether a variable has zero or near-zero variance. The results show that variables mycelium
, sclerotia
, and leaf.mild
have near-zero variance (“nzv”). None of thea variables have zero variance. Below is plot of the near-zero variance predictors.
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
X <- Soybean[,2:36]
nearZeroVar(X, names = TRUE, saveMetrics=T)
## freqRatio percentUnique zeroVar nzv
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
par(mfrow = c(2,2))
plot(Soybean$mycelium, main='mycelium')
plot(Soybean$sclerotia, main='sclerotia')
plot(Soybean$leaf.mild, main='leaf.mild' )
Below is table that shows count of missing data for each variable.
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.2
sorted <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[sorted])
x | |
---|---|
hail | 121 |
sever | 121 |
seed.tmt | 121 |
lodging | 121 |
germ | 112 |
leaf.mild | 108 |
fruiting.bodies | 106 |
fruit.spots | 106 |
seed.discolor | 106 |
shriveling | 106 |
leaf.shread | 100 |
seed | 92 |
mold.growth | 92 |
seed.size | 92 |
leaf.halo | 84 |
leaf.marg | 84 |
leaf.size | 84 |
leaf.malf | 84 |
fruit.pods | 84 |
precip | 38 |
stem.cankers | 38 |
canker.lesion | 38 |
ext.decay | 38 |
mycelium | 38 |
int.discolor | 38 |
sclerotia | 38 |
plant.stand | 36 |
roots | 31 |
temp | 30 |
crop.hist | 16 |
plant.growth | 16 |
stem | 16 |
date | 1 |
area.dam | 1 |
Class | 0 |
leaves | 0 |
Below is table that lists classes with missing data. There are 19 categories of classes, and only 4 have missing values. The class phytophthora-rot
has the maximum number of missing values. This shows that pattern of missing values is related to class.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
kable(Soybean %>% mutate(nul=rowSums(is.na(Soybean))) %>% group_by(Class) %>% summarize(missing=sum(nul)) %>% filter(missing!=0))
Class | missing |
---|---|
2-4-d-injury | 450 |
cyst-nematode | 336 |
diaporthe-pod-&-stem-blight | 177 |
herbicide-injury | 160 |
phytophthora-rot | 1214 |
From previous exercise, we know that leaf.mild
, mycelium
and sclerotia
have degenerate distributions. So we can remove these from the model. The table below shows fraction of missing data for each variable. Fraction of missing values range from 0 to about .21. Variables Class
and date
do not have any missing values. So, we don’t have do to anything for these variables. For variables with a small to moderate fraction of missing values and only have 0 or 1 values, we can randomly input 1 or 0 for the missing values. These variables are plant.stand
, hail
, plant.growth
, leaves
, leaf.shread
, leaf.malf
, lodging
, stem
, fruiting.bodies
, seed
, mold.growth
, seed.discolor
, seed.size
, shriveling
. For the rest of the other variables, we can use k-nearest neighbor to impute values.
missing <- colSums(is.na(Soybean)==TRUE)
notMissing <- colSums(is.na(Soybean)==FALSE)
result <- vector()
for (i in 1:ncol(Soybean)){
result <- append(result, missing[i]/notMissing[i])
}
sorted <- order(result)
df <- data.frame(colnames(Soybean), result[sorted])
row.names(df) <- NULL
colnames(df) = c("Variable", "Fraction Missing")
kable(df)
Variable | Fraction Missing |
---|---|
Class | 0.0000000 |
date | 0.0000000 |
plant.stand | 0.0014663 |
precip | 0.0014663 |
temp | 0.0239880 |
hail | 0.0239880 |
crop.hist | 0.0239880 |
area.dam | 0.0459418 |
sever | 0.0475460 |
seed.tmt | 0.0556414 |
germ | 0.0589147 |
plant.growth | 0.0589147 |
leaves | 0.0589147 |
leaf.halo | 0.0589147 |
leaf.marg | 0.0589147 |
leaf.size | 0.0589147 |
leaf.shread | 0.0589147 |
leaf.malf | 0.1402337 |
leaf.mild | 0.1402337 |
stem | 0.1402337 |
lodging | 0.1402337 |
stem.cankers | 0.1402337 |
canker.lesion | 0.1556684 |
fruiting.bodies | 0.1556684 |
ext.decay | 0.1556684 |
mycelium | 0.1715266 |
int.discolor | 0.1837088 |
sclerotia | 0.1837088 |
fruit.pods | 0.1837088 |
fruit.spots | 0.1837088 |
seed | 0.1878261 |
mold.growth | 0.1961471 |
seed.discolor | 0.2153025 |
seed.size | 0.2153025 |
shriveling | 0.2153025 |
roots | 0.2153025 |