library(ggplot2)
library(mlbench)
library(e1071)
library(caret)
Glass data from the UC Irvine Machine Learning Repositorydata(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
hist(Glass$RI)
hist(Glass$Na)
hist(Glass$Mg)
hist(Glass$Al)
hist(Glass$Si)
hist(Glass$K)
hist(Glass$Ca)
hist(Glass$Ba)
hist(Glass$Fe)
Based on the graphs above, the values for Mg, K, Ba, Fe elements are not normally distributed and are very much skewed to the right or left. The rest of the 5 predictors are near-normally distributed with slight skewness.
The below table and plot provide count of observations for each type of glass.
table(Glass$Type)
##
## 1 2 3 5 6 7
## 70 76 17 13 9 29
barplot(table(Glass$Type))
The cor function will help us see if there is any collinearity relationship between predictors.
gp <- Glass[,1:9]
cor(gp, use = "all.obs")
## RI Na Mg Al Si
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073
## K Ca Ba Fe
## RI -0.289832711 0.8104027 -0.0003860189 0.143009609
## Na -0.266086504 -0.2754425 0.3266028795 -0.241346411
## Mg 0.005395667 -0.4437500 -0.4922621178 0.083059529
## Al 0.325958446 -0.2595920 0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K 1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155 1.0000000 -0.1128409671 0.124968219
## Ba -0.042618059 -0.1128410 1.0000000000 -0.058691755
## Fe -0.007719049 0.1249682 -0.0586917554 1.000000000
Indeed, we see that there is high collinearity between RI and Ca predictors as also shown in the plot below.
plot(Glass$RI, Glass$Ca)
Fe element is are very much right-skewed with outliers. This is illustrated by the summary and the box-plot belowsummary(Glass$Fe)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
boxplot(Glass$Fe)
Fe element.hist(log(Glass$Fe))
Soybean data from the UC Irvine Machine Learning Repositorydata(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
There are 3 predictors with degenerate distributions, listed below:
nzv <- nearZeroVar(Soybean)
names(Soybean)[nzv]
## [1] "leaf.mild" "mycelium" "sclerotia"
Here’s the graph illustrating the disproportionate percentage of distribution for each unique value in leaf.mild predictor
barplot((table(Soybean$leaf.mild) * 100) / nrow(Soybean))
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Based on the summary below, there are 3 predictors (sever, seed.tmt, and lodging) with the most missing values:
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
There appears to be a pattern of missing data related to certain classes. The plot below shows this pattern which seems to be nearly identical across predictors with the most missing data.
barplot(table(Soybean[is.na(Soybean$lodging), 'Class']))
table(Soybean[is.na(Soybean$lodging), 'Class'])
##
## 2-4-d-injury alternarialeaf-spot
## 16 0
## anthracnose bacterial-blight
## 0 0
## bacterial-pustule brown-spot
## 0 0
## brown-stem-rot charcoal-rot
## 0 0
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 0 0
## frog-eye-leaf-spot herbicide-injury
## 0 8
## phyllosticta-leaf-spot phytophthora-rot
## 0 68
## powdery-mildew purple-seed-stain
## 0 0
## rhizoctonia-root-rot
## 0Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Given that roughly 18% of the data is missing, it may be better to impute the missing values. However, a common pattern in the data appears to be that a predictor can only take on a single value and so imputing by assigning that exact value should be safe. However, such predictors are likely to be eliminated from the model due to having zero or near-zero variance. Here’s an illustration of this point for phytophthora-rot predictor:
summary(Soybean[Soybean$Class == 'phytophthora-rot',])
## Class date plant.stand precip temp hail
## phytophthora-rot :88 0: 7 0: 0 0: 0 0: 9 0 :14
## 2-4-d-injury : 0 1:23 1:88 1:30 1:51 1 : 6
## alternarialeaf-spot: 0 2:25 2:58 2:28 NA's:68
## anthracnose : 0 3:27
## bacterial-blight : 0 4: 6
## bacterial-pustule : 0 5: 0
## (Other) : 0 6: 0
## crop.hist area.dam sever seed.tmt germ plant.growth leaves
## 0: 6 0: 0 0 : 0 0 :10 0 : 7 0: 0 0: 0
## 1:20 1:87 1 : 7 1 :10 1 : 7 1:88 1:88
## 2:32 2: 0 2 :13 2 : 0 2 : 6
## 3:30 3: 1 NA's:68 NA's:68 NA's:68
##
##
##
## leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem
## 0 :33 0 : 0 0 : 0 0 :33 0 :33 0 :33 0: 0
## 1 : 0 1 : 0 1 : 0 1 : 0 1 : 0 1 : 0 1:88
## 2 : 0 2 :33 2 :33 NA's:55 NA's:55 2 : 0
## NA's:55 NA's:55 NA's:55 NA's:55
##
##
##
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 0 :18 0: 6 0: 0 0 :20 0:69 0:88
## 1 : 2 1:19 1: 0 1 : 0 1: 6 1: 0
## NA's:68 2:30 2:88 NA's:68 2:13
## 3:33 3: 0
##
##
##
## int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 0:88 0:88 0 : 0 0 : 0 0 :20 0 :20
## 1: 0 1: 0 1 : 0 1 : 0 1 : 0 1 : 0
## 2: 0 2 : 0 2 : 0 NA's:68 NA's:68
## 3 :20 4 :20
## NA's:68 NA's:68
##
##
## seed.discolor seed.size shriveling roots
## 0 :20 0 :20 0 :20 0:20
## 1 : 0 1 : 0 1 : 0 1:68
## NA's:68 NA's:68 NA's:68 2: 0
##
##
##
##
For predictors with various values, a more sophisticated method of imputation should be used. If it is determined that a variable with missing values is highly correlated with another predictor then K-nearest neighbor model technique can be used.