3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214
glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and
percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
> library(mlbench)
> data(Glass)
> str(Glass)
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ... $ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ... $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ... $Ba :num 0000000000...
$Fe :num 000000.260000.11...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the
relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a)
glass <- subset(Glass, select = -Type)
predictors <- colnames(glass)
par(mfrow = c(3, 3))
for(i in 1:9) {
hist(glass[,i], main = predictors[i])
}
The variables differ quite a bit. Some are more normally distributed (e.g., Na, Al) while others do not look normal at all (e.g., Ba, Fe, K).
Relationships between predictors:
corrplot(cor(Glass[,1:9]), method='square')
(b)
(c)
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict
disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal
conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels
consist of 19 distinct classes.
The data can be loaded via:
> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate
in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is
the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
(a)
sb_freq <- Soybean
head(Soybean[,2:length(sb_freq)])
## date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ
## 1 6 0 2 1 0 1 1 1 0 0
## 2 4 0 2 1 0 2 0 2 1 1
## 3 3 0 2 1 0 1 0 2 1 2
## 4 3 0 2 1 0 1 0 2 0 1
## 5 6 0 2 1 0 2 0 1 0 2
## 6 5 0 2 1 0 3 0 1 0 1
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1 1 1 0 2 2 0 0
## 2 1 1 0 2 2 0 0
## 3 1 1 0 2 2 0 0
## 4 1 1 0 2 2 0 0
## 5 1 1 0 2 2 0 0
## 6 1 1 0 2 2 0 0
## leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies
## 1 0 1 1 3 1 1
## 2 0 1 0 3 1 1
## 3 0 1 0 3 0 1
## 4 0 1 0 3 0 1
## 5 0 1 0 3 1 1
## 6 0 1 0 3 0 1
## ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 1 1 0 0 0 0 4 0
## 2 1 0 0 0 0 4 0
## 3 1 0 0 0 0 4 0
## 4 1 0 0 0 0 4 0
## 5 1 0 0 0 0 4 0
## 6 1 0 0 0 0 4 0
## mold.growth seed.discolor seed.size shriveling roots
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
sb_freq[, 2:length(sb_freq)] <- lapply(sb_freq[, 2:length(sb_freq)], function(x) as.numeric(as.character(x)))
ggplot(data=melt(sb_freq), mapping=aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = 'free_x')
## Warning: Removed 2337 rows containing non-finite values (stat_count).
(b)
Soybean$na_count <- apply(Soybean, 1, function(x) sum(is.na(x)))
colSums(is.na(Soybean))
## Class date plant.stand precip
## 0 1 36 38
## temp hail crop.hist area.dam
## 30 121 16 1
## sever seed.tmt germ plant.growth
## 121 121 112 16
## leaves leaf.halo leaf.marg leaf.size
## 0 84 84 84
## leaf.shread leaf.malf leaf.mild stem
## 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies
## 121 38 38 106
## ext.decay mycelium int.discolor sclerotia
## 38 38 38 38
## fruit.pods fruit.spots seed mold.growth
## 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
## na_count
## 0
(Soybean %>%
select(Class, na_count) %>%
group_by(Class) %>%
summarise(na_count = sum(na_count)) %>%
arrange(desc(na_count)))
## # A tibble: 19 x 2
## Class na_count
## <fct> <int>
## 1 phytophthora-rot 1214
## 2 2-4-d-injury 450
## 3 cyst-nematode 336
## 4 diaporthe-pod-&-stem-blight 177
## 5 herbicide-injury 160
## 6 alternarialeaf-spot 0
## 7 anthracnose 0
## 8 bacterial-blight 0
## 9 bacterial-pustule 0
## 10 brown-spot 0
## 11 brown-stem-rot 0
## 12 charcoal-rot 0
## 13 diaporthe-stem-canker 0
## 14 downy-mildew 0
## 15 frog-eye-leaf-spot 0
## 16 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
(c)
Mice library = Mice means multivariate imputation by chained equations
library(mice)
sb <- Soybean[, -c(37, 35, 34, 33, 28, 26, 21, 19, 18)]
sb[, 2:length(sb)] <- lapply(sb[, 2:length(sb)], function(x) as.numeric(as.character(x)))
str(sb[,-1])
## 'data.frame': 683 obs. of 27 variables:
## $ date : num 6 4 3 3 6 5 5 4 6 4 ...
## $ plant.stand : num 0 0 0 0 0 0 0 0 0 0 ...
## $ precip : num 2 2 2 2 2 2 2 2 2 2 ...
## $ temp : num 1 1 1 1 1 1 1 1 1 1 ...
## $ hail : num 0 0 0 0 0 0 0 1 0 0 ...
## $ crop.hist : num 1 2 1 1 2 3 2 1 3 2 ...
## $ area.dam : num 1 0 0 0 0 0 0 0 0 0 ...
## $ sever : num 1 2 2 2 1 1 1 1 1 2 ...
## $ seed.tmt : num 0 1 1 0 0 0 1 0 1 0 ...
## $ germ : num 0 1 2 1 2 1 0 2 1 2 ...
## $ plant.growth : num 1 1 1 1 1 1 1 1 1 1 ...
## $ leaves : num 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.halo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ leaf.marg : num 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.size : num 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.shread : num 0 0 0 0 0 0 0 0 0 0 ...
## $ stem : num 1 1 1 1 1 1 1 1 1 1 ...
## $ stem.cankers : num 3 3 3 3 3 3 3 3 3 3 ...
## $ canker.lesion : num 1 1 0 0 1 0 1 1 1 1 ...
## $ fruiting.bodies: num 1 1 1 1 1 1 1 1 1 1 ...
## $ ext.decay : num 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fruit.pods : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fruit.spots : num 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mold.growth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ roots : num 0 0 0 0 0 0 0 0 0 0 ...
summary(sb[,-1])
## date plant.stand precip temp
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:1.000
## Median :4.000 Median :0.0000 Median :2.000 Median :1.000
## Mean :3.554 Mean :0.4529 Mean :1.597 Mean :1.182
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.000 Max. :1.0000 Max. :2.000 Max. :2.000
## NA's :1 NA's :36 NA's :38 NA's :30
## hail crop.hist area.dam sever
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.000 Median :2.000 Median :1.000 Median :1.0000
## Mean :0.226 Mean :1.885 Mean :1.581 Mean :0.7331
## 3rd Qu.:0.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.000 Max. :3.000 Max. :3.000 Max. :2.0000
## NA's :121 NA's :16 NA's :1 NA's :121
## seed.tmt germ plant.growth leaves
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.000 Median :0.0000 Median :1.0000
## Mean :0.5196 Mean :1.049 Mean :0.3388 Mean :0.8873
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.000 Max. :1.0000 Max. :1.0000
## NA's :121 NA's :112 NA's :16
## leaf.halo leaf.marg leaf.size leaf.shread
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :0.000 Median :1.000 Median :0.0000
## Mean :1.202 Mean :0.773 Mean :1.284 Mean :0.1647
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :1.0000
## NA's :84 NA's :84 NA's :84 NA's :100
## stem stem.cankers canker.lesion fruiting.bodies
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00 Median :1.0000 Median :0.0000
## Mean :0.5562 Mean :1.06 Mean :0.9798 Mean :0.1802
## 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.00 Max. :3.0000 Max. :1.0000
## NA's :16 NA's :38 NA's :38 NA's :106
## ext.decay int.discolor fruit.pods fruit.spots
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.2496 Mean :0.1302 Mean :0.5042 Mean :1.021
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :2.0000 Max. :2.0000 Max. :3.0000 Max. :4.000
## NA's :38 NA's :38 NA's :84 NA's :106
## seed mold.growth roots
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1946 Mean :0.1134 Mean :0.1779
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :2.0000
## NA's :92 NA's :92 NA's :31
correlated_sb <- cor(sb[,-1], use="pairwise.complete.obs")
corrplot(correlated_sb, method = "square", order = "hclust")
What is seen and what can be done:
Leaf.marg shares strong correlations with both leaf.halo and leaf.size.
Maybe it’s better to remove leaf.marg along with leaf.halo and leaf.size to make it more efficient
Fruit.spots and plant.growth also share strong correlation.
Fruit.spots has many more NA’s, maybe it needs removal.
For imputations, maybe we can use k-nearest neighbors to model the NA’s