The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
predictor_vars <- Glass %>%
select(-Type) %>%
gather(key = 'predictor_variable', value = 'value')
# Plot and print a histogram for each predictor variable.
ggplot(predictor_vars) +
geom_histogram(aes(x = value, y = ..density..), bins = 30, fill = 'blue') +
labs(title = 'Distributions of Predictor Variables') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_density(aes(x = value), color = 'red') +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)
ggplot(Glass, aes(x = Type)) +
geom_bar(fill = 'blue') +
labs(title = 'Distribution of Categorical Variables') +
theme(plot.title = element_text(hjust = 0.5))
The columns Ba, Fe, and K look to be heavily skewed right. This is caused by left limit is bounded at 0 and outliers on the right side of the distribution.
The distribution of Ba, Fe, K, Mg, RI, and Si are not symmetric. Centering, scaling, and applying the BoxCox transformation would benefit a model using these variables.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Using nearZeroVasrs illustrates that the leaf.mild, mycelium, and sclerotia columns meet the conditions to be a degenerate feature.
data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
nearZeroVar(Soybean)
## [1] 19 26 28
colnames(Soybean[,c(19,26,28)])
## [1] "leaf.mild" "mycelium" "sclerotia"
The below visualization of missing data by feature displays a step wise pattern.
soybean_missing_counts <- sapply(Soybean, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
as.data.frame() %>%
rename('NA_Count' ='.')
soybean_missing_counts <- soybean_missing_counts%>%
mutate('Feature' = rownames(soybean_missing_counts))
ggplot(soybean_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'blue') +
labs(title = 'Soybean Missing Counts') +
theme(plot.title = element_text(hjust = 0.5))
I used the mice package that evokes a predictive mean matching to impute missing data. Below shows the new distribution of the predictor variables.
impute_df <- complete(imputed,1)
summary(impute_df)
## Class date plant.stand precip temp hail
## brown-spot : 92 0: 26 0:390 0: 96 0: 83 0:486
## alternarialeaf-spot: 91 1: 75 1:293 1:114 1:382 1:197
## frog-eye-leaf-spot : 91 2: 93 2:473 2:218
## phytophthora-rot : 88 3:118
## anthracnose : 44 4:131
## brown-stem-rot : 44 5:149
## (Other) :233 6: 91
## crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0: 67 0:123 0:316 0:335 0:165 0:441 0: 77 0:276
## 1:169 1:227 1:322 1:263 1:213 1:242 1:606 1: 36
## 2:225 2:145 2: 45 2: 85 2:305 2:371
## 3:222 3:188
##
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0:372 0:121 0:575 0:575 0:643 0:312 0:549
## 1: 21 1:327 1:108 1:108 1: 20 1:371 1:134
## 2:290 2:235 2: 20
##
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0:417 0:332 0:503 0:535 0:651 0:615
## 1: 39 1: 87 1:180 1:135 1: 32 1: 44
## 2: 36 2:177 2: 13 2: 24
## 3:191 3: 87
##
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 0:625 0:423 0:373 0:476 0:544 0:589 0:624
## 1: 58 1:130 1: 81 1:207 1:139 1: 94 1: 59
## 2: 14 2: 72
## 3:116 4:157
##
##
##
## shriveling roots
## 0:627 0:582
## 1: 56 1: 86
## 2: 15
##
##
##
##