Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(fpp3)
library(mlbench)
library(corrplot)
library(moments)
library(tidyverse)
library(VIM)
library(naniar)
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Glass |> pivot_longer(cols = c(1:9), names_to = 'predictors', values_to = 'value') |> ggplot()+ geom_histogram(aes(value))+
facet_wrap(~predictors, scales='free')+
labs(title='Distributions of All Predictors')corrplot(cor(Glass[,1:9]), method='number', type='lower')
Al, Ba, CA, Fe, K and RI look right skewed. However, BA, Fe and K have
many data at zero. Mg is left skewed with many data at zero. Na and Si
are quite normal. RI has strong correlation with Al, Ca and Si. RI also
has some correlation with K. Other strong correlation include Mg and
Al/Ba/Ca, Ba and Al/Na/Mg.
Glass |> pivot_longer(cols = c(1:9), names_to = 'predictors', values_to = 'value') |> ggplot()+ geom_boxplot(aes(value), outlier.colour = 'green')+
facet_wrap(~predictors, scales='free')+
labs(title='Boxplot of All Predictors')apply(Glass[-10], 2, skewness)## RI Na Mg Al Si K Ca
## 1.6140150 0.4509917 -1.1444648 0.9009179 -0.7253173 6.5056358 2.0326774
## Ba Fe
## 3.3924309 1.7420068
All green dots above are outliners and only Mg has no outliner. According to the skewness check, Mg and Si are left skewed and the rest are right skewed. Na is the one closet to normal distribution.
log_transform <- function(x) {
if_else(x != 0, log(x + 1e-11), log(1))
}
Glass_log<-Glass|>
select(RI, Al, K, Ca, Ba, Fe)|> mutate_all(log_transform)
Glass_log |> pivot_longer(cols = c(1:6), names_to = 'predictors_log', values_to = 'value_log') |> ggplot()+ geom_histogram(aes(value_log))+
facet_wrap(~predictors_log, scales='free')+
labs(title='Distributions of All Predictors')apply(Glass_log, 2, skewness)## RI Al K Ca Ba Fe
## 1.6015381 -0.8405381 -1.1189706 1.0588744 -3.4060277 -1.2504705
There are transformations including logarithmic, square root, box-cox and reciprocal to handle the skewness. Since most of the predictors are right skewed, a logarithmic transformation is used. However, it’s recommended to remove those zeros first before transformation.
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data("Soybean")
summary(Soybean)## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
A degenerate distribution is a probability distribution in a space with support only on a manifold of lower dimension, and according to others a distribution with support only at a single point. By the latter definition, it is a deterministic distribution and takes only a single value. This distribution satisfies the definition of “random variable” even though it does not appear random in the everyday sense of the word; hence it is considered degenerate.
Soybean|> gather() |> ggplot(aes(value)) +
geom_bar()+
facet_wrap(~key, scales='free')+
labs(title='Frequency Distributions For All Categorical Predictors')
plant.stand, hail, plant.growth, leaves, leaf.shread, leaf.malf, stem,
lodging, fruiting.bodies, mycelium, sclerotia, seed, mold.growth,
seed.discolor, seed.size and shriveling are degenerate.
missing <- aggr(Soybean, col=c('red','orange'), numbers=TRUE,
sortVars=TRUE, labels=names(Soybean), cex.axis=.5, gap=1,
ylab=c('Histogram of missing data','Pattern'))##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
gg_miss_upset(Soybean, nsets=10)
Hail, sever, seed.tmt and lodging has the most missing values. There are
pattern of missing values between these 4 predictors.
library(caret)## 载入需要的程辑包:lattice
##
## 载入程辑包:'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
# Identification of near zero variance predictors
nearZeroVar(Soybean,freqCut = 100/5, uniqueCut = 15)## [1] 19 26 28
# Substitute the missing value by KNN
Soybean_new <- Soybean[,-c(19,26,28)] |>
kNN()
aggr(Soybean_new, col=c('orange'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean_new), cex.axis=.5, gap=1,ylab=c('Histogram of missing data','Pattern'))##
## Variables sorted by number of missings:
## Variable Count
## Class 0
## date 0
## plant.stand 0
## precip 0
## temp 0
## hail 0
## crop.hist 0
## area.dam 0
## sever 0
## seed.tmt 0
## germ 0
## plant.growth 0
## leaves 0
## leaf.halo 0
## leaf.marg 0
## leaf.size 0
## leaf.shread 0
## leaf.malf 0
## stem 0
## lodging 0
## stem.cankers 0
## canker.lesion 0
## fruiting.bodies 0
## ext.decay 0
## int.discolor 0
## fruit.pods 0
## fruit.spots 0
## seed 0
## mold.growth 0
## seed.discolor 0
## seed.size 0
## shriveling 0
## roots 0
## Class_imp 0
## date_imp 0
## plant.stand_imp 0
## precip_imp 0
## temp_imp 0
## hail_imp 0
## crop.hist_imp 0
## area.dam_imp 0
## sever_imp 0
## seed.tmt_imp 0
## germ_imp 0
## plant.growth_imp 0
## leaves_imp 0
## leaf.halo_imp 0
## leaf.marg_imp 0
## leaf.size_imp 0
## leaf.shread_imp 0
## leaf.malf_imp 0
## stem_imp 0
## lodging_imp 0
## stem.cankers_imp 0
## canker.lesion_imp 0
## fruiting.bodies_imp 0
## ext.decay_imp 0
## int.discolor_imp 0
## fruit.pods_imp 0
## fruit.spots_imp 0
## seed_imp 0
## mold.growth_imp 0
## seed.discolor_imp 0
## seed.size_imp 0
## shriveling_imp 0
## roots_imp 0
Using KNN to substitute the missing data and the result can be found above.