The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
describe(Glass)
## vars n mean sd median trimmed mad min max range skew kurtosis
## RI 1 214 1.52 0.00 1.52 1.52 0.00 1.51 1.53 0.02 1.60 4.72
## Na 2 214 13.41 0.82 13.30 13.38 0.64 10.73 17.38 6.65 0.45 2.90
## Mg 3 214 2.68 1.44 3.48 2.87 0.30 0.00 4.49 4.49 -1.14 -0.45
## Al 4 214 1.44 0.50 1.36 1.41 0.31 0.29 3.50 3.21 0.89 1.94
## Si 5 214 72.65 0.77 72.79 72.71 0.57 69.81 75.41 5.60 -0.72 2.82
## K 6 214 0.50 0.65 0.56 0.43 0.17 0.00 6.21 6.21 6.46 52.87
## Ca 7 214 8.96 1.42 8.60 8.74 0.66 5.43 16.19 10.76 2.02 6.41
## Ba 8 214 0.18 0.50 0.00 0.03 0.00 0.00 3.15 3.15 3.37 12.08
## Fe 9 214 0.06 0.10 0.00 0.04 0.00 0.00 0.51 0.51 1.73 2.52
## Type* 10 214 2.54 1.71 2.00 2.31 1.48 1.00 6.00 5.00 1.04 -0.29
## se
## RI 0.00
## Na 0.06
## Mg 0.10
## Al 0.03
## Si 0.05
## K 0.04
## Ca 0.10
## Ba 0.03
## Fe 0.01
## Type* 0.12
# Density
ggplot(melt(Glass),aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_density()
# Plot a correlation graph while ignoring NAs
cor.glass <- cor(Glass[-10], use = "pairwise.complete.obs")
corrplot(cor.glass, method = "color",addCoef.col="black",na.label = "NA",number.cex=0.5, tl.cex=0.5)
There are no missing values in the dataframe. Mg, K, Ba and Fe all have zeroes in their values.
There is a strong correlation between Ca and RI (0.81), some other variable exhibit moderate correlation, Ba/AI (0.48), Si/RI (-0.54), Ba/Mg (-0.49).
K might have outliers as the tail is very long and narrow. Additionally, RI, Ca, Ba, and Fe appear to be right-skewed.
skewValues <- apply(Glass[-10],2,skewness)
skewValues
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
glass_transformed <- preProcess(Glass[-10], method= c('BoxCox','center','scale'))
glass_transformed
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (0)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
glass_transformed_df <- predict(glass_transformed,Glass[-10])
ggplot(melt(glass_transformed_df),aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_density()
ss_df <- spatialSign(glass_transformed_df)
ggplot(melt(ss_df),aes(value)) +
facet_wrap(~ Var2, scales = "free") +
geom_density()
Glass<- Glass %>%
transform(K_log = log(Glass$K+1)) %>%
transform(Ba_log = log(Glass$Ba+1)) %>%
transform(Ca_log = log(Glass$Ca+1)) %>%
transform(Fe_log = log(Glass$Fe+1))
ggplot(melt(Glass),aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_density()
After centering, scaling, applying Box-Cox and spatial sign, transformations didn’t seem to be sufficient to improve most problematic variables’ distribution.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
describe(Soybean)
## vars n mean sd median trimmed mad min max range skew
## Class* 1 683 9.30 5.51 8 9.18 7.41 1 19 18 0.11
## date* 2 682 4.55 1.69 5 4.62 1.48 1 7 6 -0.30
## plant.stand* 3 647 1.45 0.50 1 1.44 0.00 1 2 1 0.19
## precip* 4 645 2.60 0.69 3 2.74 0.00 1 3 2 -1.42
## temp* 5 653 2.18 0.63 2 2.23 0.00 1 3 2 -0.16
## hail* 6 562 1.23 0.42 1 1.16 0.00 1 2 1 1.31
## crop.hist* 7 667 2.88 0.98 3 2.98 1.48 1 4 3 -0.40
## area.dam* 8 682 2.58 1.07 2 2.60 1.48 1 4 3 0.02
## sever* 9 562 1.73 0.60 2 1.69 0.00 1 3 2 0.17
## seed.tmt* 10 562 1.52 0.61 1 1.45 0.00 1 3 2 0.74
## germ* 11 571 2.05 0.79 2 2.06 1.48 1 3 2 -0.09
## plant.growth* 12 667 1.34 0.47 1 1.30 0.00 1 2 1 0.68
## leaves* 13 683 1.89 0.32 2 1.98 0.00 1 2 1 -2.44
## leaf.halo* 14 599 2.20 0.95 3 2.25 0.00 1 3 2 -0.41
## leaf.marg* 15 599 1.77 0.96 1 1.72 0.00 1 3 2 0.46
## leaf.size* 16 599 2.28 0.61 2 2.34 0.00 1 3 2 -0.25
## leaf.shread* 17 583 1.16 0.37 1 1.08 0.00 1 2 1 1.80
## leaf.malf* 18 599 1.08 0.26 1 1.00 0.00 1 2 1 3.22
## leaf.mild* 19 575 1.10 0.40 1 1.00 0.00 1 3 2 3.95
## stem* 20 667 1.56 0.50 2 1.57 0.00 1 2 1 -0.23
## lodging* 21 562 1.07 0.26 1 1.00 0.00 1 2 1 3.23
## stem.cankers* 22 645 2.06 1.35 1 1.95 0.00 1 4 3 0.61
## canker.lesion* 23 645 1.98 1.08 2 1.85 1.48 1 4 3 0.51
## fruiting.bodies* 24 577 1.18 0.38 1 1.10 0.00 1 2 1 1.66
## ext.decay* 25 645 1.25 0.48 1 1.16 0.00 1 3 2 1.70
## mycelium* 26 645 1.01 0.10 1 1.00 0.00 1 2 1 10.20
## int.discolor* 27 645 1.13 0.42 1 1.00 0.00 1 3 2 3.34
## sclerotia* 28 645 1.03 0.17 1 1.00 0.00 1 2 1 5.40
## fruit.pods* 29 599 1.50 0.88 1 1.28 0.00 1 4 3 1.84
## fruit.spots* 30 577 1.85 1.17 1 1.69 0.00 1 4 3 0.95
## seed* 31 591 1.19 0.40 1 1.12 0.00 1 2 1 1.54
## mold.growth* 32 591 1.11 0.32 1 1.02 0.00 1 2 1 2.43
## seed.discolor* 33 577 1.11 0.31 1 1.02 0.00 1 2 1 2.47
## seed.size* 34 591 1.10 0.30 1 1.00 0.00 1 2 1 2.66
## shriveling* 35 577 1.07 0.25 1 1.00 0.00 1 2 1 3.49
## roots* 36 652 1.18 0.44 1 1.07 0.00 1 3 2 2.46
## kurtosis se
## Class* -1.38 0.21
## date* -0.90 0.06
## plant.stand* -1.97 0.02
## precip* 0.55 0.03
## temp* -0.58 0.02
## hail* -0.29 0.02
## crop.hist* -0.92 0.04
## area.dam* -1.29 0.04
## sever* -0.56 0.03
## seed.tmt* -0.44 0.03
## germ* -1.40 0.03
## plant.growth* -1.54 0.02
## leaves* 3.98 0.01
## leaf.halo* -1.76 0.04
## leaf.marg* -1.75 0.04
## leaf.size* -0.63 0.02
## leaf.shread* 1.26 0.02
## leaf.malf* 8.35 0.01
## leaf.mild* 14.68 0.02
## stem* -1.95 0.02
## lodging* 8.42 0.01
## stem.cankers* -1.51 0.05
## canker.lesion* -1.24 0.04
## fruiting.bodies* 0.75 0.02
## ext.decay* 1.98 0.02
## mycelium* 102.18 0.00
## int.discolor* 10.57 0.02
## sclerotia* 27.19 0.01
## fruit.pods* 2.41 0.04
## fruit.spots* -0.76 0.05
## seed* 0.37 0.02
## mold.growth* 3.93 0.01
## seed.discolor* 4.12 0.01
## seed.size* 5.10 0.01
## shriveling* 10.21 0.01
## roots* 5.49 0.02
There are 35 categorical variables labeled as numerical that do not explicitly provide information. Some variables appear continious data binned into categories, such as precipitation or temperature. We investigate the actual meaning by calling ?Soybean
.
[,1] Class the 19 classes
[,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3] plant.stand normal(0),lt-normal(1).
[,4] precip lt-norm(0),norm(1),gt-norm(2).
[,5] temp lt-norm(0),norm(1),gt-norm(2).
[,6] hail yes(0),no(1).
[,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9] sever minor(0),pot-severe(1),severe(2).
[,10] seed.tmt none(0),fungicide(1),other(2).
[,11] germ 90-100%(0),80-89%(1),lt-80%(2).
[,12] plant.growth norm(0),abnorm(1).
[,13] leaves norm(0),abnorm(1).
[,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2).
[,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2).
[,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2).
[,17] leaf.shread absent(0),present(1).
[,18] leaf.malf absent(0),present(1).
[,19] leaf.mild absent(0),upper-surf(1),lower-surf(2).
[,20] stem norm(0),abnorm(1).
[,21] lodging yes(0),no(1).
[,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24] fruiting.bodies absent(0),present(1).
[,25] ext.decay absent(0),firm-and-dry(1),watery(2).
[,26] mycelium absent(0),present(1).
[,27] int.discolor none(0),brown(1),black(2).
[,28] sclerotia absent(0),present(1).
[,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3).
[,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31] seed norm(0),abnorm(1).
[,32] mold.growth absent(0),present(1).
[,33] seed.discolor absent(0),present(1).
[,34] seed.size norm(0),lt-norm(1).
[,35] shriveling absent(0),present(1).
[,36] roots norm(0),rotted(1),galls-cysts(2).
Additionally, there are zero variance predictors that carry no information and can be disregarded:
names(Soybean[,nearZeroVar(Soybean)])
## [1] "leaf.mild" "mycelium" "sclerotia"
par(mar=c(2,1,2,1),mfrow=c(5,7))
for(i in colnames(Soybean[-1])){
tab <- table(Soybean[i], useNA='always')
names(tab)[is.na(names(tab))] <- "NA"
barplot(tab, main=i)
}
Almost all variables have some missing data, and in some cases, NAs are equal or greater than factors within variable (ex., sever, fruit.spots)
Based on the above, we could eliminate degenerate predictors that have near-zero variance. We also could potentially remove classes that have a high number of missing values. Because of a high number of missing values, imputation might not be the best solution.