library(mlbench)
## Warning: package 'mlbench' was built under R version 4.1.2
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
data <- Glass %>% select(-Type)
chart.Correlation(data, bg=c("blue","red","yellow"), pch=21)
glass_cor<- cor(Glass[1:9])
corrplot(glass_cor, method="number")
The chart shows a string correlation bewteen Ri and Ca, what means that there are going to be the same information for both in the model. It would be a good perspective to remove one of them from our model to avoiid redundancy. Other variables show moderate to weak correlation.
par(mfrow=c(2, 2))
colnames <- dimnames(Glass)[[2]]
for (i in 1:9) {
d <- density(Glass[,i])
plot(d, type="n", main=colnames[i])
polygon(d, col="blue", border="gray")
}
Plots show that some of the variables have distribution close to normal. Na, Ai, Si, Ba (with exception of right skew) can be considered as approximately normal.
id.vars = NULL
df <- melt(Glass)
## Using Type as id variables
plot1 <-ggplot(data = df, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors",
breaks = seq(0, 4, 0.5),
limits=c(0, 4)) + coord_flip()
plot2 <- ggplot(data = df, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors",
breaks = seq(5, 15, 2),
limits=c(5, 15)) + coord_flip()
plot3 <- ggplot(data = df, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors",
breaks = seq(69, 76, 1),
limits=c(69, 76))+ coord_flip()
grid.arrange(plot1, plot2, plot3, nrow = 2)
## Warning: Removed 645 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1501 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1712 rows containing non-finite values (stat_boxplot).
There are outliers in all the predictor variables. The plots show that some of the predictors are skewed. It can be seen in the plots that Mg is left-skewed, while K, Ba, and Fe are right skewed. Ca also seems to be slightly right skewed.
We could apply a Box-Cox transformation to the dataset to reduce skewness, using the forecast package. The Box-Cox method uses a maximum likelihood estimation to determine the transformation parameter \(\lambda\) what could help to improve the classification model.As mentioned in the textbook as well, we could use the caret class preProcess which has the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction. The function calculates the required quantities for the transformation.
The data can be loaded via:
library(mlbench)
data(Soybean)
#?Soybean
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
for (i in 2:ncol(Soybean)){
temp <-as.data.frame(table(Soybean[[i]]))
temp$col = colnames(Soybean[i])
assign(paste0("freq_",colnames(Soybean[i])),temp)
rm(temp)
}
all <- do.call("rbind",mget(ls(pattern = "^freq_.*")))
head(all,4)
## Var1 Freq col
## freq_area.dam.1 0 123 area.dam
## freq_area.dam.2 1 227 area.dam
## freq_area.dam.3 2 145 area.dam
## freq_area.dam.4 3 187 area.dam
freq_date
## Var1 Freq col
## 1 0 26 date
## 2 1 75 date
## 3 2 93 date
## 4 3 118 date
## 5 4 131 date
## 6 5 149 date
## 7 6 90 date
freq_temp
## Var1 Freq col
## 1 0 80 temp
## 2 1 374 temp
## 3 2 199 temp
freq_hail
## Var1 Freq col
## 1 0 435 hail
## 2 1 127 hail
freq_leaves
## Var1 Freq col
## 1 0 77 leaves
## 2 1 606 leaves
par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
plot(Soybean[i], main = colnames(Soybean[i]))
}
# checking for near zero or zero variance
var <- nearZeroVar(Soybean)
par(mfrow = c(2,2))
for(i in var) {
plot(Soybean[i], main = colnames(Soybean[i]))
}
colSums(is.na(Soybean))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
sb <- Soybean%>% mutate(nul=rowSums(is.na(Soybean)))%>%
group_by(Class)%>% summarize(miss=sum(nul)) %>%filter(miss!=0)
aggr_plot <- aggr(Soybean, col=c('black','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Missing data","Model"))
##
## Variables sorted by number of missings:
## Variable Count
## K 0.177159590
## Fe 0.177159590
## RI 0.177159590
## Mg 0.177159590
## Na 0.163982430
## RI 0.158125915
## K 0.155197657
## Mg 0.155197657
## K 0.155197657
## Ba 0.155197657
## Ba 0.146412884
## Al 0.134699854
## Si 0.134699854
## Ca 0.134699854
## Si 0.122986823
## K 0.122986823
## Ca 0.122986823
## Fe 0.122986823
## Na 0.122986823
## Al 0.055636896
## Al 0.055636896
## Si 0.055636896
## Ca 0.055636896
## Ba 0.055636896
## Fe 0.055636896
## RI 0.055636896
## Mg 0.052708638
## Fe 0.045387994
## Si 0.043923865
## Ca 0.023426061
## Mg 0.023426061
## Na 0.023426061
## Na 0.001464129
## Ba 0.001464129
## RI 0.000000000
## Al 0.000000000
sb
## # A tibble: 5 x 2
## Class miss
## <fct> <dbl>
## 1 2-4-d-injury 450
## 2 cyst-nematode 336
## 3 diaporthe-pod-&-stem-blight 177
## 4 herbicide-injury 160
## 5 phytophthora-rot 1214
The different Classes with the missing values were summarized. Out of 19 categories of Class, only 4 have the missing values with class phytophthora-rot having the maximum of missing values. It looks that there is a pattern of missing data related to the classes.
sb <- knnImputation(Soybean[,-1])
colSums(is.na(sb))
## date plant.stand precip temp hail
## 0 0 0 0 0
## crop.hist area.dam sever seed.tmt germ
## 0 0 0 0 0
## plant.growth leaves leaf.halo leaf.marg leaf.size
## 0 0 0 0 0
## leaf.shread leaf.malf leaf.mild stem lodging
## 0 0 0 0 0
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 0 0 0 0 0
## int.discolor sclerotia fruit.pods fruit.spots seed
## 0 0 0 0 0
## mold.growth seed.discolor seed.size shriveling roots
## 0 0 0 0 0
Due missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values should be dropped, because if there are more missing values than 5%, the predictor might not be providing correct information to the model. We used k nearest neighbors to impute the missing values in our dataset.