The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Glass_sub <- subset(Glass, select = -Type)
predictors <- colnames(Glass)
par(mfrow = c(3,3))
for(i in 1:9) {
hist(Glass_sub[,i],main = predictors[i])
}
corrplot.mixed(cor(Glass_sub[,1:9]), lower.col = "black", number.cex = .7)
Do there appear to be any outliers in the data? Are any predictors skewed?
Glass_gather <- Glass %>%
pivot_longer(-Type, names_to = "predictors", values_to = "measurement", values_drop_na = TRUE) %>%
mutate(predictors = as.factor(predictors))
par(mfrow = c(3,3))
for(i in 1:9) {
boxplot(Glass_sub[,i],main = predictors[i], horizontal = TRUE)
}
Glass_gather %>%
filter(predictors == 'Na'|predictors == 'K'|predictors == 'Ba'|predictors == 'Ca')%>%
ggplot(aes(x=Type, y=measurement, color=predictors))+
geom_jitter()+
scale_color_brewer(palette = "Set1") +
theme_light()
par(mfrow = c(3,3))
for(i in 1:9) {
hist(Glass_sub[,i],main = predictors[i])
}
Are there any relevant transformations of one or more predictors that might improve the classification model?
glass_process <- preProcess(Glass, method=c('BoxCox', 'center', 'scale'))
glass_update<- predict(glass_process, Glass)
par(mfrow = c(3,3))
for(i in 1:9) {
hist(glass_update[,i],main = predictors[i])
}
par(mfrow = c(3,3))
for(i in 1:9) {
boxplot(glass_update[,i],main = predictors[i], horizontal = TRUE)
}
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data("Soybean")
Soybean = Soybean
nearZeroVar(Soybean)
## [1] 19 26 28
names(Soybean[,nearZeroVar(Soybean)])
## [1] "leaf.mild" "mycelium" "sclerotia"
summary(Soybean[19])
## leaf.mild
## 0 :535
## 1 : 20
## 2 : 20
## NA's:108
summary(Soybean[26])
## mycelium
## 0 :639
## 1 : 6
## NA's: 38
summary(Soybean[28])
## sclerotia
## 0 :625
## 1 : 20
## NA's: 38
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Soybean %>%
group_by(Class)%>%
gg_miss_fct(Class)
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
#copy dataframe
Soybean_update <- Soybean
#rewrite dataframe so I control the formating
fwrite(Soybean,"soybean.temp")
Soybean_update <- fread("soybean.temp",colClasses = "character")
#updating all columns but class to numberic
Soybean_update <- Soybean_update %>%
mutate_at(vars("date","plant.stand","precip","temp","hail","crop.hist","area.dam","sever","seed.tmt","germ","plant.growth","leaves","leaf.halo","leaf.marg","leaf.size","leaf.shread","leaf.malf","leaf.mild","stem","lodging","stem.cankers","canker.lesion","fruiting.bodies","ext.decay","mycelium","int.discolor","sclerotia","fruit.pods","fruit.spots","seed","mold.growth","seed.discolor","seed.size","shriveling","roots" ),as.numeric)
#setting all NA's to mean of column
Soybean_update <-Soybean_update %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))
summary(Soybean_update)
## Class date plant.stand precip
## Length:683 Min. :0.000 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.000
## Mode :character Median :4.000 Median :0.0000 Median :2.000
## Mean :3.554 Mean :0.4529 Mean :1.597
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :6.000 Max. :1.0000 Max. :2.000
## temp hail crop.hist area.dam
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :0.000 Median :2.000 Median :1.000
## Mean :1.182 Mean :0.226 Mean :1.885 Mean :1.581
## 3rd Qu.:2.000 3rd Qu.:0.226 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :2.000 Max. :1.000 Max. :3.000 Max. :3.000
## sever seed.tmt germ plant.growth
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000
## Median :1.0000 Median :0.5196 Median :1.000 Median :0.0000
## Mean :0.7331 Mean :0.5196 Mean :1.049 Mean :0.3388
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.0000 Max. :2.000 Max. :1.0000
## leaves leaf.halo leaf.marg leaf.size
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000
## Median :1.0000 Median :2.000 Median :0.000 Median :1.000
## Mean :0.8873 Mean :1.202 Mean :0.773 Mean :1.284
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :1.0000 Max. :2.000 Max. :2.000 Max. :2.000
## leaf.shread leaf.malf leaf.mild stem
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :1.0000
## Mean :0.1647 Mean :0.07513 Mean :0.1043 Mean :0.5562
## 3rd Qu.:0.1647 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. :1.0000
## lodging stem.cankers canker.lesion fruiting.bodies
## Min. :0.00000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.00 Median :0.9798 Median :0.0000
## Mean :0.07473 Mean :1.06 Mean :0.9798 Mean :0.1802
## 3rd Qu.:0.00000 3rd Qu.:3.00 3rd Qu.:2.0000 3rd Qu.:0.1802
## Max. :1.00000 Max. :3.00 Max. :3.0000 Max. :1.0000
## ext.decay mycelium int.discolor sclerotia
## Min. :0.0000 Min. :0.000000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.000000 Median :0.0000 Median :0.00000
## Mean :0.2496 Mean :0.009302 Mean :0.1302 Mean :0.03101
## 3rd Qu.:0.2496 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :2.0000 Max. :1.000000 Max. :2.0000 Max. :1.00000
## fruit.pods fruit.spots seed mold.growth
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.5042 Mean :1.021 Mean :0.1946 Mean :0.1134
## 3rd Qu.:1.0000 3rd Qu.:1.021 3rd Qu.:0.1946 3rd Qu.:0.0000
## Max. :3.0000 Max. :4.000 Max. :1.0000 Max. :1.0000
## seed.discolor seed.size shriveling roots
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.1109 Mean :0.09983 Mean :0.06586 Mean :0.1779
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :2.0000