suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(e1071))
## Warning: package 'e1071' was built under R version 4.3.3
suppressMessages(library(caret))
The UC Irvine Machine Learning Repository contains a dat set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between the predictors.
glass <- as.data.frame(Glass)
summary(glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
glass_long <- glass %>%
pivot_longer(cols = c(1:9), names_to = "predictor", values_to = "value")
ggplot(glass_long, aes(x = value)) +
geom_boxplot(fill = "cornflowerblue", color = "black") +
facet_wrap(~ predictor, scales = "free_x") +
theme_minimal()
ggpairs(glass, columns = 1:9, progress = FALSE)
## Warning in geom_point(): All aesthetics have length 1, but the data has 81 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
Calcium and the refracting index are highly positively correlated. RI is also negatively correlated with both silicon and aluminum. Magnesium is negatively correlated with barium, calcium, and aluminum. Aluminum is positively correlated with barium.
Do there appear to be any outliers in the data? Are any predictors skewed?
skewValues <- apply(glass[,1:9], 2, skewness)
skewValues
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
Yes, based on the skew values and the box plots, potassium, calcium, and barium are significantly right-skewed. Iron and the refracting index are right-skewed. Sodium and aluminum are roughly symmetric. Silicon and magnesium are left-skewed. There also appear to be outliers in the data for each predictor except magnesium, based on the box plots.
Are there any relevant transformations of one or more predictors that might improve the classification model?
First, I would apply a log transformation to Ca and RI, since they are right-skewed and contain values greater than 1. Then I would center and scale the data.
glass$RI <- log(glass$RI)
glass$Ca <- log(glass$Ca)
glass$RI <- (glass$RI-mean(glass$RI))/ sd(glass$RI)
glass$Mg <- (glass$Mg-mean(glass$Mg))/ sd(glass$Mg)
glass$Si <- (glass$Si-mean(glass$Si))/ sd(glass$Si)
glass$K <- (glass$K-mean(glass$K))/ sd(glass$K)
glass$Ca <- (glass$Ca-mean(glass$Ca))/ sd(glass$Ca)
glass$Ba <- (glass$Ba-mean(glass$Ba))/ sd(glass$Ba)
glass$Fe <- (glass$Fe-mean(glass$Fe))/ sd(glass$Fe)
Finally, feature extraction through principle component analysis would likely help to account for redundancy in the several predictors that are correlated with each other.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth.) The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
soybean <- as.data.frame(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
summary(soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
nearZeroVar(soybean)
## [1] 19 26 28
For all of the categorical variables, the fraction of unique values over the sample size is low (<10%), since none of the variables have 68 unique values. Therefore to see if any distributions are degenerate, as described in the text, we need only look at the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value. When looking at this ratio, three categorical variables stand out as having degenerate distributions: leaf.mild(535/20=26.75), mycelium(639/6=106.5), and sclerotia(625/20=31.25). The model would likely be improved by eliminating these variables.
soybean <- soybean %>%
select(-leaf.mild, -mycelium, -sclerotia)
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
percent_na <- sort(sapply(soybean, function(x) mean(is.na(x)) * 100))
print(percent_na)
## Class leaves date area.dam crop.hist
## 0.0000000 0.0000000 0.1464129 0.1464129 2.3426061
## plant.growth stem temp roots plant.stand
## 2.3426061 2.3426061 4.3923865 4.5387994 5.2708638
## precip stem.cankers canker.lesion ext.decay int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 12.2986823 12.2986823 12.2986823 12.2986823 12.2986823
## seed mold.growth seed.size leaf.shread fruiting.bodies
## 13.4699854 13.4699854 13.4699854 14.6412884 15.5197657
## fruit.spots seed.discolor shriveling germ hail
## 15.5197657 15.5197657 15.5197657 16.3982430 17.7159590
## sever seed.tmt lodging
## 17.7159590 17.7159590 17.7159590
Yes, some predictors are more likely to be missing than others. 4 predictors have over 17% missing values (hail, sever, seed.tmt, and lodging,) and 3 have less than 1% missing values (area.dam, date, and leaves, which has no missing values.)
soybean %>%
group_by(Class) %>%
summarize(total_na = sum(rowSums(is.na(cur_data())))) %>%
arrange(desc(total_na))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `total_na = sum(rowSums(is.na(cur_data())))`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
## # A tibble: 19 × 2
## Class total_na
## <fct> <dbl>
## 1 phytophthora-rot 1159
## 2 2-4-d-injury 402
## 3 cyst-nematode 294
## 4 diaporthe-pod-&-stem-blight 162
## 5 herbicide-injury 136
## 6 alternarialeaf-spot 0
## 7 anthracnose 0
## 8 bacterial-blight 0
## 9 bacterial-pustule 0
## 10 brown-spot 0
## 11 brown-stem-rot 0
## 12 charcoal-rot 0
## 13 diaporthe-stem-canker 0
## 14 downy-mildew 0
## 15 frog-eye-leaf-spot 0
## 16 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
The pattern of missing data is related to the classes. Only 5 classes account for all of the missing data.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Based on the context of the data, I would handle missing values with imputation. First, I would consider the levels of each particular predictor. For the predictors where a 0 was used to indicate the feature was absent, normal, or did not apply, I would replace missing data with a 0. Researchers may not have been checking for those features based on the classification of the plant’s disease, meaning that that predictor is not a feature of that disease. For remaining missing values, I would use a k-nearest neighbors classifier to impute missing values based on similar data.