The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The Glass
data set was loaded, and some basic information, such as the dimensions, structure, and missing values was returned. This data is complete, and there are no missing values.
# load data
data(Glass)
# get dimensions
dim(Glass)
## [1] 214 10
# check out the structure
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
# check for missing values
introduce(Glass)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 214 10 1 9 0
## total_missing_values complete_rows total_observations memory_usage
## 1 0 214 2140 19984
# dataframe of predictors only
predictors <- Glass %>%
dplyr::select(1:9)
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Visualizations for understanding distribution of variables included density plots, boxplots, and qq-plots.
# density of the predictors
Glass %>%
pivot_longer(1:9) %>%
ggplot(aes(x=value)) +
geom_density(fill="blue", alpha=0.4)+
facet_wrap(~name, scales="free") +
theme_minimal()
# boxplots
Glass %>%
pivot_longer(1:9) %>%
ggplot(aes(x=value)) +
geom_boxplot(fill="blue", alpha=0.4)+
facet_wrap(~name, scales="free") +
theme_minimal()
# QQ
plot_qq(predictors)
# distributions of predictors vs target
plot_scatterplot(Glass, by = "Type", sampled_rows = 1000L)
Correlation between all predictors can be visualized with a correlation matrix. High correlations are in dark blue and red, indication positive and negative correlation, respectively. The correlation matrix, however, only provides a value for linear correlation, and is not useful for relationships that are not linear. Alternatively, a scatterplot can be used.
# correlation matrix of predictors
corrplot(cor(predictors), order = "hclust")
# scatterplot matrix of predictors
pairs(predictors)
Do there appear to be any outliers in the data? Are any predictors skewed?
From the visualizations above we can get an understanding of the data. In particular we see a number of skewed distributions: K, Ba, Ca, Fe. Even the more symmetric distributions among predictors show a degree of skew, and this observation can govern decisions regarding transforming before modeling. A calculation of skewness was applied to each of the variables to confirm observations from the visuals.
A number of the variables seem to have extreme values as indicated from the visuals above. Ba, Na, K in particular seem to have extreme values. The getOutliers()
function provides a way of indexing potential outliers in a variable. Based on classification method used (i.e. model sensitivity to extreme values), the outliers can be removed, kept, or otherwise handled.
# quantify the skewness of each predictor
apply(predictors, 2, skewness) %>% sort(decreasing = T)
## K Ba Ca Fe RI Al Na
## 6.4600889 3.3686800 2.0184463 1.7298107 1.6027151 0.8946104 0.4478343
## Si Mg
## -0.7202392 -1.1364523
# index outliers
getOutliers(predictors$K)$iRight
## [1] 164 172 173 186 187 202 208
getOutliers(predictors$Ba)$iRight
## [1] 62 107 129 164 186 187 190 191 192 193 194 195 196 197 198 199 200 201 203
## [20] 204 205 206 207 208 209 210 211 212 213 214
getOutliers(predictors$Na)$iRight
## [1] 185 190
Are there any relevant transformations of one or more predictors that might improve the classification model?
As determined previously, most of the variables contain at least some skew, so Box-Cox transformations would also be beneficial. A Box-Cox transformation was applied to RI, Ca, Na, Al, Si and had an overall effect of reducing skewness. The original data were on the same scale (% abundance) but the Box-Cox transformed data are not - for that reason a center and scale was applied.
# transformation to scale and center
trans1 <- preProcess(predictors, method=c("BoxCox","center","scale"))
transform1 <- predict(trans1, predictors)
head(transform1)
## RI Na Mg Al Si K
## 1 0.8756898 0.3133883 1.2517037 -0.65520274 -1.12729016 -0.67013422
## 2 -0.2471367 0.6129977 0.6346799 -0.08726137 0.09719851 -0.02615193
## 3 -0.7216425 0.1798164 0.6000157 0.27454124 0.43512776 -0.16414813
## 4 -0.2305698 -0.2150217 0.6970756 -0.23439154 -0.05836211 0.11184428
## 5 -0.3101056 -0.1402661 0.6485456 -0.34194384 0.55238422 0.08117845
## 6 -0.7947626 -0.7480171 0.6416128 0.42852325 0.40909039 0.21917466
## Ca Ba Fe
## 1 -0.01194308 -0.3520514 -0.5850791
## 2 -0.87918677 -0.3520514 -0.5850791
## 3 -0.93250264 -0.3520514 -0.5850791
## 4 -0.48665658 -0.3520514 -0.5850791
## 5 -0.63291627 -0.3520514 -0.5850791
## 6 -0.63291627 -0.3520514 2.0832652
transform1 %>%
pivot_longer(1:9) %>%
ggplot(aes(x=value)) +
geom_density(fill="blue", alpha=0.4)+
facet_wrap(~name, scales="free") +
theme_minimal()
apply(transform1, 2, skewness) %>% sort(decreasing = T)
## K Ba Fe RI Al Na
## 6.46008890 3.36867997 1.72981071 1.56566039 0.09105899 0.03384644
## Ca Si Mg
## -0.19395573 -0.65090568 -1.13645228
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
First, some basic information regarding the dataset…
# load data
data("Soybean")
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# overview of missing missing data
introduce(Soybean)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 683 36 36 0 0
## total_missing_values complete_rows total_observations memory_usage
## 1 2337 562 24588 128600
# dataframe of just the predictors
predictors <- Soybean %>% dplyr::select(-Class)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
There are three variables that can be categorized as near zero variance variables, which are leaf.mild
, mycelium
, and sclerotia
. These variables will offer no value to the classification model and will be removed from the dataset in a later section.
# plot distributions
plot_bar(predictors)
# find and remove degenerates
degenerates <- nearZeroVar(predictors)
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
The missing data are related to only 5 of the classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury, and phytophthora-rot. There are missing data in almost all of the observations of these classes. It could be related to difficulties collecting data for these classes, but whatever the reason, we know that the data are informative, that is, the pattern of missing data is related to the outcome. The missing values also come in discrete bins (ie several missing 121 times, 106 times and so on) which gives further validity to the claim that the they are not missing at random.
# plot of missing values by attribute
plot_missing(Soybean)
# get counts of missing vlaues by predictor
counts <- apply(is.na(predictors), 2, sum) %>% sort(decreasing=T)
counts
## hail sever seed.tmt lodging germ
## 121 121 121 121 112
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 108 106 106 106 106
## leaf.shread seed mold.growth seed.size leaf.halo
## 100 92 92 92 84
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 84 84 84 84 38
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 38 38 38 38 38
## sclerotia plant.stand roots temp crop.hist
## 38 36 31 30 16
## plant.growth stem date area.dam leaves
## 16 16 1 1 0
# table of Soybean Class counts
counts <- table(Soybean$Class) %>%
data.frame()
# classes with missing observation
Soybean[!complete.cases(Soybean),] %>%
group_by(Var1=Class) %>%
summarize(missing_data=n()) %>%
left_join(counts, by="Var1")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## Var1 missing_data Freq
## <fct> <int> <int>
## 1 2-4-d-injury 16 16
## 2 cyst-nematode 14 14
## 3 diaporthe-pod-&-stem-blight 15 15
## 4 herbicide-injury 8 8
## 5 phytophthora-rot 68 88
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
First, the degenerate variables determined in part (a) can be removed since they offer no predictive value. That drops the number of predictors down to 32.
There are a number of ways to deal with the remaining missing values:
One option would be to filter the data for complete cases only. This is an expensive option - as stated previously 17% of the observations would be eliminated and since there are only 683 observations to begin with, the predictive performance of the model could be compromised. This would also effectively reduce the number of distinct classes of the target from 19 to 15. The importance of these classes would have to be determined by the modeler.
Another option is to create a new variable that expresses whether or not an observation has a missing value. As stated previously, the missingness appears to be informative, and the presence of a missing value would suggest 1 of the 5 classes.
Finally imputation can be used to estimate values where missing. This approach adds a layer of uncertainty to the model. This would probably not be my first option, but I’ll fill in values in this manner anyway. After imputing the number of missing values goes to zero and there is no loss in number of observations.
# first eliminate degenerates
Soybean <- Soybean[,-degenerates]
Soybean_imputed <- knnImputation(Soybean, k=3)
introduce(Soybean_imputed)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 683 33 33 0 0
## total_missing_values complete_rows total_observations memory_usage
## 1 0 683 22539 118112