The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# Make a copy of the Glass dataset and remove the categorical variable - "Type".
glassCopy <- subset(Glass, select = -Type)
# Plot the predictor variables distribution.
glassCopy %>%
gather() %>%
ggplot(aes(value, color = 'red', fill = 'brown')) +
facet_wrap(~ key, scales = 'free') +
geom_histogram(bins = 16) +
theme_light() +
theme(legend.position = 'none') +
ggtitle('Distribution of Predictor Variables')
# Create a correlation matrix of the predictor variables.
corrplot(cor(glassCopy))
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Looking at the "Distribution of Predictor Variables" plot above, we can see that some of the variables are close to normally distributed (AI, Ca, Na, RI, and Si), whilst the remaining variables are skewed (Ba, Fe, K, and Mg). Ba, Fe, and K are skewed to the right. K has an outlier at 3 and 6, and there are a lot of outliers in Al, Ba, Ca, Mg, Fe, and Ri.
The correlation matrix tells us that most of the variables are not strongly related. Some exceptions to this are the relationships between Si and RI, Ca and RI, Ba and Mg.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes - applying a Box-Cox or Log transformation to the skewed variables - Ba, Fe, K, and Mg, might improve the classification model.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
nearZeroVar(Soybean, saveMetrics = TRUE) %>%
kable(caption = 'Variables Near Zero Variance Status Report') %>%
kable_styling()
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| Class | 1.010989 | 2.7818448 | FALSE | FALSE |
| date | 1.137405 | 1.0248902 | FALSE | FALSE |
| plant.stand | 1.208191 | 0.2928258 | FALSE | FALSE |
| precip | 4.098214 | 0.4392387 | FALSE | FALSE |
| temp | 1.879397 | 0.4392387 | FALSE | FALSE |
| hail | 3.425197 | 0.2928258 | FALSE | FALSE |
| crop.hist | 1.004587 | 0.5856515 | FALSE | FALSE |
| area.dam | 1.213904 | 0.5856515 | FALSE | FALSE |
| sever | 1.651282 | 0.4392387 | FALSE | FALSE |
| seed.tmt | 1.373874 | 0.4392387 | FALSE | FALSE |
| germ | 1.103627 | 0.4392387 | FALSE | FALSE |
| plant.growth | 1.951327 | 0.2928258 | FALSE | FALSE |
| leaves | 7.870130 | 0.2928258 | FALSE | FALSE |
| leaf.halo | 1.547511 | 0.4392387 | FALSE | FALSE |
| leaf.marg | 1.615385 | 0.4392387 | FALSE | FALSE |
| leaf.size | 1.479638 | 0.4392387 | FALSE | FALSE |
| leaf.shread | 5.072917 | 0.2928258 | FALSE | FALSE |
| leaf.malf | 12.311111 | 0.2928258 | FALSE | FALSE |
| leaf.mild | 26.750000 | 0.4392387 | FALSE | TRUE |
| stem | 1.253378 | 0.2928258 | FALSE | FALSE |
| lodging | 12.380952 | 0.2928258 | FALSE | FALSE |
| stem.cankers | 1.984293 | 0.5856515 | FALSE | FALSE |
| canker.lesion | 1.807910 | 0.5856515 | FALSE | FALSE |
| fruiting.bodies | 4.548077 | 0.2928258 | FALSE | FALSE |
| ext.decay | 3.681481 | 0.4392387 | FALSE | FALSE |
| mycelium | 106.500000 | 0.2928258 | FALSE | TRUE |
| int.discolor | 13.204546 | 0.4392387 | FALSE | FALSE |
| sclerotia | 31.250000 | 0.2928258 | FALSE | TRUE |
| fruit.pods | 3.130769 | 0.5856515 | FALSE | FALSE |
| fruit.spots | 3.450000 | 0.5856515 | FALSE | FALSE |
| seed | 4.139130 | 0.2928258 | FALSE | FALSE |
| mold.growth | 7.820895 | 0.2928258 | FALSE | FALSE |
| seed.discolor | 8.015625 | 0.2928258 | FALSE | FALSE |
| seed.size | 9.016949 | 0.2928258 | FALSE | FALSE |
| shriveling | 14.184211 | 0.2928258 | FALSE | FALSE |
| roots | 6.406977 | 0.4392387 | FALSE | FALSE |
# Search for degenerate distributions in the Soybean dataset.
degenerateDistributions <- nearZeroVar(Soybean)
colnames(Soybean)[degenerateDistributions]
## [1] "leaf.mild" "mycelium" "sclerotia"
As per the above "Variables Near Zero Variance Status Report" table and NearZeroVar() search results, There are 3 variables in the Soybean dataset with degenerate distributions - leaf.mild, mycelium, and sclerotia.
(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Print out a table of missing values by column (sorted in descending order).
missingValuesOrdered <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[missingValuesOrdered], caption = 'Missing Values By Column') %>%
kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>%
scroll_box(width = '100%', height = '600px')
| x | |
|---|---|
| hail | 121 |
| sever | 121 |
| seed.tmt | 121 |
| lodging | 121 |
| germ | 112 |
| leaf.mild | 108 |
| fruiting.bodies | 106 |
| fruit.spots | 106 |
| seed.discolor | 106 |
| shriveling | 106 |
| leaf.shread | 100 |
| seed | 92 |
| mold.growth | 92 |
| seed.size | 92 |
| leaf.halo | 84 |
| leaf.marg | 84 |
| leaf.size | 84 |
| leaf.malf | 84 |
| fruit.pods | 84 |
| precip | 38 |
| stem.cankers | 38 |
| canker.lesion | 38 |
| ext.decay | 38 |
| mycelium | 38 |
| int.discolor | 38 |
| sclerotia | 38 |
| plant.stand | 36 |
| roots | 31 |
| temp | 30 |
| crop.hist | 16 |
| plant.growth | 16 |
| stem | 16 |
| date | 1 |
| area.dam | 1 |
| Class | 0 |
| leaves | 0 |
# Print a table containing a count of missing values by class.
classesMissingValues <- Soybean %>%
mutate(nul = rowSums(is.na(Soybean))) %>%
group_by(Class) %>%
summarize(missing = sum(nul)) %>%
filter(missing != 0)
kable(classesMissingValues, caption = 'Missing Values By Class') %>%
kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>%
scroll_box(width = '100%')
| Class | missing |
|---|---|
| 2-4-d-injury | 450 |
| cyst-nematode | 336 |
| diaporthe-pod-&-stem-blight | 177 |
| herbicide-injury | 160 |
| phytophthora-rot | 1214 |
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
For this question, I decided to impute missing values using the MICE (Multivariate Imputation by Chained Equations) package's mice() imputation function. As per the below before and after imputation missing values count tables, we can see that the imputation has removed all missing values from the dataset.
#' mice_imputation - Mice Imputation.
#'
#' Given a dataset, runs the MICE algorithm on the dataset
#' to impute both numerical and categorical missing values.
#'
#' @param dataframe A dataframe on which to run the MICE algorithm.
#'
#' @return The passed dataset with missing values imputed to complete values.
#'
mice_imputation <- function(dataframe) {
imputation <- mice(dataframe, m = 1, method = 'cart', printFlag = FALSE)
imputed <- mice::complete(imputation)
}
# Check for empty values prior to imputing the data.
sapply(Soybean, function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count Before Imputation') %>% kable_styling()
| x | |
|---|---|
| hail | 121 |
| sever | 121 |
| seed.tmt | 121 |
| lodging | 121 |
| germ | 112 |
| leaf.mild | 108 |
| fruiting.bodies | 106 |
| fruit.spots | 106 |
| seed.discolor | 106 |
| shriveling | 106 |
| leaf.shread | 100 |
| seed | 92 |
| mold.growth | 92 |
| seed.size | 92 |
| leaf.halo | 84 |
| leaf.marg | 84 |
| leaf.size | 84 |
| leaf.malf | 84 |
| fruit.pods | 84 |
| precip | 38 |
| stem.cankers | 38 |
| canker.lesion | 38 |
| ext.decay | 38 |
| mycelium | 38 |
| int.discolor | 38 |
| sclerotia | 38 |
| plant.stand | 36 |
| roots | 31 |
| temp | 30 |
| crop.hist | 16 |
| plant.growth | 16 |
| stem | 16 |
| date | 1 |
| area.dam | 1 |
| Class | 0 |
| leaves | 0 |
# Check for empty values once again after running the MICE imputation on the data.
sapply(mice_imputation(Soybean), function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count After Imputation') %>% kable_styling()
| x | |
|---|---|
| Class | 0 |
| date | 0 |
| plant.stand | 0 |
| precip | 0 |
| temp | 0 |
| hail | 0 |
| crop.hist | 0 |
| area.dam | 0 |
| sever | 0 |
| seed.tmt | 0 |
| germ | 0 |
| plant.growth | 0 |
| leaves | 0 |
| leaf.halo | 0 |
| leaf.marg | 0 |
| leaf.size | 0 |
| leaf.shread | 0 |
| leaf.malf | 0 |
| leaf.mild | 0 |
| stem | 0 |
| lodging | 0 |
| stem.cankers | 0 |
| canker.lesion | 0 |
| fruiting.bodies | 0 |
| ext.decay | 0 |
| mycelium | 0 |
| int.discolor | 0 |
| sclerotia | 0 |
| fruit.pods | 0 |
| fruit.spots | 0 |
| seed | 0 |
| mold.growth | 0 |
| seed.discolor | 0 |
| seed.size | 0 |
| shriveling | 0 |
| roots | 0 |