The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
The predictor variables vary widely amongst each other as shown in the histograms below. For some of the predictors, a significant amount of observations are zero (Ba, Fe, K, Mg). Some of the other predictors follow somewhat of a normal distribution with some skewness present. So these skewed variables would probably benefit from a Box-Cox transformation.
The correlation plot below shows us that Ca and
RI are highly correlated with one another, with a
correlation value of 0.81. All of the other variables hhave an absolute
correlation of less than 0.55 and are therefore moderately correlated or
less.
corrplot(cor(dplyr::select_if(Glass, is.numeric), use = "na.or.complete"),
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 0.75,
tl.cex = 0.5)
The plot below shows us that there are indeed some outliers present within the data. We also have some skewed predictors (Al, Ca, Na, RI, Si) shown in the histogram in Problem 3.1a.
I think that all of the variables would benefit from a Box-Cox
transformation. Mg has a bimodal distribution and might not
benefit from such a transformation, but let’s try transforming all of
the variables.
As expected, the plot above shows us that Ai, Ca, Na, Ri, and Si have been transformed and are closer to a normal distribution, while the variables with significant zero observations did not benefit from a Box-Cox transformation.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
## See ?Soybean for details
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Soybean %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol = 2) +
geom_density() +
geom_bar() +
scale_x_discrete(guide = guide_axis(angle = 90))
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
Looking at the histograms for each of the categorical predictors, we
can see a degenerate distribution in the mycelium and
sclerotia variables. We can also see somewhat of a
degenerate distribution in the seed.discolor,
seed.size, lodging, leaf.malf,
and shriveling variables.
Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(),
names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
filter(missing == TRUE) %>%
arrange(desc(n)) %>%
mutate(percent_missing = (n/nrow(Soybean))*100) %>%
head()
## # A tibble: 6 × 4
## variables missing n percent_missing
## <chr> <lgl> <int> <dbl>
## 1 hail TRUE 121 17.7
## 2 lodging TRUE 121 17.7
## 3 seed.tmt TRUE 121 17.7
## 4 sever TRUE 121 17.7
## 5 germ TRUE 112 16.4
## 6 leaf.mild TRUE 108 15.8
The tibble above shows us the top 6 variables that have the most
missing data. As we can see, 5 out of the 6 variables have 17.77%
missing, while leaf.mild has 15.8% missing. I think that
the fact that the first 4 variables have the exact same amount of
percent missing indicates that there might be a bit of a pattern in
these particular variables.
In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.
I decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.
The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching), which is used for quantitative data.
Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data. Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:
“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’
temp_Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(),
names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y=variables,x=n,fill=missing))+
geom_col()+
scale_fill_manual(values=c("skyblue3","gold"))+
theme(axis.title.y=element_blank()) + theme_classic()