The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Glass %>%
gather(key='predictor', value='response', -Type) %>%
ggplot(aes(response, fill=Type)) +
geom_histogram(bins=20) +
facet_wrap(~predictor, scales='free') +
labs(x='', y='') +
scale_x_continuous(expand=c(0.02, 0.02, 0.02, 0.02)) +
scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) +
scale_fill_brewer(palette='Set1') +
theme_bw()
The 9 predictors differ greatly. Predictors like Al, Na and Mg appear to have strong predictive value. From the distributions we can see that there is some segregation by type based on the value of that predictor. This indicates that the predictor may aid in creating a model. Conversly, predictors like Ba and Fe have such a small range of values that it is unclear whether they can even be used to build a model. In both cases more than 90% of all observations are essentially 0. Other predictors have a more normal distribution, and may be of some use as well. The predictors values may need to scaled as they are mesaured on greatly different scales.
Glass %>%
gather(key='predictor', value='response', -Type) %>%
ggplot(aes(predictor, response)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
facet_wrap(~predictor, scales='free') +
labs(x='', y='') +
theme_bw() +
theme(panel.border = element_blank(),
strip.background = element_rect(fill='grey80', color='white'))
Glass %>%
gather(key='predictor', value='response', -Type) %>%
ggplot(aes(predictor, response)) +
geom_boxplot() +
facet_wrap(~predictor, scales='free') +
labs(x='', y='') +
theme_bw() +
theme(panel.border = element_blank(),
strip.background = element_rect(fill='grey80', color='white'))
Comparing the previous plot and this one we can see that predictors Ba, Fe, and K are heavily skewed with outliers while predictors Al, Ca, Na are more normally distributed. Mg is bimodal. The above boxplot clearly identifies all the points that are definitionally outliers.
The type of transformations that should be made depend on the nature of the model being developed. For example, random forests are robust against data that is either skewed or of different order of magnitude. On the other hand, if the model is a multiple logistic regression (or GAM) then the predictors will need to be modified.
In this case, the predictors should be scaled as they are all on greatly different scales. For eaxmple, Si and Fe are different order of magnitudes. Next we would want to address the skew of the predictors. This can be done via a BoxCox transformation or by simply taking the natural log of the predictors values. Predictor Ba may need to be removed entirely as it is a near-zero variance predictor and this can cause trouble for regressions. Conversly, for trees these modifications would not be necessary as trees are rubust when considering skewed, diverse predictors.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Soybean %>%
select_if(is.factor) %>%
gather(key='predictor', value='value', -Class) %>%
count(predictor, value) %>%
ggplot(aes(value, n)) +
geom_histogram(stat='identity') +
facet_wrap(~predictor, scales='free') +
labs(x='', y='') +
scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) +
scale_x_discrete(expand=c(0.15, 0.15, 0.15, 0.15)) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
There are a few degenerate distributions. That is, distributions where a single value is so overwhelming that there is essentially no measure coming from that predictor. These are, int.discolor, leaf.mild, mycelium, sclerotia and possibly shriveling. It is important to note that none of these are truely degenerate as they all have values present for each categorical value. However, the distributions are so skewed that it may cause problems based on the selected model. Further investigation into each of these predictors should be performed before removing them. Removing a predictor should be considered a last case scenario.
##
## Variables sorted by number of missings:
## Variable Count
## hail 121
## sever 121
## seed.tmt 121
## lodging 121
## germ 112
## leaf.mild 108
## fruiting.bodies 106
## fruit.spots 106
## seed.discolor 106
## shriveling 106
## leaf.shread 100
## seed 92
## mold.growth 92
## seed.size 92
## leaf.halo 84
## leaf.marg 84
## leaf.size 84
## leaf.malf 84
## fruit.pods 84
## precip 38
## stem.cankers 38
## canker.lesion 38
## ext.decay 38
## mycelium 38
## int.discolor 38
## sclerotia 38
## plant.stand 36
## roots 31
## temp 30
## crop.hist 16
## plant.growth 16
## stem 16
## date 1
## area.dam 1
## Class 0
## leaves 0
There is a strong pattern of missing data. The above plot indicates that if a sample is missing a single predictor, it is likely also missing several others. In fact, there are no samples missing only one or two predictors. This is problematic because it makes imputing dangerous as we would be supplementing numerous values for individual samples.
Soybean %>%
mutate(complete = complete.cases(.)) %>%
count(Class, complete) %>%
ggplot(aes(reorder(Class, -n, FUN=sum), n, fill=complete)) +
geom_bar(stat='identity') +
coord_flip() +
scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) +
scale_fill_brewer(palette='Set1') +
labs(y='Observations',
x='Soybean Type',
fill='Complete Observations') +
theme_bw() +
theme(panel.grid.major.y = element_blank(),
panel.border = element_blank(),
legend.position = c(0.65, 0.75),
legend.background = element_rect(fill='grey90', color='black'),
legend.key = element_rect(color='grey90'))
The missing data is entirely contained within 5 of the predictors. For 4 of the classes, every single observation is missing multiple predictors and for one of the classes a sizeable majority of samples are missing multiple predictors.
The strategy taken depends on the goal of the model. If I were given free reign I would eliminate the 5 classes that are missing predictors. This would leave me with approximately 82% of the available data and, more importantly, no missing data. Of course, eliminating classes is not something to be taken lightly (and may not even be appropriate given the task). If I am required to keep all of the classes, the next step would be to determine which (if any) of the predictors can be imputed and which may be better left out due to the number of missing values. For example, hail, sever, seed.tmt, ad lodging are all missing 121 observations. Furthermore, several plots (not shown) demonstrate that the soybeans that are missing data are all missing the same data. Imputed values then is a bad idea.