library(mlbench)
library(tidyverse)
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
glass <- data(Glass)
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_histogram(bins = 20) + facet_wrap(~key, scales = 'free')
Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_boxplot() + facet_wrap(~key, scales = 'free')
The histograms show us that several predictors are skewed. The boxplots show us that all predictors except ‘Mg’ have outliers.
For the predictors that show a skewness, a Box-Cox transformation would be applicable. For the predictors with outliers we can use a spatial sign transformation.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details
soybean <- data(Soybean)
Soybean %>% select(!Class)%>% drop_na() %>% gather() %>% ggplot(aes(value)) + geom_bar() + facet_wrap(~ key)
Degenerate distributions are distributions where the variable primarily takes one value and others occur at a very low rate. Here we can say ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate.
Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(),
names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y=variables,x=n,fill=missing))+
geom_col()
Soybean %>% filter_all(any_vars(is.na(.))) %>% select(Class) %>% group_by(Class) %>% summarise(count = n())
## # A tibble: 5 x 2
## Class count
## <fct> <int>
## 1 2-4-d-injury 16
## 2 cyst-nematode 14
## 3 diaporthe-pod-&-stem-blight 15
## 4 herbicide-injury 8
## 5 phytophthora-rot 68
All missing data is contained within the five classes shown above. Of the predictors, it seems ‘hail’, ‘lodging’, ‘sever’ and ‘seed.tmt’ have the most missing values.
Imputation is one way we can account for missing data without getting rid of it all together. For the imputation strategy we can fill the missing data with several values such as: maxia, minima, mean, or median. For this strategy filling in the missing data with the mean for that column would be the best approach.