The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")
It can be seen that: * Al
is slightly right skewed * Ba
is right skewed and mostly centered around 0 * Ca
is right skewed * Fe
is right skewed and mostly centered around 0 * K
is right skewed * Mg
is left skewed and bimodal * Na
is almost normal with a slight right tail * RI
is right skewed * Si
is left skewed * Type
is mostly centered around Types 1,2, and 7
There also seems to be a strong positive correlation between RI
and Ca
. There are also notable negative correlations between RI
and Si
, Al
and Mg
, Ca
and Mg
, Ba
and Mg
. There is also notable positive correlations between Ba
and Al
.
There seems to be outliers in Ba
, K
, RI
, Ca
, Fe
, and possibly Na
. There are some predictors that are skewed as mentioned in 3.1.a
Glass %>%
keep(is.numeric) %>%
apply(., 2, skewness) %>%
round(4)
## RI Na Mg Al Si K Ca Ba Fe
## 1.6027 0.4478 -1.1365 0.8946 -0.7202 6.4601 2.0184 3.3687 1.7298
Since Be
, Fe
, and K
have a strong right skewness with a concentrations of points with low values, they may benefit from a log transformation. Mg
may also be log transformed since it is left skewed. The table below shows the optimal lambdas. RI
can be inverse squared while Si
can be squared. Al
can be square rooted. It would also be interesting to see how the model performs without Ca
as it has correlations with other variables.
Glass %>%
keep(is.numeric) %>%
mutate_all(funs(BoxCoxTrans(.)$lambda)) %>%
head(1)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## RI Na Mg Al Si K Ca Ba Fe
## 1 -2 -0.1 NA 0.5 2 NA -1.1 NA NA
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
columns <- colnames(Soybean)
lapply(columns,
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})
Degenerate distributions are ones that take on one possible value. mycelium
and sclerotia
seem to be degenerate. leaf.mild
and leaf.malf
seem to also almost one-sided when you discount the missing values.
Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","red"))
Soybean %>%
group_by(Class) %>%
mutate(class_Total = n()) %>%
ungroup() %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(),
Proportion = Missing / class_Total) %>%
ungroup()%>%
select(Class, Proportion) %>%
distinct()
## # A tibble: 5 x 2
## Class Proportion
## <fct> <dbl>
## 1 phytophthora-rot 0.773
## 2 diaporthe-pod-&-stem-blight 1
## 3 cyst-nematode 1
## 4 2-4-d-injury 1
## 5 herbicide-injury 1
Soybean %>%
filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
"2-4-d-injury", "herbicide-injury")) %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values with Missing Classes Removed",
x = "Proportion") +
scale_fill_manual(values=c("grey","red"))
There does seem to be a pattern in that some of the cases that are missing data are affiliated with certain cases. After those five classes were removed from the data, there seems to be no missing data.
One strategy would be to remove those 5 classes completely from the data. You can also subset the data by their class, with those 5 classes separately. You can then impute the variables that have missing values using KNN. If there are certain variables that are affiliated with those classes that have no data at all, then they can be removed in the subsetted dataset.