library(mlbench)
library(tidyverse)
library(corrplot)The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15,aes(x=value, y = ..density..),fill="gray", color="#e9ecef") +
geom_density(aes(x=value), color='red', lwd = 0.8) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot() Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")Strong Negative Relationships variables
Ca / Mg; RI / Si; RI / Al; Mg / Al; Mg / Ba
Strong Positive Relationships
Ca / RI; K / Al; Al / Ba; Na / Ba
There seems to be outliers in Ba, K, RI, Ca, and Fe. As mention earlier, there is right skewed in Al, Ba, Ca, Fe, K and Ri while there is a left skewness in Mg and Si
It appears that there are outliers in some of the variables in the dataset. Ba, Ca, Fe, K, Mg, and Na appear to have observations that are outliers to the rest of the variable. There are also predictors that have a skewed data distribution. Ca, Ba, Na, and RI are right skewed. Mg and Si are left skewed.
Box-Cox transformation would be very helpful in improving the classification model. For the predictors with outliers we can use a log, square root and spatial sign transformation.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
soybean <- data(Soybean)
Soybean %>% select(!Class)%>% drop_na() %>% gather() %>% ggplot(aes(value)) + geom_bar() + facet_wrap(~ key)## Warning: attributes are not identical across measure variables; they will be
## dropped
Degenerate distributions are distributions where the variable primarily takes one value and others occur at a very low rate. Here we can say ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate.
*(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?**
Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","violet"))## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <]8;;https://github.com/tidyverse/dplyr/issueshttps://github.com/tidyverse/dplyr/issues]8;;>.
Soybean %>%
group_by(Class) %>%
mutate(class_Total = n()) %>%
ungroup() %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(),
Proportion = Missing / class_Total) %>%
ungroup()%>%
select(Class, Proportion) %>%
distinct() ## # A tibble: 5 × 2
## Class Proportion
## <fct> <dbl>
## 1 phytophthora-rot 0.773
## 2 diaporthe-pod-&-stem-blight 1
## 3 cyst-nematode 1
## 4 2-4-d-injury 1
## 5 herbicide-injury 1
There does seem to be a pattern in that some of the cases that are missing data are affiliated with certain cases. After those five classes were removed from the data, there seems to be no missing data.
*(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Imputation method of handling missing data is one way we can account for missing data without getting rid of it all together. For the imputation strategy we can fill the missing data with several values such as: maxia, minima, mean, or median. For this strategy filling in the missing data with the mean for that column would be the best approach.