library(mlbench)
library(tidyverse)
library(caret)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass %>%
select(-Type) %>%
pivot_longer(cols = everything(), names_to = "predictors", values_to = "vals") %>%
ggplot(aes(x = vals))+
geom_histogram(bins = 30, fill = "coral", color = "black", alpha = 0.5)+
facet_wrap(~ predictors, scales = "free")+
theme_minimal()
library(corrplot)
## corrplot 0.92 loaded
# Remove the 'RI' and 'Type' columns
Glass_filtered <- Glass %>%
select(-Type)
# Calculate the correlation matrix
cor_matrix <- cor(Glass_filtered, use = "complete.obs")
# Create the correlation plot
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7,
col = colorRampPalette(c("darkgreen", "white", "coral"))(200))
Checking for skewness, there a few clear examples of right skew. Namely the predictors BA, FE and K. While MG exhibits a left skew.
While AI, CA, SI, NA and RI exhibit the most centrality across the data.
As for correlation, CI and RI as well as AL and FE exibit the highest positive relationships. While SI and K and BA and MG have the highest negative relationship.
#Using the same long format df from part As:
Glass_filtered <- Glass %>% select(-Type) %>%
pivot_longer(cols = everything(), names_to = "predictors", values_to = "vals")
ggplot(Glass_filtered, aes(x = predictors, y = vals)) +
geom_boxplot(fill = "coral", color = "black", alpha = 0.7) +
labs(title = "Boxplots for All Variables in the Glass Dataset",
x = "Variable",
y = "Value") +
theme_minimal() +
facet_wrap(~ predictors, scales = "free")
* The predictors for this dataset all contain significant outliers
except the MG predictor.
bc_glass <- Glass %>%
select( -Type) %>%
preProcess(method = c("BoxCox"))
bc_glass
## Created from 214 samples and 5 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
There are a few different transformations that could work in the case for this dataset. To start a box cox transformation could be very helpful here since it can help to stabilize skewness or non-normal distribution across the dataset (there really only seems to be a few normally distributed vars). A square-root transformation can also be used here, however, this issue has BoxCox written all over it, given the variability, skeweness and outliers present in the data.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(ggplot2)
library(mlbench)
data(Soybean)
Soybean %>%
select(-Class) %>%
gather() %>%
ggplot(aes(x = value))+
geom_bar()+
facet_wrap(~ key)+
ggtitle(label = "Soybean Categorical Dispersion")
## Warning: attributes are not identical across measure variables; they will be
## dropped
Degeneration refers to a distribution that only has a chance of being
that specific value (category in this instance). This can also be
referred to by constant distribution. There do seem to be quite a few
distributions in the dataset that degenerate. Some good examples
are:
Dealing with missing data is tricky and depends on what you are trying to address and achieve with your research and I have not been able to find any sort of best practice when it comes to missingness and droping variables (ie, this column is missing >15% of its data so you should drop). So I think that imputation is the way to go here. To impute across the missing predictors I will use MICE. The basic code is found here : https://libguides.princeton.edu/R-Missingdata. This imputation will utilize Predictive Mean Matching to achieve its
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
imputed_data <- mice(Soybean, m=5, method = "pmm", print=FALSE)
## Warning: Number of logged events: 1666
Just pulling out the first dataset created by pmm for the example
complete_data_1 <- complete(imputed_data, action = 1)