# Check variables on dataset
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Glass %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(col='red',bins = 10) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Predictor Variables")
Glass %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot( col='orange') +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Predictor Variables")
Based on graphs, most of the variables are right skewed, except for Mg which is left skewed. and Si which is slightly skewed to the left. It seems to be strong correlation between BA, NA, and RI variables.
Seems that Na variable has the most normally distribution with a slight right skew. The rest of the variables tend to have right skews, with the exception of MG which is left skewed, and SI can be consider bimodal with a slight left skew.
The boxplots reveals a considerable amount of outliers for all variables except Mg and Fe.
Based on what I learned in this course so far, I can suggest Box-Cox transformation to deal with the skewness of the variables and obtain a more linear distribution.
par(mfrow=c(1,2))
BoxCoxTrans(Glass$Si)
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.81 72.28 72.79 72.65 73.09 75.41
##
## Largest/Smallest: 1.08
## Sample Skewness: -0.72
##
## Estimated Lambda: 2
hist(Glass$Si, main='Original Distribution of Si')
hist(Glass$Si**.5, main='Transformed (Lambda = 0.5)')
library(mlbench)
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
New_Soybean <- Soybean %>%
select(-Class) %>%
mutate(across(where(is.factor), as.numeric)) %>%
pivot_longer(cols = -c(date), names_to = "name", values_to = "value")
# Plot
ggplot(New_Soybean, aes(value)) +
geom_histogram(col='red',bins = 10) +
facet_wrap(vars(name)) +
ggtitle("Soybean Variables Histogram")
## Warning: Removed 2336 rows containing non-finite outside the scale range
## (`stat_bin()`).
Some of the predictors in this dataset have missing values, some others are imbalanced, one of the reasons it might be that the predictors are being analyzed with only one variable.
There are several ways to handle missing data, the MICE package in R is one of them, it handles variables with missing data with a separate model. KNN imputation is another way to deal with this issue, it fills missing data values in predictors, however, it is recommended to eliminate variables with a high percentage of missing values.