The plot below of a density plot overlayed on a histogram for all variables shows that more variables than those identified using the skewness statistics appear to have a skewed distribution. Some like Ca and RI may benefit from transformation, however other such as Ba and Fe may not benefit even after transformation as they are too far away from a normal distribution.
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
gather() %>%
group_by(key) %>%
ggplot(data = ., aes(value)) +
geom_histogram(bins = 30, aes(y = ..density..)) +
geom_density(alpha = 0.3, color = NA, fill = 'yellow') +
#scale_x_continuous() +
#scale_y_continuous() +
# scale_y_continuous(labels = scales::percent) +
facet_wrap(~key, scales = 'free') +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
The scatterplot matrix below shows the relationships, Pearson Correlation Coefficient, between all variables in the data. The first column which represents the relationships between the response variable and each predictor further supports the observations made from the scatterplots. The matrix also shows that there are relationships between the some of the predictors as well. The strongest correlation is between Ca and RI.
# Calculate pairwise pearson correlation and display as upper matrix plot
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
cor(method = 'pearson', use = 'pairwise.complete.obs') %>%
ggcorrplot(corr = ., method = 'square', type = 'upper'
, lab = TRUE, lab_size = 3, lab_col = 'grey20')
The histogram and density show that some of the predictors have long tails indicative of outliers. The plot below shows a boxplot and violin plot for each variable. Many of the variables appear to have outliers with predictors like Ba, Fe and K appear to have the most potential outliers. How these outliers will be handled will be dependent on a mix of reference data and statistical techniques.
# boxplot with violin plot overlaid for all variables
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
gather() %>%
group_by(key) %>%
ggplot(data = ., aes(x = '', y = value)) +
geom_boxplot() +
geom_violin(alpha = 0.3, color = NA, fill = 'lightgreen') +
labs(x = NULL, y = NULL) +
theme(axis.ticks.y=element_blank()) +
facet_wrap(~key, scales = 'free', ncol = 2) +
coord_flip()
The following tables are the summary statistics for all of the numeric variables in the dataset. They have been broken up into two tables for ease of reading. While some aspects of the distribution of the variables are easier seen in visualization. Two notable statistics are that K has a strongly positive skew and kurtosis (thus expecting lots of outliers for this variable) which may need to be addressed before developing any type of model; other predictors show skew and kurtosis, but to a lesser degree.
From https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm, we know Normal Distirubtion would have a skewness near 0 and a kurtosis higher than 3.
statFuns <-
funs(missing = sum(is.na(.))
, min = min(., na.rm = TRUE)
, Q1 = quantile(., .25, na.rm = TRUE)
, mean = mean(., na.rm = TRUE)
, median = median(., na.rm = TRUE)
, Q3 = quantile(., .75, na.rm = TRUE)
, max = max(., na.rm = TRUE)
, sd = sd(., na.rm = TRUE)
, mad = mad(., na.rm = TRUE)
, skewness = skewness(., na.rm = TRUE)
, kurtosis = kurtosis(., na.rm = TRUE)
)
# Create data frame of basic summary stats
dfsum <-
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
summarise_all(statFuns) %>%
gather() %>%
separate(key, c('metric', 'stat'), sep = '(_)(?!.*_)') %>%
spread(stat, value) %>%
dplyr::select(metric, names(statFuns))
# print summary stats as two tables for easier reading
# stats for table 1
vTbl1Stats <- c('metric', 'min', 'Q1', 'mean', 'median', 'Q3', 'max')
pandoc.table(dfsum[, vTbl1Stats], missing = '-')
##
## ------------------------------------------------------------
## metric min Q1 mean median Q3 max
## -------- ------- -------- --------- -------- ------- -------
## Al 0.29 1.19 1.445 1.36 1.63 3.5
##
## Ba 0 0 0.175 0 0 3.15
##
## Ca 5.43 8.24 8.957 8.6 9.172 16.19
##
## Fe 0 0 0.05701 0 0.1 0.51
##
## K 0 0.1225 0.4971 0.555 0.61 6.21
##
## Mg 0 2.115 2.685 3.48 3.6 4.49
##
## Na 10.73 12.91 13.41 13.3 13.83 17.38
##
## RI 1.511 1.517 1.518 1.518 1.519 1.534
##
## Si 69.81 72.28 72.65 72.79 73.09 75.41
## ------------------------------------------------------------
pandoc.table(dfsum[
, c('metric', setdiff(colnames(dfsum), vTbl1Stats))], missing = '-')
##
## --------------------------------------------------------------
## metric missing sd mad skewness kurtosis
## -------- --------- ---------- ---------- ---------- ----------
## Al 0 0.4993 0.3113 0.8946 1.938
##
## Ba 0 0.4972 0 3.369 12.08
##
## Ca 0 1.423 0.6598 2.018 6.41
##
## Fe 0 0.09744 0 1.73 2.52
##
## K 0 0.6522 0.1705 6.46 52.87
##
## Mg 0 1.442 0.3039 -1.136 -0.4527
##
## Na 0 0.8166 0.6449 0.4478 2.898
##
## RI 0 0.003037 0.001875 1.603 4.717
##
## Si 0 0.7745 0.5708 -0.7202 2.816
## --------------------------------------------------------------
Unknown outliers will be determined using the Median Absolute Deviation (MAD); as a metric of central tendency, median is less effected by extreme values than mean. Anything that exceeds three MAD of the median will be considered an outlier and dropped. Outlier removed data will be stored in a seperate dataframe. 59% of data are removed after MAD removal and it is a quite huge loss. It is not reasonable to consider adressing 59% of loss through outlier removal - we will keep using original dataframe for the next question.
# create function to evaluate whether a value is an outlier
Glass_num <- Glass %>% keep(is.numeric)
Glass_num$index <- c(1:nrow(Glass_num))
madOutlier <- function(x, cutoff = 3){
madrange <- mad(x)*c(-1,1)*cutoff
lower <- x < median(x) + madrange[1]
upper <- x > median(x) + madrange[2]
return(lower | upper)
}
# store outliers in a dataframe
dfOutlier <- as.data.frame(apply(Glass_num, 2, madOutlier))
# retain INDEX of outliers to remove
outlierIndex <-
Glass_num$index[
apply(
dfOutlier
, 1, any)]
# df after outliers are removed
Glass_num_mad <- Glass_num[!(Glass_num$index %in% outlierIndex),]
#Glass_num_mad
round( length(outlierIndex) / nrow(Glass_num) * 100, 2)
## [1] 59.35
# potential outliers?
# length(outlierIndex)
The data contains enough observations that some deviation from normal distributions are acceptable.
It is known that Box-Cox does not allow zero values so that a constant = 1 was added for zero_inflated variables such as Ba, K, Mg and Fe before applying the method. It seems like except for Al, Ca and Na, most of variables did not get normalized from Box-Cox.
Yeo-Johnson works like Box-Cox but the difference is that it allows zero/negative values.
From below stat summary, we know that Yeo-Johnson did not perform any transformation for zero-inflated variables such as Ba, Fe and already fairly normalized variables such as Si, Ri.
In general, Yeo-Johnson works better in transformation for most variables than Box-Cox.
# apply box-cox power transformations to heavily skewed variables
# TODO revise to use either lapply & function or purrr map commands
# apply yeo-johnson -> it works for negative or zero values
pre <- preProcess(Glass, method = 'YeoJohnson')
yj <- predict(pre,Glass)
Glass$Ba_yj <- yj$Ba
Glass$K_yj <- yj$K
Glass$Mg_yj <- yj$Mg
Glass$Fe_yj <- yj$Fe
Glass$Al_yj <- yj$Al
Glass$Ca_yj <- yj$Ca
Glass$Na_yj <- yj$Na
Glass$RI_yj <- yj$RI
Glass$Si_yj <- yj$Si
# Since these variables have zero value adding 1 to allow box-cox transform
Ba = BoxCoxTrans(Glass$Ba + 1) #as.numeric(powerTransform(Glass$Ba + 1)$lambda)
K = BoxCoxTrans(Glass$K + 1) #as.numeric(powerTransform(Glass$K + 1)$lambda)
Mg = BoxCoxTrans(Glass$Mg + 1) #as.numeric(powerTransform(Glass$Mg + 1)$lambda)
Fe = BoxCoxTrans(Glass$Fe + 1) #as.numeric(powerTransform(Glass$Fe + 1)$lambda)
Al = BoxCoxTrans(Glass$Al) #as.numeric(powerTransform(Glass$Al)$lambda)
Ca = BoxCoxTrans(Glass$Ca) #as.numeric(powerTransform(Glass$Ca)$lambda)
Na = BoxCoxTrans(Glass$Na) #as.numeric(powerTransform(Glass$Na)$lambda)
RI = BoxCoxTrans(Glass$RI) #as.numeric(powerTransform(Glass$RI)$lambda)
Si = BoxCoxTrans(Glass$Si) #as.numeric(powerTransform(Glass$Si)$lambda)
Glass$Ba_BC <-
predict(Ba,Glass$Ba+1)
Glass$Ca_BC <-
predict(Ca,Glass$Ca)
Glass$K_BC <-
predict(K,Glass$K+1)
Glass$Mg_BC <-
predict(Mg,Glass$Mg+1)
Glass$Na_BC <-
predict(Na,Glass$Na)
Glass$RI_BC <-
predict(RI,Glass$RI)
Glass$Si_BC <-
predict(Si,Glass$Si)
Glass$Fe_BC <-
predict(Fe,Glass$Fe+1)
Glass$Al_BC <-
predict(Al,Glass$Al)
# plot histogram to show impact of box-cox & YeoJohnson
Glass %>%
dplyr::select(matches('Ba|Ca|K|Mg|Na')) %>%
gather() %>%
ggplot(data = ., aes(value)) +
geom_histogram(bins = 30) +
facet_wrap(~key, ncol = 3, scales = 'free')
Glass %>%
dplyr::select(matches('RI|Si|Fe|Al')) %>%
gather() %>%
ggplot(data = ., aes(value)) +
geom_histogram(bins = 30) +
facet_wrap(~key, ncol = 3, scales = 'free')
# Let's re-run stat summary gain
# Create data frame of basic summary stats
dfsum <-
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
summarise_all(statFuns) %>%
gather() %>%
separate(key, c('metric', 'stat'), sep = '(_)(?!.*_)') %>%
spread(stat, value) %>%
dplyr::select(metric, names(statFuns))
# print summary stats as two tables for easier reading
statFuns <-
funs(missing = sum(is.na(.))
, min = min(., na.rm = TRUE)
, Q1 = quantile(., .25, na.rm = TRUE)
, mean = mean(., na.rm = TRUE)
, median = median(., na.rm = TRUE)
, Q3 = quantile(., .75, na.rm = TRUE)
, max = max(., na.rm = TRUE)
, sd = sd(., na.rm = TRUE)
, mad = mad(., na.rm = TRUE)
, skewness = skewness(., na.rm = TRUE)
, kurtosis = kurtosis(., na.rm = TRUE)
)
# Create data frame of basic summary stats
dfsum <-
Glass %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
keep(is.numeric) %>%
summarise_all(statFuns) %>%
gather() %>%
separate(key, c('metric', 'stat'), sep = '(_)(?!.*_)') %>%
spread(stat, value) %>%
dplyr::select(metric, names(statFuns))
# print summary stats as two tables for easier reading
# stats for table 1
vTbl1Stats <- c('metric', 'min', 'Q1', 'mean', 'median', 'Q3', 'max')
pandoc.table(dfsum[, vTbl1Stats], missing = '-')
##
## ----------------------------------------------------------------
## metric min Q1 mean median Q3 max
## -------- -------- -------- --------- -------- --------- --------
## Al 0.29 1.19 1.445 1.36 1.63 3.5
##
## Al_BC -0.923 0.1817 0.3685 0.3324 0.5534 1.742
##
## Al_yj 0.2547 0.7849 0.8753 0.8598 0.9685 1.508
##
## Ba 0 0 0.175 0 0 3.15
##
## Ba_BC 0 0 0.05545 0 0 0.471
##
## Ba_yj 0 0 0.175 0 0 3.15
##
## Ca 5.43 8.24 8.957 8.6 9.172 16.19
##
## Ca_BC 0.7677 0.8197 0.8256 0.8238 0.8297 0.8666
##
## Ca_yj 0.6778 0.7007 0.7033 0.7025 0.7051 0.7212
##
## Fe 0 0 0.05701 0 0.1 0.51
##
## Fe_BC 0 0 0.04289 0 0.08678 0.2807
##
## Fe_yj 0 0 0.05701 0 0.1 0.51
##
## K 0 0.1225 0.4971 0.555 0.61 6.21
##
## K_BC 0 0.1091 0.2771 0.3569 0.3789 0.8613
##
## K_yj 0 0.1092 0.2781 0.3581 0.3802 0.8709
##
## Mg 0 2.115 2.685 3.48 3.6 4.49
##
## Mg_BC 0 4.353 7.323 9.535 10.08 14.57
##
## Mg_yj 0 5.231 9.486 12.35 13.13 19.72
##
## Na 10.73 12.91 13.41 13.3 13.83 17.38
##
## Na_BC 2.373 2.558 2.594 2.588 2.626 2.855
##
## Na_yj 1.973 2.079 2.1 2.096 2.118 2.245
##
## RI 1.511 1.517 1.518 1.518 1.519 1.534
##
## RI_BC 0.281 0.2826 0.2831 0.2829 0.2833 0.2875
##
## RI_yj 1.511 1.517 1.518 1.518 1.519 1.534
##
## Si 69.81 72.28 72.65 72.79 73.09 75.41
##
## Si_BC 2436 2612 2639 2649 2670 2843
##
## Si_yj 69.81 72.28 72.65 72.79 73.09 75.41
## ----------------------------------------------------------------
pandoc.table(dfsum[
, c('metric', setdiff(colnames(dfsum), vTbl1Stats))], missing = '-')
##
## -----------------------------------------------------------------
## metric missing sd mad skewness kurtosis
## -------- --------- ----------- ----------- ----------- ----------
## Al 0 0.4993 0.3113 0.8946 1.938
##
## Al_BC 0 0.4134 0.2644 0.09106 1.168
##
## Al_yj 0 0.2005 0.1316 0.0002128 0.9978
##
## Ba 0 0.4972 0 3.369 12.08
##
## Ba_BC 0 0.131 0 2.136 2.92
##
## Ba_yj 0 0.4972 0 3.369 12.08
##
## Ca 0 1.423 0.6598 2.018 6.41
##
## Ca_BC 0 0.01254 0.00717 -0.194 4.291
##
## Ca_yj 0 0.00553 0.003179 -0.2064 4.247
##
## Fe 0 0.09744 0 1.73 2.52
##
## Fe_BC 0 0.06896 0 1.327 0.453
##
## Fe_yj 0 0.09744 0 1.73 2.52
##
## K 0 0.6522 0.1705 6.46 52.87
##
## K_BC 0 0.1711 0.06567 -0.08698 0.1301
##
## K_yj 0 0.1721 0.06618 -0.07082 0.1747
##
## Mg 0 1.442 0.3039 -1.136 -0.4527
##
## Mg_BC 0 4.161 1.392 -0.9274 -0.7869
##
## Mg_yj 0 5.467 1.984 -0.8771 -0.8481
##
## Na 0 0.8166 0.6449 0.4478 2.898
##
## Na_BC 0 0.06062 0.04903 0.03385 2.5
##
## Na_yj 0 0.03426 0.02774 -0.008848 2.49
##
## RI 0 0.003037 0.001875 1.603 4.717
##
## RI_BC 0 0.0008634 0.0005372 1.566 4.553
##
## RI_yj 0 0.003037 0.001875 1.603 4.717
##
## Si 0 0.7745 0.5708 -0.7202 2.816
##
## Si_BC 0 56.06 41.55 -0.6509 2.781
##
## Si_yj 0 0.7745 0.5708 -0.7202 2.816
## -----------------------------------------------------------------
The variables with degenrate distributions are variables with “zero variance” - in fact, variables that have one unique value with lots of zeroes.
Characteristics of “zero variance”:
Those variables are leaf.mild, mycelium and sclerotia.
paste("Degenerate variables:", paste(names(Soybean[,nearZeroVar(Soybean)]),collapse = ', '))
## [1] "Degenerate variables: leaf.mild, mycelium, sclerotia"
Soybean[2:15] %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
#keep(is.numeric) %>%
#select(-Class) %>%
gather() %>%
group_by(key) %>%
ggplot(data = ., aes(value)) +
stat_count() +
#geom_density(alpha = 0.3, color = NA, fill = 'yellow') +
#scale_x_continuous() +
#scale_y_continuous() +
# scale_y_continuous(labels = scales::percent) +
facet_wrap(~key, scales = 'free') +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Soybean[16:length(Soybean)] %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
#keep(is.numeric) %>%
#select(-Class) %>%
gather() %>%
group_by(key) %>%
ggplot(data = ., aes(value)) +
stat_count() +
#geom_density(alpha = 0.3, color = NA, fill = 'yellow') +
#scale_x_continuous() +
#scale_y_continuous() +
# scale_y_continuous(labels = scales::percent) +
facet_wrap(~key, scales = 'free') +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Soybean[1] %>%
# union(dfEval %>% mutate(TARGET_WINS = as.numeric(NA))) %>%
#keep(is.numeric) %>%
#select(Class) %>%
gather() %>%
group_by(key) %>%
ggplot(data = ., aes(value)) +
stat_count() +
#geom_density(alpha = 0.3, color = NA, fill = 'yellow') +
#scale_x_continuous() +
#scale_y_continuous() +
# scale_y_continuous(labels = scales::percent) +
facet_wrap(~key, scales = 'free') +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Mode and KNN imputation for the missing categorical values would be used. As you can see, when Mode is used, since it replaces NAs with the most occuring value, mode category can be highly over-represented, while all others are underrepresented. In other words, the distribution of imputed data can be highly biased. When KNN is used, however, it minimizes the biase as categories are fairly evenly represented.
# subset columns for imputation
dfimpute <- subset(Soybean, select = - c(Class, leaves) )
df <- dfimpute
# perform mode imputation
df[, sapply(df, function(x) !is.numeric(x))] <- apply(df[, sapply(df, function(x) !is.numeric(x))], 2, function(x) {x[is.na(x)] <- names(sort(table(x), decreasing = TRUE)[1]); x})
# create dataframe for plotting imputation impact
plotlist <- list()
for (i in 1:length(names(df))){
t = nrow(unique(df[i]))
name <- names(df[i])
missingness <- c(rep("Pre Imputation", t), rep("Post Imputation", t)) # Pre/post imputation
Category <- as.factor(rep(names(table(dfimpute[i])), 2)) # Categories
Count <- c(as.numeric(table(dfimpute[i])), as.numeric(table(df[i]))) # Count of categories
plotlist[[i]] <- data.frame(name, missingness, Category, Count) }
alldata <- do.call(rbind, plotlist)
plot1 = alldata[alldata$name %in% names(df)[1:15],]
plot2 = alldata[alldata$name %in% names(df)[16:length(df)],]
# let's see Pre vs Post imputation
ggplot(plot1, aes(Category, Count, fill = missingness)) + # Create plot
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~name, scales = 'free') +
theme(legend.title = element_blank()) +
ggtitle("Mode Imputation")
ggplot(plot2, aes(Category, Count, fill = missingness)) + # Create plot
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~name, scales = 'free') +
theme(legend.title = element_blank()) +
ggtitle("Mode Imputation")
# perform KNN imputation
knnOutput <- knnImputation(dfimpute)
df <- knnOutput
# create dataframe for plotting imputation impact
plotlist <- list()
for (i in 1:length(names(df))){
t = nrow(unique(df[i]))
name <- names(df[i])
missingness <- c(rep("Pre Imputation", t), rep("Post Imputation", t)) # Pre/post imputation
Category <- as.factor(rep(names(table(dfimpute[i])), 2)) # Categories
Count <- c(as.numeric(table(dfimpute[i])), as.numeric(table(df[i]))) # Count of categories
plotlist[[i]] <- data.frame(name, missingness, Category, Count) }
alldata <- do.call(rbind, plotlist)
plot1 = alldata[alldata$name %in% names(df)[1:15],]
plot2 = alldata[alldata$name %in% names(df)[16:length(df)],]
# let's see Pre vs Post imputation
ggplot(plot1, aes(Category, Count, fill = missingness)) + # Create plot
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~name, scales = 'free') +
theme(legend.title = element_blank()) +
ggtitle("KNN Imputation")
ggplot(plot2, aes(Category, Count, fill = missingness)) + # Create plot
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~name, scales = 'free') +
theme(legend.title = element_blank()) +
ggtitle("KNN Imputation")