Data 624 - Hw4

library(caret)
library(corrplot)
library(DMwR2)
library(dplyr)
library(GGally)
library(ggplot2)
library(knitr)
library(psych)
library(mlbench)
library(tidyr)
data(Glass)
data(Soybean)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
There are some skewed variables like Mg. Al is a bit right skewed. There is positive correlation between Ca and RI. Types is centered by 1,2, and 7.

p <- Glass[,1:9]
p %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_density() +
  facet_wrap(~key, scales = 'free')

pairs(p)

r <-cor(p)
corrplot.mixed(r, 
               lower.col = "red",
               number.cex = .8,
               mar=c(0,0,1,0))

Do there appear to be any outliers in the data? Are any predictors skewed?
There are outliers in K, Fe, Na, RI, etc. Yes there are predictors that are skewed.

p %>%
  gather() %>% 
  ggplot(aes(x=key,y=value,color=key)) +
    geom_boxplot()

pred.norm <- p / apply(p, 2, sd)
pred.norm %>%
  gather() %>% 
  ggplot(aes(x=key,y=value,color=key)) +
    geom_boxplot()+
    scale_y_continuous()

pr <-  describe(p)
ggplot(pr,aes(x = row.names(pr),y=skew))+
  geom_bar(stat='identity')

Are there any relevant transformations of one or more predictors that might improve the classification model?
Box cox can be used for skew. Log transformation can also be used.

3.2
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
The following are degenerate predictor with degenerate distributions: “leaf.mild” “mycelium” “sclerotia”
They have variances that lean to approaching zero.

x <- nearZeroVar(Soybean)
colnames(Soybean)[x]

## [1] "leaf.mild" "mycelium"  "sclerotia"

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
There are appear to be several predictors that are more likely to be missing like hail and lodging.

Soybean %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  scale_fill_manual(values=c("lightgrey","lightblue"))

Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Imputation can be used with KNN and with zeroes. Variables with no data can be removed.