library(caret)
library(corrplot)
library(DMwR2)
library(dplyr)
library(GGally)
library(ggplot2)
library(knitr)
library(psych)
library(mlbench)
library(tidyr)
data(Glass)
data(Soybean)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
3.1
(a) Using visualizations, explore the predictor variables to understand
their distributions as well as the relationships between
predictors.
There are some skewed variables like Mg. Al is a bit right skewed. There
is positive correlation between Ca and RI. Types is centered by 1,2, and
7.
p <- Glass[,1:9]
p %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p %>%
gather() %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(~key, scales = 'free')
pairs(p)
r <-cor(p)
corrplot.mixed(r,
lower.col = "red",
number.cex = .8,
mar=c(0,0,1,0))
p %>%
gather() %>%
ggplot(aes(x=key,y=value,color=key)) +
geom_boxplot()
pred.norm <- p / apply(p, 2, sd)
pred.norm %>%
gather() %>%
ggplot(aes(x=key,y=value,color=key)) +
geom_boxplot()+
scale_y_continuous()
pr <- describe(p)
ggplot(pr,aes(x = row.names(pr),y=skew))+
geom_bar(stat='identity')
3.2
(a) Investigate the frequency distributions for the categorical
predictors. Are any of the distributions degenerate in the ways
discussed earlier in this chapter?
The following are degenerate predictor with degenerate distributions:
“leaf.mild” “mycelium” “sclerotia”
They have variances that lean to approaching zero.
x <- nearZeroVar(Soybean)
colnames(Soybean)[x]
## [1] "leaf.mild" "mycelium" "sclerotia"
Soybean %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
scale_fill_manual(values=c("lightgrey","lightblue"))