3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
#install.packages('mlbench')
#install.packages('corrplot')
#install.packages("moments")
library(moments)
library(mlbench)
library(corrplot)
## corrplot 0.92 loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.7
## ✔ tidyr 1.1.4 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
type 1 and 2 has the highest count. predictors has left and right skewed, and Ba and Mg, Al and Mg, Rl and Ca, Al and Rl and Ca and Mghas strong negative correlation while Si and Rl, Ba and Na, Al and K, Ba and Na have postive correlations.
For the left and right skewed, geom_histogram is not as easy to tell left of right as the stat_density().
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
stat_density() +
facet_wrap(~key, scales = 'free') +
ggtitle("Distributions of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")
x <- cor(Glass[1:9])
corrplot(x, method="number")
We cab re-use the geom_boxplot, we can see all of them has outliers besides Mg.
skewness of RI, Mg, K, Ca, Ba, Fe are less than -1 or greater than 1, the distribution is highly skewed.
skewness of Si and Al are between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. Si
skewness of Na is between -0.5 and 0.5, the distribution is approximately symmetric.
we know Mg and Si is left skewed while others are right skewed.
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
apply(., 2, skewness) %>%
round(2)
## RI Na Mg Al Si K Ca Ba Fe
## 1.61 0.45 -1.14 0.90 -0.73 6.51 2.03 3.39 1.74
I was trying to see if exculde the outliner to see if it will be better. however, I do not see the different.
Then I think remove Ca or Rl maybe is a good action since they are highly correlated and RI, Mg, K, Ca, Ba, Fe can use box-cox transformations to make the data more normal distribution-like.
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot(outlier.shape = NA) +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. 6 http://archive.ics.uci.edu/ml/index.html. 3.8 Computing 59 The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
I was trying to use stat_density(), however, it is not clear to answer the question.
Soybean %>%
select(-Class)%>%
gather() %>%
ggplot(aes(value)) +
stat_density() +
facet_wrap(~ key) +
labs(title = "Distribution of Soybean")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Removed 2 rows containing missing values (position_stack).
leaf.malf, leaf.mild, leaf.shread, lodging, mold.growth, mycelium, roots, sclerotia, seed and seed.discolor maybe are near-zero variance predictors.
Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor.
some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance predictors” may have a single value for the vast majority of the samples.
Soybean %>%
select(-Class)%>%
gather() %>%
ggplot(aes(value)) +
geom_bar()+
facet_wrap(~ key) +
labs(title = "Distribution of Soybean")
## Warning: attributes are not identical across measure variables;
## they will be dropped
We can see hail,sever,seed.tmt and lodging are really roughly 18%.
missingCols <- sort(colSums(is.na(Soybean)))
percentofmissing <- missingCols/683*100
percentofmissing
## Class leaves date area.dam crop.hist
## 0.0000000 0.0000000 0.1464129 0.1464129 2.3426061
## plant.growth stem temp roots plant.stand
## 2.3426061 2.3426061 4.3923865 4.5387994 5.2708638
## precip stem.cankers canker.lesion ext.decay mycelium
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## int.discolor sclerotia leaf.halo leaf.marg leaf.size
## 5.5636896 5.5636896 12.2986823 12.2986823 12.2986823
## leaf.malf fruit.pods seed mold.growth seed.size
## 12.2986823 12.2986823 13.4699854 13.4699854 13.4699854
## leaf.shread fruiting.bodies fruit.spots seed.discolor shriveling
## 14.6412884 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.mild germ hail sever seed.tmt
## 15.8125915 16.3982430 17.7159590 17.7159590 17.7159590
## lodging
## 17.7159590
5 of them has missing value when phytophthora-rot has the most, so I do not think there is a pattern.
Soybean %>%
mutate(Total = n()) %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n() ) %>%
select(Class, Missing ) %>%
unique()
## # A tibble: 5 × 2
## # Groups: Class [5]
## Class Missing
## <fct> <int>
## 1 phytophthora-rot 68
## 2 diaporthe-pod-&-stem-blight 15
## 3 cyst-nematode 14
## 4 2-4-d-injury 16
## 5 herbicide-injury 8
I think we can review the reason of missing for phytophthora-rot since we saw most of the missing fall into phytophthora-rot. We should delete it. like the book said There are cases where the missing values might be concentrated in specific samples. For large data sets, removal of samples based on missing values is not a problem, assuming that the missingness is not informative.