3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

#install.packages('mlbench')
#install.packages('corrplot')
#install.packages("moments")
library(moments)
library(mlbench)
library(corrplot)
## corrplot 0.92 loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.7
## ✔ tidyr   1.1.4     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

type 1 and 2 has the highest count. predictors has left and right skewed, and Ba and Mg, Al and Mg, Rl and Ca, Al and Rl and Ca and Mghas strong negative correlation while Si and Rl, Ba and Na, Al and K, Ba and Na have postive correlations.

For the left and right skewed, geom_histogram is not as easy to tell left of right as the stat_density().

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  stat_density() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Distributions of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

x <- cor(Glass[1:9])
corrplot(x,  method="number")

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

We cab re-use the geom_boxplot, we can see all of them has outliers besides Mg.

skewness of RI, Mg, K, Ca, Ba, Fe are less than -1 or greater than 1, the distribution is highly skewed.

skewness of Si and Al are between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. Si

skewness of Na is between -0.5 and 0.5, the distribution is approximately symmetric.

we know Mg and Si is left skewed while others are right skewed.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  apply(., 2, skewness) %>%
  round(2)
##    RI    Na    Mg    Al    Si     K    Ca    Ba    Fe 
##  1.61  0.45 -1.14  0.90 -0.73  6.51  2.03  3.39  1.74
  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

I was trying to see if exculde the outliner to see if it will be better. however, I do not see the different.

Then I think remove Ca or Rl maybe is a good action since they are highly correlated and RI, Mg, K, Ca, Ba, Fe can use box-cox transformations to make the data more normal distribution-like.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot(outlier.shape = NA) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. 6 http://archive.ics.uci.edu/ml/index.html. 3.8 Computing 59 The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

I was trying to use stat_density(), however, it is not clear to answer the question.

Soybean %>%
  select(-Class)%>%
  gather() %>% 
  ggplot(aes(value)) +
  stat_density() + 
  facet_wrap(~ key) +
  labs(title = "Distribution of Soybean")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Removed 2 rows containing missing values (position_stack).

leaf.malf, leaf.mild, leaf.shread, lodging, mold.growth, mycelium, roots, sclerotia, seed and seed.discolor maybe are near-zero variance predictors.

Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor.

some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance predictors” may have a single value for the vast majority of the samples.

Soybean %>%
  select(-Class)%>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_bar()+
  facet_wrap(~ key) +
  labs(title = "Distribution of Soybean")
## Warning: attributes are not identical across measure variables;
## they will be dropped

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can see hail,sever,seed.tmt and lodging are really roughly 18%.

missingCols <- sort(colSums(is.na(Soybean)))
percentofmissing <- missingCols/683*100
percentofmissing 
##           Class          leaves            date        area.dam       crop.hist 
##       0.0000000       0.0000000       0.1464129       0.1464129       2.3426061 
##    plant.growth            stem            temp           roots     plant.stand 
##       2.3426061       2.3426061       4.3923865       4.5387994       5.2708638 
##          precip    stem.cankers   canker.lesion       ext.decay        mycelium 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##    int.discolor       sclerotia       leaf.halo       leaf.marg       leaf.size 
##       5.5636896       5.5636896      12.2986823      12.2986823      12.2986823 
##       leaf.malf      fruit.pods            seed     mold.growth       seed.size 
##      12.2986823      12.2986823      13.4699854      13.4699854      13.4699854 
##     leaf.shread fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      14.6412884      15.5197657      15.5197657      15.5197657      15.5197657 
##       leaf.mild            germ            hail           sever        seed.tmt 
##      15.8125915      16.3982430      17.7159590      17.7159590      17.7159590 
##         lodging 
##      17.7159590

5 of them has missing value when phytophthora-rot has the most, so I do not think there is a pattern.

Soybean %>%
  mutate(Total = n()) %>% 
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n() ) %>%
  select(Class, Missing ) %>%
  unique()
## # A tibble: 5 × 2
## # Groups:   Class [5]
##   Class                       Missing
##   <fct>                         <int>
## 1 phytophthora-rot                 68
## 2 diaporthe-pod-&-stem-blight      15
## 3 cyst-nematode                    14
## 4 2-4-d-injury                     16
## 5 herbicide-injury                  8
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I think we can review the reason of missing for phytophthora-rot since we saw most of the missing fall into phytophthora-rot. We should delete it. like the book said There are cases where the missing values might be concentrated in specific samples. For large data sets, removal of samples based on missing values is not a problem, assuming that the missingness is not informative.