library(dplyr)
library(tidyr)
library(e1071)
library(corrplot)
library(naniar)
library(ggplot2)

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

part a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(mlbench)
data(Glass)

Distribution of Numeric Variables (Elements)

glass_longer <- Glass %>%
  pivot_longer(
    cols = -Type,
    names_to = "Variable",
    values_to = "Value"
  )

ggplot(glass_longer, aes(x=Value) ) + 
  geom_histogram(bins = 15, fill = "lightblue", color = "black") + 
  facet_wrap( ~Variable, scales = "free") + 
  labs(title = "Histograms of Numerical Variables in the Glass Dataset")

Distribution of the Elements:

  • Al: Right-Skewed (May need logarithmic transformation but box-cox is more robust)
  • Ba: Extremely Right-Skewed (May need logarithmic transformation since its heavily skewed and contains many 0)
  • Ca: Approximately normal but slightly right-skewed
  • Fe: Extremely Right-Skewed (May need logarithmic transformation since its heavily skewed and contains many 0)
  • K: Highly Right-Skewed (May need logarithmic transformation since its heavily skewed and contains many 0)
  • Mg: Bimodal
  • Na: Approximately normal but slightly right-skewed
  • RI: Approximately normal but slightly right-skewed
  • Si: Approximately normal but slightly left-skewed

Distribution of Catergorical Variable (Glass Type)

ggplot(Glass, aes(x=Type))+
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Glass Types")

Severe class imbalance is present. There are suppose to be 7 glass tpyes. The majority class is types 1 and 2. There are no observations that are type 4.

Distribution of Catergorical Variable (Glass Type)

ggplot(glass_longer, aes(x = Type, y = Value, fill = Type)) + 
  geom_boxplot() + 
  facet_wrap( ~Variable, scales = "free_y") + 
  labs(title = "Distribution of Element Percentages by Glass Types")

The following elements can be key predictors for glass classification as they have the least overlap.

  • Magnesium (Mg): While glass types 1, 2, and 3 has the highest magnesium content, glass types 5, 6, and 7 have lower or 0 magnesium content.

  • Barium (Ba): Glass type 7 has higher barium content than the other glass types has close to 0 barium content .

  • Aluminum (Al): Glass type 7 has the highest median aluminum content, followed by tpye 5. Glass types 1, 2, and 3 has lower median aluminum content.

Type 5 has the highest median Calcium (Ca) and Refraction Index (RI).

Type 7 has the lowest median Refraction Index (RI) but has the highest median Silicon (Si).

Correlation

corrplot(cor(Glass %>%
               select(where(is.numeric))),
         type = "lower", 
         addCoef.col = "black",
         number.cex = 0.7)

Refraction Index (RI) and Calcium (Ca) are highly positive correlated 0.81.

Aluminum (Al) and Barium (Ba) have moderate positive correlations with a correlation coefficient of 0.48.

The following pairs have moderate negative correlation:

  • Refraction Index (RI) and Silicon (Si) (r = -0.54)
  • Barium (Ba) and Magnesium (Mg) (r = -0.49)
  • Aluminum (Al) and Magnesium (Mg) (r = -0.48)
  • Calcium (Ca) and Magnesium (Mg) (r = -0.48)
  • Aluminum (Al) and Refraction Index (RI) (r = -0.48)

Multicollinearity is present. Refraction Index (RI) and Calcium (Ca) are highly positive correlated 0.81. So using both features for some regression model may be redundant. We can consider to remove one of them for certain models.

part b

Do there appear to be any outliers in the data? Are any predictors skewed?

There appeak to be outliers in almost every numeric variables.

  • Al: Both high and low value outliers present
  • Ba: Significant amount of high value outliers present
  • Ca: Both high and low value outliers present
  • Fe: high value outliers present
  • K: A few high value outliers present
  • Mg:
  • Na: A few high and low value outliers present
  • RI: A few high and low value outliers present
  • Si: A few high and low value outliers present
ggplot(glass_longer, aes(x=Value) ) + 
  geom_boxplot(bins = 15, fill = "lightblue", color = "black") + 
  facet_wrap( ~Variable, scales = "free") + 
  labs(title = "Boxplot of Numerical Variables in the Glass Dataset")
## Warning in geom_boxplot(bins = 15, fill = "lightblue", color = "black"):
## Ignoring unknown parameters: `bins`

part c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Ba, Fe, and K are extremely right-skewed and contains many zeros. So we can use logarithmic transformation or box-cox transformation for a more robust transformation to deal with the skewness. Outliers are present. We can consider removing them if outliers are not significant. If the amount of outliers are significant, we can impute them. But we should be cautious in dealing with the outliers because some elements can be key predcitors for certain models due it unusual high outlier values. For example, Barium can be a key predictor for class classification. Glass type 7 has higher barium content than the other glass types has close to 0 barium content .

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

part a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

Dsitribution of Class

ggplot(Soybean, aes(x=Class))+
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Class") + 
  theme(axis.text.x = element_text(angle = 45,hjust = 1, vjust = 1 ))

Dsitribution of Variables

soybean_char <- Soybean %>%
  mutate(across(-Class, as.character))

soybean_longer <- soybean_char %>%
  pivot_longer(
    cols = -Class,
    names_to = "Variable",
    values_to = "Value"
  )
ggplot(soybean_longer, aes(x=Value)) + 
  geom_bar(fill = "lightblue", color = "black") + 
  facet_wrap( ~Variable, scales = "free") + 
  labs(title = "Bar Plot of Variables in the Soybean Dataset")

The most evident degenerate distributions are those variables with a single massive bar (Near-Zero Variance) and very little elsewhere. The following predictors shows this case:

  • mycelium
  • sclerotia
highest_props <- soybean_longer%>%
  group_by(Variable, Value)%>%
  summarise(Count =n(), .groups = "drop")%>%
  group_by(Variable)%>%
  mutate(Proportion = Count/sum(Count))%>%
  slice_max(Proportion, n =1 , with_ties = FALSE)%>%
  select(Variable, Top_value = Value, Proportion)%>%
  arrange(desc(Proportion))
  
highest_props
## # A tibble: 35 × 3
## # Groups:   Variable [35]
##    Variable     Top_value Proportion
##    <chr>        <chr>          <dbl>
##  1 mycelium     0              0.936
##  2 sclerotia    0              0.915
##  3 leaves       1              0.887
##  4 int.discolor 0              0.851
##  5 leaf.malf    0              0.811
##  6 roots        0              0.807
##  7 shriveling   0              0.789
##  8 leaf.mild    0              0.783
##  9 seed.size    0              0.779
## 10 mold.growth  0              0.767
## # ℹ 25 more rows

part b

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Missing Data in Predictors

The top ten predictors that has the highest missing data are:

  • hail
  • sever
  • seed.tmt
  • lodging
  • germ
  • leaf.mild
  • fruiting.bodies
  • fruit.spots
  • seed.discolor
  • shriveling
# code found from "https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html"
miss_var_summary(Soybean)
## # A tibble: 36 × 3
##    variable        n_miss pct_miss
##    <chr>            <int>    <num>
##  1 hail               121     17.7
##  2 sever              121     17.7
##  3 seed.tmt           121     17.7
##  4 lodging            121     17.7
##  5 germ               112     16.4
##  6 leaf.mild          108     15.8
##  7 fruiting.bodies    106     15.5
##  8 fruit.spots        106     15.5
##  9 seed.discolor      106     15.5
## 10 shriveling         106     15.5
## # ℹ 26 more rows

Missing Values by Class

Missingness for the predictors are is not uniform across all class. 5 classes has close to 100% (ie. 2-4-d-injury, cyst-nematode,..) missing values and other classes has close to 0% missing values (ie. anthracnose)

Some predictors has high missing values across classes, like stem.cankers, shriveling, and more.

# code found from "https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html"
gg_miss_fct(x = Soybean, fct = Class)

part c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

5 classes have high percentage of missing values across most predictors. If the percetange of observations from these classes is low (ie. less than 10%) compared to the whole dataset, we can consider removing them to simply modeling.

If they represent a significant portion, we should keep them and remove the predictors that has highest percentage of missing values and impute the rest using KNN, Mice or even other predictive modelings.