3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggplot(data = Glass, aes(x = Na)) + 
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x= Mg)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Al)) + 
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Si)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=K)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Ca)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Ba)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Fe)) +
  geom_histogram(binwidth = 0.05)

ggplot(data = Glass, aes(x=RI)) + 
  geom_histogram(binwidth = 0.001)

I looked for the correlation between variables to further explore the relationships between them.

cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)
##       RI    Na    Mg    Al    Si     K    Ca    Ba    Fe
## RI  1.00 -0.19 -0.12 -0.41 -0.54 -0.29  0.81  0.00  0.14
## Na -0.19  1.00 -0.27  0.16 -0.07 -0.27 -0.28  0.33 -0.24
## Mg -0.12 -0.27  1.00 -0.48 -0.17  0.01 -0.44 -0.49  0.08
## Al -0.41  0.16 -0.48  1.00 -0.01  0.33 -0.26  0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01  1.00 -0.19 -0.21 -0.10 -0.09
## K  -0.29 -0.27  0.01  0.33 -0.19  1.00 -0.32 -0.04 -0.01
## Ca  0.81 -0.28 -0.44 -0.26 -0.21 -0.32  1.00 -0.11  0.12
## Ba  0.00  0.33 -0.49  0.48 -0.10 -0.04 -0.11  1.00 -0.06
## Fe  0.14 -0.24  0.08 -0.07 -0.09 -0.01  0.12 -0.06  1.00

The strongest correlation is between RI and Ca (0.81). The scatterplot of these variables is shown below.

The next strongest correlations are RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48).

ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()

Do there appear to be any outliers in the data? Are any predictors skewed?

There are skews in most of the predictors. RI, Na, Al, Ca, are right skewed while Si, Mg, and K are left skewed.

Ba and Fe are mostly zeros with a small number of positive non-zero values.

The outliers can be seen on the following boxplots.

ggplot(data = Glass, aes(x=RI)) +
  geom_boxplot()

ggplot(data = Glass, aes(x=Na)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Mg)) +
  geom_boxplot()

ggplot(data= Glass, aes(x=Al)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Si)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=K)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Ca)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Ba)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Fe)) +
  geom_boxplot()

The box plots all appear to have some outliers with the exception of Mg. RI, Na, Al, Si, and Ca have both high and low outliers while K, Ba, and Fa only have high outliers. Interestingly, the boxplot for Ba seems to lie only at 0 and all points that are not 0 were outliers.

Are there any relevant transformations of one or more predictors that might improve the classification model?

It may be helpful to remove RI or Ca to account for collinearity since they are highly correlated. The remaining variables should be centered and scaled.A spatial sign could be used to address outliers, especially on the variables with high and low outliers including RI, Na, and Al. For the variables that show the most skew, a Box-Cox transformation could also be used to improve symmetry.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The criteria for degenerate predictors that were discussed in the chapter are: a. The fraction of unique values over the sample size is low (say 10 %) b. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

print(dim(Soybean)[1])
## [1] 683

To meet the first criteria, there must be 68 or fewer unique values for the variable.

sapply(Soybean, function(x) length(unique(x)))
##           Class            date     plant.stand          precip            temp 
##              19               8               3               4               4 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##               3               5               5               4               4 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##               4               3               2               4               4 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##               4               3               3               4               3 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##               3               5               5               3               4 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##               3               4               3               5               5 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##               3               3               3               3               3 
##           roots 
##               4

The max unique values is 19 so all columns can qualify. For the second criteria, we need to find the frequencies of the 2 most prevalent values.

soy_ratio <- sapply(Soybean, function(x) {
  # Count values 
  counts <- table(x)
  # Sort 
  counts <- sort(table(x), decreasing = TRUE)
  # Calculate ratio
  counts[1]/counts[2]
  })

# Arrange in descending order
sort(soy_ratio, decreasing = TRUE)
##        mycelium.0       sclerotia.0       leaf.mild.0      shriveling.0 
##        106.500000         31.250000         26.750000         14.184211 
##    int.discolor.0         lodging.0       leaf.malf.0       seed.size.0 
##         13.204545         12.380952         12.311111          9.016949 
##   seed.discolor.0          leaves.1     mold.growth.0           roots.0 
##          8.015625          7.870130          7.820896          6.406977 
##     leaf.shread.0 fruiting.bodies.0            seed.0          precip.2 
##          5.072917          4.548077          4.139130          4.098214 
##       ext.decay.0     fruit.spots.0            hail.0      fruit.pods.0 
##          3.681481          3.450000          3.425197          3.130769 
##    stem.cankers.0    plant.growth.0            temp.1   canker.lesion.0 
##          1.984293          1.951327          1.879397          1.807910 
##           sever.1       leaf.marg.0       leaf.halo.2       leaf.size.1 
##          1.651282          1.615385          1.547511          1.479638 
##        seed.tmt.0            stem.1        area.dam.1     plant.stand.0 
##          1.373874          1.253378          1.213904          1.208191 
##            date.5            germ.1  Class.brown-spot       crop.hist.2 
##          1.137405          1.103627          1.010989          1.004587

The ratios of Mycelium, sclerotia, leaf.mild are above 20 and therefore degenerate.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The variables that were identified as degenerate(mycelium, sclerotia, and leaf.mild) can be eliminated. After eliminating, I would use imputation to fill in any remainder missing data. Since missing data is associated with 5 classes, it may be helpful to introduce a new variable to represent the number of missing observations. It may also be worthwhile to remove variables with the higher percentages of data missing (over 15%) to reduce noise and see if that improves the outcome.