624 hw4

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggplot(data = Glass, aes(x = Na)) + 
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x= Mg)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Al)) + 
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Si)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=K)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Ca)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Ba)) +
  geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x=Fe)) +
  geom_histogram(binwidth = 0.05)

ggplot(data = Glass, aes(x=RI)) + 
  geom_histogram(binwidth = 0.001)

I looked for the correlation between variables to further explore the relationships between them.

cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)

##       RI    Na    Mg    Al    Si     K    Ca    Ba    Fe
## RI  1.00 -0.19 -0.12 -0.41 -0.54 -0.29  0.81  0.00  0.14
## Na -0.19  1.00 -0.27  0.16 -0.07 -0.27 -0.28  0.33 -0.24
## Mg -0.12 -0.27  1.00 -0.48 -0.17  0.01 -0.44 -0.49  0.08
## Al -0.41  0.16 -0.48  1.00 -0.01  0.33 -0.26  0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01  1.00 -0.19 -0.21 -0.10 -0.09
## K  -0.29 -0.27  0.01  0.33 -0.19  1.00 -0.32 -0.04 -0.01
## Ca  0.81 -0.28 -0.44 -0.26 -0.21 -0.32  1.00 -0.11  0.12
## Ba  0.00  0.33 -0.49  0.48 -0.10 -0.04 -0.11  1.00 -0.06
## Fe  0.14 -0.24  0.08 -0.07 -0.09 -0.01  0.12 -0.06  1.00

The strongest correlation is between RI and Ca (0.81). The scatterplot of these variables is shown below.

The next strongest correlations are RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48).

ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()

Do there appear to be any outliers in the data? Are any predictors skewed?

There are skews in most of the predictors. RI, Na, Al, Ca, are right skewed while Si, Mg, and K are left skewed.

Ba and Fe are mostly zeros with a small number of positive non-zero values.

The outliers can be seen on the following boxplots.

ggplot(data = Glass, aes(x=RI)) +
  geom_boxplot()

ggplot(data = Glass, aes(x=Na)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Mg)) +
  geom_boxplot()

ggplot(data= Glass, aes(x=Al)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Si)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=K)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Ca)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Ba)) +
  geom_boxplot()

ggplot(data=Glass, aes(x=Fe)) +
  geom_boxplot()

The box plots all appear to have some outliers with the exception of Mg. RI, Na, Al, Si, and Ca have both high and low outliers while K, Ba, and Fa only have high outliers. Interestingly, the boxplot for Ba seems to lie only at 0 and all points that are not 0 were outliers.

Are there any relevant transformations of one or more predictors that might improve the classification model?

It may be helpful to remove RI or Ca to account for collinearity since they are highly correlated. The remaining variables should be centered and scaled.A spatial sign could be used to address outliers, especially on the variables with high and low outliers including RI, Na, and Al. For the variables that show the most skew, a Box-Cox transformation could also be used to improve symmetry.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The criteria for degenerate predictors that were discussed in the chapter are: a. The fraction of unique values over the sample size is low (say 10 %) b. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

print(dim(Soybean)[1])

## [1] 683

To meet the first criteria, there must be 68 or fewer unique values for the variable.

sapply(Soybean, function(x) length(unique(x)))

##           Class            date     plant.stand          precip            temp 
##              19               8               3               4               4 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##               3               5               5               4               4 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##               4               3               2               4               4 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##               4               3               3               4               3 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##               3               5               5               3               4 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##               3               4               3               5               5 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##               3               3               3               3               3 
##           roots 
##               4

The max unique values is 19 so all columns can qualify. For the second criteria, we need to find the frequencies of the 2 most prevalent values.

soy_ratio <- sapply(Soybean, function(x) {
  # Count values 
  counts <- table(x)
  # Sort 
  counts <- sort(table(x), decreasing = TRUE)
  # Calculate ratio
  counts[1]/counts[2]
  })

# Arrange in descending order
sort(soy_ratio, decreasing = TRUE)

##        mycelium.0       sclerotia.0       leaf.mild.0      shriveling.0 
##        106.500000         31.250000         26.750000         14.184211 
##    int.discolor.0         lodging.0       leaf.malf.0       seed.size.0 
##         13.204545         12.380952         12.311111          9.016949 
##   seed.discolor.0          leaves.1     mold.growth.0           roots.0 
##          8.015625          7.870130          7.820896          6.406977 
##     leaf.shread.0 fruiting.bodies.0            seed.0          precip.2 
##          5.072917          4.548077          4.139130          4.098214 
##       ext.decay.0     fruit.spots.0            hail.0      fruit.pods.0 
##          3.681481          3.450000          3.425197          3.130769 
##    stem.cankers.0    plant.growth.0            temp.1   canker.lesion.0 
##          1.984293          1.951327          1.879397          1.807910 
##           sever.1       leaf.marg.0       leaf.halo.2       leaf.size.1 
##          1.651282          1.615385          1.547511          1.479638 
##        seed.tmt.0            stem.1        area.dam.1     plant.stand.0 
##          1.373874          1.253378          1.213904          1.208191 
##            date.5            germ.1  Class.brown-spot       crop.hist.2 
##          1.137405          1.103627          1.010989          1.004587

The ratios of Mycelium, sclerotia, leaf.mild are above 20 and therefore degenerate.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Missing data per column
missing_data <- colMeans(is.na(Soybean))
sort(missing_data, decreasing =TRUE)

##            hail           sever        seed.tmt         lodging            germ 
##     0.177159590     0.177159590     0.177159590     0.177159590     0.163982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##     0.158125915     0.155197657     0.155197657     0.155197657     0.155197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##     0.146412884     0.134699854     0.134699854     0.134699854     0.122986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##     0.122986823     0.122986823     0.122986823     0.122986823     0.055636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##     0.055636896     0.055636896     0.055636896     0.055636896     0.055636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##     0.055636896     0.052708638     0.045387994     0.043923865     0.023426061 
##    plant.growth            stem            date        area.dam           Class 
##     0.023426061     0.023426061     0.001464129     0.001464129     0.000000000 
##          leaves 
##     0.000000000

# Missing values by class
sapply(Soybean, function(x) 
  tapply(is.na(x), Soybean$Class, mean))

##                             Class   date plant.stand precip temp      hail
## 2-4-d-injury                    0 0.0625         1.0      1    1 1.0000000
## alternarialeaf-spot             0 0.0000         0.0      0    0 0.0000000
## anthracnose                     0 0.0000         0.0      0    0 0.0000000
## bacterial-blight                0 0.0000         0.0      0    0 0.0000000
## bacterial-pustule               0 0.0000         0.0      0    0 0.0000000
## brown-spot                      0 0.0000         0.0      0    0 0.0000000
## brown-stem-rot                  0 0.0000         0.0      0    0 0.0000000
## charcoal-rot                    0 0.0000         0.0      0    0 0.0000000
## cyst-nematode                   0 0.0000         1.0      1    1 1.0000000
## diaporthe-pod-&-stem-blight     0 0.0000         0.4      0    0 1.0000000
## diaporthe-stem-canker           0 0.0000         0.0      0    0 0.0000000
## downy-mildew                    0 0.0000         0.0      0    0 0.0000000
## frog-eye-leaf-spot              0 0.0000         0.0      0    0 0.0000000
## herbicide-injury                0 0.0000         0.0      1    0 1.0000000
## phyllosticta-leaf-spot          0 0.0000         0.0      0    0 0.0000000
## phytophthora-rot                0 0.0000         0.0      0    0 0.7727273
## powdery-mildew                  0 0.0000         0.0      0    0 0.0000000
## purple-seed-stain               0 0.0000         0.0      0    0 0.0000000
## rhizoctonia-root-rot            0 0.0000         0.0      0    0 0.0000000
##                             crop.hist area.dam     sever  seed.tmt      germ
## 2-4-d-injury                        1   0.0625 1.0000000 1.0000000 1.0000000
## alternarialeaf-spot                 0   0.0000 0.0000000 0.0000000 0.0000000
## anthracnose                         0   0.0000 0.0000000 0.0000000 0.0000000
## bacterial-blight                    0   0.0000 0.0000000 0.0000000 0.0000000
## bacterial-pustule                   0   0.0000 0.0000000 0.0000000 0.0000000
## brown-spot                          0   0.0000 0.0000000 0.0000000 0.0000000
## brown-stem-rot                      0   0.0000 0.0000000 0.0000000 0.0000000
## charcoal-rot                        0   0.0000 0.0000000 0.0000000 0.0000000
## cyst-nematode                       0   0.0000 1.0000000 1.0000000 1.0000000
## diaporthe-pod-&-stem-blight         0   0.0000 1.0000000 1.0000000 0.4000000
## diaporthe-stem-canker               0   0.0000 0.0000000 0.0000000 0.0000000
## downy-mildew                        0   0.0000 0.0000000 0.0000000 0.0000000
## frog-eye-leaf-spot                  0   0.0000 0.0000000 0.0000000 0.0000000
## herbicide-injury                    0   0.0000 1.0000000 1.0000000 1.0000000
## phyllosticta-leaf-spot              0   0.0000 0.0000000 0.0000000 0.0000000
## phytophthora-rot                    0   0.0000 0.7727273 0.7727273 0.7727273
## powdery-mildew                      0   0.0000 0.0000000 0.0000000 0.0000000
## purple-seed-stain                   0   0.0000 0.0000000 0.0000000 0.0000000
## rhizoctonia-root-rot                0   0.0000 0.0000000 0.0000000 0.0000000
##                             plant.growth leaves leaf.halo leaf.marg leaf.size
## 2-4-d-injury                           1      0     0.000     0.000     0.000
## alternarialeaf-spot                    0      0     0.000     0.000     0.000
## anthracnose                            0      0     0.000     0.000     0.000
## bacterial-blight                       0      0     0.000     0.000     0.000
## bacterial-pustule                      0      0     0.000     0.000     0.000
## brown-spot                             0      0     0.000     0.000     0.000
## brown-stem-rot                         0      0     0.000     0.000     0.000
## charcoal-rot                           0      0     0.000     0.000     0.000
## cyst-nematode                          0      0     1.000     1.000     1.000
## diaporthe-pod-&-stem-blight            0      0     1.000     1.000     1.000
## diaporthe-stem-canker                  0      0     0.000     0.000     0.000
## downy-mildew                           0      0     0.000     0.000     0.000
## frog-eye-leaf-spot                     0      0     0.000     0.000     0.000
## herbicide-injury                       0      0     0.000     0.000     0.000
## phyllosticta-leaf-spot                 0      0     0.000     0.000     0.000
## phytophthora-rot                       0      0     0.625     0.625     0.625
## powdery-mildew                         0      0     0.000     0.000     0.000
## purple-seed-stain                      0      0     0.000     0.000     0.000
## rhizoctonia-root-rot                   0      0     0.000     0.000     0.000
##                             leaf.shread leaf.malf leaf.mild stem   lodging
## 2-4-d-injury                      1.000     0.000     1.000    1 1.0000000
## alternarialeaf-spot               0.000     0.000     0.000    0 0.0000000
## anthracnose                       0.000     0.000     0.000    0 0.0000000
## bacterial-blight                  0.000     0.000     0.000    0 0.0000000
## bacterial-pustule                 0.000     0.000     0.000    0 0.0000000
## brown-spot                        0.000     0.000     0.000    0 0.0000000
## brown-stem-rot                    0.000     0.000     0.000    0 0.0000000
## charcoal-rot                      0.000     0.000     0.000    0 0.0000000
## cyst-nematode                     1.000     1.000     1.000    0 1.0000000
## diaporthe-pod-&-stem-blight       1.000     1.000     1.000    0 1.0000000
## diaporthe-stem-canker             0.000     0.000     0.000    0 0.0000000
## downy-mildew                      0.000     0.000     0.000    0 0.0000000
## frog-eye-leaf-spot                0.000     0.000     0.000    0 0.0000000
## herbicide-injury                  0.000     0.000     1.000    0 1.0000000
## phyllosticta-leaf-spot            0.000     0.000     0.000    0 0.0000000
## phytophthora-rot                  0.625     0.625     0.625    0 0.7727273
## powdery-mildew                    0.000     0.000     0.000    0 0.0000000
## purple-seed-stain                 0.000     0.000     0.000    0 0.0000000
## rhizoctonia-root-rot              0.000     0.000     0.000    0 0.0000000
##                             stem.cankers canker.lesion fruiting.bodies
## 2-4-d-injury                           1             1       1.0000000
## alternarialeaf-spot                    0             0       0.0000000
## anthracnose                            0             0       0.0000000
## bacterial-blight                       0             0       0.0000000
## bacterial-pustule                      0             0       0.0000000
## brown-spot                             0             0       0.0000000
## brown-stem-rot                         0             0       0.0000000
## charcoal-rot                           0             0       0.0000000
## cyst-nematode                          1             1       1.0000000
## diaporthe-pod-&-stem-blight            0             0       0.0000000
## diaporthe-stem-canker                  0             0       0.0000000
## downy-mildew                           0             0       0.0000000
## frog-eye-leaf-spot                     0             0       0.0000000
## herbicide-injury                       1             1       1.0000000
## phyllosticta-leaf-spot                 0             0       0.0000000
## phytophthora-rot                       0             0       0.7727273
## powdery-mildew                         0             0       0.0000000
## purple-seed-stain                      0             0       0.0000000
## rhizoctonia-root-rot                   0             0       0.0000000
##                             ext.decay mycelium int.discolor sclerotia
## 2-4-d-injury                        1        1            1         1
## alternarialeaf-spot                 0        0            0         0
## anthracnose                         0        0            0         0
## bacterial-blight                    0        0            0         0
## bacterial-pustule                   0        0            0         0
## brown-spot                          0        0            0         0
## brown-stem-rot                      0        0            0         0
## charcoal-rot                        0        0            0         0
## cyst-nematode                       1        1            1         1
## diaporthe-pod-&-stem-blight         0        0            0         0
## diaporthe-stem-canker               0        0            0         0
## downy-mildew                        0        0            0         0
## frog-eye-leaf-spot                  0        0            0         0
## herbicide-injury                    1        1            1         1
## phyllosticta-leaf-spot              0        0            0         0
## phytophthora-rot                    0        0            0         0
## powdery-mildew                      0        0            0         0
## purple-seed-stain                   0        0            0         0
## rhizoctonia-root-rot                0        0            0         0
##                             fruit.pods fruit.spots      seed mold.growth
## 2-4-d-injury                 1.0000000   1.0000000 1.0000000   1.0000000
## alternarialeaf-spot          0.0000000   0.0000000 0.0000000   0.0000000
## anthracnose                  0.0000000   0.0000000 0.0000000   0.0000000
## bacterial-blight             0.0000000   0.0000000 0.0000000   0.0000000
## bacterial-pustule            0.0000000   0.0000000 0.0000000   0.0000000
## brown-spot                   0.0000000   0.0000000 0.0000000   0.0000000
## brown-stem-rot               0.0000000   0.0000000 0.0000000   0.0000000
## charcoal-rot                 0.0000000   0.0000000 0.0000000   0.0000000
## cyst-nematode                0.0000000   1.0000000 0.0000000   0.0000000
## diaporthe-pod-&-stem-blight  0.0000000   0.0000000 0.0000000   0.0000000
## diaporthe-stem-canker        0.0000000   0.0000000 0.0000000   0.0000000
## downy-mildew                 0.0000000   0.0000000 0.0000000   0.0000000
## frog-eye-leaf-spot           0.0000000   0.0000000 0.0000000   0.0000000
## herbicide-injury             0.0000000   1.0000000 1.0000000   1.0000000
## phyllosticta-leaf-spot       0.0000000   0.0000000 0.0000000   0.0000000
## phytophthora-rot             0.7727273   0.7727273 0.7727273   0.7727273
## powdery-mildew               0.0000000   0.0000000 0.0000000   0.0000000
## purple-seed-stain            0.0000000   0.0000000 0.0000000   0.0000000
## rhizoctonia-root-rot         0.0000000   0.0000000 0.0000000   0.0000000
##                             seed.discolor seed.size shriveling roots
## 2-4-d-injury                    1.0000000 1.0000000  1.0000000     1
## alternarialeaf-spot             0.0000000 0.0000000  0.0000000     0
## anthracnose                     0.0000000 0.0000000  0.0000000     0
## bacterial-blight                0.0000000 0.0000000  0.0000000     0
## bacterial-pustule               0.0000000 0.0000000  0.0000000     0
## brown-spot                      0.0000000 0.0000000  0.0000000     0
## brown-stem-rot                  0.0000000 0.0000000  0.0000000     0
## charcoal-rot                    0.0000000 0.0000000  0.0000000     0
## cyst-nematode                   1.0000000 0.0000000  1.0000000     0
## diaporthe-pod-&-stem-blight     0.0000000 0.0000000  0.0000000     1
## diaporthe-stem-canker           0.0000000 0.0000000  0.0000000     0
## downy-mildew                    0.0000000 0.0000000  0.0000000     0
## frog-eye-leaf-spot              0.0000000 0.0000000  0.0000000     0
## herbicide-injury                1.0000000 1.0000000  1.0000000     0
## phyllosticta-leaf-spot          0.0000000 0.0000000  0.0000000     0
## phytophthora-rot                0.7727273 0.7727273  0.7727273     0
## powdery-mildew                  0.0000000 0.0000000  0.0000000     0
## purple-seed-stain               0.0000000 0.0000000  0.0000000     0
## rhizoctonia-root-rot            0.0000000 0.0000000  0.0000000     0

tapply(rowSums(is.na(Soybean)), 
       Soybean$Class, sum)/sum(is.na(Soybean))

##                2-4-d-injury         alternarialeaf-spot 
##                  0.19255456                  0.00000000 
##                 anthracnose            bacterial-blight 
##                  0.00000000                  0.00000000 
##           bacterial-pustule                  brown-spot 
##                  0.00000000                  0.00000000 
##              brown-stem-rot                charcoal-rot 
##                  0.00000000                  0.00000000 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                  0.14377407                  0.07573813 
##       diaporthe-stem-canker                downy-mildew 
##                  0.00000000                  0.00000000 
##          frog-eye-leaf-spot            herbicide-injury 
##                  0.00000000                  0.06846384 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                  0.00000000                  0.51946941 
##              powdery-mildew           purple-seed-stain 
##                  0.00000000                  0.00000000 
##        rhizoctonia-root-rot 
##                  0.00000000

Hail, sever, seed.tmt, and lodging variables are each the most likely to be missing, as each of these variables have the highest rate of missing data. There are also several variables that are missing data in <5% of observations including date, leaves, class, area.dam, stem, temp, crop.his, and plant.growth.

The majority of classes do not have any missing values (14/19). The largest percent of missing data is from the 2-4-d-injury class at ~19%.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The variables that were identified as degenerate(mycelium, sclerotia, and leaf.mild) can be eliminated. After eliminating, I would use imputation to fill in any remainder missing data. Since missing data is associated with 5 classes, it may be helpful to introduce a new variable to represent the number of missing observations. It may also be worthwhile to remove variables with the higher percentages of data missing (over 15%) to reduce noise and see if that improves the outcome.