Data 624- HW 4

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
library(fpp2)
library(lattice)
library(tidyverse)
library(psych)
library(e1071)
library(car)
library(caret)
library(naniar)
library(VIM)
library(mice)
library(miceFast)

data(Glass)
describe(Glass)

##       vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04    -0.29
##         se
## RI    0.00
## Na    0.06
## Mg    0.10
## Al    0.03
## Si    0.05
## K     0.04
## Ca    0.10
## Ba    0.03
## Fe    0.01
## Type* 0.12

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

pairs(~RI+Na+Mg+Al+Si+K+Ca+Ba+Fe,data=Glass,
   main="Glass ID ")

ggplot(Glass) + 
  geom_point(mapping = aes(x = K, y = Ca, size = Type))

## Warning: Using size for a discrete variable is not advised.

ggplot(Glass, aes(x = Si, y = Fe)) + 
  geom_point()

ggplot(Glass, aes(x=as.factor(Si), y=Fe)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + 
    xlab("Type")

ggplot(Glass, aes(x=RI)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$RI)

## [1] 1.602715

ggplot(Glass, aes(x=Na)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Na)

## [1] 0.4478343

ggplot(Glass, aes(x=Mg)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Mg)

## [1] -1.136452

ggplot(Glass, aes(x=Al)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Al)

## [1] 0.8946104

ggplot(Glass, aes(x=Si)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Si)

## [1] -0.7202392

ggplot(Glass, aes(x=K)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$K)

## [1] 6.460089

ggplot(Glass, aes(x=Ca)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Ca)

## [1] 2.018446

ggplot(Glass, aes(x=Ba)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Ba)

## [1] 3.36868

ggplot(Glass, aes(x=Fe)) + 
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

skewness(Glass$Fe)

## [1] 1.729811

Do there appear to be any outliers in the data? Are any predictors skewed?

From the above graphs, we can visualize that there seems to be some outliers present in the data presented, also as previously shown, some predictors are moderately skewed.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, a Box-cox transformation can be used to normalize data by enhanced forecasting. When using The function “powerTransform()” and when utilizing the “car package” it will tally up the Box-Cox transformation using the maximum likelihood for returns and it approaches the information on the estimated values with convenient rounded values that are 1.96 SD the maximum likelihood estimate.

summary(powerTransform(Glass[,1:9], family="yjPower"))$result[,1:2]

##      Est Power Rounded Pwr
## RI -25.0853114      -25.09
## Na   1.3755562        1.00
## Mg   1.7699080        2.00
## Al   0.9773267        1.00
## Si  10.9452696       10.95
## K   -0.1441078        0.00
## Ca   0.6774333        0.50
## Ba  -6.8620464       -6.86
## Fe -14.9245600      -14.92

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

describe(Soybean)

##                  vars   n mean   sd median trimmed  mad min max range  skew
## Class*              1 683 9.30 5.51      8    9.18 7.41   1  19    18  0.11
## date*               2 682 4.55 1.69      5    4.62 1.48   1   7     6 -0.30
## plant.stand*        3 647 1.45 0.50      1    1.44 0.00   1   2     1  0.19
## precip*             4 645 2.60 0.69      3    2.74 0.00   1   3     2 -1.42
## temp*               5 653 2.18 0.63      2    2.23 0.00   1   3     2 -0.16
## hail*               6 562 1.23 0.42      1    1.16 0.00   1   2     1  1.31
## crop.hist*          7 667 2.88 0.98      3    2.98 1.48   1   4     3 -0.40
## area.dam*           8 682 2.58 1.07      2    2.60 1.48   1   4     3  0.02
## sever*              9 562 1.73 0.60      2    1.69 0.00   1   3     2  0.17
## seed.tmt*          10 562 1.52 0.61      1    1.45 0.00   1   3     2  0.74
## germ*              11 571 2.05 0.79      2    2.06 1.48   1   3     2 -0.09
## plant.growth*      12 667 1.34 0.47      1    1.30 0.00   1   2     1  0.68
## leaves*            13 683 1.89 0.32      2    1.98 0.00   1   2     1 -2.44
## leaf.halo*         14 599 2.20 0.95      3    2.25 0.00   1   3     2 -0.41
## leaf.marg*         15 599 1.77 0.96      1    1.72 0.00   1   3     2  0.46
## leaf.size*         16 599 2.28 0.61      2    2.34 0.00   1   3     2 -0.25
## leaf.shread*       17 583 1.16 0.37      1    1.08 0.00   1   2     1  1.80
## leaf.malf*         18 599 1.08 0.26      1    1.00 0.00   1   2     1  3.22
## leaf.mild*         19 575 1.10 0.40      1    1.00 0.00   1   3     2  3.95
## stem*              20 667 1.56 0.50      2    1.57 0.00   1   2     1 -0.23
## lodging*           21 562 1.07 0.26      1    1.00 0.00   1   2     1  3.23
## stem.cankers*      22 645 2.06 1.35      1    1.95 0.00   1   4     3  0.61
## canker.lesion*     23 645 1.98 1.08      2    1.85 1.48   1   4     3  0.51
## fruiting.bodies*   24 577 1.18 0.38      1    1.10 0.00   1   2     1  1.66
## ext.decay*         25 645 1.25 0.48      1    1.16 0.00   1   3     2  1.70
## mycelium*          26 645 1.01 0.10      1    1.00 0.00   1   2     1 10.20
## int.discolor*      27 645 1.13 0.42      1    1.00 0.00   1   3     2  3.34
## sclerotia*         28 645 1.03 0.17      1    1.00 0.00   1   2     1  5.40
## fruit.pods*        29 599 1.50 0.88      1    1.28 0.00   1   4     3  1.84
## fruit.spots*       30 577 1.85 1.17      1    1.69 0.00   1   4     3  0.95
## seed*              31 591 1.19 0.40      1    1.12 0.00   1   2     1  1.54
## mold.growth*       32 591 1.11 0.32      1    1.02 0.00   1   2     1  2.43
## seed.discolor*     33 577 1.11 0.31      1    1.02 0.00   1   2     1  2.47
## seed.size*         34 591 1.10 0.30      1    1.00 0.00   1   2     1  2.66
## shriveling*        35 577 1.07 0.25      1    1.00 0.00   1   2     1  3.49
## roots*             36 652 1.18 0.44      1    1.07 0.00   1   3     2  2.46
##                  kurtosis   se
## Class*              -1.38 0.21
## date*               -0.90 0.06
## plant.stand*        -1.97 0.02
## precip*              0.55 0.03
## temp*               -0.58 0.02
## hail*               -0.29 0.02
## crop.hist*          -0.92 0.04
## area.dam*           -1.29 0.04
## sever*              -0.56 0.03
## seed.tmt*           -0.44 0.03
## germ*               -1.40 0.03
## plant.growth*       -1.54 0.02
## leaves*              3.98 0.01
## leaf.halo*          -1.76 0.04
## leaf.marg*          -1.75 0.04
## leaf.size*          -0.63 0.02
## leaf.shread*         1.26 0.02
## leaf.malf*           8.35 0.01
## leaf.mild*          14.68 0.02
## stem*               -1.95 0.02
## lodging*             8.42 0.01
## stem.cankers*       -1.51 0.05
## canker.lesion*      -1.24 0.04
## fruiting.bodies*     0.75 0.02
## ext.decay*           1.98 0.02
## mycelium*          102.18 0.00
## int.discolor*       10.57 0.02
## sclerotia*          27.19 0.01
## fruit.pods*          2.41 0.04
## fruit.spots*        -0.76 0.05
## seed*                0.37 0.02
## mold.growth*         3.93 0.01
## seed.discolor*       4.12 0.01
## seed.size*           5.10 0.01
## shriveling*         10.21 0.01
## roots*               5.49 0.02

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

soy1 <- Soybean[,2:36]

nearZeroVar(soy1, names = TRUE)

## [1] "leaf.mild" "mycelium"  "sclerotia"

The nearZeroVar from the caret library can detect degenerate variables. Degenerate distributions is a variable with “zero-variance”. This means that the unique values over the same is very low, and that the frequency of the most prevalent value to the next most prevalent value is large. The degenerate variables are leaf.mild, mycelium, and sclerotia.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

gg_miss_var(Soybean)

res<-summary(aggr(Soybean, sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

vis_miss(Soybean, sort_miss=TRUE)

Soybean %>%
  group_by(Class) %>%
  miss_var_summary() %>%
  ggplot(aes(Class, variable, fill=pct_miss)) + geom_tile() +scale_fill_gradient(low="blue", high="yellow") +
  theme(axis.text.x=element_text(angle=90,hjust=1))

The above graph shows patterns of the missing data within the Soybean dataset. Colored by pct_missing, yellow values indicate more missing data per class within our data. Missing data can introduce bias in our models.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors.

imputed = mice(Soybean, method="rf", m=2)

## 
##  iter imp variable
##   1   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots

## Warning: Number of logged events: 340

imputed <- complete(imputed)
imputed <- as.data.frame(imputed)
gg_miss_var(imputed)

Data 624- HW 4

Amanda Arce

3/1/2021