## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.1.8     ✔ tsibble     1.1.3
## ✔ dplyr       1.1.0     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.0
## ✔ lubridate   1.9.2     ✔ fable       0.3.2
## ✔ ggplot2     3.4.1     ✔ fabletools  0.3.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## corrplot 0.92 loaded

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

A

Q: Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library('mlbench')
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Glass %>%
  select_if(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

Glass %>%
  select_if(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass %>%
  select_if(is.numeric) %>%
  cor() %>%
  corrplot()

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

A: We can see with the histograms and the box plots that all follow a normal distribution and that AI, BA, CA, FE, NA, and RI are all right skewed while the reset are left skewed. As far as correlation goes, we can see that the strongest is between Ca and RI, while AI and Ba, AI and K, NA and Na have the next strongest correlations. It looks like the most common types of glass are 1 & 2.

B

Q: Do there appear to be any outliers in the data? Are any predictors skewed?

A: Yes we saw that several were skewed in the answer above, and looking at the boxplots, we can see that there are a lot of outliers for Ca, Ba, AI, Fe, and Rr while the rest have outliers these columns have the highest.

C

Q: Are there any relevant transformations of one or more predictors that might improve the classification model?

A: Yes, for the data points that are skewed we could either run a Box Cox or a log transformation and the ones with the outliers we could run a Spatial Sign transformation. Then, if we wanted to cut out the noise we could run a PCA analysis, while this is not exactly a transformation, it does reduce the data allowing for clearer interpretations and better results.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

A

Q: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

head(Soybean)

##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

Soybean %>%
  select(-Class)%>%
  drop_na() %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_bar()+
  facet_wrap(~ key)

## Warning: attributes are not identical across measure variables; they will be
## dropped

A: Histograms help identify categorical predictors with degenerate distributions, in this plot I also removed all NAs since it made it look like some had more variables than when they were removed. Degenerate distributions are ones that “handful of unique values that occur with very low frequencies.” mycelium seems to be degenerate and we should remove.

B

Q: Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

sort(apply(Soybean, 2, function(col)sum(is.na(col))/length(col)), decreasing =  T)

##            hail           sever        seed.tmt         lodging            germ 
##     0.177159590     0.177159590     0.177159590     0.177159590     0.163982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##     0.158125915     0.155197657     0.155197657     0.155197657     0.155197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##     0.146412884     0.134699854     0.134699854     0.134699854     0.122986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##     0.122986823     0.122986823     0.122986823     0.122986823     0.055636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##     0.055636896     0.055636896     0.055636896     0.055636896     0.055636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##     0.055636896     0.052708638     0.045387994     0.043923865     0.023426061 
##    plant.growth            stem            date        area.dam           Class 
##     0.023426061     0.023426061     0.001464129     0.001464129     0.000000000 
##          leaves 
##     0.000000000

Soybean %>%
   filter_all(any_vars(is.na(.))) %>%
   select(Class) %>%
   group_by(Class) %>%
   summarise(count = n())

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

A: It looks like there are is data missing for hail, sever, seed.tmt, lodging, etc. with the all the NAs made up by the five classes above with the biggest class at fault being phytophthora-rot.

C

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

soy <- Soybean %>%
   filter_all(any_vars(is.na(.))) %>%
   group_by(Class)

soy_name <- as.character(unique(soy$Class))



for (i in soy_name) {
    t <- Soybean %>%
    filter(Class == as.character(i)) 
    
    print(i)
    print(sort(apply(t, 2, function(col)sum(is.na(col))/length(col)), decreasing =  T))
    }

## [1] "phytophthora-rot"
##            hail           sever        seed.tmt            germ         lodging 
##       0.7727273       0.7727273       0.7727273       0.7727273       0.7727273 
## fruiting.bodies      fruit.pods     fruit.spots            seed     mold.growth 
##       0.7727273       0.7727273       0.7727273       0.7727273       0.7727273 
##   seed.discolor       seed.size      shriveling       leaf.halo       leaf.marg 
##       0.7727273       0.7727273       0.7727273       0.6250000       0.6250000 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild           Class 
##       0.6250000       0.6250000       0.6250000       0.6250000       0.0000000 
##            date     plant.stand          precip            temp       crop.hist 
##       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
##        area.dam    plant.growth          leaves            stem    stem.cankers 
##       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
##   canker.lesion       ext.decay        mycelium    int.discolor       sclerotia 
##       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
##           roots 
##       0.0000000 
## [1] "diaporthe-pod-&-stem-blight"
##            hail           sever        seed.tmt       leaf.halo       leaf.marg 
##             1.0             1.0             1.0             1.0             1.0 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild         lodging 
##             1.0             1.0             1.0             1.0             1.0 
##           roots     plant.stand            germ           Class            date 
##             1.0             0.4             0.4             0.0             0.0 
##          precip            temp       crop.hist        area.dam    plant.growth 
##             0.0             0.0             0.0             0.0             0.0 
##          leaves            stem    stem.cankers   canker.lesion fruiting.bodies 
##             0.0             0.0             0.0             0.0             0.0 
##       ext.decay        mycelium    int.discolor       sclerotia      fruit.pods 
##             0.0             0.0             0.0             0.0             0.0 
##     fruit.spots            seed     mold.growth   seed.discolor       seed.size 
##             0.0             0.0             0.0             0.0             0.0 
##      shriveling 
##             0.0 
## [1] "cyst-nematode"
##     plant.stand          precip            temp            hail           sever 
##               1               1               1               1               1 
##        seed.tmt            germ       leaf.halo       leaf.marg       leaf.size 
##               1               1               1               1               1 
##     leaf.shread       leaf.malf       leaf.mild         lodging    stem.cankers 
##               1               1               1               1               1 
##   canker.lesion fruiting.bodies       ext.decay        mycelium    int.discolor 
##               1               1               1               1               1 
##       sclerotia     fruit.spots   seed.discolor      shriveling           Class 
##               1               1               1               1               0 
##            date       crop.hist        area.dam    plant.growth          leaves 
##               0               0               0               0               0 
##            stem      fruit.pods            seed     mold.growth       seed.size 
##               0               0               0               0               0 
##           roots 
##               0 
## [1] "2-4-d-injury"
##     plant.stand          precip            temp            hail       crop.hist 
##          1.0000          1.0000          1.0000          1.0000          1.0000 
##           sever        seed.tmt            germ    plant.growth     leaf.shread 
##          1.0000          1.0000          1.0000          1.0000          1.0000 
##       leaf.mild            stem         lodging    stem.cankers   canker.lesion 
##          1.0000          1.0000          1.0000          1.0000          1.0000 
## fruiting.bodies       ext.decay        mycelium    int.discolor       sclerotia 
##          1.0000          1.0000          1.0000          1.0000          1.0000 
##      fruit.pods     fruit.spots            seed     mold.growth   seed.discolor 
##          1.0000          1.0000          1.0000          1.0000          1.0000 
##       seed.size      shriveling           roots            date        area.dam 
##          1.0000          1.0000          1.0000          0.0625          0.0625 
##           Class          leaves       leaf.halo       leaf.marg       leaf.size 
##          0.0000          0.0000          0.0000          0.0000          0.0000 
##       leaf.malf 
##          0.0000 
## [1] "herbicide-injury"
##          precip            hail           sever        seed.tmt            germ 
##               1               1               1               1               1 
##       leaf.mild         lodging    stem.cankers   canker.lesion fruiting.bodies 
##               1               1               1               1               1 
##       ext.decay        mycelium    int.discolor       sclerotia     fruit.spots 
##               1               1               1               1               1 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##               1               1               1               1               1 
##           Class            date     plant.stand            temp       crop.hist 
##               0               0               0               0               0 
##        area.dam    plant.growth          leaves       leaf.halo       leaf.marg 
##               0               0               0               0               0 
##       leaf.size     leaf.shread       leaf.malf            stem      fruit.pods 
##               0               0               0               0               0 
##           roots 
##               0

A: For all the classes that we see a missing values to be 100% we should remove. While with “phytophthora-rot”, we could impute the values since there is not a full 100% missing values.

Data Preprocessing/Overfitting

2023-02-25

1

A

B

C

3.2

A

B

C