Homework 4

3.2

3.1

library(mlbench)
library(ggplot2)

library(corrplot)

## corrplot 0.95 loaded

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:ggplot2':
## 
##     element

library(tidyr)
library(caret)

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

#(a)

#Looking at the distributions

#RI - right skewed, has outliers
Glass |>
  ggplot(aes(x = RI)) + geom_histogram(bins = 15)

#Na - closer to normal, more centered but still has a right tail
Glass |>
  ggplot(aes(x = Na)) + geom_histogram(bins = 15)

#Mg - extreme outliers, left-skewed, but different from the others (bimodal)
Glass |>
  ggplot(aes(x = Mg)) + geom_histogram(bins = 15)

#Al - still has some outliers but closer to normal, if a bit right-skewed
Glass |>
  ggplot(aes(x = Al)) + geom_histogram(bins = 10)

#Si - closest to normal distribution so far
Glass |>
  ggplot(aes(x = Si)) + geom_histogram(bins = 10)

#K - extreme right-skewness, outliers
Glass |>
  ggplot(aes(x = K)) + geom_histogram(bins = 15)

#Ca - skewed to the right, a long tail of outliers on the right
Glass |>
  ggplot(aes(x = Ca)) + geom_histogram(bins = 15)

#Ba - extreme right-skewness, outliers
Glass |>
  ggplot(aes(x = Ba)) + geom_histogram(bins = 15)

#Fe - extreme right-skewness, outliers
Glass |>
  ggplot(aes(x = Fe)) + geom_histogram(bins = 15)

#Correlation matrix

glass_corr <- cor(Glass[, -10])
corrplot(glass_corr, method = "color", addCoef.col = "black")

My observations on the distributions are in the comments of the code chunk. As for the correlations, there is high correlation between RI and Ca, meaning they have a positive relationship. For the most part, there aren’t any other positive relationships besides that one, and Al and Ba. Interestingly, there are some negative correlations that imply there is an inverse relationship, such as between Si and RI, or Mg and Al, Ba and Mg, for example.

#(b)

#Getting the skewness
apply(Glass[, -10], 2, skewness)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

#Visualize outliers with boxplots - better for outliers than histograms
Glass |>
  pivot_longer(-Type, names_to = "predictor", values_to = "value") |>
  ggplot(aes(x = predictor, y = value)) +
  geom_boxplot() +
  facet_wrap(~ predictor, scales = "free")

Yes, there are outliers in almost pretty much every predictor. Ba and K are very extreme, for example. I consider this an example of why one should use different types of visualizations. That’s because in the histograms, Mg appeared to have outliers, while the boxplot reveals that it doesn’t really have them - it’s more that we have two clusters of different values, like a bimodal distribution. I have included a more descriptions of each distribution in part (a) already.

As for the skewness, the textbook advises that the skewness values will be close to 0 if the distribution is symmetric, and larger if it’s right, and negative if left-skewed. Based on this, we can see that the Na predictor actually has the least skewed distribution despite the outliers, so perhaps all other predictors could benefit from a transformation, depending on that the Box-Cox tests show.

#(c)

lambda_K <- BoxCoxTrans(Glass$K)
lambda_Ca <- BoxCoxTrans(Glass$Ca)
lambda_RI <- BoxCoxTrans(Glass$RI)
lambda_Al <- BoxCoxTrans(Glass$Al)
lambda_Ba <- BoxCoxTrans(Glass$Ba)
lambda_Fe <- BoxCoxTrans(Glass$Fe)
lambda_Mg <- BoxCoxTrans(Glass$Mg)
lambda_Si <- BoxCoxTrans(Glass$Si)
lambda_Na <- BoxCoxTrans(Glass$Na)

print("Box-Cox results for K")

## [1] "Box-Cox results for K"

lambda_K

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied

print("Box-Cox results for Ca")

## [1] "Box-Cox results for Ca"

lambda_Ca

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1

print("Box-Cox results for RI")

## [1] "Box-Cox results for RI"

lambda_RI

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2

print("Box-Cox results for Al")

## [1] "Box-Cox results for Al"

lambda_Al

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5

print("Box-Cox results for Ba")

## [1] "Box-Cox results for Ba"

lambda_Ba

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Lambda could not be estimated; no transformation is applied

print("Box-Cox results for Fe")

## [1] "Box-Cox results for Fe"

lambda_Fe

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Lambda could not be estimated; no transformation is applied

print("Box-Cox results for Mg")

## [1] "Box-Cox results for Mg"

lambda_Mg

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490 
## 
## Lambda could not be estimated; no transformation is applied

print("Box-Cox results for Si")

## [1] "Box-Cox results for Si"

lambda_Si

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2

print("Box-Cox results for Na")

## [1] "Box-Cox results for Na"

lambda_Na

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations

Based on these results: Ca has a lambda of -1.1, meaning an inverse transformation can be used here. RI has a lambda of -2, and the largest/smallest ratio is nowhere close to 20, so perhaps it isn’t the best candidate for a transformation. Al has a lambda of 0.5, meaning a square root transformation can be used. K, Ba, Fe, and Mg could not be transformed using Box-Cox because they many zero values in their distributions. For these predictors, the spatial sign transformation discussed in the textbook may be a more appropriate way to reduce the influence of extreme values. Si has a lambda of 2, which suggests a square transformation, though its skewness of -0.72 and a largest/smallest ratio of 1.08 are both well within the acceptable range based on the textbook, making a transformation less critical. Na has a lambda of -0.1, which can indicate a log transform, but because it showed no meaningful skewness, no transformation is recommended.

3.2

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

#(a)

nearZeroVar(Soybean, saveMetrics = TRUE)

##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Yes we have some degenerate distributions: leaf.mild - frequency ratio of 26.75, meaning the most common value appears 26.75x more than the second most common. mycelium — frequency ratio of 106.5, extremely degenerate, one value dominates almost entirely. sclerotia — frequency ratio of 31.25.

These near-zero variance predictors are problematic because one category overwhelmingly dominates, meaning they are unlikely to help the model.

#b

#Which predictors have the most missing values
missing_by_pred <- colSums(is.na(Soybean))
missing_by_pred[missing_by_pred > 0]

##            date     plant.stand          precip            temp            hail 
##               1              36              38              30             121 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##              16               1             121             121             112 
##    plant.growth       leaf.halo       leaf.marg       leaf.size     leaf.shread 
##              16              84              84              84             100 
##       leaf.malf       leaf.mild            stem         lodging    stem.cankers 
##              84             108              16             121              38 
##   canker.lesion fruiting.bodies       ext.decay        mycelium    int.discolor 
##              38             106              38              38              38 
##       sclerotia      fruit.pods     fruit.spots            seed     mold.growth 
##              38              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31

#Is missingness related to class
missing_by_class <- aggregate(is.na(Soybean[, -1]), 
                               by = list(Class = Soybean$Class), 
                               FUN = sum)
missing_by_class

##                          Class date plant.stand precip temp hail crop.hist
## 1                 2-4-d-injury    1          16     16   16   16        16
## 2          alternarialeaf-spot    0           0      0    0    0         0
## 3                  anthracnose    0           0      0    0    0         0
## 4             bacterial-blight    0           0      0    0    0         0
## 5            bacterial-pustule    0           0      0    0    0         0
## 6                   brown-spot    0           0      0    0    0         0
## 7               brown-stem-rot    0           0      0    0    0         0
## 8                 charcoal-rot    0           0      0    0    0         0
## 9                cyst-nematode    0          14     14   14   14         0
## 10 diaporthe-pod-&-stem-blight    0           6      0    0   15         0
## 11       diaporthe-stem-canker    0           0      0    0    0         0
## 12                downy-mildew    0           0      0    0    0         0
## 13          frog-eye-leaf-spot    0           0      0    0    0         0
## 14            herbicide-injury    0           0      8    0    8         0
## 15      phyllosticta-leaf-spot    0           0      0    0    0         0
## 16            phytophthora-rot    0           0      0    0   68         0
## 17              powdery-mildew    0           0      0    0    0         0
## 18           purple-seed-stain    0           0      0    0    0         0
## 19        rhizoctonia-root-rot    0           0      0    0    0         0
##    area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1         1    16       16   16           16      0         0         0
## 2         0     0        0    0            0      0         0         0
## 3         0     0        0    0            0      0         0         0
## 4         0     0        0    0            0      0         0         0
## 5         0     0        0    0            0      0         0         0
## 6         0     0        0    0            0      0         0         0
## 7         0     0        0    0            0      0         0         0
## 8         0     0        0    0            0      0         0         0
## 9         0    14       14   14            0      0        14        14
## 10        0    15       15    6            0      0        15        15
## 11        0     0        0    0            0      0         0         0
## 12        0     0        0    0            0      0         0         0
## 13        0     0        0    0            0      0         0         0
## 14        0     8        8    8            0      0         0         0
## 15        0     0        0    0            0      0         0         0
## 16        0    68       68   68            0      0        55        55
## 17        0     0        0    0            0      0         0         0
## 18        0     0        0    0            0      0         0         0
## 19        0     0        0    0            0      0         0         0
##    leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1          0          16         0        16   16      16           16
## 2          0           0         0         0    0       0            0
## 3          0           0         0         0    0       0            0
## 4          0           0         0         0    0       0            0
## 5          0           0         0         0    0       0            0
## 6          0           0         0         0    0       0            0
## 7          0           0         0         0    0       0            0
## 8          0           0         0         0    0       0            0
## 9         14          14        14        14    0      14           14
## 10        15          15        15        15    0      15            0
## 11         0           0         0         0    0       0            0
## 12         0           0         0         0    0       0            0
## 13         0           0         0         0    0       0            0
## 14         0           0         0         8    0       8            8
## 15         0           0         0         0    0       0            0
## 16        55          55        55        55    0      68            0
## 17         0           0         0         0    0       0            0
## 18         0           0         0         0    0       0            0
## 19         0           0         0         0    0       0            0
##    canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             16              16        16       16           16        16
## 2              0               0         0        0            0         0
## 3              0               0         0        0            0         0
## 4              0               0         0        0            0         0
## 5              0               0         0        0            0         0
## 6              0               0         0        0            0         0
## 7              0               0         0        0            0         0
## 8              0               0         0        0            0         0
## 9             14              14        14       14           14        14
## 10             0               0         0        0            0         0
## 11             0               0         0        0            0         0
## 12             0               0         0        0            0         0
## 13             0               0         0        0            0         0
## 14             8               8         8        8            8         8
## 15             0               0         0        0            0         0
## 16             0              68         0        0            0         0
## 17             0               0         0        0            0         0
## 18             0               0         0        0            0         0
## 19             0               0         0        0            0         0
##    fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 1          16          16   16          16            16        16         16
## 2           0           0    0           0             0         0          0
## 3           0           0    0           0             0         0          0
## 4           0           0    0           0             0         0          0
## 5           0           0    0           0             0         0          0
## 6           0           0    0           0             0         0          0
## 7           0           0    0           0             0         0          0
## 8           0           0    0           0             0         0          0
## 9           0          14    0           0            14         0         14
## 10          0           0    0           0             0         0          0
## 11          0           0    0           0             0         0          0
## 12          0           0    0           0             0         0          0
## 13          0           0    0           0             0         0          0
## 14          0           8    8           8             8         8          8
## 15          0           0    0           0             0         0          0
## 16         68          68   68          68            68        68         68
## 17          0           0    0           0             0         0          0
## 18          0           0    0           0             0         0          0
## 19          0           0    0           0             0         0          0
##    roots
## 1     16
## 2      0
## 3      0
## 4      0
## 5      0
## 6      0
## 7      0
## 8      0
## 9      0
## 10    15
## 11     0
## 12     0
## 13     0
## 14     0
## 15     0
## 16     0
## 17     0
## 18     0
## 19     0

Soybean |>
  mutate(across(everything(), is.na)) |>
  pivot_longer(everything(), names_to = "variables", values_to = "missing") |>
  count(variables, missing) |>
  ggplot(aes(y = variables, x = n, fill = missing)) +
  geom_col(position = "fill") +
  labs(title = "Proportion of missing Values", x = "Proportion") +
  scale_fill_manual(values = c("grey", "black"))

Soybean |>
  mutate(total_missing = rowSums(is.na(Soybean))) |>
  group_by(Class) |>
  summarise(missing = sum(total_missing)) |>
  ggplot(aes(y = reorder(Class, missing), x = missing)) +
  geom_col() +
  labs(title = "Total Missing Values by Class", x = "Total Missing", y = "Class")

Yes, certain predictors are more likely to be missing than others. Hail, sever, seed.tmt, and lodging all have 121 missing values, suggesting they tend to be missing together rather than independently. Similarly, mycelium, int.discolor, stem.cankers, canker.lesion and sclerotia share exactly 38 missing values, pointing to another cluster of co-missing predictors.

These missing values are clearly related to specific classes. Only five disease classes account for all missing data. They are phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury, meanwhile the remaining classes have no missing values whatsoever. This is informative missingness as described in the textbook, meaning the pattern of missing data is directly tied to the outcome and is not missing at random, which is the most problematic kind of missing data to deal with.

Given that the data is MNAR, imputation would be very problematic here because filling in the missing values with estimates from other predictors will ignore the fact that the missingness itself carries information. A more appropriate strategy would be to remove the samples with missing data entirely if the goal is a clean dataset, since those five classes are the only source of missingness. However, this comes at a big cost because removing samples means losing all observations from those five disease classes entirely. This would obviously make the model unable to predict them at all. Alternatively, predictors with high missingness concentrated in those classes could be removed instead, which would be less destructive than losing the entire classes, though we still risk losing possibly useful information. As another option, if it’s important to keep all the classes, the missing values could be treated as a separate category. For example, coding them as an additional level like “unknown” will keep the informative nature of the missingness rather than obscure it through imputation.

Homework 4

Hristiyana Yaneva

2026-02-26

3.1

3.2