Exercise 3.1: Glass dataset

The Glass dataset includes 214 observations of 10 variables. The variables include:

  • 9 predictor variables (RI through Fe), all of which are numeric
  • 1 target variable (Type) which is categorical and takes the values (1, 2, 3, 5, 6, 7).
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
attach(Glass)

a. Distributions and variable relationships

To start, we can use the ggpairs function from the GGally package to visualize the distributions of the 9 predictor variables as well as their bivariate scatterplots and correlations. In this chart, the data points and the correlations are conditioned on the target Type variable.

# pairs plot excluding the target variable
ggpairs(Glass[ , -10], aes(col = Type), 
        title = "Pairs plot of predictor variables")

From these plots, we can see that:

  • Some predictors appear to be approximately symmetric: Na, Al, and Si
  • Other predictors appear to be highly skewed: K and Ba
  • Some predictor variables appear highly correlated: RI and Ca.

Close-up examples of these variables are shown below:

  • Distribution of Na: example of approximately symmetric distribution
  • Distribution of K: example of highly skewed distribution
  • Scatterplot of RI vs. Ca: highly correlated relationship.
ggplot(Glass, aes(x = Na, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Na: example of approximately symmetric distribution")

ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of K: example of highly skewed distribution")

ggplot(Glass, aes(x = Ca, y = RI, col = Type)) + 
  geom_point() + 
  geom_smooth(aes(x = Ca, y = RI), inherit.aes = FALSE, method = "lm", se = FALSE) + 
  labs(title = "RI vs. Ca: example of correlated variables")

We can take an alternative look at relationships between the predictors by viewing the correlation matrix, using the corrplot function from the corrplot package. It is apparent that the largest correlations in absolute value are:

  • RI and Ca: positive (0.81)
  • Al and Ba: positive (0.48
  • RI and Si: negative (-0.54)
  • Mg and Ba: negative (-0.49)
  • Mg and Al: negative (-0.48)
# correlation plot excluding the target variable
corrs <- cor(Glass[ , -10]) 
corrplot(corrs)

b. Outliers and skew

We can quantify skewness in the predictors by computing the skewness statistic and the ratio of high-to-low values. These statistics confirm our observations from the density plots below that the following variables all have highly skewed distributions:

  • K
  • Ba
  • Ca.
# skew statistics
skewValues <- apply(Glass[ , -10], 2, skewness)

# high-to-low ratios; add 0.1 to min to prevent division by 0
hiloRatios <- apply(Glass[ , -10], 2, function(x) max(x) / min(x + 0.1))

cbind(Skew = skewValues, Hilo = hiloRatios) %>%
  kable(digits = 2, 
        col.names=c("Skew statistic", "High-to-Low ratio"), 
        caption = "Predictors with Skewed Distributions")
Predictors with Skewed Distributions
Skew statistic High-to-Low ratio
RI 1.60 0.95
Na 0.45 1.60
Mg -1.14 44.90
Al 0.89 8.97
Si -0.72 1.08
K 6.46 62.10
Ca 2.02 2.93
Ba 3.37 31.50
Fe 1.73 5.10
ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of K: highly skewed distribution")

ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Ba: highly skewed distribution")

ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Ca: highly skewed distribution")

From the boxplots below, outliers in these variable distributions are apparent:

  • K: outliers in Types 5 and 7
  • Ba: outliers in Types 2 and 5
  • Ca: outliers in Types 2, 6, and 7
ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = K, fill = Type)) + 
  labs(title = "Distribution of K by Type")

ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = Ba, fill = Type)) + 
  labs(title = "Distribution of Ba by Type")

ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = Ca, fill = Type)) + 
  labs(title = "Distribution of Ca by Type")

c. Transformations

Some transformations of the predictors that might improve a classification model include:

  • Box-Cox transformations of the highly skewed variables
  • Normalizing the variables by centering and scaling.

We can visualize the distributions of the highly skewed variables (K, Ba, and Ca) before and after the Box-Cox transformation in the density plots below. Generally transforming by Box-Cox improves the skewness of the overall variable distributions, although some of the distributions conditioned on Type may still remain skewed.

# box-cox transformation
K_bc <- BoxCoxTrans(Glass$K + 0.1)
K_trans <- predict(K_bc, Glass$K + 0.1)
Ba_bc <- BoxCoxTrans(Glass$Ba + 0.1)
Ba_trans <- predict(Ba_bc, Glass$Ba + 0.1)
Ca_bc <- BoxCoxTrans(Glass$Ca + 0.1)
Ca_trans <- predict(Ca_bc, Glass$Ca + 0.1)

# plot distributions before & after box-cox
ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of K before Box-Cox")

ggplot(Glass, aes(x = K_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of K after Box-Cox")

ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ba before Box-Cox")

ggplot(Glass, aes(x = Ba_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ba after Box-Cox")

ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ca before Box-Cox")

ggplot(Glass, aes(x = Ca_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ca after Box-Cox")

detach(Glass)

Exercise 3.2: Soybean dataset

The Soybean dataset includes 683 observations of 36 variables. The variables include:

  • 35 predictor variables (date through roots), all of which are categorical
  • 1 target variable (Class) which is categorical and specifies 19 distinct classes.
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
attach(Soybean)

a. Frequency distributions

We start by reviewing the frequency distributions of the predictors using the summary function. It is apparent that:

  • Many of the distributions are highly skewed or imbalanced; for instance, ~94% of the observations for the mycelium variable fall into the same category.
  • Missing values are scattered throughout the dataset, affecting almost all variables to differing degrees. Missing values comprise ~18% (121 cases) of the data for several variables (hail, sever, seed.tmt, and lodging), which may indicate a common source of the problem.
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 

From the text, degenerate distributions that have near-zero variance can be identified as satisfying two conditions:

  • The frequency ratio of the most prevalent value to the second-most prevalent value is > 20
  • The proportion of unique values as a percentage of the sample size is < 10%.

We can use the nearZeroVar function in the caret package to identify the predictors with degenerate distributions. Based on this definition, the variables with near-zero variance include:

  • leaf.mild: 78% of the data in the most prevalent value
  • mycelium: 94% in the most prevalent value
  • sclerotia: 92% in the most prevalent value.
# get near-zero variance variables
nzv.cols <- nearZeroVar(Soybean)
nzv.out <- nearZeroVar(Soybean, saveMetrics=TRUE)[nzv.cols, ]
nzv.prop.high <- apply(Soybean, 2, function(x) max(table(x)) / length(x))[nzv.cols]

summary(Soybean[ , nzv.cols])
##  leaf.mild  mycelium   sclerotia 
##  0   :535   0   :639   0   :625  
##  1   : 20   1   :  6   1   : 20  
##  2   : 20   NA's: 38   NA's: 38  
##  NA's:108
df <- nzv.out %>% 
  mutate(prophigh = nzv.prop.high * 100) %>%
  select(freqRatio, percentUnique, prophigh)
rownames(df) <- rownames(nzv.out)
  
kable(df, 
      digits = 2, 
      col.names = c("Frequency ratio", "Pct. unique values", "Pct. most prevalent value"), 
      caption = "Near-Zero Variance Predictors")
Near-Zero Variance Predictors
Frequency ratio Pct. unique values Pct. most prevalent value
leaf.mild 26.75 0.44 78.33
mycelium 106.50 0.29 93.56
sclerotia 31.25 0.29 91.51

b. Missing data

Measured by the number of complete cases, roughly 18% of the cases in the dataset are missing values for one or more variables. Measured by the number of missing values across all cases and variables, roughly 10% of the dataset is missing data.

# proportion of complete cases
1 - sum(complete.cases(Soybean)) / nrow(Soybean)
## [1] 0.1771596
# proportion of missing values across all rows & cols
sum(is.na(Soybean)) / ncol(Soybean) / nrow(Soybean)
## [1] 0.09504636

We can detect patterns in the missing data by viewing the distribution of NA values across variables and cases. From the output below, it is evident that:

  • Nearly all variables have missing data (34 out of 35 predictors)
  • Missing data is concentrated in certain groups of variables such as:
    • 121 missing values: hail, sever, seed.tmt, lodging
    • 106 missing values: fruiting.bodies, fruit.spots, seed.discolor, shriveling
    • 84 missing values: leaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods
  • Missing data is found in only 121 cases (out of 683)
# exclude target variable
isna <- is.na(Soybean[ , -1])

# missing data by cols
nabyCol <- colSums(isna)
kable(nabyCol, 
      col.names="# Cases with missing data", 
      caption="Predictors with Missing Data")
Predictors with Missing Data
# Cases with missing data
date 1
plant.stand 36
precip 38
temp 30
hail 121
crop.hist 16
area.dam 1
sever 121
seed.tmt 121
germ 112
plant.growth 16
leaves 0
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.shread 100
leaf.malf 84
leaf.mild 108
stem 16
lodging 121
stem.cankers 38
canker.lesion 38
fruiting.bodies 106
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
fruit.pods 84
fruit.spots 106
seed 92
mold.growth 92
seed.discolor 106
seed.size 92
shriveling 106
roots 31
kable(addmargins(table(nabyCol)), 
      col.names=c("# Missing values", "# Predictors"),
      caption = "Predictors by Number of Missing Values")
Predictors by Number of Missing Values
# Missing values # Predictors
0 1
1 2
16 3
30 1
31 1
36 1
38 7
84 5
92 3
100 1
106 4
108 1
112 1
121 4
Sum 35
# missing data by rows
nabyRow <- rowSums(isna)
kable(addmargins(table(nabyRow)), 
      col.names=c("# Missing values", "# Cases"),
      caption = "Cases by Number of Missing Values")
Cases by Number of Missing Values
# Missing values # Cases
0 562
11 9
13 19
19 55
20 8
24 14
28 15
30 1
Sum 683

Further, we can visualize the pattern of missing data in the image below, where the horizontal axis represents the variables and the vertical axis represents the cases in the dataset (note that the cases on the vertical axis are in reverse order compared to the dataframe). There is a clear pattern of missing values in the dataframe, with concentrations in selected cases and variables.

# image of missing data
par(mfrow = c(1,1))
image(t(isna), 
      xlab = "Variables", 
      ylab = "Cases (note reverse order)")
title("Image of Missing Values in Soybean Dataframe")

Finally, we can cross-tabulate by Class to see if there is a pattern of missing data related to the target variable. It is clear that the missing data is highly structured and dependent on the target:

  • 2-4-d-injury: all cases are missing some data
  • cyst-nematode: all cases are missing some data
  • diaporthe-pod-&-stem-blight: all cases are missing some data
  • herbicide-injury: all cases are missing some data
  • phytophthora-rot: 68 out of 88 cases are missing some data.
# this time include target variable
isna <- is.na(Soybean)
nabyCol <- colSums(isna)
nabyRow <- rowSums(isna)

dfwna <- Soybean %>% mutate(nas = nabyRow, anyna = nabyRow > 0)

# cross tab by any missing data
addmargins(xtabs(~ Class + anyna, dfwna)) %>% 
  kable(col.names = c("Complete Cases", "Incomplete Cases", "Total Cases"), 
        caption = "Class by Complete / Incomplete Cases")
Class by Complete / Incomplete Cases
Complete Cases Incomplete Cases Total Cases
2-4-d-injury 0 16 16
alternarialeaf-spot 91 0 91
anthracnose 44 0 44
bacterial-blight 20 0 20
bacterial-pustule 20 0 20
brown-spot 92 0 92
brown-stem-rot 44 0 44
charcoal-rot 20 0 20
cyst-nematode 0 14 14
diaporthe-pod-&-stem-blight 0 15 15
diaporthe-stem-canker 20 0 20
downy-mildew 20 0 20
frog-eye-leaf-spot 91 0 91
herbicide-injury 0 8 8
phyllosticta-leaf-spot 20 0 20
phytophthora-rot 20 68 88
powdery-mildew 20 0 20
purple-seed-stain 20 0 20
rhizoctonia-root-rot 20 0 20
Sum 562 121 683

Conditioning on the incomplete cases, we can see that the missing values arise from a consistent set of predictors, which depends on the target Class:

  • 2-4-d-injury: incomplete cases due to missing data in 28 or 30 predictors
  • cyst-nematode: incomplete cases due to missing data in 24 predictors
  • diaporthe-pod-&-stem-blight: incomplete cases due to missing data in 11 or 13 predictors
  • herbicide-injury: incomplete cases due to missing data in 20 predictors
  • phytophthora-rot: incomplete cases due to missing data in 13 or 19 predictors.
# cross tab by number of missing data
addmargins(xtabs(~ Class + nas, dfwna, subset = (anyna == TRUE))) %>%
  kable(caption = "Classes with Incomplete Cases by Number of Missing Variables")
Classes with Incomplete Cases by Number of Missing Variables
11 13 19 20 24 28 30 Sum
2-4-d-injury 0 0 0 0 0 15 1 16
alternarialeaf-spot 0 0 0 0 0 0 0 0
anthracnose 0 0 0 0 0 0 0 0
bacterial-blight 0 0 0 0 0 0 0 0
bacterial-pustule 0 0 0 0 0 0 0 0
brown-spot 0 0 0 0 0 0 0 0
brown-stem-rot 0 0 0 0 0 0 0 0
charcoal-rot 0 0 0 0 0 0 0 0
cyst-nematode 0 0 0 0 14 0 0 14
diaporthe-pod-&-stem-blight 9 6 0 0 0 0 0 15
diaporthe-stem-canker 0 0 0 0 0 0 0 0
downy-mildew 0 0 0 0 0 0 0 0
frog-eye-leaf-spot 0 0 0 0 0 0 0 0
herbicide-injury 0 0 0 8 0 0 0 8
phyllosticta-leaf-spot 0 0 0 0 0 0 0 0
phytophthora-rot 0 13 55 0 0 0 0 68
powdery-mildew 0 0 0 0 0 0 0 0
purple-seed-stain 0 0 0 0 0 0 0 0
rhizoctonia-root-rot 0 0 0 0 0 0 0 0
Sum 9 19 55 8 14 15 1 121

c. Strategy for missing data

One strategy for handling the missing data is to remove the predictors that account for the bulk of the systematic missing data, such as variables that have more than 80 incomplete cases:

  • Variables with 121 incomplete cases: hail, sever, seed.tmt, lodging
  • Variables with 106 incomplete cases: fruiting.bodies, fruit.spots, seed.discolor, shriveling
  • Variables with 84 incomplete cases: leaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods.

Dropping the variables leaves 16 predictors remaining in the dataset, with the remaining missing data isolated to certain cases, which can be removed as incomplete cases.

# drop missing data
dropna <- Soybean[ , nabyCol < 80]
# dim of remaining dataframe
dim(dropna)
## [1] 683  17
# visualize missing data
image(t(is.na(dropna)), 
      xlab = "Variables", 
      ylab = "Cases (note reverse order)")
title("Image of Missing Values in Reduced Dataframe")

After dropping the remaining incomplete cases, we are left with a dataset with 630 complete observations of 16 predictor variables. Starting with the raw Soybean dataset (683 cases with 35 predictors), we’ve dropped 19 predictors and 53 incomplete cases. Finally we review the summary statistics of the reduced dataset.

# drop remaining imcomplete cases
final_df <- dropna[complete.cases(dropna), ]
dim(final_df)
## [1] 630  17
summary(final_df)
##                  Class     date    plant.stand precip  temp    crop.hist
##  brown-spot         : 92   0: 20   0:347       0: 74   0: 72   0: 59    
##  alternarialeaf-spot: 91   1: 68   1:283       1:110   1:374   1:156    
##  frog-eye-leaf-spot : 91   2: 86               2:446   2:184   2:208    
##  phytophthora-rot   : 88   3:110                               3:207    
##  anthracnose        : 44   4:124                                        
##  brown-stem-rot     : 44   5:140                                        
##  (Other)            :180   6: 82                                        
##  area.dam plant.growth leaves  stem    stem.cankers canker.lesion
##  0:113    0:426        0: 62   0:282   0:364        0:305        
##  1:215    1:204        1:568   1:348   1: 39        1: 83        
##  2:135                                 2: 36        2:177        
##  3:167                                 3:191        3: 65        
##                                                                  
##                                                                  
##                                                                  
##  ext.decay mycelium int.discolor sclerotia roots  
##  0:482     0:624    0:566        0:610     0:551  
##  1:135     1:  6    1: 44        1: 20     1: 78  
##  2: 13              2: 20                  2:  1  
##                                                   
##                                                   
##                                                   
## 
detach(Soybean)

It would be critical to review the summary statistics and compare versus the initial summary statistics, particularly with respect to variable distributions. Importantly, we would want to inspect the distribution of the target Class variable, and the predictor distributions conditioned on Class, to determine the degree to which dropping the missing data has introduced any bias into the final dataset.