Glass datasetThe Glass dataset includes 214 observations of 10 variables. The variables include:
RI through Fe), all of which are numericType) which is categorical and takes the values (1, 2, 3, 5, 6, 7).library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
attach(Glass)
To start, we can use the ggpairs function from the GGally package to visualize the distributions of the 9 predictor variables as well as their bivariate scatterplots and correlations. In this chart, the data points and the correlations are conditioned on the target Type variable.
# pairs plot excluding the target variable
ggpairs(Glass[ , -10], aes(col = Type),
title = "Pairs plot of predictor variables")
From these plots, we can see that:
Na, Al, and SiK and BaRI and Ca.Close-up examples of these variables are shown below:
Na: example of approximately symmetric distributionK: example of highly skewed distributionRI vs. Ca: highly correlated relationship.ggplot(Glass, aes(x = Na, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Density of Na: example of approximately symmetric distribution")
ggplot(Glass, aes(x = K, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Density of K: example of highly skewed distribution")
ggplot(Glass, aes(x = Ca, y = RI, col = Type)) +
geom_point() +
geom_smooth(aes(x = Ca, y = RI), inherit.aes = FALSE, method = "lm", se = FALSE) +
labs(title = "RI vs. Ca: example of correlated variables")
We can take an alternative look at relationships between the predictors by viewing the correlation matrix, using the corrplot function from the corrplot package. It is apparent that the largest correlations in absolute value are:
RI and Ca: positive (0.81)Al and Ba: positive (0.48RI and Si: negative (-0.54)Mg and Ba: negative (-0.49)Mg and Al: negative (-0.48)# correlation plot excluding the target variable
corrs <- cor(Glass[ , -10])
corrplot(corrs)
We can quantify skewness in the predictors by computing the skewness statistic and the ratio of high-to-low values. These statistics confirm our observations from the density plots below that the following variables all have highly skewed distributions:
# skew statistics
skewValues <- apply(Glass[ , -10], 2, skewness)
# high-to-low ratios; add 0.1 to min to prevent division by 0
hiloRatios <- apply(Glass[ , -10], 2, function(x) max(x) / min(x + 0.1))
cbind(Skew = skewValues, Hilo = hiloRatios) %>%
kable(digits = 2,
col.names=c("Skew statistic", "High-to-Low ratio"),
caption = "Predictors with Skewed Distributions")
| Skew statistic | High-to-Low ratio | |
|---|---|---|
| RI | 1.60 | 0.95 |
| Na | 0.45 | 1.60 |
| Mg | -1.14 | 44.90 |
| Al | 0.89 | 8.97 |
| Si | -0.72 | 1.08 |
| K | 6.46 | 62.10 |
| Ca | 2.02 | 2.93 |
| Ba | 3.37 | 31.50 |
| Fe | 1.73 | 5.10 |
ggplot(Glass, aes(x = K, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Density of K: highly skewed distribution")
ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Density of Ba: highly skewed distribution")
ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Density of Ca: highly skewed distribution")
From the boxplots below, outliers in these variable distributions are apparent:
K: outliers in Types 5 and 7Ba: outliers in Types 2 and 5Ca: outliers in Types 2, 6, and 7ggplot(Glass) +
geom_boxplot(aes(x = Type, y = K, fill = Type)) +
labs(title = "Distribution of K by Type")
ggplot(Glass) +
geom_boxplot(aes(x = Type, y = Ba, fill = Type)) +
labs(title = "Distribution of Ba by Type")
ggplot(Glass) +
geom_boxplot(aes(x = Type, y = Ca, fill = Type)) +
labs(title = "Distribution of Ca by Type")
Some transformations of the predictors that might improve a classification model include:
We can visualize the distributions of the highly skewed variables (K, Ba, and Ca) before and after the Box-Cox transformation in the density plots below. Generally transforming by Box-Cox improves the skewness of the overall variable distributions, although some of the distributions conditioned on Type may still remain skewed.
# box-cox transformation
K_bc <- BoxCoxTrans(Glass$K + 0.1)
K_trans <- predict(K_bc, Glass$K + 0.1)
Ba_bc <- BoxCoxTrans(Glass$Ba + 0.1)
Ba_trans <- predict(Ba_bc, Glass$Ba + 0.1)
Ca_bc <- BoxCoxTrans(Glass$Ca + 0.1)
Ca_trans <- predict(Ca_bc, Glass$Ca + 0.1)
# plot distributions before & after box-cox
ggplot(Glass, aes(x = K, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of K before Box-Cox")
ggplot(Glass, aes(x = K_trans, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of K after Box-Cox")
ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of Ba before Box-Cox")
ggplot(Glass, aes(x = Ba_trans, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of Ba after Box-Cox")
ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of Ca before Box-Cox")
ggplot(Glass, aes(x = Ca_trans, fill = Type, col = Type)) +
geom_density() +
geom_rug() +
labs(title = "Distribution of Ca after Box-Cox")
detach(Glass)
Soybean datasetThe Soybean dataset includes 683 observations of 36 variables. The variables include:
date through roots), all of which are categoricalClass) which is categorical and specifies 19 distinct classes.data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
attach(Soybean)
We start by reviewing the frequency distributions of the predictors using the summary function. It is apparent that:
mycelium variable fall into the same category.hail, sever, seed.tmt, and lodging), which may indicate a common source of the problem.summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
From the text, degenerate distributions that have near-zero variance can be identified as satisfying two conditions:
We can use the nearZeroVar function in the caret package to identify the predictors with degenerate distributions. Based on this definition, the variables with near-zero variance include:
leaf.mild: 78% of the data in the most prevalent valuemycelium: 94% in the most prevalent valuesclerotia: 92% in the most prevalent value.# get near-zero variance variables
nzv.cols <- nearZeroVar(Soybean)
nzv.out <- nearZeroVar(Soybean, saveMetrics=TRUE)[nzv.cols, ]
nzv.prop.high <- apply(Soybean, 2, function(x) max(table(x)) / length(x))[nzv.cols]
summary(Soybean[ , nzv.cols])
## leaf.mild mycelium sclerotia
## 0 :535 0 :639 0 :625
## 1 : 20 1 : 6 1 : 20
## 2 : 20 NA's: 38 NA's: 38
## NA's:108
df <- nzv.out %>%
mutate(prophigh = nzv.prop.high * 100) %>%
select(freqRatio, percentUnique, prophigh)
rownames(df) <- rownames(nzv.out)
kable(df,
digits = 2,
col.names = c("Frequency ratio", "Pct. unique values", "Pct. most prevalent value"),
caption = "Near-Zero Variance Predictors")
| Frequency ratio | Pct. unique values | Pct. most prevalent value | |
|---|---|---|---|
| leaf.mild | 26.75 | 0.44 | 78.33 |
| mycelium | 106.50 | 0.29 | 93.56 |
| sclerotia | 31.25 | 0.29 | 91.51 |
Measured by the number of complete cases, roughly 18% of the cases in the dataset are missing values for one or more variables. Measured by the number of missing values across all cases and variables, roughly 10% of the dataset is missing data.
# proportion of complete cases
1 - sum(complete.cases(Soybean)) / nrow(Soybean)
## [1] 0.1771596
# proportion of missing values across all rows & cols
sum(is.na(Soybean)) / ncol(Soybean) / nrow(Soybean)
## [1] 0.09504636
We can detect patterns in the missing data by viewing the distribution of NA values across variables and cases. From the output below, it is evident that:
hail, sever, seed.tmt, lodgingfruiting.bodies, fruit.spots, seed.discolor, shrivelingleaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods# exclude target variable
isna <- is.na(Soybean[ , -1])
# missing data by cols
nabyCol <- colSums(isna)
kable(nabyCol,
col.names="# Cases with missing data",
caption="Predictors with Missing Data")
| # Cases with missing data | |
|---|---|
| date | 1 |
| plant.stand | 36 |
| precip | 38 |
| temp | 30 |
| hail | 121 |
| crop.hist | 16 |
| area.dam | 1 |
| sever | 121 |
| seed.tmt | 121 |
| germ | 112 |
| plant.growth | 16 |
| leaves | 0 |
| leaf.halo | 84 |
| leaf.marg | 84 |
| leaf.size | 84 |
| leaf.shread | 100 |
| leaf.malf | 84 |
| leaf.mild | 108 |
| stem | 16 |
| lodging | 121 |
| stem.cankers | 38 |
| canker.lesion | 38 |
| fruiting.bodies | 106 |
| ext.decay | 38 |
| mycelium | 38 |
| int.discolor | 38 |
| sclerotia | 38 |
| fruit.pods | 84 |
| fruit.spots | 106 |
| seed | 92 |
| mold.growth | 92 |
| seed.discolor | 106 |
| seed.size | 92 |
| shriveling | 106 |
| roots | 31 |
kable(addmargins(table(nabyCol)),
col.names=c("# Missing values", "# Predictors"),
caption = "Predictors by Number of Missing Values")
| # Missing values | # Predictors |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 16 | 3 |
| 30 | 1 |
| 31 | 1 |
| 36 | 1 |
| 38 | 7 |
| 84 | 5 |
| 92 | 3 |
| 100 | 1 |
| 106 | 4 |
| 108 | 1 |
| 112 | 1 |
| 121 | 4 |
| Sum | 35 |
# missing data by rows
nabyRow <- rowSums(isna)
kable(addmargins(table(nabyRow)),
col.names=c("# Missing values", "# Cases"),
caption = "Cases by Number of Missing Values")
| # Missing values | # Cases |
|---|---|
| 0 | 562 |
| 11 | 9 |
| 13 | 19 |
| 19 | 55 |
| 20 | 8 |
| 24 | 14 |
| 28 | 15 |
| 30 | 1 |
| Sum | 683 |
Further, we can visualize the pattern of missing data in the image below, where the horizontal axis represents the variables and the vertical axis represents the cases in the dataset (note that the cases on the vertical axis are in reverse order compared to the dataframe). There is a clear pattern of missing values in the dataframe, with concentrations in selected cases and variables.
# image of missing data
par(mfrow = c(1,1))
image(t(isna),
xlab = "Variables",
ylab = "Cases (note reverse order)")
title("Image of Missing Values in Soybean Dataframe")
Finally, we can cross-tabulate by Class to see if there is a pattern of missing data related to the target variable. It is clear that the missing data is highly structured and dependent on the target:
2-4-d-injury: all cases are missing some datacyst-nematode: all cases are missing some datadiaporthe-pod-&-stem-blight: all cases are missing some dataherbicide-injury: all cases are missing some dataphytophthora-rot: 68 out of 88 cases are missing some data.# this time include target variable
isna <- is.na(Soybean)
nabyCol <- colSums(isna)
nabyRow <- rowSums(isna)
dfwna <- Soybean %>% mutate(nas = nabyRow, anyna = nabyRow > 0)
# cross tab by any missing data
addmargins(xtabs(~ Class + anyna, dfwna)) %>%
kable(col.names = c("Complete Cases", "Incomplete Cases", "Total Cases"),
caption = "Class by Complete / Incomplete Cases")
| Complete Cases | Incomplete Cases | Total Cases | |
|---|---|---|---|
| 2-4-d-injury | 0 | 16 | 16 |
| alternarialeaf-spot | 91 | 0 | 91 |
| anthracnose | 44 | 0 | 44 |
| bacterial-blight | 20 | 0 | 20 |
| bacterial-pustule | 20 | 0 | 20 |
| brown-spot | 92 | 0 | 92 |
| brown-stem-rot | 44 | 0 | 44 |
| charcoal-rot | 20 | 0 | 20 |
| cyst-nematode | 0 | 14 | 14 |
| diaporthe-pod-&-stem-blight | 0 | 15 | 15 |
| diaporthe-stem-canker | 20 | 0 | 20 |
| downy-mildew | 20 | 0 | 20 |
| frog-eye-leaf-spot | 91 | 0 | 91 |
| herbicide-injury | 0 | 8 | 8 |
| phyllosticta-leaf-spot | 20 | 0 | 20 |
| phytophthora-rot | 20 | 68 | 88 |
| powdery-mildew | 20 | 0 | 20 |
| purple-seed-stain | 20 | 0 | 20 |
| rhizoctonia-root-rot | 20 | 0 | 20 |
| Sum | 562 | 121 | 683 |
Conditioning on the incomplete cases, we can see that the missing values arise from a consistent set of predictors, which depends on the target Class:
2-4-d-injury: incomplete cases due to missing data in 28 or 30 predictorscyst-nematode: incomplete cases due to missing data in 24 predictorsdiaporthe-pod-&-stem-blight: incomplete cases due to missing data in 11 or 13 predictorsherbicide-injury: incomplete cases due to missing data in 20 predictorsphytophthora-rot: incomplete cases due to missing data in 13 or 19 predictors.# cross tab by number of missing data
addmargins(xtabs(~ Class + nas, dfwna, subset = (anyna == TRUE))) %>%
kable(caption = "Classes with Incomplete Cases by Number of Missing Variables")
| 11 | 13 | 19 | 20 | 24 | 28 | 30 | Sum | |
|---|---|---|---|---|---|---|---|---|
| 2-4-d-injury | 0 | 0 | 0 | 0 | 0 | 15 | 1 | 16 |
| alternarialeaf-spot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| anthracnose | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| bacterial-blight | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| bacterial-pustule | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| brown-spot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| brown-stem-rot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| charcoal-rot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| cyst-nematode | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 14 |
| diaporthe-pod-&-stem-blight | 9 | 6 | 0 | 0 | 0 | 0 | 0 | 15 |
| diaporthe-stem-canker | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| downy-mildew | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| frog-eye-leaf-spot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| herbicide-injury | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 8 |
| phyllosticta-leaf-spot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| phytophthora-rot | 0 | 13 | 55 | 0 | 0 | 0 | 0 | 68 |
| powdery-mildew | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| purple-seed-stain | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| rhizoctonia-root-rot | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sum | 9 | 19 | 55 | 8 | 14 | 15 | 1 | 121 |
One strategy for handling the missing data is to remove the predictors that account for the bulk of the systematic missing data, such as variables that have more than 80 incomplete cases:
hail, sever, seed.tmt, lodgingfruiting.bodies, fruit.spots, seed.discolor, shrivelingleaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods.Dropping the variables leaves 16 predictors remaining in the dataset, with the remaining missing data isolated to certain cases, which can be removed as incomplete cases.
# drop missing data
dropna <- Soybean[ , nabyCol < 80]
# dim of remaining dataframe
dim(dropna)
## [1] 683 17
# visualize missing data
image(t(is.na(dropna)),
xlab = "Variables",
ylab = "Cases (note reverse order)")
title("Image of Missing Values in Reduced Dataframe")
After dropping the remaining incomplete cases, we are left with a dataset with 630 complete observations of 16 predictor variables. Starting with the raw Soybean dataset (683 cases with 35 predictors), we’ve dropped 19 predictors and 53 incomplete cases. Finally we review the summary statistics of the reduced dataset.
# drop remaining imcomplete cases
final_df <- dropna[complete.cases(dropna), ]
dim(final_df)
## [1] 630 17
summary(final_df)
## Class date plant.stand precip temp crop.hist
## brown-spot : 92 0: 20 0:347 0: 74 0: 72 0: 59
## alternarialeaf-spot: 91 1: 68 1:283 1:110 1:374 1:156
## frog-eye-leaf-spot : 91 2: 86 2:446 2:184 2:208
## phytophthora-rot : 88 3:110 3:207
## anthracnose : 44 4:124
## brown-stem-rot : 44 5:140
## (Other) :180 6: 82
## area.dam plant.growth leaves stem stem.cankers canker.lesion
## 0:113 0:426 0: 62 0:282 0:364 0:305
## 1:215 1:204 1:568 1:348 1: 39 1: 83
## 2:135 2: 36 2:177
## 3:167 3:191 3: 65
##
##
##
## ext.decay mycelium int.discolor sclerotia roots
## 0:482 0:624 0:566 0:610 0:551
## 1:135 1: 6 1: 44 1: 20 1: 78
## 2: 13 2: 20 2: 1
##
##
##
##
detach(Soybean)
It would be critical to review the summary statistics and compare versus the initial summary statistics, particularly with respect to variable distributions. Importantly, we would want to inspect the distribution of the target Class variable, and the predictor distributions conditioned on Class, to determine the degree to which dropping the missing data has introduced any bias into the final dataset.