Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
# Required R packages
library(mlbench)
library(tidyverse)
library(GGally)
library(caret)
library(VIM)The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
par(mfrow = c(3,3))
for (i in 1:9){
rcompanion::plotNormalDensity(
Glass[,i], main = sprintf("Density of %s", names(Glass)[i]),
xlab = sprintf("skewness = %1.2f", psych::describe(Glass)[i,11]),
col2 = "steelblue2", col3 = "royalblue4")
}The plots above represent a density plot for a vector of values and a superimposed normal curve with the same mean and standard deviation. The plot can be used to quickly compare the distribution of data to a normal distribution. It is evident that no variables are truly normally distributed. While Na, Al, and Si are nearly normal, there is a small deviation in the tails. The refractive index, Mg, and K show evidence of bimodal distribution, while Ca, Ba, Fe, as well as K, are positively skewed.
ggpairs(Glass[1:9], title = "Correlogram with the variables", progress = FALSE,
lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) From the correlogram, the relationship between the refractive index and Ca suggests that is a highly positive correlation. There are some variables with moderate correlations, but no other relationship seems noteworthy.
par(mfrow = c(5,2))
for (i in 1:9){
boxplot(
Glass[i], main = sprintf("%s", names(Glass)[i]), col = "steelblue2", horizontal = TRUE,
xlab = sprintf("skewness = %1.2f # of outliers = %d", psych::describe(Glass)[i,11],
length(boxplot(Glass[i], plot = FALSE)$out)))
}The boxplots reveal that there are a number of outliers within each variable and that they may be the likely cause for the skewness of their distribution as discussed previously. It is interesting to see that while Mg does not appear to have any outliers, the distribution is slight, negatively skewed. Ba and Ca appears to have more outliers that may influence modeling.
Because there are quite a few variables that are affected by skewness, Box-cox transformation is one important method that can improve the model. Moreover, there are no missing values, so imputation is not necessary. Lastly, because there were some correlations, data reduction is considered to analyze if the data by generating a smaller set of predictors can capture a majority of the information in the original variables. As a result, a series of transformations to multiple variables is done, namely, Box-cox transformation and PCA.
## Created from 214 samples and 10 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (1)
## - principal component signal extraction (9)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
The pre-processing transformation that can be applied to all the variables. Refractive index, Na, Al, Si, and Ca were box-cox transformed, followed by centering, scaling and PCA, along with all other variables. After applying the transformation, the density is nearly normal and not heavily skewed, and there is no correlation among the data.
transformed = predict(glass.t, Glass)
par(mfrow = c(3,3))
for (i in 2:8){
rcompanion::plotNormalDensity(
transformed[,i], main = sprintf("Density of %s", names(transformed)[i]),
xlab = sprintf("skewness = %1.2f", psych::describe(transformed)[i,11]),
col2 = "steelblue2", col3 = "royalblue4")
}ggpairs(transformed[2:8], title = "Correlogram with the PCA variables", progress = FALSE,
lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value dna means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
A data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class.
summarytools::dfSummary(Soybean, plain.ascii = TRUE, style = "grid", graph.col = FALSE, footnote = NA)## Data Frame Summary
## Soybean
## Dimensions: 683 x 36
## Duplicates: 52
##
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | No | Variable | Stats / Values | Freqs (% of Valid) | Valid | Missing |
## +====+===================+=================================+====================+==========+==========+
## | 1 | Class | 1. 2-4-d-injury | 16 ( 2.3%) | 683 | 0 |
## | | [factor] | 2. alternarialeaf-spot | 91 (13.3%) | (100%) | (0%) |
## | | | 3. anthracnose | 44 ( 6.4%) | | |
## | | | 4. bacterial-blight | 20 ( 2.9%) | | |
## | | | 5. bacterial-pustule | 20 ( 2.9%) | | |
## | | | 6. brown-spot | 92 (13.5%) | | |
## | | | 7. brown-stem-rot | 44 ( 6.4%) | | |
## | | | 8. charcoal-rot | 20 ( 2.9%) | | |
## | | | 9. cyst-nematode | 14 ( 2.0%) | | |
## | | | 10. diaporthe-pod-&-stem-blig | 15 ( 2.2%) | | |
## | | | [ 9 others ] | 307 (45.0%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 2 | date | 1. 0 | 26 ( 3.8%) | 682 | 1 |
## | | [factor] | 2. 1 | 75 (11.0%) | (99.85%) | (0.15%) |
## | | | 3. 2 | 93 (13.6%) | | |
## | | | 4. 3 | 118 (17.3%) | | |
## | | | 5. 4 | 131 (19.2%) | | |
## | | | 6. 5 | 149 (21.9%) | | |
## | | | 7. 6 | 90 (13.2%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 3 | plant.stand | 1. 0 | 354 (54.7%) | 647 | 36 |
## | | [ordered, factor] | 2. 1 | 293 (45.3%) | (94.73%) | (5.27%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 4 | precip | 1. 0 | 74 (11.5%) | 645 | 38 |
## | | [ordered, factor] | 2. 1 | 112 (17.4%) | (94.44%) | (5.56%) |
## | | | 3. 2 | 459 (71.2%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 5 | temp | 1. 0 | 80 (12.2%) | 653 | 30 |
## | | [ordered, factor] | 2. 1 | 374 (57.3%) | (95.61%) | (4.39%) |
## | | | 3. 2 | 199 (30.5%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 6 | hail | 1. 0 | 435 (77.4%) | 562 | 121 |
## | | [factor] | 2. 1 | 127 (22.6%) | (82.28%) | (17.72%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 7 | crop.hist | 1. 0 | 65 ( 9.8%) | 667 | 16 |
## | | [factor] | 2. 1 | 165 (24.7%) | (97.66%) | (2.34%) |
## | | | 3. 2 | 219 (32.8%) | | |
## | | | 4. 3 | 218 (32.7%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 8 | area.dam | 1. 0 | 123 (18.0%) | 682 | 1 |
## | | [factor] | 2. 1 | 227 (33.3%) | (99.85%) | (0.15%) |
## | | | 3. 2 | 145 (21.3%) | | |
## | | | 4. 3 | 187 (27.4%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 9 | sever | 1. 0 | 195 (34.7%) | 562 | 121 |
## | | [factor] | 2. 1 | 322 (57.3%) | (82.28%) | (17.72%) |
## | | | 3. 2 | 45 ( 8.0%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 10 | seed.tmt | 1. 0 | 305 (54.3%) | 562 | 121 |
## | | [factor] | 2. 1 | 222 (39.5%) | (82.28%) | (17.72%) |
## | | | 3. 2 | 35 ( 6.2%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 11 | germ | 1. 0 | 165 (28.9%) | 571 | 112 |
## | | [ordered, factor] | 2. 1 | 213 (37.3%) | (83.6%) | (16.4%) |
## | | | 3. 2 | 193 (33.8%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 12 | plant.growth | 1. 0 | 441 (66.1%) | 667 | 16 |
## | | [factor] | 2. 1 | 226 (33.9%) | (97.66%) | (2.34%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 13 | leaves | 1. 0 | 77 (11.3%) | 683 | 0 |
## | | [factor] | 2. 1 | 606 (88.7%) | (100%) | (0%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 14 | leaf.halo | 1. 0 | 221 (36.9%) | 599 | 84 |
## | | [factor] | 2. 1 | 36 ( 6.0%) | (87.7%) | (12.3%) |
## | | | 3. 2 | 342 (57.1%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 15 | leaf.marg | 1. 0 | 357 (59.6%) | 599 | 84 |
## | | [factor] | 2. 1 | 21 ( 3.5%) | (87.7%) | (12.3%) |
## | | | 3. 2 | 221 (36.9%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 16 | leaf.size | 1. 0 | 51 ( 8.5%) | 599 | 84 |
## | | [ordered, factor] | 2. 1 | 327 (54.6%) | (87.7%) | (12.3%) |
## | | | 3. 2 | 221 (36.9%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 17 | leaf.shread | 1. 0 | 487 (83.5%) | 583 | 100 |
## | | [factor] | 2. 1 | 96 (16.5%) | (85.36%) | (14.64%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 18 | leaf.malf | 1. 0 | 554 (92.5%) | 599 | 84 |
## | | [factor] | 2. 1 | 45 ( 7.5%) | (87.7%) | (12.3%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 19 | leaf.mild | 1. 0 | 535 (93.0%) | 575 | 108 |
## | | [factor] | 2. 1 | 20 ( 3.5%) | (84.19%) | (15.81%) |
## | | | 3. 2 | 20 ( 3.5%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 20 | stem | 1. 0 | 296 (44.4%) | 667 | 16 |
## | | [factor] | 2. 1 | 371 (55.6%) | (97.66%) | (2.34%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 21 | lodging | 1. 0 | 520 (92.5%) | 562 | 121 |
## | | [factor] | 2. 1 | 42 ( 7.5%) | (82.28%) | (17.72%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 22 | stem.cankers | 1. 0 | 379 (58.8%) | 645 | 38 |
## | | [factor] | 2. 1 | 39 ( 6.0%) | (94.44%) | (5.56%) |
## | | | 3. 2 | 36 ( 5.6%) | | |
## | | | 4. 3 | 191 (29.6%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 23 | canker.lesion | 1. 0 | 320 (49.6%) | 645 | 38 |
## | | [factor] | 2. 1 | 83 (12.9%) | (94.44%) | (5.56%) |
## | | | 3. 2 | 177 (27.4%) | | |
## | | | 4. 3 | 65 (10.1%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 24 | fruiting.bodies | 1. 0 | 473 (82.0%) | 577 | 106 |
## | | [factor] | 2. 1 | 104 (18.0%) | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 25 | ext.decay | 1. 0 | 497 (77.0%) | 645 | 38 |
## | | [factor] | 2. 1 | 135 (20.9%) | (94.44%) | (5.56%) |
## | | | 3. 2 | 13 ( 2.0%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 26 | mycelium | 1. 0 | 639 (99.1%) | 645 | 38 |
## | | [factor] | 2. 1 | 6 ( 0.9%) | (94.44%) | (5.56%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 27 | int.discolor | 1. 0 | 581 (90.1%) | 645 | 38 |
## | | [factor] | 2. 1 | 44 ( 6.8%) | (94.44%) | (5.56%) |
## | | | 3. 2 | 20 ( 3.1%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 28 | sclerotia | 1. 0 | 625 (96.9%) | 645 | 38 |
## | | [factor] | 2. 1 | 20 ( 3.1%) | (94.44%) | (5.56%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 29 | fruit.pods | 1. 0 | 407 (68.0%) | 599 | 84 |
## | | [factor] | 2. 1 | 130 (21.7%) | (87.7%) | (12.3%) |
## | | | 3. 2 | 14 ( 2.3%) | | |
## | | | 4. 3 | 48 ( 8.0%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 30 | fruit.spots | 1. 0 | 345 (59.8%) | 577 | 106 |
## | | [factor] | 2. 1 | 75 (13.0%) | (84.48%) | (15.52%) |
## | | | 3. 2 | 57 ( 9.9%) | | |
## | | | 4. 4 | 100 (17.3%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 31 | seed | 1. 0 | 476 (80.5%) | 591 | 92 |
## | | [factor] | 2. 1 | 115 (19.5%) | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 32 | mold.growth | 1. 0 | 524 (88.7%) | 591 | 92 |
## | | [factor] | 2. 1 | 67 (11.3%) | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 33 | seed.discolor | 1. 0 | 513 (88.9%) | 577 | 106 |
## | | [factor] | 2. 1 | 64 (11.1%) | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 34 | seed.size | 1. 0 | 532 (90.0%) | 591 | 92 |
## | | [factor] | 2. 1 | 59 (10.0%) | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 35 | shriveling | 1. 0 | 539 (93.4%) | 577 | 106 |
## | | [factor] | 2. 1 | 38 ( 6.6%) | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 36 | roots | 1. 0 | 551 (84.5%) | 652 | 31 |
## | | [factor] | 2. 1 | 86 (13.2%) | (95.46%) | (4.54%) |
## | | | 3. 2 | 15 ( 2.3%) | | |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
A random variable, X, is degenerate if, for some a constant, c, P(X = c) = 1. These near-zero variance predictors may have a single value for the vast majority of the samples. The rule of thumb for detecting near-zero variance predictors is:
If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model. Therefore, from the above table, there are a few questionable variables that may be degenerate. For criteria #1, a low sample size, these include leaf.malf, leaf.mild, lodging, mycelium, sclerotia, int.discolor, mold.growth, seed.discolor, seed.size, and shriveling. Let’s further determine which of these can be removed as predictors.
df = distinct(Soybean)
variables = c("leaf.malf", "lodging", "mycelium", "sclerotia", "mold.growth", "seed.discolor", "seed.size",
"shriveling", "leaf.mild", "int.discolor")
counts = data.frame()
for (i in variables) {
counts = rbind(counts, as.data.frame(table(df[i])))
}
ratio = c()
for (i in seq(1, 16, by = 2)) {
ratio[i] = counts$Freq[i]/counts$Freq[i+1]
}
for (i in c(17,20)) {
ratio[i] = counts$Freq[i]/counts$Freq[i+1]
ratio[i+1] = counts$Freq[i]/counts$Freq[i+2]
ratio[22] = NA
}
decision = c()
for (i in 1:22) {
if (is.na(ratio[i])){
decision[i] = ""
} else if (ratio[i] > 20) {
decision[i] = "Remove"
} else {
decision[i] = "Keep"
}
}
variables = c("leaf.malf","", "lodging","","mycelium","", "sclerotia","", "mold.growth", "","seed.discolor",
"", "seed.size","", "shriveling","", "leaf.mild","","", "int.discolor","","")
options(knitr.kable.NA = '')
cbind(variables, rename(counts, factors = Var1, freq = Freq), ratio, decision) %>%
knitr::kable(digits = 2L, caption = "Near-zero Variance Predictors")| variables | factors | freq | ratio | decision |
|---|---|---|---|---|
| leaf.malf | 0 | 523 | 11.62 | Keep |
| 1 | 45 | |||
| lodging | 0 | 490 | 11.67 | Keep |
| 1 | 42 | |||
| mycelium | 0 | 591 | 98.50 | Remove |
| 1 | 6 | |||
| sclerotia | 0 | 577 | 28.85 | Remove |
| 1 | 20 | |||
| mold.growth | 0 | 490 | 7.42 | Keep |
| 1 | 66 | |||
| seed.discolor | 0 | 483 | 7.67 | Keep |
| 1 | 63 | |||
| seed.size | 0 | 502 | 9.30 | Keep |
| 1 | 54 | |||
| shriveling | 0 | 509 | 13.76 | Keep |
| 1 | 37 | |||
| leaf.mild | 0 | 504 | 25.20 | Remove |
| 1 | 20 | 25.20 | Remove | |
| 2 | 20 | |||
| int.discolor | 0 | 535 | 12.74 | Keep |
| 1 | 42 | 26.75 | Remove | |
| 2 | 20 |
From the investigation above, it is indicative that mycelium, sclerotia, and leaf.mild are strongly imbalanced. Thus, it is advantageous to remove these variables from the model. Note that int.discolor, resulted in both a keep and remove for each factor, given that we can keep one factor, the variable is kept unless there is another indication that is affecting the model.
Soybean[which(!complete.cases(Soybean)),] %>%
group_by(Class) %>% summarise(Count = n()) %>%
mutate(Proportion = (Count/nrow(Soybean))) %>%
arrange(desc(Count)) %>%
knitr::kable(digits = 3L, caption = "Proportion of Incomplete Cases by Class")| Class | Count | Proportion |
|---|---|---|
| phytophthora-rot | 68 | 0.100 |
| 2-4-d-injury | 16 | 0.023 |
| diaporthe-pod-&-stem-blight | 15 | 0.022 |
| cyst-nematode | 14 | 0.020 |
| herbicide-injury | 8 | 0.012 |
Looking within the Class variables, it appears that nearly 10% of the missing data is the phytophthora-rot class. So we dive further into the proportion of missing data within each variable below.
na.counts = as.data.frame(((sapply(Soybean, function(x) sum(is.na(x))))/nrow(Soybean))*100)
names(na.counts) = "counts"
na.counts = cbind(variables = rownames(na.counts), data.frame(na.counts, row.names = NULL))
na.counts %>% arrange(counts) %>% mutate(name = factor(variables, levels = variables)) %>%
ggplot(aes(x = name, y = counts)) + geom_segment( aes(xend = name, yend = 0)) +
geom_point(size = 4, color = "steelblue2") + coord_flip() + theme_bw() +
labs(title = "Proportion of Missing Data", x = "Variables", y = "% of Missing data") +
scale_y_continuous(labels = scales::percent_format(scale = 1))aggr(Soybean, col = c('steelblue2','royalblue4'), numbers = FALSE, sortVars = TRUE,
oma = c(6,4,3,2), labels = names(Soybean), cex.axis = 0.8, gap = 3, axes = TRUE, bars = FALSE,
combined = TRUE, Prop = TRUE, ylab = c("Combination of Missing Data"))##
## Variables sorted by number of missings:
## Variable Count
## hail 121
## sever 121
## seed.tmt 121
## lodging 121
## germ 112
## leaf.mild 108
## fruiting.bodies 106
## fruit.spots 106
## seed.discolor 106
## shriveling 106
## leaf.shread 100
## seed 92
## mold.growth 92
## seed.size 92
## leaf.halo 84
## leaf.marg 84
## leaf.size 84
## leaf.malf 84
## fruit.pods 84
## precip 38
## stem.cankers 38
## canker.lesion 38
## ext.decay 38
## mycelium 38
## int.discolor 38
## sclerotia 38
## plant.stand 36
## roots 31
## temp 30
## crop.hist 16
## plant.growth 16
## stem 16
## date 1
## area.dam 1
## Class 0
## leaves 0
The graphs above are very helpful in indicating the amount of missing data the Soybean data contains. From the first plot, it highlights lodging, hail, sever and seed.tmt accounts for nearly 18% each. The second plot shows the pattern of the missing data as it relates to the other variables. It shows 82% are complete, in addition to the Class and leaves variables. There are quite a few missingness patterns, but their overall proportion is not extreme. For example, from the graph, the first set of variables, from hail to fruit.pods, accounts for 8% of the missing data when the other variables are complete, note this does not indicate within variable missingness. Therefore, for some imputation methods, such as certain types of multiple imputations, having fewer missingness patterns is helpful, as it requires fitting fewer models.
From Part A, mycelium, sclerotia, and leaf.mild are strongly imbalanced and it was deemed advantageous to remove these variables from the model. If the data set is large enough, rows with missing values can be deleted. However, because these proportions are not too extreme for most of the variables, the imputation by k-Nearest Neighbor is conducted. The distance computation for defining the nearest neighbors is based on Gower distance (Gower 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous and semi-continuous. As a result, the data set is now complete.
Soybean.complete = Soybean %>% select(-c(mycelium, sclerotia, leaf.mild)) %>% kNN()
aggr(Soybean.complete, col = c('steelblue2','royalblue4'), numbers = FALSE, sortVars = FALSE,
oma = c(8,4,3,2), labels = names(Soybean.complete), cex.axis = 0.8, gap = 3, axes = TRUE,
bars = FALSE, combined = TRUE, Prop = TRUE, ylab = c("Combination of Missing Data"))