#install.packages("easyGgplot2")R Markdown
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
First, let's plot the frequency distribution for our target (Class) and predictor variables:
As we can see above, the frequency among the different factors of the categorical variables is unbalanced in almost half of predictors. There are some predictors such as "data", "germ" and "croop.hist" where the observations show what looks like a normal distribution if there were more factor levels.
Other predictors such as "sclerotia" have only two levels and these levels are very unbalanced. These are examples of degenerate distributions. These constant or almost-constant predictors show frequency distributions with zero or near-zero variance respectively. For example, in the "sclerotia" predictor:
##
## 0 1
## 625 20
Has a very small frequency of observations belonging to one of the levels. Most of the observations 96.9% belong to one of the two possible levels.
We can use the nearZeroVar function from the caret package to identify predictors that "have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors)".
## freqRatio percentUnique zeroVar nzv
## leaf.mild 26.75 0.4392387 FALSE TRUE
## mycelium 106.50 0.2928258 FALSE TRUE
## sclerotia 31.25 0.2928258 FALSE TRUE
Three of the predictors "leaf.mild", "mycelium" and "mycelium" qualify as having near-zero variance (nzv). If we wanted to simplify the model by reducing the numbet of predictor variables, we can choose nzv predictors as they would not add signficantly to the specificity of the model.
## freqRatio percentUnique zeroVar nzv
## leaf.mild 26.75 0.4392387 FALSE TRUE
## mycelium 106.50 0.2928258 FALSE TRUE
## sclerotia 31.25 0.2928258 FALSE TRUE
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
At 35 predictors, we have a wealth of information to generate a predictive model. Using all predictors will lead to an overly complicated model. Therefore, we will need to select features (predictor variables) with the predictive potential. How can we choose which variables to keep or not?
We can select for elimination, those variables with zero or near-zero variance that we found earlier on.
Also, we can choose to eliminate those variables that show collinearity. We can see in the Correlation Matrix plot above that there are some variables that are strongly correlated with each other. For example, "leaf.marg" is strongly negative correlated with "leaf.halo"
After we have simplified our range of predictors we need to deal with the missing observations.
## Class date plant.stand precip
## 0.0000000 0.1464129 5.2708638 5.5636896
## temp hail crop.hist area.dam
## 4.3923865 17.7159590 2.3426061 0.1464129
## sever seed.tmt germ plant.growth
## 17.7159590 17.7159590 16.3982430 2.3426061
## leaves leaf.halo leaf.marg leaf.size
## 0.0000000 12.2986823 12.2986823 12.2986823
## leaf.shread leaf.malf leaf.mild stem
## 14.6412884 12.2986823 15.8125915 2.3426061
## lodging stem.cankers canker.lesion fruiting.bodies
## 17.7159590 5.5636896 5.5636896 15.5197657
## ext.decay mycelium int.discolor sclerotia
## 5.5636896 5.5636896 5.5636896 5.5636896
## fruit.pods fruit.spots seed mold.growth
## 12.2986823 15.5197657 13.4699854 13.4699854
## seed.discolor seed.size shriveling roots
## 15.5197657 13.4699854 15.5197657 4.5387994
We can see above that there are some predictors (e.g. "hail", "lodging" an others) that have a higher than 15% percentage of observartions missing. We could target these variables for elimination. The rest of the predictor variables can have their missing values completed using one of the imputation techniques.
In summary, the strategy would be first to simplify the number of variables and then to deal with the missing data by eliminating the variables (>15% missing) or using imputation. The selection of 15% missing is arbitrary but I believe a viable approach taking into account the large number of predictor variables.
Near-zero variance predictors. Should we remove them? *https://www.r-bloggers.com/2014/03/near-zero-variance-predictors-should-we-remove-them/*