#install.packages("easyGgplot2")

R Markdown

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First, let's plot the frequency distribution for our target (Class) and predictor variables:

As we can see above, the frequency among the different factors of the categorical variables is unbalanced in almost half of predictors. There are some predictors such as "data", "germ" and "croop.hist" where the observations show what looks like a normal distribution if there were more factor levels.

Other predictors such as "sclerotia" have only two levels and these levels are very unbalanced. These are examples of degenerate distributions. These constant or almost-constant predictors show frequency distributions with zero or near-zero variance respectively. For example, in the "sclerotia" predictor:

## 
##   0   1 
## 625  20

Has a very small frequency of observations belonging to one of the levels. Most of the observations 96.9% belong to one of the two possible levels.

We can use the nearZeroVar function from the caret package to identify predictors that "have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors)".

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

Three of the predictors "leaf.mild", "mycelium" and "mycelium" qualify as having near-zero variance (nzv). If we wanted to simplify the model by reducing the numbet of predictor variables, we can choose nzv predictors as they would not add signficantly to the specificity of the model.

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

At 35 predictors, we have a wealth of information to generate a predictive model. Using all predictors will lead to an overly complicated model. Therefore, we will need to select features (predictor variables) with the predictive potential. How can we choose which variables to keep or not?

We can select for elimination, those variables with zero or near-zero variance that we found earlier on.

Also, we can choose to eliminate those variables that show collinearity. We can see in the Correlation Matrix plot above that there are some variables that are strongly correlated with each other. For example, "leaf.marg" is strongly negative correlated with "leaf.halo"

After we have simplified our range of predictors we need to deal with the missing observations.

##           Class            date     plant.stand          precip 
##       0.0000000       0.1464129       5.2708638       5.5636896 
##            temp            hail       crop.hist        area.dam 
##       4.3923865      17.7159590       2.3426061       0.1464129 
##           sever        seed.tmt            germ    plant.growth 
##      17.7159590      17.7159590      16.3982430       2.3426061 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##       0.0000000      12.2986823      12.2986823      12.2986823 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##      17.7159590       5.5636896       5.5636896      15.5197657 
##       ext.decay        mycelium    int.discolor       sclerotia 
##       5.5636896       5.5636896       5.5636896       5.5636896 
##      fruit.pods     fruit.spots            seed     mold.growth 
##      12.2986823      15.5197657      13.4699854      13.4699854 
##   seed.discolor       seed.size      shriveling           roots 
##      15.5197657      13.4699854      15.5197657       4.5387994

We can see above that there are some predictors (e.g. "hail", "lodging" an others) that have a higher than 15% percentage of observartions missing. We could target these variables for elimination. The rest of the predictor variables can have their missing values completed using one of the imputation techniques.

In summary, the strategy would be first to simplify the number of variables and then to deal with the missing data by eliminating the variables (>15% missing) or using imputation. The selection of 15% missing is arbitrary but I believe a viable approach taking into account the large number of predictor variables.

Near-zero variance predictors. Should we remove them? *https://www.r-bloggers.com/2014/03/near-zero-variance-predictors-should-we-remove-them/*