Question 3.1:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Part A:

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Part B:

Do there appear to be any outliers in the data? Are any predictors skewed?

First we check the Missing value plot in part A and notice that there are no missing values in the data, which is great! We then check the histogram plot and notice a few things. Firstly, no predictor is normally distributed, there are a few with near normal distributions: Al, Na, Rl, Si. Let’s look at these closer.

  • Ai, Ca, Rl: These predictors are slightly right skewed with some outliers.
  • Na: Very very slight right skewness with a few outliers.

Next we have Ba, Fe, and K predictors. They are all extremely right skewed. It seems that most of these values equal 0.

Predictor Mg, is bi-modal with peaks at 0 and 3.5. Finally, predictor Si is slightly left skewed with some outliers, but other than that is is fairly normal.

Finally, we look at the correlation plot and notice that some predictor variables have extremely high correlation coefficients. Some predictors that might cause issues are Ca and Rl, with a correlation coefficient of 0.81. Additionally, some other variables have slightly high correlations, Ba & Na (0.33), Al & K (0.33), Ba & Al (0.48). Other potentially problematic correlations between predictors are Si & Rl (-0.54), Ba & Mg (-0.49), Al & Mg (-0.48), Ca & Mg (-0.44), Al & Rl (-0.41), and Ca & K (-0.32). The rest of the variable have little to no correlation.

Part C:

Are there any relevant transformations of one or more predictors that might improve the classification model?

Removing the Ca variable would help the model. First because it has such a high correlation with Rl and secondly because it has some of the highest correlation with other variables.

Ba, Fe, and K variables could benefit from a Log transformation, substituting the 0 value with an extremely small number like 0.00001 to avoid errors of logging 0. Or if it makes more sense to use a box-cox transformation we could do that as well.

Depending on our model’s sensitivity to outliers we will have to deal with them. First we would make sure that the outlier variables make sense, that is, are they physically possible or is it a data entry error. If it is a data entry error we would want to remove those rows or replace them with missing values and then we can deal with the missing data and either remove it or impute it. If it’s not a data entry error we could us a spatial sign transformation, by first centering and scaling the variables and then applying the spatial sign transformation which normalizes the distance of each point to the center of the predictor’s distribution. This should mitigate the effects of outliers on the model.

For the Mg predictor we could try to bin it as to deal with the bimodal distribution. We could say something like Mg <= the mean (2.685) and Mg > the mean.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)
'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Part A:

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Based on the chart above we see that only 3 predictors have a ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value greater than 20. Mycelium, sclerotia, and leaf.mild. Mycelium is the predictor with the worst ratio at 106. The other two predictors have a ratio of sclerotia - 31 and leaf.mild - 27. If the model is susceptible to these types of distributions then it may be worth removing these three predictors.

Part B:

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can see that there are a number of missing rows in the data for each feature. Most predictors have more than 5% missing data. However, only 10 have about more than 15% missing data. Interestingly, however, only 5 classes out of 19 are responsible for the missing values. Furthermore, 50% of the missing data is related to just phytophthora-rot. This seems to suggest that the missing data is related to class.

Part C:

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

It does not seem that the missing data is at random, I would classify it as MNAR (Missing Not at Random). With this in mind and the evidence that suggests that most of the missing data is from the phytophthora-rot class, I would attempt a model where I drop the phytophthora-rot class and then impute the remainder of the missing data for the other 4 classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury.