CUNY 624 Homework Assignment 4

Joel Park

3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identificiation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

From ?Glass, this is “a data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).”

There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The 10 variables are:

  1. RI: refractive index
  2. Na: Sodium
  3. Mg: Magnesium
  4. Al: Aluminum
  5. Si: Silicon
  6. K : Potassium
  7. Ca: Calcium
  8. Ba: Barium
  9. Fe: Iron
  10. Type: Type of Glass (Class Attribute)

We assume that the depedent variable, Type is categorical and has no ordinal component to the number.

A. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

With exploratory data analysis, we will approach this in two ways, 1. descriptive and 2. visually.

Let’s look at the head, columns, and summary of the data.

head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
colnames(Glass)
##  [1] "RI"   "Na"   "Mg"   "Al"   "Si"   "K"    "Ca"   "Ba"   "Fe"   "Type"
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

From above, we note that all predictor (independent) variables are float, and the dependent variable, “Type” is categorical type (the types are in string format).

The psych package has a function called describe() that can better describe skew and kurtosis and other features of the individual predictor variables.

library(psych)
describe(Glass)
##       vars   n  mean   sd median trimmed  mad   min   max range  skew
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04
##       kurtosis   se
## RI        4.72 0.00
## Na        2.90 0.06
## Mg       -0.45 0.10
## Al        1.94 0.03
## Si        2.82 0.05
## K        52.87 0.04
## Ca        6.41 0.10
## Ba       12.08 0.03
## Fe        2.52 0.01
## Type*    -0.29 0.12
anyNA(Glass)
## [1] FALSE

Judging from the information above, it appears obvious that there are no missing data, so there will not need any imputations for this exercise.

Though the descriptive statistics can provide a significant amount of information, it can be a little difficult to interpret with so many variables. The visualization component truly helps with “eyeballing” of the data and helps us determine which models or assumptions is appropriate.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(reshape2)

ggplot(melt(Glass, id.vars=c('Type')), aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free") 

ggplot(melt(Glass, id.vars=c('Type')), aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + coord_flip()

As you can see many (in fact, likely all of them) do not have a normal distribution. As the normal distribution is important in many models, we must take the skewed predictor values into consideration.

So for example, Si has a left skew whereas Ca has a right skew.

Let’s take a closer look at normality via the QQ plot.

# Reference: http://www.sthda.com/english/wiki/ggplot2-qq-plot-quantile-quantile-graph-quick-start-guide-r-software-and-data-visualization
Glass_columns <- colnames(Glass)

ggplot(melt(Glass, id.vars=c('Type')), aes(sample=value)) + stat_qq() + facet_wrap(~variable, scale="free") 

As you can see from the Q-Q Norm plot, again, we confirm that these variables (some worse than others) have issues with the normal distribution (they do not lie in a straight line on the qq plot).

Are there any correlations in the data?

correlations <- cor(Glass[,-10])
correlations
##               RI          Na           Mg          Al          Si
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073
##               K         Ca            Ba           Fe
## RI -0.289832711  0.8104027 -0.0003860189  0.143009609
## Na -0.266086504 -0.2754425  0.3266028795 -0.241346411
## Mg  0.005395667 -0.4437500 -0.4922621178  0.083059529
## Al  0.325958446 -0.2595920  0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K   1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155  1.0000000 -0.1128409671  0.124968219
## Ba -0.042618059 -0.1128410  1.0000000000 -0.058691755
## Fe -0.007719049  0.1249682 -0.0586917554  1.000000000

On the trace (diagonal) of this matrix, all of the values will equal 1 (given that all of these values equal to it self). With values closer to 0, this indicates that no correlation exists. Let’s visualize this.

library(corrplot)
corrplot(correlations, order = "hclust")

It seems that the largest correlation (visualization) is Ca and RI. Let’s see what their correlation value is:

cor(Glass[,c('Ca','RI')])
##           Ca        RI
## Ca 1.0000000 0.8104027
## RI 0.8104027 1.0000000

The correlation here is 0.81, which is fairly high.

Suppose we want to use a cutoff of 0.75 (similar to the assignment in the textbook), let’s find out what predictor variables should be potentially dropped off due to high correlation. (High correlation can lead to collinearity, which can ultimately lead to an unstable model.)

library(caret)
## Loading required package: lattice
highCorr <- findCorrelation(correlations, cutoff = .75)
print(paste0("Number of Predictor Variables with Pearson Correlation > 0.75: ",length(highCorr)))
## [1] "Number of Predictor Variables with Pearson Correlation > 0.75: 1"
print(paste0("Caret recommends dropping predictor variable: ", Glass_columns[highCorr[1]]))
## [1] "Caret recommends dropping predictor variable: Ca"

The above information has provided general information regarding the structure of our data.

B. Do there appears to be any outliers in the data? Are any predictors skewed?

As we have noted from the box plot in section A, the box plots does a great job showing all the outliers for each predictor variable. By definition, any value outside of the 1.5 times below and above lower and upper quantile respectively, this would show as a “dot” on the boxplot. There are many ways to handle outliers (including deleting them), but that being said, having domain knowledge as to why these outliers exist is important and these values should be considered quite heavily before deciding to either transform or delete them.

Another method of showing outliers:

# Reference: https://stackoverflow.com/questions/44089894/identifying-the-outliers-in-a-data-set-in-r
for(i in 1:9) {
  print(paste0("Predictor Variable (outlier values): ", colnames(Glass[i])))
  print(paste0(boxplot(Glass[i],plot=FALSE)$out))
}
## [1] "Predictor Variable (outlier values): RI"
##  [1] "1.52667" "1.5232"  "1.51215" "1.52725" "1.5241"  "1.52475" "1.53125"
##  [8] "1.53393" "1.52664" "1.52739" "1.52777" "1.52614" "1.52369" "1.51115"
## [15] "1.51131" "1.52315" "1.52365"
## [1] "Predictor Variable (outlier values): Na"
## [1] "11.45" "10.73" "11.23" "11.02" "11.03" "17.38" "15.79"
## [1] "Predictor Variable (outlier values): Mg"
## character(0)
## [1] "Predictor Variable (outlier values): Al"
##  [1] "0.29" "0.47" "0.47" "0.51" "3.5"  "3.04" "3.02" "0.34" "2.38" "2.79"
## [11] "2.68" "2.54" "2.34" "2.66" "2.51" "2.42" "2.74" "2.88"
## [1] "Predictor Variable (outlier values): Si"
##  [1] "70.57" "69.81" "70.16" "74.45" "69.89" "70.48" "70.7"  "74.55"
##  [9] "75.41" "70.26" "70.43" "75.18"
## [1] "Predictor Variable (outlier values): K"
## [1] "1.68" "6.21" "6.21" "1.76" "1.46" "2.7"  "1.41"
## [1] "Predictor Variable (outlier values): Ca"
##  [1] "11.64" "10.79" "13.24" "13.3"  "16.19" "11.52" "10.99" "14.68"
##  [9] "14.96" "14.4"  "11.14" "13.44" "5.87"  "11.41" "11.62" "11.53"
## [17] "11.32" "12.24" "12.5"  "11.27" "10.88" "11.22" "6.65"  "5.43" 
## [25] "5.79"  "6.47" 
## [1] "Predictor Variable (outlier values): Ba"
##  [1] "0.09" "0.11" "0.69" "0.14" "0.11" "3.15" "0.27" "0.09" "0.06" "0.15"
## [11] "2.2"  "0.24" "1.19" "1.63" "1.68" "0.76" "0.64" "0.4"  "1.59" "1.57"
## [21] "0.61" "0.81" "0.66" "0.64" "0.53" "0.63" "0.56" "1.71" "0.67" "1.55"
## [31] "1.38" "2.88" "0.54" "1.06" "1.59" "1.64" "1.57" "1.67"
## [1] "Predictor Variable (outlier values): Fe"
##  [1] "0.26" "0.3"  "0.31" "0.32" "0.34" "0.28" "0.29" "0.28" "0.35" "0.37"
## [11] "0.51" "0.28"

These are the values that are outliers (from a boxplot) for each predictor variable. So the answer is yes, there does appear to be outliers in all predictor variables with the exception of Mg.

Let us take a look at the skewness of the predictor variables. (We have already touched on this in section A.)

# Reference: Textbook Chapter 3
library(e1071)
apply(Glass[,-10], 2, skewness)
##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889 
##         Ca         Ba         Fe 
##  2.0184463  3.3686800  1.7298107

According to https://brownmath.com/stat/shape.htm#SkewnessInterpret,

“Classic rule of thumb for skewness: If skewness is less than −1 or greater than +1, the distribution is highly skewed. If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed. If skewness is between −½ and +½, the distribution is approximately symmetric.”

It appears that all of these predictor values, with the exception of Na show highly skewed values. Not surprisingly, all of these data points will need to be potentially transformed in hopes to achieve a normal distribution.

C. Are there any relevant transformations of one or more predictors that might improve the classification model?

As noted in section B, there will likely need to be several predictor variables that will require a transformation. A popular way to transform our predictor variables is via BoxCox Transformation.

transformed_glass <- apply(Glass[,-10], 2, BoxCoxTrans)
transformed_glass
## $RI
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2 
## 
## 
## $Na
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations
## 
## 
## $Mg
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Al
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5 
## 
## 
## $Si
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2 
## 
## 
## $K
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Ca
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1 
## 
## 
## $Ba
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Fe
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Lambda could not be estimated; no transformation is applied

As you can note above, some of the transformations could be applied. But, Box Cox transformation could not be applied to some of these variables, which is because a BoxCox transformation cannot be performed on variables that contain negative of 0 values.

If you notice some of these predictor variables such as Fe or Ba, we should potentially consider that their distributions may never achieve normal distribution despite attempting a BoxCox transformation. For these particular instances, we should consider other types of distributions such as Poisson Distribution or the Negative Binomial Distribution (which may ultimately be a better fit.).

Reference: https://www.researchgate.net/post/How_to_transform_count_data_with_0s_to_get_a_normal_distribution

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

According to ?Soybean, “there are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.”

“A data frame with 683 observations on 36 variables. There are 35 categeorical attributes, all numerical and a nominal denoting the class.

A. Investigate the frequency distributions in the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Let’s take a look at the head of the dataset.

head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker    6           0      2    1    0         1
## 2 diaporthe-stem-canker    4           0      2    1    0         2
## 3 diaporthe-stem-canker    3           0      2    1    0         1
## 4 diaporthe-stem-canker    3           0      2    1    0         1
## 5 diaporthe-stem-canker    6           0      2    1    0         2
## 6 diaporthe-stem-canker    5           0      2    1    0         3
##   area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1        1     1        0    0            1      1         0         2
## 2        0     2        1    1            1      1         0         2
## 3        0     2        1    2            1      1         0         2
## 4        0     2        0    1            1      1         0         2
## 5        0     1        0    2            1      1         0         2
## 6        0     1        0    1            1      1         0         2
##   leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1         2           0         0         0    1       1            3
## 2         2           0         0         0    1       0            3
## 3         2           0         0         0    1       0            3
## 4         2           0         0         0    1       0            3
## 5         2           0         0         0    1       0            3
## 6         2           0         0         0    1       0            3
##   canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             1               1         1        0            0         0
## 2             1               1         1        0            0         0
## 3             0               1         1        0            0         0
## 4             0               1         1        0            0         0
## 5             1               1         1        0            0         0
## 6             0               1         1        0            0         0
##   fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1          0           4    0           0             0         0
## 2          0           4    0           0             0         0
## 3          0           4    0           0             0         0
## 4          0           4    0           0             0         0
## 5          0           4    0           0             0         0
## 6          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0

Now, let’s perform a visualization.

ggplot(melt(Soybean, id.vars=c('Class')), aes(x=value)) + geom_histogram(stat="count") + facet_wrap(~variable, scale="free") 
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad

We must remember that these frequency distribution visualizations can be misleading. The values within the predictor variables are categorical and are not ordinal. Therefore, in some of the predictor values, a value of 2 does not necessarily mean that it is of greater value than a value of 1. These numbers are simply a placeholder. Therefore, we must be careful not treat them as ordinal values.

To answer the question, I refer to section 3.5 “Removing Predictors.”

“There are potential advantages to removing predictors prior to modeling. First, few predictors means decreased computational time and complexity. Second, if two predictors are highly correlated, this implies that they are measuring the same underlying information. Removing one should not compromise the performance of the model and might lead to a more parsimonious and interpretable model. Third, some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.”

What is a degenerate distribution? Let’s refer to: https://www.statisticshowto.datasciencecentral.com/degenerate-distribution/.

A degenerate distribution is a distribution where a constant c is chosen with a probability of a 1. So an example would be a weighted die. If the dice was thrown and was weighted so it always landed on a 6, such that \(P(X = 6) = 1\), this is known as a degenerate distribution.

If we have predictive variables with values that have this degenerative distribution, then it will have zero-variance predictors. If the distribution was near degenerative distribution, i.e., the weighted die was thrown 100 times and 99% of the time, 6 rolled, and the other 1% of the time, a 5 was rolled, this is what we call a near zero-variance predictor. These zero or near zero-variance predictors can detract from the models as typically, they do not add any value to the model.

A good rule of thumb for detecting near-zero variance predictors is:

  1. The fraction of unique values over the sample size is low (say 10%).
  2. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

We can filter out near-zero variance predictors using the caret package and the nearZeroVar() function.

nearZeroVar(Soybean)
## [1] 19 26 28

This shows a vector of integers of which columns should be removed. In this case, the degenerate distributions are columns 19, 26, and 28.

B. Roughly 18% of the data are mising. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Let’s determine which predictor variables contain the missing data points.

Soybean_na <- apply(Soybean, 2, function(x){sum(is.na(x))})
Soybean_na
##           Class            date     plant.stand          precip 
##               0               1              36              38 
##            temp            hail       crop.hist        area.dam 
##              30             121              16               1 
##           sever        seed.tmt            germ    plant.growth 
##             121             121             112              16 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##               0              84              84              84 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##             121              38              38             106 
##       ext.decay        mycelium    int.discolor       sclerotia 
##              38              38              38              38 
##      fruit.pods     fruit.spots            seed     mold.growth 
##              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31
Soybean_na[order(Soybean_na, decreasing=TRUE)]
##            hail           sever        seed.tmt         lodging 
##             121             121             121             121 
##            germ       leaf.mild fruiting.bodies     fruit.spots 
##             112             108             106             106 
##   seed.discolor      shriveling     leaf.shread            seed 
##             106             106             100              92 
##     mold.growth       seed.size       leaf.halo       leaf.marg 
##              92              92              84              84 
##       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium 
##              38              38              38              38 
##    int.discolor       sclerotia     plant.stand           roots 
##              38              38              36              31 
##            temp       crop.hist    plant.growth            stem 
##              30              16              16              16 
##            date        area.dam           Class          leaves 
##               1               1               0               0

So looking at the total missing values, there appears to be quite a bit of missing data here spread throughout the dataset. Is there a particular pattern as to why this data is missing? When we inspected the documentation for the dataset, the document did state that four of the classes were controversial and some consider them to be unjustified given how sparse the data was. Perhaps the missing data may be related to those four classes.

Num_of_NAs <- apply(Soybean, 1, function(x){sum(is.na(x))})
Class_Soybean <- Soybean$Class
Soybean_na_df <- data.frame(Class_Soybean, Num_of_NAs)
head(Soybean_na_df,10)
##            Class_Soybean Num_of_NAs
## 1  diaporthe-stem-canker          0
## 2  diaporthe-stem-canker          0
## 3  diaporthe-stem-canker          0
## 4  diaporthe-stem-canker          0
## 5  diaporthe-stem-canker          0
## 6  diaporthe-stem-canker          0
## 7  diaporthe-stem-canker          0
## 8  diaporthe-stem-canker          0
## 9  diaporthe-stem-canker          0
## 10 diaporthe-stem-canker          0

Now that we have successfully created a new dataframe with each row corresponding to the number of missing values and the class of the soybeans, let’s find the sum of all the NAs for each class type.

# Reference: https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group
# Reference: how to reorder R dataframe decreasing column
results <- aggregate(Soybean_na_df$Num_of_NAs, by=list(Class_Soybean=Soybean_na_df$Class_Soybean), FUN=sum)
results[order(results[,"x"]),]
##                  Class_Soybean    x
## 2          alternarialeaf-spot    0
## 3                  anthracnose    0
## 4             bacterial-blight    0
## 5            bacterial-pustule    0
## 6                   brown-spot    0
## 7               brown-stem-rot    0
## 8                 charcoal-rot    0
## 11       diaporthe-stem-canker    0
## 12                downy-mildew    0
## 13          frog-eye-leaf-spot    0
## 15      phyllosticta-leaf-spot    0
## 17              powdery-mildew    0
## 18           purple-seed-stain    0
## 19        rhizoctonia-root-rot    0
## 14            herbicide-injury  160
## 10 diaporthe-pod-&-stem-blight  177
## 9                cyst-nematode  336
## 1                 2-4-d-injury  450
## 16            phytophthora-rot 1214

Interestingly, all of the missing data is coming from herbicide-injury, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, and phytophtora-rot. Granted that the description had stated that there were about 4 that were sparsely found, these are probably the missing data that likely corresponds to these “Classes”.

C. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

According to the textbook, “Applied Predictive Modeling”, we can either remove the rows with the missing variables (if there is a large dataset with spare NULLS), or we can impute the missing data.

Let’s see if we can determine how the null values are distributed throughout the Soybean dataset. Let’s start by inspecting the class with the most missing data points, phytophthora-rot.

Reference: https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637

head(Soybean[Soybean$Class=='phytophthora-rot',],20)
##               Class date plant.stand precip temp hail crop.hist area.dam
## 31 phytophthora-rot    0           1      2    1    1         1        1
## 32 phytophthora-rot    1           1      2    1 <NA>         3        1
## 33 phytophthora-rot    2           1      2    2 <NA>         2        1
## 34 phytophthora-rot    1           1      2    0    0         2        1
## 35 phytophthora-rot    2           1      2    2 <NA>         2        1
## 36 phytophthora-rot    3           1      2    1 <NA>         2        1
## 37 phytophthora-rot    0           1      1    1    0         1        1
## 38 phytophthora-rot    3           1      2    0    0         2        1
## 39 phytophthora-rot    2           1      1    1 <NA>         0        1
## 40 phytophthora-rot    2           1      2    0    0         1        1
## 41 phytophthora-rot    2           1      2    1 <NA>         1        1
## 42 phytophthora-rot    1           1      2    1 <NA>         1        1
## 43 phytophthora-rot    0           1      2    1    0         3        1
## 44 phytophthora-rot    0           1      1    1    1         2        1
## 45 phytophthora-rot    3           1      2    0    0         1        1
## 46 phytophthora-rot    2           1      2    2 <NA>         3        1
## 47 phytophthora-rot    0           1      2    1    0         2        1
## 48 phytophthora-rot    2           1      1    2 <NA>         2        1
## 49 phytophthora-rot    2           1      2    1    1         1        1
## 50 phytophthora-rot    0           1      2    1    0         3        1
##    sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 31     1        0    0            1      1         0         2         2
## 32  <NA>     <NA> <NA>            1      1         0         2         2
## 33  <NA>     <NA> <NA>            1      1      <NA>      <NA>      <NA>
## 34     2        1    1            1      1         0         2         2
## 35  <NA>     <NA> <NA>            1      1      <NA>      <NA>      <NA>
## 36  <NA>     <NA> <NA>            1      1      <NA>      <NA>      <NA>
## 37     1        0    0            1      1         0         2         2
## 38     2        1    1            1      1         0         2         2
## 39  <NA>     <NA> <NA>            1      1         0         2         2
## 40     2        0    1            1      1         0         2         2
## 41  <NA>     <NA> <NA>            1      1         0         2         2
## 42  <NA>     <NA> <NA>            1      1      <NA>      <NA>      <NA>
## 43     1        0    0            1      1         0         2         2
## 44     2        1    0            1      1         0         2         2
## 45     2        1    0            1      1         0         2         2
## 46  <NA>     <NA> <NA>            1      1      <NA>      <NA>      <NA>
## 47     1        0    1            1      1         0         2         2
## 48  <NA>     <NA> <NA>            1      1         0         2         2
## 49     2        0    2            1      1         0         2         2
## 50     1        0    2            1      1         0         2         2
##    leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 31           0         0         0    1       0            1             2
## 32           0         0         0    1    <NA>            2             2
## 33        <NA>      <NA>      <NA>    1    <NA>            3             2
## 34           0         0         0    1       0            2             2
## 35        <NA>      <NA>      <NA>    1    <NA>            2             2
## 36        <NA>      <NA>      <NA>    1    <NA>            3             2
## 37           0         0         0    1       0            1             2
## 38           0         0         0    1       0            2             2
## 39           0         0         0    1    <NA>            2             2
## 40           0         0         0    1       0            1             2
## 41           0         0         0    1    <NA>            2             2
## 42        <NA>      <NA>      <NA>    1    <NA>            1             2
## 43           0         0         0    1       0            1             2
## 44           0         0         0    1       1            2             2
## 45           0         0         0    1       0            2             2
## 46        <NA>      <NA>      <NA>    1    <NA>            3             2
## 47           0         0         0    1       0            1             2
## 48           0         0         0    1    <NA>            2             2
## 49           0         0         0    1       0            1             2
## 50           0         0         0    1       0            1             2
##    fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 31               0         1        0            0         0          3
## 32            <NA>         0        0            0         0       <NA>
## 33            <NA>         0        0            0         0       <NA>
## 34               0         0        0            0         0          3
## 35            <NA>         0        0            0         0       <NA>
## 36            <NA>         0        0            0         0       <NA>
## 37               0         0        0            0         0          3
## 38               0         0        0            0         0          3
## 39            <NA>         0        0            0         0       <NA>
## 40               0         0        0            0         0          3
## 41            <NA>         0        0            0         0       <NA>
## 42            <NA>         0        0            0         0       <NA>
## 43               0         0        0            0         0          3
## 44               0         1        0            0         0          3
## 45               0         0        0            0         0          3
## 46            <NA>         0        0            0         0       <NA>
## 47               0         0        0            0         0          3
## 48            <NA>         0        0            0         0       <NA>
## 49               0         1        0            0         0          3
## 50               0         0        0            0         0          3
##    fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 31           4    0           0             0         0          0     0
## 32        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 33        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 34           4    0           0             0         0          0     0
## 35        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 36        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 37           4    0           0             0         0          0     0
## 38           4    0           0             0         0          0     0
## 39        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 40           4    0           0             0         0          0     0
## 41        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 42        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 43           4    0           0             0         0          0     0
## 44           4    0           0             0         0          0     0
## 45           4    0           0             0         0          0     0
## 46        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 47           4    0           0             0         0          0     0
## 48        <NA> <NA>        <NA>          <NA>      <NA>       <NA>     1
## 49           4    0           0             0         0          0     0
## 50           4    0           0             0         0          0     0

We try to avoid disposing of data points despite the number of missing values as the more data we throw away, the more potential valuable information we may be losing. Therefore, we should attempt to impute the missing values and see if we could increase . If we remember from the above sections, all of these data points are categorical. In the textbook, the authors had utilized K Nearest Neighbors as a means to impute the data. (https://www.researchgate.net/post/What_is_the_proper_imputation_method_for_categorical_missing_value is another webiste that has information in regards for imputing categorical missing data, but we will focus on KNN).

KNN is an unsupervised method where the data is layered in a multidimensional hyperplane. Similar types of data will be clustered around each other. Therefore, if there is a data point that is missing some of its predictor values, the data point will take its K (whichever number we decide as a parameter) nearest neighbors and take the mode, and thus imputing in that value.

# Reference: https://stats.stackexchange.com/questions/61110/knn-imputation-r-packages
library(DMwR)
## Loading required package: grid
imputed_data <- knnImputation(Soybean,k=5)
head(imputed_data,20)
##                    Class date plant.stand precip temp hail crop.hist
## 1  diaporthe-stem-canker    6           0      2    1    0         1
## 2  diaporthe-stem-canker    4           0      2    1    0         2
## 3  diaporthe-stem-canker    3           0      2    1    0         1
## 4  diaporthe-stem-canker    3           0      2    1    0         1
## 5  diaporthe-stem-canker    6           0      2    1    0         2
## 6  diaporthe-stem-canker    5           0      2    1    0         3
## 7  diaporthe-stem-canker    5           0      2    1    0         2
## 8  diaporthe-stem-canker    4           0      2    1    1         1
## 9  diaporthe-stem-canker    6           0      2    1    0         3
## 10 diaporthe-stem-canker    4           0      2    1    0         2
## 11          charcoal-rot    6           0      0    2    0         1
## 12          charcoal-rot    4           0      0    1    1         1
## 13          charcoal-rot    3           0      0    1    0         1
## 14          charcoal-rot    6           0      0    1    1         3
## 15          charcoal-rot    6           0      0    2    0         1
## 16          charcoal-rot    5           0      0    2    1         3
## 17          charcoal-rot    6           0      0    2    1         0
## 18          charcoal-rot    4           0      0    1    0         2
## 19          charcoal-rot    3           0      0    2    0         2
## 20          charcoal-rot    5           0      0    2    1         2
##    area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1         1     1        0    0            1      1         0         2
## 2         0     2        1    1            1      1         0         2
## 3         0     2        1    2            1      1         0         2
## 4         0     2        0    1            1      1         0         2
## 5         0     1        0    2            1      1         0         2
## 6         0     1        0    1            1      1         0         2
## 7         0     1        1    0            1      1         0         2
## 8         0     1        0    2            1      1         0         2
## 9         0     1        1    1            1      1         0         2
## 10        0     2        0    2            1      1         0         2
## 11        3     1        1    0            1      1         0         2
## 12        3     1        1    1            1      1         0         2
## 13        2     1        0    0            1      1         0         2
## 14        3     1        1    0            1      1         0         2
## 15        3     1        1    1            1      1         0         2
## 16        3     1        1    2            1      1         0         2
## 17        2     1        0    0            1      1         0         2
## 18        2     1        0    1            1      1         0         2
## 19        2     1        0    2            1      1         0         2
## 20        2     1        0    2            1      1         0         2
##    leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1          2           0         0         0    1       1            3
## 2          2           0         0         0    1       0            3
## 3          2           0         0         0    1       0            3
## 4          2           0         0         0    1       0            3
## 5          2           0         0         0    1       0            3
## 6          2           0         0         0    1       0            3
## 7          2           0         0         0    1       1            3
## 8          2           0         0         0    1       0            3
## 9          2           0         0         0    1       0            3
## 10         2           0         0         0    1       0            3
## 11         2           0         0         0    1       0            0
## 12         2           0         0         0    1       1            0
## 13         2           0         0         0    1       0            0
## 14         2           0         0         0    1       0            0
## 15         2           0         0         0    1       0            0
## 16         2           0         0         0    1       0            0
## 17         2           0         0         0    1       1            0
## 18         2           0         0         0    1       0            0
## 19         2           0         0         0    1       0            0
## 20         2           0         0         0    1       0            0
##    canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1              1               1         1        0            0         0
## 2              1               1         1        0            0         0
## 3              0               1         1        0            0         0
## 4              0               1         1        0            0         0
## 5              1               1         1        0            0         0
## 6              0               1         1        0            0         0
## 7              1               1         1        0            0         0
## 8              1               1         1        0            0         0
## 9              1               1         1        0            0         0
## 10             1               1         1        0            0         0
## 11             3               0         0        0            2         1
## 12             3               0         0        0            2         1
## 13             3               0         0        0            2         1
## 14             3               0         0        0            2         1
## 15             3               0         0        0            2         1
## 16             3               0         0        0            2         1
## 17             3               0         0        0            2         1
## 18             3               0         0        0            2         1
## 19             3               0         0        0            2         1
## 20             3               0         0        0            2         1
##    fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1           0           4    0           0             0         0
## 2           0           4    0           0             0         0
## 3           0           4    0           0             0         0
## 4           0           4    0           0             0         0
## 5           0           4    0           0             0         0
## 6           0           4    0           0             0         0
## 7           0           4    0           0             0         0
## 8           0           4    0           0             0         0
## 9           0           4    0           0             0         0
## 10          0           4    0           0             0         0
## 11          0           4    0           0             0         0
## 12          0           4    0           0             0         0
## 13          0           4    0           0             0         0
## 14          0           4    0           0             0         0
## 15          0           4    0           0             0         0
## 16          0           4    0           0             0         0
## 17          0           4    0           0             0         0
## 18          0           4    0           0             0         0
## 19          0           4    0           0             0         0
## 20          0           4    0           0             0         0
##    shriveling roots
## 1           0     0
## 2           0     0
## 3           0     0
## 4           0     0
## 5           0     0
## 6           0     0
## 7           0     0
## 8           0     0
## 9           0     0
## 10          0     0
## 11          0     0
## 12          0     0
## 13          0     0
## 14          0     0
## 15          0     0
## 16          0     0
## 17          0     0
## 18          0     0
## 19          0     0
## 20          0     0
anyNA(imputed_data)
## [1] FALSE

We have successfully imputed all of the data utilizing K Nearest Neighbors.