(3.1)

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 3.5.3
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))  
}

Based on the histogram plots above, it appears that RI, NA, Al, and Si are approximately normal in their distribution. The rest of the other predictors do not appear to be approximately normal in their distributions.

library(corrplot)
## corrplot 0.84 loaded
y <- cor(Glass[1:9])
corrplot(y,  method="number")

Predictors Ri and Ca are strongly correlated with each other, which means that they represent the same information. As they represent the same information, it’s recommended to only use one of these variables. The rest of the other variables are weakly to moderately correlated.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
par(mfrow = c(3, 3))
for (i in 1:ncol(Glass[1:9])){
  boxplot(Glass[,i], xlab=colnames(Glass[1:9])[i], horizontal=T)
}

Data that fall within the boxplot fall within the 25th and 75th percentile. The middle line is the median or 50th percentile. Values outside the whikers are considered outliers. As you can see, every predictor variable has outliers except Mg. Predictors K and Ba show the most extreme outliers.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

We saw that RI and Ca are highly correlated. We can remove one of the highly correlated variable. We also saw that some of the predictors were skewed. Applying the box-cox transformation to predictors k, Ba, ’Mg, and Fe would result in a more symmetric distibution. A data transformation that can help minimize the problem of outliers is the spatial sign transformation. The effect of this transformation makes all the samples equidistant from the center of the sphere.


(3.2)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A degenerate distribution happens when a predictor variable has a single unique value (zero variance) or only has a handful of unique values (near-zero variance) that occur with very low frequencies. Below, the nearZeroVar() functionfrom the caret library is used to examine uniqueness of data. The table below shows whether a variable has zero or near-zero variance. The results show that variables mycelium, sclerotia, and leaf.mild have near-zero variance (“nzv”). None of thea variables have zero variance. Below is plot of the near-zero variance predictors.

library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
X <- Soybean[,2:36]
nearZeroVar(X, names = TRUE, saveMetrics=T)
##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE
par(mfrow = c(2,2))
plot(Soybean$mycelium, main='mycelium')
plot(Soybean$sclerotia, main='sclerotia')
plot(Soybean$leaf.mild, main='leaf.mild' )

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Below is table that shows count of missing data for each variable.

library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.2
sorted <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[sorted])
x
hail 121
sever 121
seed.tmt 121
lodging 121
germ 112
leaf.mild 108
fruiting.bodies 106
fruit.spots 106
seed.discolor 106
shriveling 106
leaf.shread 100
seed 92
mold.growth 92
seed.size 92
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.malf 84
fruit.pods 84
precip 38
stem.cankers 38
canker.lesion 38
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
plant.stand 36
roots 31
temp 30
crop.hist 16
plant.growth 16
stem 16
date 1
area.dam 1
Class 0
leaves 0

Below is table that lists classes with missing data. There are 19 categories of classes, and only 4 have missing values. The class phytophthora-rot has the maximum number of missing values. This shows that pattern of missing values is related to class.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
kable(Soybean %>% mutate(nul=rowSums(is.na(Soybean))) %>% group_by(Class) %>% summarize(missing=sum(nul)) %>% filter(missing!=0))
Class missing
2-4-d-injury 450
cyst-nematode 336
diaporthe-pod-&-stem-blight 177
herbicide-injury 160
phytophthora-rot 1214
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

From previous exercise, we know that leaf.mild, mycelium and sclerotia have degenerate distributions. So we can remove these from the model. The table below shows fraction of missing data for each variable. Fraction of missing values range from 0 to about .21. Variables Class and date do not have any missing values. So, we don’t have do to anything for these variables. For variables with a small to moderate fraction of missing values and only have 0 or 1 values, we can randomly input 1 or 0 for the missing values. These variables are plant.stand, hail, plant.growth, leaves, leaf.shread, leaf.malf, lodging, stem, fruiting.bodies, seed, mold.growth, seed.discolor, seed.size, shriveling. For the rest of the other variables, we can use k-nearest neighbor to impute values.

missing <- colSums(is.na(Soybean)==TRUE)
notMissing <- colSums(is.na(Soybean)==FALSE)
result <- vector()

for (i in 1:ncol(Soybean)){
    result <- append(result, missing[i]/notMissing[i])
}
sorted <- order(result)
df <- data.frame(colnames(Soybean), result[sorted])
row.names(df) <- NULL
colnames(df) = c("Variable", "Fraction Missing")
kable(df)
Variable Fraction Missing
Class 0.0000000
date 0.0000000
plant.stand 0.0014663
precip 0.0014663
temp 0.0239880
hail 0.0239880
crop.hist 0.0239880
area.dam 0.0459418
sever 0.0475460
seed.tmt 0.0556414
germ 0.0589147
plant.growth 0.0589147
leaves 0.0589147
leaf.halo 0.0589147
leaf.marg 0.0589147
leaf.size 0.0589147
leaf.shread 0.0589147
leaf.malf 0.1402337
leaf.mild 0.1402337
stem 0.1402337
lodging 0.1402337
stem.cankers 0.1402337
canker.lesion 0.1556684
fruiting.bodies 0.1556684
ext.decay 0.1556684
mycelium 0.1715266
int.discolor 0.1837088
sclerotia 0.1837088
fruit.pods 0.1837088
fruit.spots 0.1837088
seed 0.1878261
mold.growth 0.1961471
seed.discolor 0.2153025
seed.size 0.2153025
shriveling 0.2153025
roots 0.2153025