Data 624 Homework 4

(3.1)

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)

## Warning: package 'mlbench' was built under R version 3.5.3

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))  
}

Based on the histogram plots above, it appears that RI, NA, Al, and Si are approximately normal in their distribution. The rest of the other predictors do not appear to be approximately normal in their distributions.

library(corrplot)

## corrplot 0.84 loaded

y <- cor(Glass[1:9])
corrplot(y,  method="number")

Predictors Ri and Ca are strongly correlated with each other, which means that they represent the same information. As they represent the same information, it’s recommended to only use one of these variables. The rest of the other variables are weakly to moderately correlated.

Do there appear to be any outliers in the data? Are any predictors skewed?

par(mfrow = c(3, 3))
for (i in 1:ncol(Glass[1:9])){
  boxplot(Glass[,i], xlab=colnames(Glass[1:9])[i], horizontal=T)
}

Data that fall within the boxplot fall within the 25th and 75th percentile. The middle line is the median or 50th percentile. Values outside the whikers are considered outliers. As you can see, every predictor variable has outliers except Mg. Predictors K and Ba show the most extreme outliers.

Are there any relevant transformations of one or more predictors that might improve the classification model?

We saw that RI and Ca are highly correlated. We can remove one of the highly correlated variable. We also saw that some of the predictors were skewed. Applying the box-cox transformation to predictors k, Ba, ’Mg, and Fe would result in a more symmetric distibution. A data transformation that can help minimize the problem of outliers is the spatial sign transformation. The effect of this transformation makes all the samples equidistant from the center of the sphere.

(3.2)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A degenerate distribution happens when a predictor variable has a single unique value (zero variance) or only has a handful of unique values (near-zero variance) that occur with very low frequencies. Below, the nearZeroVar() functionfrom the caret library is used to examine uniqueness of data. The table below shows whether a variable has zero or near-zero variance. The results show that variables mycelium, sclerotia, and leaf.mild have near-zero variance (“nzv”). None of thea variables have zero variance. Below is plot of the near-zero variance predictors.

library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: ggplot2

X <- Soybean[,2:36]
nearZeroVar(X, names = TRUE, saveMetrics=T)

##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

par(mfrow = c(2,2))
plot(Soybean$mycelium, main='mycelium')
plot(Soybean$sclerotia, main='sclerotia')
plot(Soybean$leaf.mild, main='leaf.mild' )

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Below is table that shows count of missing data for each variable.

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.2

sorted <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[sorted])

	x
hail	121
sever	121
seed.tmt	121
lodging	121
germ	112
leaf.mild	108
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.shread	100
seed	92
mold.growth	92
seed.size	92
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
date	1
area.dam	1
Class	0
leaves	0

Below is table that lists classes with missing data. There are 19 categories of classes, and only 4 have missing values. The class phytophthora-rot has the maximum number of missing values. This shows that pattern of missing values is related to class.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

kable(Soybean %>% mutate(nul=rowSums(is.na(Soybean))) %>% group_by(Class) %>% summarize(missing=sum(nul)) %>% filter(missing!=0))

Class	missing
2-4-d-injury	450
cyst-nematode	336
diaporthe-pod-&-stem-blight	177
herbicide-injury	160
phytophthora-rot	1214

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

From previous exercise, we know that leaf.mild, mycelium and sclerotia have degenerate distributions. So we can remove these from the model. The table below shows fraction of missing data for each variable. Fraction of missing values range from 0 to about .21. Variables Class and date do not have any missing values. So, we don’t have do to anything for these variables. For variables with a small to moderate fraction of missing values and only have 0 or 1 values, we can randomly input 1 or 0 for the missing values. These variables are plant.stand, hail, plant.growth, leaves, leaf.shread, leaf.malf, lodging, stem, fruiting.bodies, seed, mold.growth, seed.discolor, seed.size, shriveling. For the rest of the other variables, we can use k-nearest neighbor to impute values.

missing <- colSums(is.na(Soybean)==TRUE)
notMissing <- colSums(is.na(Soybean)==FALSE)
result <- vector()

for (i in 1:ncol(Soybean)){
    result <- append(result, missing[i]/notMissing[i])
}
sorted <- order(result)
df <- data.frame(colnames(Soybean), result[sorted])
row.names(df) <- NULL
colnames(df) = c("Variable", "Fraction Missing")
kable(df)

Variable	Fraction Missing
Class	0.0000000
date	0.0000000
plant.stand	0.0014663
precip	0.0014663
temp	0.0239880
hail	0.0239880
crop.hist	0.0239880
area.dam	0.0459418
sever	0.0475460
seed.tmt	0.0556414
germ	0.0589147
plant.growth	0.0589147
leaves	0.0589147
leaf.halo	0.0589147
leaf.marg	0.0589147
leaf.size	0.0589147
leaf.shread	0.0589147
leaf.malf	0.1402337
leaf.mild	0.1402337
stem	0.1402337
lodging	0.1402337
stem.cankers	0.1402337
canker.lesion	0.1556684
fruiting.bodies	0.1556684
ext.decay	0.1556684
mycelium	0.1715266
int.discolor	0.1837088
sclerotia	0.1837088
fruit.pods	0.1837088
fruit.spots	0.1837088
seed	0.1878261
mold.growth	0.1961471
seed.discolor	0.2153025
seed.size	0.2153025
shriveling	0.2153025
roots	0.2153025

Data 624 Homework 4

S. Tinapunan

March 7, 2021

(3.1)

(3.2)