DATA 624: Home Work 04

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

The data can be accessed via:

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First of all, let’s do some data basic exploration

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
## [1] FALSE

No null found in the data frame

## [1] FALSE

No missing values found in the data frame

In order to see predictor variables distribution we need to pivot the data.

## # A tibble: 6 x 3
##   Type  Predictor Value
##   <fct> <fct>     <dbl>
## 1 1     RI         1.52
## 2 1     Na        13.6 
## 3 1     Mg         4.49
## 4 1     Al         1.1 
## 5 1     Si        71.8 
## 6 1     K          0.06

We are using histogram to see the distribution of predictor variables.

We can see high concentrations of silica (Si), sodium (Na) and lime (Ca) in the histogram which are the main raw matreials of glass.

Source: https://www.cmog.org/article/chemistry-glass

Let’s look at the predictor variables relationship using corrplot

We can see from above that most of the predictors are negatively correlated. Here, data represent chemical concentration on a percentage scale. As a result we can see when one element increases other element decreases.

In addtion, we observed that most of the correlations are not very strong except correlation between calcium and the refractive index which are strongly positively correlated.

(b)Do there appear to be any outliers in the data? Are any predictors skewed?

Let’s look at the basic statistics of the predictor variables.

Predictor Min 1st Qu. Median Mean 3rd Qu. Max
Al 0.29000 1.190000 1.36000 1.4449065 1.630000 3.50000
Ba 0.00000 0.000000 0.00000 0.1750467 0.000000 3.15000
Ca 5.43000 8.240000 8.60000 8.9569626 9.172500 16.19000
Fe 0.00000 0.000000 0.00000 0.0570093 0.100000 0.51000
K 0.00000 0.122500 0.55500 0.4970561 0.610000 6.21000
Mg 0.00000 2.115000 3.48000 2.6845327 3.600000 4.49000
Na 10.73000 12.907500 13.30000 13.4078505 13.825000 17.38000
RI 1.51115 1.516522 1.51768 1.5183654 1.519157 1.53393
Si 69.81000 72.280000 72.79000 72.6509346 73.087500 75.41000

I want to see how the predictors are distributed by the type of glass. I will use a scatter plot to do this but will be excluding scilica because of the difference in scale.

## Warning: Removed 415 rows containing missing values (geom_point).

It looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potasium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.

Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.

(c)Are there any relevant transformations of one or more predictors that might improve the classification model?

  • Yes, transformations like a log or a Box Cox could help improve the classification model.
  • Removing skew is removing outliers that improves a model’s performance.
  • Also, centering and scaling can be important for all variables with any model.
  • Checking if there are any missing values in any columns that can cause a delay or miscalculate or need to addressed by imputation/removal/or other means. We can confirm there are no missing value in this dataset.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a)Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First of all, let’s do some data basic exploration

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 
##                   Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker    6           0      2    1    0         1
## 2 diaporthe-stem-canker    4           0      2    1    0         2
## 3 diaporthe-stem-canker    3           0      2    1    0         1
## 4 diaporthe-stem-canker    3           0      2    1    0         1
## 5 diaporthe-stem-canker    6           0      2    1    0         2
## 6 diaporthe-stem-canker    5           0      2    1    0         3
##   area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1        1     1        0    0            1      1         0         2
## 2        0     2        1    1            1      1         0         2
## 3        0     2        1    2            1      1         0         2
## 4        0     2        0    1            1      1         0         2
## 5        0     1        0    2            1      1         0         2
## 6        0     1        0    1            1      1         0         2
##   leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1         2           0         0         0    1       1            3
## 2         2           0         0         0    1       0            3
## 3         2           0         0         0    1       0            3
## 4         2           0         0         0    1       0            3
## 5         2           0         0         0    1       0            3
## 6         2           0         0         0    1       0            3
##   canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             1               1         1        0            0         0
## 2             1               1         1        0            0         0
## 3             0               1         1        0            0         0
## 4             0               1         1        0            0         0
## 5             1               1         1        0            0         0
## 6             0               1         1        0            0         0
##   fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1          0           4    0           0             0         0
## 2          0           4    0           0             0         0
## 3          0           4    0           0             0         0
## 4          0           4    0           0             0         0
## 5          0           4    0           0             0         0
## 6          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0
## [1] FALSE

No null found in the data frame

## [1] TRUE

There are missing values found in the data frame

## Using Class as id variables
## Warning: Removed 2337 rows containing non-finite values (stat_count).

I am assuming the degenerate distibuted variables discussed earlier in the chapter refers to section 3.5 on removing predictors which states:

• The fraction of unique values over the sample size is low (say 10 %).
• The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

There’s a lot of missing variables. The authors recommended removing variables with near zero variance. The caret package has a function for that. Here’s the output from that function:

freqRatio percentUnique zeroVar nzv
Class 1.010989 2.7818448 FALSE FALSE
date 1.137405 1.0248902 FALSE FALSE
plant.stand 1.208191 0.2928258 FALSE FALSE
precip 4.098214 0.4392387 FALSE FALSE
temp 1.879397 0.4392387 FALSE FALSE
hail 3.425197 0.2928258 FALSE FALSE
crop.hist 1.004587 0.5856515 FALSE FALSE
area.dam 1.213904 0.5856515 FALSE FALSE
sever 1.651282 0.4392387 FALSE FALSE
seed.tmt 1.373874 0.4392387 FALSE FALSE
germ 1.103627 0.4392387 FALSE FALSE
plant.growth 1.951327 0.2928258 FALSE FALSE
leaves 7.870130 0.2928258 FALSE FALSE
leaf.halo 1.547511 0.4392387 FALSE FALSE
leaf.marg 1.615385 0.4392387 FALSE FALSE
leaf.size 1.479638 0.4392387 FALSE FALSE
leaf.shread 5.072917 0.2928258 FALSE FALSE
leaf.malf 12.311111 0.2928258 FALSE FALSE
leaf.mild 26.750000 0.4392387 FALSE TRUE
stem 1.253378 0.2928258 FALSE FALSE
lodging 12.380952 0.2928258 FALSE FALSE
stem.cankers 1.984293 0.5856515 FALSE FALSE
canker.lesion 1.807910 0.5856515 FALSE FALSE
fruiting.bodies 4.548077 0.2928258 FALSE FALSE
ext.decay 3.681481 0.4392387 FALSE FALSE
mycelium 106.500000 0.2928258 FALSE TRUE
int.discolor 13.204546 0.4392387 FALSE FALSE
sclerotia 31.250000 0.2928258 FALSE TRUE
fruit.pods 3.130769 0.5856515 FALSE FALSE
fruit.spots 3.450000 0.5856515 FALSE FALSE
seed 4.139130 0.2928258 FALSE FALSE
mold.growth 7.820895 0.2928258 FALSE FALSE
seed.discolor 8.015625 0.2928258 FALSE FALSE
seed.size 9.016949 0.2928258 FALSE FALSE
shriveling 14.184211 0.2928258 FALSE FALSE
roots 6.406977 0.4392387 FALSE FALSE

There are three variables (leaf.mild, mycelium, sclerotia) that have a near zero variance, and should probably be removed.

(b)Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

There are blocks of observations that are missing. Since the data are arranged by the classes this suggests that the patterns of missing data are related to the classes which can also be varified by below

## # A tibble: 19 x 2
##    Class                       na_count
##    <fct>                          <int>
##  1 phytophthora-rot                1214
##  2 2-4-d-injury                     450
##  3 cyst-nematode                    336
##  4 diaporthe-pod-&-stem-blight      177
##  5 herbicide-injury                 160
##  6 alternarialeaf-spot                0
##  7 anthracnose                        0
##  8 bacterial-blight                   0
##  9 bacterial-pustule                  0
## 10 brown-spot                         0
## 11 brown-stem-rot                     0
## 12 charcoal-rot                       0
## 13 diaporthe-stem-canker              0
## 14 downy-mildew                       0
## 15 frog-eye-leaf-spot                 0
## 16 phyllosticta-leaf-spot             0
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

(c)Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will be eliminating the three near zero variance predictiors. For all other predictors I will be imputing values. I don’t have any domain expertise that would inform the imputations, so I will be using decision trees to (hopefully) produce good imputations. It has been my experience that decision trees preform really well. The dlookr package integrates well with the tidyverse.

## Warning: package 'dlookr' was built under R version 3.6.3
## Loading required package: mice
## Warning: package 'mice' was built under R version 3.6.3
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
## 
##     transform
Soybean_complete <- Soybean %>%
  # Impute missing values using rpart
  mutate(
    date = imputate_na(Soybean, date, Class, method = "rpart", no_attrs = TRUE),
    plant.stand = imputate_na(Soybean, plant.stand, Class, method = "rpart", no_attrs = TRUE),
    precip = imputate_na(Soybean, precip, Class, method = "rpart", no_attrs = TRUE),
    temp = imputate_na(Soybean, temp, Class, method = "rpart", no_attrs = TRUE),
    hail = imputate_na(Soybean, hail, Class, method = "rpart", no_attrs = TRUE),
    crop.hist = imputate_na(Soybean, crop.hist, Class, method = "rpart", no_attrs = TRUE),
    area.dam = imputate_na(Soybean, area.dam, Class, method = "rpart", no_attrs = TRUE),
    sever = imputate_na(Soybean, sever, Class, method = "rpart", no_attrs = TRUE),
    seed.tmt = imputate_na(Soybean, seed.tmt, Class, method = "rpart", no_attrs = TRUE),
    germ = imputate_na(Soybean, germ, Class, method = "rpart", no_attrs = TRUE),
    plant.growth = imputate_na(Soybean, plant.growth, Class, method = "rpart", no_attrs = TRUE),
    leaf.halo = imputate_na(Soybean, leaf.halo, Class, method = "rpart", no_attrs = TRUE),
    leaf.marg = imputate_na(Soybean, leaf.marg, Class, method = "rpart", no_attrs = TRUE),
    leaf.size = imputate_na(Soybean, leaf.size, Class, method = "rpart", no_attrs = TRUE),
    leaf.shread = imputate_na(Soybean, leaf.shread, Class, method = "rpart", no_attrs = TRUE),
    leaf.malf = imputate_na(Soybean, leaf.malf, Class, method = "rpart", no_attrs = TRUE),
    stem = imputate_na(Soybean, stem, Class, method = "rpart", no_attrs = TRUE),
    lodging = imputate_na(Soybean, lodging, Class, method = "rpart", no_attrs = TRUE),
    stem.cankers = imputate_na(Soybean, stem.cankers, Class, method = "rpart", no_attrs = TRUE),
    canker.lesion = imputate_na(Soybean, canker.lesion, Class, method = "rpart", no_attrs = TRUE),
    fruiting.bodies = imputate_na(Soybean, fruiting.bodies, Class, method = "rpart", no_attrs = TRUE),
    ext.decay = imputate_na(Soybean, ext.decay, Class, method = "rpart", no_attrs = TRUE),
    int.discolor = imputate_na(Soybean, int.discolor, Class, method = "rpart", no_attrs = TRUE),
    fruit.pods = imputate_na(Soybean, fruit.pods, Class, method = "rpart", no_attrs = TRUE),
    seed = imputate_na(Soybean, seed, Class, method = "rpart", no_attrs = TRUE),
    mold.growth = imputate_na(Soybean, mold.growth, Class, method = "rpart", no_attrs = TRUE),
    seed.discolor = imputate_na(Soybean, seed.discolor, Class, method = "rpart", no_attrs = TRUE),
    seed.size = imputate_na(Soybean, seed.size, Class, method = "rpart", no_attrs = TRUE),
    shriveling = imputate_na(Soybean, shriveling, Class, method = "rpart", no_attrs = TRUE),
    fruit.spots = imputate_na(Soybean, fruit.spots, Class, method = "rpart", no_attrs = TRUE),
    roots = imputate_na(Soybean, roots, Class, method = "rpart", no_attrs = TRUE)) %>%
  # Drop the near zero variance predictors
  select(-leaf.mild, -mycelium, -sclerotia) 
## Warning: Unquoting language objects with `!!!` is deprecated as of rlang 0.4.0.
## Please use `!!` instead.
## 
##   # Bad:
##   dplyr::select(data, !!!enquo(x))
## 
##   # Good:
##   dplyr::select(data, !!enquo(x))    # Unquote single quosure
##   dplyr::select(data, !!!enquos(x))  # Splice list of quosures
## 
## This warning is displayed once per session.

We can prove that it worked by presenting the following

Forhad Akbar

9/25/2020