DATA 624 - Homework 4

problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

 knitr::opts_chunk$set(warning = FALSE, message = FALSE)  

# package was not installed i have installed below package
#install.packages("mlbench")
library(mlbench)
data(Glass) # load data set
str(Glass) # view data set structure

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

To visualize distributions, I can plot a density graph of all predictors:

# load package
library(tidyr); library(ggplot2)
# density plot 
Glass %>%
  purrr::keep(is.numeric) %>%   
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_density()

To understand relationship between all predictors, I can make scatterplots and run correlations.

# package was not installed i have installed below package
#install.packages("GGally")
Glass %>%
  purrr::keep(is.numeric) %>%
  GGally::ggpairs()

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, most of the variables are skewed (some right-skewed, some left-skewed) and have outliers.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, to deal with skewness I can run a “Box-Cox” transformation (in case of positive values); to deal with outliers, I can run a “Spatial Sign” transformation.

# package was not installed i have installed below package
#install.packages("caret")
# load package
library(caret) 
# run Box-Cox and Spatial Sign
preProc <- preProcess(Glass[,-10], method = c("BoxCox", "spatialSign"))
# apply transformation to the data set
GlassT <- predict(preProc, Glass[,-10])
# density plot
GlassT %>%
  purrr::keep(is.numeric) %>%   
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_density()

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Variable “Class” is the outcome. The others are predictors. I can inspect degenerated distributions numerically, visually and through function “Near Zero Variance.

Hmisc::describe(Soybean)

## Soybean 
## 
##  36  Variables      683  Observations
## --------------------------------------------------------------------------------
## Class 
##        n  missing distinct 
##      683        0       19 
## 
## lowest : 2-4-d-injury           alternarialeaf-spot    anthracnose            bacterial-blight       bacterial-pustule     
## highest: phyllosticta-leaf-spot phytophthora-rot       powdery-mildew         purple-seed-stain      rhizoctonia-root-rot  
## --------------------------------------------------------------------------------
## date 
##        n  missing distinct 
##      682        1        7 
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency     26    75    93   118   131   149    90
## Proportion 0.038 0.110 0.136 0.173 0.192 0.218 0.132
## --------------------------------------------------------------------------------
## plant.stand 
##        n  missing distinct 
##      647       36        2 
##                       
## Value          0     1
## Frequency    354   293
## Proportion 0.547 0.453
## --------------------------------------------------------------------------------
## precip 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency     74   112   459
## Proportion 0.115 0.174 0.712
## --------------------------------------------------------------------------------
## temp 
##        n  missing distinct 
##      653       30        3 
##                             
## Value          0     1     2
## Frequency     80   374   199
## Proportion 0.123 0.573 0.305
## --------------------------------------------------------------------------------
## hail 
##        n  missing distinct 
##      562      121        2 
##                       
## Value          0     1
## Frequency    435   127
## Proportion 0.774 0.226
## --------------------------------------------------------------------------------
## crop.hist 
##        n  missing distinct 
##      667       16        4 
##                                   
## Value          0     1     2     3
## Frequency     65   165   219   218
## Proportion 0.097 0.247 0.328 0.327
## --------------------------------------------------------------------------------
## area.dam 
##        n  missing distinct 
##      682        1        4 
##                                   
## Value          0     1     2     3
## Frequency    123   227   145   187
## Proportion 0.180 0.333 0.213 0.274
## --------------------------------------------------------------------------------
## sever 
##        n  missing distinct 
##      562      121        3 
##                             
## Value          0     1     2
## Frequency    195   322    45
## Proportion 0.347 0.573 0.080
## --------------------------------------------------------------------------------
## seed.tmt 
##        n  missing distinct 
##      562      121        3 
##                             
## Value          0     1     2
## Frequency    305   222    35
## Proportion 0.543 0.395 0.062
## --------------------------------------------------------------------------------
## germ 
##        n  missing distinct 
##      571      112        3 
##                             
## Value          0     1     2
## Frequency    165   213   193
## Proportion 0.289 0.373 0.338
## --------------------------------------------------------------------------------
## plant.growth 
##        n  missing distinct 
##      667       16        2 
##                       
## Value          0     1
## Frequency    441   226
## Proportion 0.661 0.339
## --------------------------------------------------------------------------------
## leaves 
##        n  missing distinct 
##      683        0        2 
##                       
## Value          0     1
## Frequency     77   606
## Proportion 0.113 0.887
## --------------------------------------------------------------------------------
## leaf.halo 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency    221    36   342
## Proportion 0.369 0.060 0.571
## --------------------------------------------------------------------------------
## leaf.marg 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency    357    21   221
## Proportion 0.596 0.035 0.369
## --------------------------------------------------------------------------------
## leaf.size 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency     51   327   221
## Proportion 0.085 0.546 0.369
## --------------------------------------------------------------------------------
## leaf.shread 
##        n  missing distinct 
##      583      100        2 
##                       
## Value          0     1
## Frequency    487    96
## Proportion 0.835 0.165
## --------------------------------------------------------------------------------
## leaf.malf 
##        n  missing distinct 
##      599       84        2 
##                       
## Value          0     1
## Frequency    554    45
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## leaf.mild 
##        n  missing distinct 
##      575      108        3 
##                             
## Value          0     1     2
## Frequency    535    20    20
## Proportion 0.930 0.035 0.035
## --------------------------------------------------------------------------------
## stem 
##        n  missing distinct 
##      667       16        2 
##                       
## Value          0     1
## Frequency    296   371
## Proportion 0.444 0.556
## --------------------------------------------------------------------------------
## lodging 
##        n  missing distinct 
##      562      121        2 
##                       
## Value          0     1
## Frequency    520    42
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## stem.cankers 
##        n  missing distinct 
##      645       38        4 
##                                   
## Value          0     1     2     3
## Frequency    379    39    36   191
## Proportion 0.588 0.060 0.056 0.296
## --------------------------------------------------------------------------------
## canker.lesion 
##        n  missing distinct 
##      645       38        4 
##                                   
## Value          0     1     2     3
## Frequency    320    83   177    65
## Proportion 0.496 0.129 0.274 0.101
## --------------------------------------------------------------------------------
## fruiting.bodies 
##        n  missing distinct 
##      577      106        2 
##                     
## Value         0    1
## Frequency   473  104
## Proportion 0.82 0.18
## --------------------------------------------------------------------------------
## ext.decay 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency    497   135    13
## Proportion 0.771 0.209 0.020
## --------------------------------------------------------------------------------
## mycelium 
##        n  missing distinct 
##      645       38        2 
##                       
## Value          0     1
## Frequency    639     6
## Proportion 0.991 0.009
## --------------------------------------------------------------------------------
## int.discolor 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency    581    44    20
## Proportion 0.901 0.068 0.031
## --------------------------------------------------------------------------------
## sclerotia 
##        n  missing distinct 
##      645       38        2 
##                       
## Value          0     1
## Frequency    625    20
## Proportion 0.969 0.031
## --------------------------------------------------------------------------------
## fruit.pods 
##        n  missing distinct 
##      599       84        4 
##                                   
## Value          0     1     2     3
## Frequency    407   130    14    48
## Proportion 0.679 0.217 0.023 0.080
## --------------------------------------------------------------------------------
## fruit.spots 
##        n  missing distinct 
##      577      106        4 
##                                   
## Value          0     1     2     4
## Frequency    345    75    57   100
## Proportion 0.598 0.130 0.099 0.173
## --------------------------------------------------------------------------------
## seed 
##        n  missing distinct 
##      591       92        2 
##                       
## Value          0     1
## Frequency    476   115
## Proportion 0.805 0.195
## --------------------------------------------------------------------------------
## mold.growth 
##        n  missing distinct 
##      591       92        2 
##                       
## Value          0     1
## Frequency    524    67
## Proportion 0.887 0.113
## --------------------------------------------------------------------------------
## seed.discolor 
##        n  missing distinct 
##      577      106        2 
##                       
## Value          0     1
## Frequency    513    64
## Proportion 0.889 0.111
## --------------------------------------------------------------------------------
## seed.size 
##        n  missing distinct 
##      591       92        2 
##                   
## Value        0   1
## Frequency  532  59
## Proportion 0.9 0.1
## --------------------------------------------------------------------------------
## shriveling 
##        n  missing distinct 
##      577      106        2 
##                       
## Value          0     1
## Frequency    539    38
## Proportion 0.934 0.066
## --------------------------------------------------------------------------------
## roots 
##        n  missing distinct 
##      652       31        3 
##                             
## Value          0     1     2
## Frequency    551    86    15
## Proportion 0.845 0.132 0.023
## --------------------------------------------------------------------------------

# visual frequencies
library(gridExtra)
library(purrr)
marrangeGrob(
  map(
    names(Soybean), 
    ~ ggplot(Soybean, aes_string(.x)) + 
      geom_bar()
  ),
  ncol = 6,
  nrow = 6,
  top = "Soybean Distributions"
)

When the first metric is low (under 10%) and the second is large (above 20), the algorithm suggests that dealing with variables with near zero variance.

nzv <- nearZeroVar(Soybean[,-1], saveMetrics= TRUE)
library(dplyr)
filter(nzv, zeroVar == TRUE | nzv == TRUE)

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# package was not installed i have installed below package
#install.packages("naniar")
# load package
# number, proportions and percentages of missingness 
library(naniar)
cbind(c("Number of Mising Values", 
        "Number of Complete Values",
        "Proportion of Missing Values",
        "Proportion of Complete Values",
        "Percentage of Missing Values",
        "Percentage of complete Values"),
      rbind(n_miss(Soybean),
            n_complete(Soybean),
            round(prop_miss(Soybean),4),
            round(prop_complete(Soybean),4),
            round(pct_miss(Soybean),2),
            round(pct_complete(Soybean),2)))

##      [,1]                            [,2]   
## [1,] "Number of Mising Values"       "2337" 
## [2,] "Number of Complete Values"     "22251"
## [3,] "Proportion of Missing Values"  "0.095"
## [4,] "Proportion of Complete Values" "0.905"
## [5,] "Percentage of Missing Values"  "9.5"  
## [6,] "Percentage of complete Values" "90.5"

The correct is that 9.5% of the data are missing (half of 18%). But it is true that some variables has a lot more of missing:

# frequency of missingness by variable
library(naniar)
gg_miss_var(Soybean, show_pct = TRUE)

For synthesis purpose, We will subset just 5 predictors and run the analysis:

# package was not installed i have installed below package
#install.packages("finalfit")
# load package
library(finalfit)
outcome <- "Class"
predictors1 <- colnames(Soybean[,2:6])
Soybean %>% 
  missing_pairs(outcome, predictors1)

Missing data are presented in grey color.

levels(Soybean$Class)

##  [1] "2-4-d-injury"                "alternarialeaf-spot"        
##  [3] "anthracnose"                 "bacterial-blight"           
##  [5] "bacterial-pustule"           "brown-spot"                 
##  [7] "brown-stem-rot"              "charcoal-rot"               
##  [9] "cyst-nematode"               "diaporthe-pod-&-stem-blight"
## [11] "diaporthe-stem-canker"       "downy-mildew"               
## [13] "frog-eye-leaf-spot"          "herbicide-injury"           
## [15] "phyllosticta-leaf-spot"      "phytophthora-rot"           
## [17] "powdery-mildew"              "purple-seed-stain"          
## [19] "rhizoctonia-root-rot"

To see the most frequent missing data on predictors

## number of missing data in each variable 
# order by number of missing data
library(naniar)
Soybean %>%
  group_by(Class) %>%
  miss_var_summary() %>%
  arrange(desc(n_miss)) %>%
  print(n=50)

## # A tibble: 665 × 4
## # Groups:   Class [19]
##    Class                       variable        n_miss pct_miss
##    <fct>                       <chr>            <int>    <num>
##  1 phytophthora-rot            hail                68     77.3
##  2 phytophthora-rot            sever               68     77.3
##  3 phytophthora-rot            seed.tmt            68     77.3
##  4 phytophthora-rot            germ                68     77.3
##  5 phytophthora-rot            lodging             68     77.3
##  6 phytophthora-rot            fruiting.bodies     68     77.3
##  7 phytophthora-rot            fruit.pods          68     77.3
##  8 phytophthora-rot            fruit.spots         68     77.3
##  9 phytophthora-rot            seed                68     77.3
## 10 phytophthora-rot            mold.growth         68     77.3
## 11 phytophthora-rot            seed.discolor       68     77.3
## 12 phytophthora-rot            seed.size           68     77.3
## 13 phytophthora-rot            shriveling          68     77.3
## 14 phytophthora-rot            leaf.halo           55     62.5
## 15 phytophthora-rot            leaf.marg           55     62.5
## 16 phytophthora-rot            leaf.size           55     62.5
## 17 phytophthora-rot            leaf.shread         55     62.5
## 18 phytophthora-rot            leaf.malf           55     62.5
## 19 phytophthora-rot            leaf.mild           55     62.5
## 20 2-4-d-injury                plant.stand         16    100  
## 21 2-4-d-injury                precip              16    100  
## 22 2-4-d-injury                temp                16    100  
## 23 2-4-d-injury                hail                16    100  
## 24 2-4-d-injury                crop.hist           16    100  
## 25 2-4-d-injury                sever               16    100  
## 26 2-4-d-injury                seed.tmt            16    100  
## 27 2-4-d-injury                germ                16    100  
## 28 2-4-d-injury                plant.growth        16    100  
## 29 2-4-d-injury                leaf.shread         16    100  
## 30 2-4-d-injury                leaf.mild           16    100  
## 31 2-4-d-injury                stem                16    100  
## 32 2-4-d-injury                lodging             16    100  
## 33 2-4-d-injury                stem.cankers        16    100  
## 34 2-4-d-injury                canker.lesion       16    100  
## 35 2-4-d-injury                fruiting.bodies     16    100  
## 36 2-4-d-injury                ext.decay           16    100  
## 37 2-4-d-injury                mycelium            16    100  
## 38 2-4-d-injury                int.discolor        16    100  
## 39 2-4-d-injury                sclerotia           16    100  
## 40 2-4-d-injury                fruit.pods          16    100  
## 41 2-4-d-injury                fruit.spots         16    100  
## 42 2-4-d-injury                seed                16    100  
## 43 2-4-d-injury                mold.growth         16    100  
## 44 2-4-d-injury                seed.discolor       16    100  
## 45 2-4-d-injury                seed.size           16    100  
## 46 2-4-d-injury                shriveling          16    100  
## 47 2-4-d-injury                roots               16    100  
## 48 diaporthe-pod-&-stem-blight hail                15    100  
## 49 diaporthe-pod-&-stem-blight sever               15    100  
## 50 diaporthe-pod-&-stem-blight seed.tmt            15    100  
## # ℹ 615 more rows

The most frequent missing data on other variables occurs on level “phytophthora-rot”.

by the percentage of data, the order (and the levels on variable “Class”) will be different:

## percentage of missing data in each variable 
# order by percentage of missing data
Soybean %>%
  group_by(Class) %>%
  miss_var_summary() %>%
  arrange(desc(pct_miss)) %>%
  print(n=50)

## # A tibble: 665 × 4
## # Groups:   Class [19]
##    Class                       variable        n_miss pct_miss
##    <fct>                       <chr>            <int>    <num>
##  1 diaporthe-pod-&-stem-blight hail                15      100
##  2 diaporthe-pod-&-stem-blight sever               15      100
##  3 diaporthe-pod-&-stem-blight seed.tmt            15      100
##  4 diaporthe-pod-&-stem-blight leaf.halo           15      100
##  5 diaporthe-pod-&-stem-blight leaf.marg           15      100
##  6 diaporthe-pod-&-stem-blight leaf.size           15      100
##  7 diaporthe-pod-&-stem-blight leaf.shread         15      100
##  8 diaporthe-pod-&-stem-blight leaf.malf           15      100
##  9 diaporthe-pod-&-stem-blight leaf.mild           15      100
## 10 diaporthe-pod-&-stem-blight lodging             15      100
## 11 diaporthe-pod-&-stem-blight roots               15      100
## 12 cyst-nematode               plant.stand         14      100
## 13 cyst-nematode               precip              14      100
## 14 cyst-nematode               temp                14      100
## 15 cyst-nematode               hail                14      100
## 16 cyst-nematode               sever               14      100
## 17 cyst-nematode               seed.tmt            14      100
## 18 cyst-nematode               germ                14      100
## 19 cyst-nematode               leaf.halo           14      100
## 20 cyst-nematode               leaf.marg           14      100
## 21 cyst-nematode               leaf.size           14      100
## 22 cyst-nematode               leaf.shread         14      100
## 23 cyst-nematode               leaf.malf           14      100
## 24 cyst-nematode               leaf.mild           14      100
## 25 cyst-nematode               lodging             14      100
## 26 cyst-nematode               stem.cankers        14      100
## 27 cyst-nematode               canker.lesion       14      100
## 28 cyst-nematode               fruiting.bodies     14      100
## 29 cyst-nematode               ext.decay           14      100
## 30 cyst-nematode               mycelium            14      100
## 31 cyst-nematode               int.discolor        14      100
## 32 cyst-nematode               sclerotia           14      100
## 33 cyst-nematode               fruit.spots         14      100
## 34 cyst-nematode               seed.discolor       14      100
## 35 cyst-nematode               shriveling          14      100
## 36 2-4-d-injury                plant.stand         16      100
## 37 2-4-d-injury                precip              16      100
## 38 2-4-d-injury                temp                16      100
## 39 2-4-d-injury                hail                16      100
## 40 2-4-d-injury                crop.hist           16      100
## 41 2-4-d-injury                sever               16      100
## 42 2-4-d-injury                seed.tmt            16      100
## 43 2-4-d-injury                germ                16      100
## 44 2-4-d-injury                plant.growth        16      100
## 45 2-4-d-injury                leaf.shread         16      100
## 46 2-4-d-injury                leaf.mild           16      100
## 47 2-4-d-injury                stem                16      100
## 48 2-4-d-injury                lodging             16      100
## 49 2-4-d-injury                stem.cankers        16      100
## 50 2-4-d-injury                canker.lesion       16      100
## # ℹ 615 more rows

Finally, we can model missingness, using a classification algorithm:

# package was not installed i have installed below package
#install.packages("rpart.plot")
# load package
# model missingness
library(rpart)
library(rpart.plot)
Soybean %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data = .) %>%
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ", roundint = FALSE)

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

There is some evidence that they are not missing at random (MAR). Missing values on many predictors tend to occur on levels 1 (“2-4-d-injury”), 9 (“cyst-nematode”), 10 (“diaporthe-pod-&-stem-blight”), 14 (“herbicide-injury”) and 16 (“phytophthora-rot”). A better ideia is to imputate them. We will use Random Forest algorithm to run imputation:

# package was not installed i have installed below package
#install.packages("missForest")
# load package
# imputation
library(missForest)
library(dplyr)
Soybean.imp <- missForest(select(Soybean, -"Class"), 
                          ntree = 1000, verbose = T)

##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1121311 
##     difference(s): 0.02183644 
##     time: 17.75 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1108298 
##     difference(s): 0.005145367 
##     time: 17.92 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.1106335 
##     difference(s): 0.002384438 
##     time: 16.26 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1111546 
##     difference(s): 0.001087639 
##     time: 16.24 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.110835 
##     difference(s): 0.0007529805 
##     time: 18.22 seconds
## 
##   missForest iteration 6 in progress...done!
##     estimated error(s): 0.1104027 
##     difference(s): 0.0007529805 
##     time: 18.2 seconds

SoybeanImpute <- data.frame(Soybean.imp$ximp)
SoybeanNew <- as.data.frame(cbind(Soybean$Class, SoybeanImpute))

DATA 624 - Homework 4

Bikash Bhowmik

02 Mar 2025