Data 624 R Programming

Exercise 3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Load the dataset

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Refractive Index:

library(ggplot2)
ggplot(data=Glass, aes(x=RI))+
  geom_histogram(bins = 18, col='blue', fill='lightgreen')+
  labs(x='Refractive Index', 
       y= 'Count',
       title = 'Histogram of refractive Indices')

The distribution of refractive index is approximately normal but a little bit right skewed.

Sodium:

ggplot(data=Glass, aes(x=Na))+
  geom_histogram(bins = 20, col='black', fill='lightpink')+
  labs(x='Percentage of Na', 
       y= 'Count',
       title = 'Histogram of percentage of sodium')

The distribution of percentage of sodium seems slightly right skewed.

Magnesium:

ggplot(data=Glass, aes(x=Mg))+
  geom_histogram(bins = 18, col='blue', fill='yellow')+
  labs(x='Percentage of Magnesium', 
       y= 'Count',
       title = 'Histogram of percentage of magnesium')

The distribution of magnesium is neither normal nor uniform.

ggplot(data=Glass, aes(x=Al))+
  geom_histogram(bins = 20, col='black', fill='lightpink')+
  labs(x='Percentage of Al', 
       y= 'Count',
       title = 'Histogram of percentage of Aluminium')

It’s approx normal distribution but a little bit right skewed.

Silicon:

ggplot(data=Glass, aes(x=Si))+
  geom_histogram(bins = 20, col='blue', fill='lightgreen')+
  labs(x='Percentage of Silicon', 
       y= 'Count',
       title = 'Histogram of percentage of silicon')

The distribution of Si is left skewed but seems normal
Histograms of K, Ca, Ba and Fe:

variables <- c('K', 'Ca', 'Ba', 'Fe')
histograms <- lapply(variables, function(var) {
  ggplot(Glass, aes_string(x = var)) +
    geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "blue") +
    ggtitle(paste("Histogram of", var)) +
    xlab(var) +
    ylab("Frequency")
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

gridExtra::grid.arrange(grobs = histograms, ncol = 2)

It can be seen that the distribution of Ca is approx normal but distributions of K, Ba, and Fe are neither normal nor uniform. Distribution of Fe appears exponential.

Correlation plot:

library(corrplot)

## corrplot 0.92 loaded

variables <- c('RI', 'Na','Mg','Al','Si', 'K', 'Ca', 'Ba', 'Fe')
cor_matrix <- cor(Glass[, variables])

# Create correlation plot
corr_plot <- corrplot(cor_matrix, method = "shade")

print(corr_plot)

## $corr
##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000
## 
## $corrPos
##    xName yName x y          corr
## 1     RI    RI 1 9  1.0000000000
## 2     RI    Na 1 8 -0.1918853790
## 3     RI    Mg 1 7 -0.1222740393
## 4     RI    Al 1 6 -0.4073260341
## 5     RI    Si 1 5 -0.5420521997
## 6     RI     K 1 4 -0.2898327111
## 7     RI    Ca 1 3  0.8104026963
## 8     RI    Ba 1 2 -0.0003860189
## 9     RI    Fe 1 1  0.1430096093
## 10    Na    RI 2 9 -0.1918853790
## 11    Na    Na 2 8  1.0000000000
## 12    Na    Mg 2 7 -0.2737319608
## 13    Na    Al 2 6  0.1567936672
## 14    Na    Si 2 5 -0.0698088065
## 15    Na     K 2 4 -0.2660865043
## 16    Na    Ca 2 3 -0.2754424856
## 17    Na    Ba 2 2  0.3266028795
## 18    Na    Fe 2 1 -0.2413464115
## 19    Mg    RI 3 9 -0.1222740393
## 20    Mg    Na 3 8 -0.2737319608
## 21    Mg    Mg 3 7  1.0000000000
## 22    Mg    Al 3 6 -0.4817985090
## 23    Mg    Si 3 5 -0.1659267225
## 24    Mg     K 3 4  0.0053956673
## 25    Mg    Ca 3 3 -0.4437500264
## 26    Mg    Ba 3 2 -0.4922621178
## 27    Mg    Fe 3 1  0.0830595289
## 28    Al    RI 4 9 -0.4073260341
## 29    Al    Na 4 8  0.1567936672
## 30    Al    Mg 4 7 -0.4817985090
## 31    Al    Al 4 6  1.0000000000
## 32    Al    Si 4 5 -0.0055237204
## 33    Al     K 4 4  0.3259584457
## 34    Al    Ca 4 3 -0.2595920102
## 35    Al    Ba 4 2  0.4794039017
## 36    Al    Fe 4 1 -0.0744021509
## 37    Si    RI 5 9 -0.5420521997
## 38    Si    Na 5 8 -0.0698088065
## 39    Si    Mg 5 7 -0.1659267225
## 40    Si    Al 5 6 -0.0055237204
## 41    Si    Si 5 5  1.0000000000
## 42    Si     K 5 4 -0.1933308544
## 43    Si    Ca 5 3 -0.2087321537
## 44    Si    Ba 5 2 -0.1021513105
## 45    Si    Fe 5 1 -0.0942007305
## 46     K    RI 6 9 -0.2898327111
## 47     K    Na 6 8 -0.2660865043
## 48     K    Mg 6 7  0.0053956673
## 49     K    Al 6 6  0.3259584457
## 50     K    Si 6 5 -0.1933308544
## 51     K     K 6 4  1.0000000000
## 52     K    Ca 6 3 -0.3178361547
## 53     K    Ba 6 2 -0.0426180594
## 54     K    Fe 6 1 -0.0077190491
## 55    Ca    RI 7 9  0.8104026963
## 56    Ca    Na 7 8 -0.2754424856
## 57    Ca    Mg 7 7 -0.4437500264
## 58    Ca    Al 7 6 -0.2595920102
## 59    Ca    Si 7 5 -0.2087321537
## 60    Ca     K 7 4 -0.3178361547
## 61    Ca    Ca 7 3  1.0000000000
## 62    Ca    Ba 7 2 -0.1128409671
## 63    Ca    Fe 7 1  0.1249682190
## 64    Ba    RI 8 9 -0.0003860189
## 65    Ba    Na 8 8  0.3266028795
## 66    Ba    Mg 8 7 -0.4922621178
## 67    Ba    Al 8 6  0.4794039017
## 68    Ba    Si 8 5 -0.1021513105
## 69    Ba     K 8 4 -0.0426180594
## 70    Ba    Ca 8 3 -0.1128409671
## 71    Ba    Ba 8 2  1.0000000000
## 72    Ba    Fe 8 1 -0.0586917554
## 73    Fe    RI 9 9  0.1430096093
## 74    Fe    Na 9 8 -0.2413464115
## 75    Fe    Mg 9 7  0.0830595289
## 76    Fe    Al 9 6 -0.0744021509
## 77    Fe    Si 9 5 -0.0942007305
## 78    Fe     K 9 4 -0.0077190491
## 79    Fe    Ca 9 3  0.1249682190
## 80    Fe    Ba 9 2 -0.0586917554
## 81    Fe    Fe 9 1  1.0000000000
## 
## $arg
## $arg$type
## [1] "full"

It can be seen that the Refractive index has a positive correlation with percentage of calcium and negative correlation with Si, Al and K percentages in the glass.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

variables <- c('RI', 'Na','Mg','Al','Si', 'K', 'Ca', 'Ba', 'Fe')
histograms <- lapply(variables, function(var) {
  ggplot(Glass, aes_string(x = var)) +
    geom_boxplot( fill = "lightgreen", color = "blue") +
    ggtitle(paste("Box plot of", var)) +
    xlab(var) 
})
gridExtra::grid.arrange(grobs = histograms, ncol = 3)

It can be observed that the there are outliers in RI, Na, Al, Si, K, Ca, Ba, and Fe. Therefore, outliers needs to be handles before applying any data analysis techniques. The distributions of RI, Na, Al, Ca are slightly right skewed. The distribution of Si is left skewed. The distribution of Mg, Ba, K and Fe are not normal but they have some other distribution. The distribution of Fe appears to be exponential with decreasing or negative slope.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer. Yes, to minimize the effect of outliers, the spatial sign transformation can be used for the variables, RI, Na, Al, Si, K, Ca, Ba, and Fe, to improve the classification model. After, handling the outliers, the noise in the data can be minimized using other transformation methods.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

ggplot(data = Soybean, mapping = aes(x=temp))+
  geom_bar(fill = 'lightgreen', col='black')+
  labs(title = 'Frequency plot of temp', 
       y='Frequency')

ggplot(data = Soybean, mapping = aes(x=leaf.size))+
  geom_bar(fill = 'yellow', col='black')+
  labs(title = 'Frequency plot of leaf size', 
       y='Frequency', 
       x='Leaf size')

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer.

data(Soybean)

missing_values <- sapply(Soybean, function(x) sum(is.na(x)))

print(missing_values)

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

categorical_vars <- sapply(Soybean, function(x) is.factor(x) || is.character(x))

missing_by_class <- data.frame()

for (var in names(Soybean)[categorical_vars]) {
 
  prop_missing <- tapply(is.na(Soybean[[var]]), Soybean[[var]], mean)
  missing_by_class <- rbind(missing_by_class, data.frame(Variable = var, Class = names(prop_missing), PropMissing = prop_missing))
}

print(missing_by_class)

##                                    Variable                       Class
## 2-4-d-injury                          Class                2-4-d-injury
## alternarialeaf-spot                   Class         alternarialeaf-spot
## anthracnose                           Class                 anthracnose
## bacterial-blight                      Class            bacterial-blight
## bacterial-pustule                     Class           bacterial-pustule
## brown-spot                            Class                  brown-spot
## brown-stem-rot                        Class              brown-stem-rot
## charcoal-rot                          Class                charcoal-rot
## cyst-nematode                         Class               cyst-nematode
## diaporthe-pod-&-stem-blight           Class diaporthe-pod-&-stem-blight
## diaporthe-stem-canker                 Class       diaporthe-stem-canker
## downy-mildew                          Class                downy-mildew
## frog-eye-leaf-spot                    Class          frog-eye-leaf-spot
## herbicide-injury                      Class            herbicide-injury
## phyllosticta-leaf-spot                Class      phyllosticta-leaf-spot
## phytophthora-rot                      Class            phytophthora-rot
## powdery-mildew                        Class              powdery-mildew
## purple-seed-stain                     Class           purple-seed-stain
## rhizoctonia-root-rot                  Class        rhizoctonia-root-rot
## 0                                      date                           0
## 1                                      date                           1
## 2                                      date                           2
## 3                                      date                           3
## 4                                      date                           4
## 5                                      date                           5
## 6                                      date                           6
## 01                              plant.stand                           0
## 11                              plant.stand                           1
## 02                                   precip                           0
## 12                                   precip                           1
## 21                                   precip                           2
## 03                                     temp                           0
## 13                                     temp                           1
## 22                                     temp                           2
## 04                                     hail                           0
## 14                                     hail                           1
## 05                                crop.hist                           0
## 15                                crop.hist                           1
## 23                                crop.hist                           2
## 31                                crop.hist                           3
## 06                                 area.dam                           0
## 16                                 area.dam                           1
## 24                                 area.dam                           2
## 32                                 area.dam                           3
## 07                                    sever                           0
## 17                                    sever                           1
## 25                                    sever                           2
## 08                                 seed.tmt                           0
## 18                                 seed.tmt                           1
## 26                                 seed.tmt                           2
## 09                                     germ                           0
## 19                                     germ                           1
## 27                                     germ                           2
## 010                            plant.growth                           0
## 110                            plant.growth                           1
## 011                                  leaves                           0
## 111                                  leaves                           1
## 012                               leaf.halo                           0
## 112                               leaf.halo                           1
## 28                                leaf.halo                           2
## 013                               leaf.marg                           0
## 113                               leaf.marg                           1
## 29                                leaf.marg                           2
## 014                               leaf.size                           0
## 114                               leaf.size                           1
## 210                               leaf.size                           2
## 015                             leaf.shread                           0
## 115                             leaf.shread                           1
## 016                               leaf.malf                           0
## 116                               leaf.malf                           1
## 017                               leaf.mild                           0
## 117                               leaf.mild                           1
## 211                               leaf.mild                           2
## 018                                    stem                           0
## 118                                    stem                           1
## 019                                 lodging                           0
## 119                                 lodging                           1
## 020                            stem.cankers                           0
## 120                            stem.cankers                           1
## 212                            stem.cankers                           2
## 33                             stem.cankers                           3
## 021                           canker.lesion                           0
## 121                           canker.lesion                           1
## 213                           canker.lesion                           2
## 34                            canker.lesion                           3
## 022                         fruiting.bodies                           0
## 122                         fruiting.bodies                           1
## 023                               ext.decay                           0
## 123                               ext.decay                           1
## 214                               ext.decay                           2
## 024                                mycelium                           0
## 124                                mycelium                           1
## 025                            int.discolor                           0
## 125                            int.discolor                           1
## 215                            int.discolor                           2
## 026                               sclerotia                           0
## 126                               sclerotia                           1
## 027                              fruit.pods                           0
## 127                              fruit.pods                           1
## 216                              fruit.pods                           2
## 35                               fruit.pods                           3
## 028                             fruit.spots                           0
## 128                             fruit.spots                           1
## 217                             fruit.spots                           2
## 41                              fruit.spots                           4
## 029                                    seed                           0
## 129                                    seed                           1
## 030                             mold.growth                           0
## 130                             mold.growth                           1
## 031                           seed.discolor                           0
## 131                           seed.discolor                           1
## 032                               seed.size                           0
## 132                               seed.size                           1
## 033                              shriveling                           0
## 133                              shriveling                           1
## 034                                   roots                           0
## 134                                   roots                           1
## 218                                   roots                           2
##                             PropMissing
## 2-4-d-injury                          0
## alternarialeaf-spot                   0
## anthracnose                           0
## bacterial-blight                      0
## bacterial-pustule                     0
## brown-spot                            0
## brown-stem-rot                        0
## charcoal-rot                          0
## cyst-nematode                         0
## diaporthe-pod-&-stem-blight           0
## diaporthe-stem-canker                 0
## downy-mildew                          0
## frog-eye-leaf-spot                    0
## herbicide-injury                      0
## phyllosticta-leaf-spot                0
## phytophthora-rot                      0
## powdery-mildew                        0
## purple-seed-stain                     0
## rhizoctonia-root-rot                  0
## 0                                     0
## 1                                     0
## 2                                     0
## 3                                     0
## 4                                     0
## 5                                     0
## 6                                     0
## 01                                    0
## 11                                    0
## 02                                    0
## 12                                    0
## 21                                    0
## 03                                    0
## 13                                    0
## 22                                    0
## 04                                    0
## 14                                    0
## 05                                    0
## 15                                    0
## 23                                    0
## 31                                    0
## 06                                    0
## 16                                    0
## 24                                    0
## 32                                    0
## 07                                    0
## 17                                    0
## 25                                    0
## 08                                    0
## 18                                    0
## 26                                    0
## 09                                    0
## 19                                    0
## 27                                    0
## 010                                   0
## 110                                   0
## 011                                   0
## 111                                   0
## 012                                   0
## 112                                   0
## 28                                    0
## 013                                   0
## 113                                   0
## 29                                    0
## 014                                   0
## 114                                   0
## 210                                   0
## 015                                   0
## 115                                   0
## 016                                   0
## 116                                   0
## 017                                   0
## 117                                   0
## 211                                   0
## 018                                   0
## 118                                   0
## 019                                   0
## 119                                   0
## 020                                   0
## 120                                   0
## 212                                   0
## 33                                    0
## 021                                   0
## 121                                   0
## 213                                   0
## 34                                    0
## 022                                   0
## 122                                   0
## 023                                   0
## 123                                   0
## 214                                   0
## 024                                   0
## 124                                   0
## 025                                   0
## 125                                   0
## 215                                   0
## 026                                   0
## 126                                   0
## 027                                   0
## 127                                   0
## 216                                   0
## 35                                    0
## 028                                   0
## 128                                   0
## 217                                   0
## 41                                    0
## 029                                   0
## 129                                   0
## 030                                   0
## 130                                   0
## 031                                   0
## 131                                   0
## 032                                   0
## 132                                   0
## 033                                   0
## 133                                   0
## 034                                   0
## 134                                   0
## 218                                   0

Yes for some predictors, missing values depends on the class of the predictor.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer.

I think missing data in categorical variables can be imputed using the most frequent entry in that predictor variable. The following code chunk imputes the missing values by mode of ‘temp’ variable.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Soybean$temp = ifelse(is.na(Soybean$temp),
                     ave(Soybean$temp, FUN = function(x) Mode(x)),
                     Soybean$temp)

sum(is.na(Soybean$temp))

## [1] 0

Now there is no missing value in the ‘temp’ variable. Similarly, missing values in other variables could also be handled.