Chapter 3 - Data Pre-Processing



3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

> library(mlbench)
> data(Glass)
> str(Glass)
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ... $ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ... $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ... $Ba :num 0000000000...
$Fe :num 000000.260000.11...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
  2. Do there appear to be any outliers in the data? Are any predictors skewed?
  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

glass <- subset(Glass, select = -Type)
predictors <- colnames(glass)

par(mfrow = c(3, 3))
for(i in 1:9) {
  hist(glass[,i], main = predictors[i])
}

corrplot(cor(Glass[,1:9]), method='square')


(b)


(c)



3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
  2. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 

(a)

sb_freq <- Soybean
head(Soybean[,2:length(sb_freq)])
##   date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ
## 1    6           0      2    1    0         1        1     1        0    0
## 2    4           0      2    1    0         2        0     2        1    1
## 3    3           0      2    1    0         1        0     2        1    2
## 4    3           0      2    1    0         1        0     2        0    1
## 5    6           0      2    1    0         2        0     1        0    2
## 6    5           0      2    1    0         3        0     1        0    1
##   plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1            1      1         0         2         2           0         0
## 2            1      1         0         2         2           0         0
## 3            1      1         0         2         2           0         0
## 4            1      1         0         2         2           0         0
## 5            1      1         0         2         2           0         0
## 6            1      1         0         2         2           0         0
##   leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies
## 1         0    1       1            3             1               1
## 2         0    1       0            3             1               1
## 3         0    1       0            3             0               1
## 4         0    1       0            3             0               1
## 5         0    1       0            3             1               1
## 6         0    1       0            3             0               1
##   ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 1         1        0            0         0          0           4    0
## 2         1        0            0         0          0           4    0
## 3         1        0            0         0          0           4    0
## 4         1        0            0         0          0           4    0
## 5         1        0            0         0          0           4    0
## 6         1        0            0         0          0           4    0
##   mold.growth seed.discolor seed.size shriveling roots
## 1           0             0         0          0     0
## 2           0             0         0          0     0
## 3           0             0         0          0     0
## 4           0             0         0          0     0
## 5           0             0         0          0     0
## 6           0             0         0          0     0
sb_freq[, 2:length(sb_freq)] <- lapply(sb_freq[, 2:length(sb_freq)], function(x) as.numeric(as.character(x)))

ggplot(data=melt(sb_freq), mapping=aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = 'free_x')
## Warning: Removed 2337 rows containing non-finite values (stat_count).


(b)

Soybean$na_count <- apply(Soybean, 1, function(x) sum(is.na(x)))
colSums(is.na(Soybean))
##           Class            date     plant.stand          precip 
##               0               1              36              38 
##            temp            hail       crop.hist        area.dam 
##              30             121              16               1 
##           sever        seed.tmt            germ    plant.growth 
##             121             121             112              16 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##               0              84              84              84 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##             121              38              38             106 
##       ext.decay        mycelium    int.discolor       sclerotia 
##              38              38              38              38 
##      fruit.pods     fruit.spots            seed     mold.growth 
##              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31 
##        na_count 
##               0
(Soybean %>%
  select(Class, na_count) %>%
  group_by(Class) %>%
  summarise(na_count = sum(na_count)) %>%
  arrange(desc(na_count)))
## # A tibble: 19 x 2
##    Class                       na_count
##    <fct>                          <int>
##  1 phytophthora-rot                1214
##  2 2-4-d-injury                     450
##  3 cyst-nematode                    336
##  4 diaporthe-pod-&-stem-blight      177
##  5 herbicide-injury                 160
##  6 alternarialeaf-spot                0
##  7 anthracnose                        0
##  8 bacterial-blight                   0
##  9 bacterial-pustule                  0
## 10 brown-spot                         0
## 11 brown-stem-rot                     0
## 12 charcoal-rot                       0
## 13 diaporthe-stem-canker              0
## 14 downy-mildew                       0
## 15 frog-eye-leaf-spot                 0
## 16 phyllosticta-leaf-spot             0
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

(c)

Mice library = Mice means multivariate imputation by chained equations

library(mice)
sb <- Soybean[, -c(37, 35, 34, 33, 28, 26, 21, 19, 18)]
sb[, 2:length(sb)] <- lapply(sb[, 2:length(sb)], function(x) as.numeric(as.character(x)))
str(sb[,-1])
## 'data.frame':    683 obs. of  27 variables:
##  $ date           : num  6 4 3 3 6 5 5 4 6 4 ...
##  $ plant.stand    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ precip         : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ temp           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hail           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ crop.hist      : num  1 2 1 1 2 3 2 1 3 2 ...
##  $ area.dam       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ sever          : num  1 2 2 2 1 1 1 1 1 2 ...
##  $ seed.tmt       : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ germ           : num  0 1 2 1 2 1 0 2 1 2 ...
##  $ plant.growth   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ leaves         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.halo      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ leaf.marg      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.size      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.shread    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ stem           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ stem.cankers   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ canker.lesion  : num  1 1 0 0 1 0 1 1 1 1 ...
##  $ fruiting.bodies: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ext.decay      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruit.pods     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruit.spots    : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mold.growth    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ roots          : num  0 0 0 0 0 0 0 0 0 0 ...
summary(sb[,-1])
##       date        plant.stand         precip           temp      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :4.000   Median :0.0000   Median :2.000   Median :1.000  
##  Mean   :3.554   Mean   :0.4529   Mean   :1.597   Mean   :1.182  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :1.0000   Max.   :2.000   Max.   :2.000  
##  NA's   :1       NA's   :36       NA's   :38      NA's   :30     
##       hail         crop.hist        area.dam         sever       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.000   Median :2.000   Median :1.000   Median :1.0000  
##  Mean   :0.226   Mean   :1.885   Mean   :1.581   Mean   :0.7331  
##  3rd Qu.:0.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :3.000   Max.   :3.000   Max.   :2.0000  
##  NA's   :121     NA's   :16      NA's   :1       NA's   :121     
##     seed.tmt           germ        plant.growth        leaves      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.000   Median :0.0000   Median :1.0000  
##  Mean   :0.5196   Mean   :1.049   Mean   :0.3388   Mean   :0.8873  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :2.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :121      NA's   :112     NA's   :16                       
##    leaf.halo       leaf.marg       leaf.size      leaf.shread    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :0.000   Median :1.000   Median :0.0000  
##  Mean   :1.202   Mean   :0.773   Mean   :1.284   Mean   :0.1647  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :1.0000  
##  NA's   :84      NA's   :84      NA's   :84      NA's   :100     
##       stem         stem.cankers  canker.lesion    fruiting.bodies 
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00   Median :1.0000   Median :0.0000  
##  Mean   :0.5562   Mean   :1.06   Mean   :0.9798   Mean   :0.1802  
##  3rd Qu.:1.0000   3rd Qu.:3.00   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.00   Max.   :3.0000   Max.   :1.0000  
##  NA's   :16       NA's   :38     NA's   :38       NA's   :106     
##    ext.decay       int.discolor      fruit.pods      fruit.spots   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.2496   Mean   :0.1302   Mean   :0.5042   Mean   :1.021  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :2.0000   Max.   :2.0000   Max.   :3.0000   Max.   :4.000  
##  NA's   :38       NA's   :38       NA's   :84       NA's   :106    
##       seed         mold.growth         roots       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1946   Mean   :0.1134   Mean   :0.1779  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.0000  
##  NA's   :92       NA's   :92       NA's   :31
correlated_sb <- cor(sb[,-1], use="pairwise.complete.obs")
corrplot(correlated_sb, method = "square", order = "hclust")

What is seen and what can be done: