Preprocessing Issue (Overfitting)

SOYEAN DATA KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

3.2A

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data("Soybean")
# summary("Soybean")

df.soy = Soybean %>% data.frame(stringsAsFactors = FALSE)
dim(df.soy)
## [1] 683  36
head(df.soy, 2)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
df.soy.new = apply(df.soy, 2, function(x) as.numeric(as.character(x))) %>% 
      data.frame()
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
df.soy.long<- stack(df.soy.new)
dim(df.soy.long)
## [1] 24588     2
head(df.soy.long)
##   values   ind
## 1     NA Class
## 2     NA Class
## 3     NA Class
## 4     NA Class
## 5     NA Class
## 6     NA Class
tail(df.soy.long)
##       values   ind
## 24583     NA roots
## 24584     NA roots
## 24585      1 roots
## 24586      1 roots
## 24587      1 roots
## 24588      1 roots

To visualize the categorical variables, we first transform the all the character variables as numeric Then the values of the character variables our transformed into factor. The factor has 6 levels , ranging from 0 to 2, 3, 4, 5, 6. Also there are NAs for their corresponding class.

We first used the stack function to make the data in long format now this data has 24588 observations, and two variables,ind variable is what we are interested in and will be used for the plot

ggplot(data = df.soy.long, 
       aes(x=as.factor(values),
           fill=as.factor(values) )) +
            geom_bar(stat='count', width=1) +   
              facet_wrap(~ind)

library(caret)
## Loading required package: lattice
nearZeroVar(df.soy, names=TRUE, saveMetrics = T)
##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Non zero variables are also displayed via nearZeroVar function. There are quite some near zero observations out there.

3.2B

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

ggr_plot <- aggr(df.soy.new,
                 col=c('lightblue','yellow'), numbers=TRUE, sortVars=TRUE, 
                 labels=names(df.soy), cex.axis=.7, gap=3, 
                 ylab=c ("Histogram,Missing data","Pattern, Missing Data" ))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##            Class 1.000000000
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##           leaves 0.000000000

Aggr package is a new package, as an extension of ggplot2, which is designed for handling all processed multidimensional data with straightforward coding. The left side is the histogram of the class with its missing data. The right hand side plots out these missing data by class with its pattern. Combining these graphs give us very nice overview of total number of missing by class, as well as relative proportion and pattern of them in this database. We can see that hail has the most missing.

### why these numbers diff from Sachid Numb
missing.total.long<-
  df.soy %>% 
    mutate (total=n()) %>% 
  group_by(Class) %>% 
    mutate (missingcnt.total=n() )%>% 
    select (Class,missingcnt.total) %>% 
    unique() %>% 
    arrange(-missingcnt.total)

head(missing.total.long, 8)
## # A tibble: 8 x 2
## # Groups:   Class [8]
##   Class                 missingcnt.total
##   <fct>                            <int>
## 1 brown-spot                          92
## 2 alternarialeaf-spot                 91
## 3 frog-eye-leaf-spot                  91
## 4 phytophthora-rot                    88
## 5 brown-stem-rot                      44
## 6 anthracnose                         44
## 7 diaporthe-stem-canker               20
## 8 charcoal-rot                        20

In order to calculate the total of missing within each class, we first calculate the missings by its class, make the data in the long format, and the sorted by missing value in descending order.

Brown-spot ,alternarialeaf-spot , frog-eye-leaf-spot , phytophthora-rot , brown-stem-rot , anthracnose , diaporthe-stem-canker , charcoal-rot are maong the top classes that have missing vale, ranking by descending order. Each class have around 90 or so missing values.

cntplot<-  ## no print after the assignment
ggplot(data = missing.total.long, 
       aes(x=reorder(Class, missingcnt.total), 
           y =missingcnt.total,
           fill = Class)) +
  geom_bar(stat='identity') +
  theme (   axis.title.x = element_text(size=10),
    axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1)
      ) +  
  ggtitle ('Sum of Missing Numbers, by Class')
missing.proport<-
df.soy %>% mutate (total=n()) %>% 
  group_by(Class) %>% 
  mutate (missing.cnt=n(), Proportion=missing.cnt/total) %>% 
  select (Class, Proportion) %>% 
  unique() %>% 
  arrange(-Proportion)

A plot for the above , which is total missing values by class is put into the memory and will be plotted alongside the proportional plot in the future.

propplot<-
ggplot(data = missing.proport,
       aes(x=reorder(Class, Proportion), 
           y =Proportion,
           fill = Class)) +
  geom_bar(stat='identity') +
  theme (   axis.title.x = element_text(size=10),
             axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1) ) +
  ggtitle ('Proportion of Missing, by Class')

The proportion of these missing values in terms of total are also calculated and the stored in the format of a plot, which will be displayed later on period

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
par(mfrow=c(1,2))
grid.arrange(cntplot, propplot,  top = textGrob("Histogram of Class "))

The grid.arrange function plots the above raw summation of missing count within each class, alongside its proportional value of the overall.

3.2c Develop a strategy for handling missing data, either by eliminating predictors or imputation.

soy_complete <- kNN(df.soy)
head(soy_complete)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0
##   Class_imp date_imp plant.stand_imp precip_imp temp_imp hail_imp crop.hist_imp
## 1     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
## 2     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
## 3     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
## 4     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
## 5     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
## 6     FALSE    FALSE           FALSE      FALSE    FALSE    FALSE         FALSE
##   area.dam_imp sever_imp seed.tmt_imp germ_imp plant.growth_imp leaves_imp
## 1        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
## 2        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
## 3        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
## 4        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
## 5        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
## 6        FALSE     FALSE        FALSE    FALSE            FALSE      FALSE
##   leaf.halo_imp leaf.marg_imp leaf.size_imp leaf.shread_imp leaf.malf_imp
## 1         FALSE         FALSE         FALSE           FALSE         FALSE
## 2         FALSE         FALSE         FALSE           FALSE         FALSE
## 3         FALSE         FALSE         FALSE           FALSE         FALSE
## 4         FALSE         FALSE         FALSE           FALSE         FALSE
## 5         FALSE         FALSE         FALSE           FALSE         FALSE
## 6         FALSE         FALSE         FALSE           FALSE         FALSE
##   leaf.mild_imp stem_imp lodging_imp stem.cankers_imp canker.lesion_imp
## 1         FALSE    FALSE       FALSE            FALSE             FALSE
## 2         FALSE    FALSE       FALSE            FALSE             FALSE
## 3         FALSE    FALSE       FALSE            FALSE             FALSE
## 4         FALSE    FALSE       FALSE            FALSE             FALSE
## 5         FALSE    FALSE       FALSE            FALSE             FALSE
## 6         FALSE    FALSE       FALSE            FALSE             FALSE
##   fruiting.bodies_imp ext.decay_imp mycelium_imp int.discolor_imp sclerotia_imp
## 1               FALSE         FALSE        FALSE            FALSE         FALSE
## 2               FALSE         FALSE        FALSE            FALSE         FALSE
## 3               FALSE         FALSE        FALSE            FALSE         FALSE
## 4               FALSE         FALSE        FALSE            FALSE         FALSE
## 5               FALSE         FALSE        FALSE            FALSE         FALSE
## 6               FALSE         FALSE        FALSE            FALSE         FALSE
##   fruit.pods_imp fruit.spots_imp seed_imp mold.growth_imp seed.discolor_imp
## 1          FALSE           FALSE    FALSE           FALSE             FALSE
## 2          FALSE           FALSE    FALSE           FALSE             FALSE
## 3          FALSE           FALSE    FALSE           FALSE             FALSE
## 4          FALSE           FALSE    FALSE           FALSE             FALSE
## 5          FALSE           FALSE    FALSE           FALSE             FALSE
## 6          FALSE           FALSE    FALSE           FALSE             FALSE
##   seed.size_imp shriveling_imp roots_imp
## 1         FALSE          FALSE     FALSE
## 2         FALSE          FALSE     FALSE
## 3         FALSE          FALSE     FALSE
## 4         FALSE          FALSE     FALSE
## 5         FALSE          FALSE     FALSE
## 6         FALSE          FALSE     FALSE
dim(soy_complete)
## [1] 683  72

I used KNN imputation to handle the missing data. As indicated here, after the imputation there is no missing data in the soybean data set anymore.