3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Plot correlation heatmap
corr <- Glass %>% subset(select=-c(Type)) %>% cor(use='pairwise.complete.obs')
corrplot.mixed(corr, upper='square', lower.col = "black")

#ggcorrplot(corr,  ggtheme=ggplot2::theme_grey)
# Create multi histograms
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

plots <- list()
i <- 0
for(var in names(Glass)[-10]){
  i <- i+1
  p <- ggplot(Glass, aes_string(x=var)) + 
    geom_histogram(position='identity') +
    xlab(var) +
    ylab('COUNT') +
    theme(legend.position='none', 
          axis.title=element_text(size=8),
          axis.text=element_text(size=6))
  plots[[i]] <- p
}
multiplot(plotlist=plots, cols=3)
## Loading required package: grid
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

par(mfrow=c(3,3))
for(var in names(Glass)[-10]){
  boxplot(Glass[var], main=paste('Boxplot of', var), horizontal = T)
}

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

From the the boxplot, it appears that all of the predictors, except Mg, have outliers.

From the histogram, it seems that the variables Ba, Mg, K, Ca, and Fe are heavily skewed.

The skewness value can be calculated to confirm:

Glass[-10] %>% apply(2, skewness) %>% sort(decreasing=T)
##          K         Ba         Ca         Fe         RI         Al 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104 
##         Na         Si         Mg 
##  0.4478343 -0.7202392 -1.1364523

Here, the absolute value of skewness of K, Ba, Ca, and Mg are significantly greater than zero.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Below, Box-Cox transform is done on all of the predictors. Then they are centered and scaled.

trans <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
transformed <- predict(trans, Glass[-10])

par(mfrow=c(3,3))
for(var in names(transformed)[-10]){
  boxplot(transformed[var], main=paste('Boxplot of', var), horizontal = T)
}

transformed %>% apply(2, skewness) %>% sort(decreasing=T)
##           K          Ba          Fe          RI          Al          Na 
##  6.46008890  3.36867997  1.72981071  1.56566039  0.09105899  0.03384644 
##          Ca          Si          Mg 
## -0.19395573 -0.65090568 -1.13645228

The centering and scaling did the job of bringing the mean to 0 and standard deviation to 1.

It appears that the Box-Cox transformation has improved the skewness of Ca, Al, and Na. It was not effective in reducing the skewness for other predictors having heavier skewness.

For some predictors having high count of zero value, such as Ba and Fe, the skewness may be due to these zeros. It might be beneficial to include an engineered binary feature that identifies if the predictor is zero or non-zero, and apply Box-Cox transform to only the non-zero values of these predictors.

After performaning the filtered transformation, the skewness of Ba and Fe are significantly reduced:

reduce_skew <- function(vec){
  trans <- vec[vec!=0] %>% BoxCoxTrans() %>% predict(vec[vec!=0])
  return(skewness(trans))
}

paste('The skewness of Ba is now:', reduce_skew(Glass$Ba))
## [1] "The skewness of Ba is now: -0.0544828448359267"
paste('The skewness of Fe is now:', reduce_skew(Glass$Fe))
## [1] "The skewness of Fe is now: 0.0729367099534234"

Lastly, some of the predictors have high correlation. For example, RI and Ca has a correlation of 0.81. It may be beneficial to remove one of the highly correlated predictors to improve stability of some linear models.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The variable with degenrate distributions is a variable with “zero-variance” issue, that satisfies both following conditions:

The nearZeroVar function can be used to find the degenrate variables:

paste('The degenerate variables are:', paste(names(Soybean[,nearZeroVar(Soybean)]), collapse = ', '))
## [1] "The degenerate variables are: leaf.mild, mycelium, sclerotia"
summary(Soybean[19])
##  leaf.mild 
##  0   :535  
##  1   : 20  
##  2   : 20  
##  NA's:108

For the leaf.mild variable, the factin of unique value over the sample size is 3/683=0.4% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 535/20=26.75 > 20.

summary(Soybean[26])
##  mycelium  
##  0   :639  
##  1   :  6  
##  NA's: 38

For the mycelium variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 639/6=106.5 > 20.

summary(Soybean[28])
##  sclerotia 
##  0   :625  
##  1   : 20  
##  NA's: 38

For the sclerotia variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 625/20=31.25 > 20.

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The count of missing values in each variables are found below:

nas <- Soybean[-1] %>% apply(2, is.na) %>% apply(2, sum, na.rm=T)
nas <- sort(nas, decreasing=T)
nas
##            hail           sever        seed.tmt         lodging 
##             121             121             121             121 
##            germ       leaf.mild fruiting.bodies     fruit.spots 
##             112             108             106             106 
##   seed.discolor      shriveling     leaf.shread            seed 
##             106             106             100              92 
##     mold.growth       seed.size       leaf.halo       leaf.marg 
##              92              92              84              84 
##       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium 
##              38              38              38              38 
##    int.discolor       sclerotia     plant.stand           roots 
##              38              38              36              31 
##            temp       crop.hist    plant.growth            stem 
##              30              16              16              16 
##            date        area.dam          leaves 
##               1               1               0

Below, a table is constructed to show the relationship of the missing data to the classes. The table is constructed as following:

  1. Select the predictor variable
  2. Find the row indices where the predictor has missing values
  3. Select these rows
  4. Count the number of occurrence in each class of the target variable
  5. Repeat 1~4 for each predictor variable
t_list <- list()
i <- 0
for (var in names(Soybean[-1])) {
  i <- i +1
  row_id <- which(is.na(Soybean[,var]))
  temp <- Soybean[row_id,'Class']
  t_list[[i]] <- as.matrix(table(temp))
}
df <- data.frame(do.call(cbind, t_list))
names(df) <- names(Soybean[-1])
df <- df[names(nas)]
df <- t(df)
kable(df) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width='100%', height = "500px")
2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight bacterial-pustule brown-spot brown-stem-rot charcoal-rot cyst-nematode diaporthe-pod-&-stem-blight diaporthe-stem-canker downy-mildew frog-eye-leaf-spot herbicide-injury phyllosticta-leaf-spot phytophthora-rot powdery-mildew purple-seed-stain rhizoctonia-root-rot
hail 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
sever 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
seed.tmt 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
lodging 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
germ 16 0 0 0 0 0 0 0 14 6 0 0 0 8 0 68 0 0 0
leaf.mild 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 55 0 0 0
fruiting.bodies 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
fruit.spots 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
seed.discolor 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
shriveling 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
leaf.shread 16 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
seed 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
mold.growth 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
seed.size 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
leaf.halo 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.marg 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.size 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.malf 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
fruit.pods 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 68 0 0 0
precip 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
stem.cankers 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
canker.lesion 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
ext.decay 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
mycelium 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
int.discolor 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
sclerotia 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
plant.stand 16 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0
roots 16 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0
temp 16 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0
crop.hist 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
plant.growth 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stem 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
date 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
area.dam 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
leaves 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Here, the columns are the classes of the target variable, and the rows are the predictors. The numbers are the count of missing values for the predictors.

From this table, it seems that some predictors have same rows with missing values, and the same distribution of classes. Furthere, these predictors’ missing values are biased toward the class phytophthorarot. For example, for the predictor hail, out of the 121 missing values, 68 (56%) of them are phytophthorarot. This indicates “informative missingness”, which can induce significant bias in the model.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the table above, I will eliminate the rows with missing values that have high bias toward the class phytophthorarot. This will remove roughly 68 rows.

# Mark the rows that has missing values and has the class being "phytophthora-rot"
eliminate <- (!complete.cases(Soybean)) & ifelse(Soybean$Class=='phytophthora-rot', 1, 0)

# Eliminate those rows
Soybean.a <- Soybean[!eliminate,]

paste('Eliminated', sum(eliminate), 'rows.')
## [1] "Eliminated 68 rows."
paste(dim(Soybean.a)[1], 'rows remaining.')
## [1] "615 rows remaining."
paste(sum(!complete.cases(Soybean.a)), 'rows still contain missing values.')
## [1] "53 rows still contain missing values."

I cannot use KNN to impute since all features are categorical, and caret forbids KNN imputation with factor features. Since the rows with missing values are just 53/615 = 8.6% of the data, I opt to fill the missing values with the mode of the feature; i.e. uthe factor with the highest frequency. Below, a function is created to go through the features to fill the missing values with the most frequent factor.

fill_na <- function(df){
  
  for (i in 2:dim(df)[2]){
    paste('Filling', sum(is.na(df[,i])), 'missing values for feature: ', names(df)[i], '.') %>% print()
    find.mode <- df[,i] %>% table() %>% sort(decreasing = T) %>% prop.table() %>% round(4)
    mode.name <- find.mode %>%  names() %>% .[1]
    paste('The most frequent factor of this feature is:', mode.name, ', which is', find.mode[mode.name]*100, '% of the class.') %>% print()
    df[is.na(df[,i]), i] <- mode.name
    paste('------------------------------------------------') %>% print()
  }
  return(df)
}

Soybean.b <- fill_na(Soybean.a)
## [1] "Filling 1 missing values for feature:  date ."
## [1] "The most frequent factor of this feature is: 5 , which is 24.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 36 missing values for feature:  plant.stand ."
## [1] "The most frequent factor of this feature is: 0 , which is 61.14 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  precip ."
## [1] "The most frequent factor of this feature is: 2 , which is 72.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 30 missing values for feature:  temp ."
## [1] "The most frequent factor of this feature is: 1 , which is 57.09 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  hail ."
## [1] "The most frequent factor of this feature is: 0 , which is 77.4 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  crop.hist ."
## [1] "The most frequent factor of this feature is: 3 , which is 32.39 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 1 missing values for feature:  area.dam ."
## [1] "The most frequent factor of this feature is: 3 , which is 30.46 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  sever ."
## [1] "The most frequent factor of this feature is: 1 , which is 57.3 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  seed.tmt ."
## [1] "The most frequent factor of this feature is: 0 , which is 54.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 44 missing values for feature:  germ ."
## [1] "The most frequent factor of this feature is: 1 , which is 37.3 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  plant.growth ."
## [1] "The most frequent factor of this feature is: 0 , which is 73.62 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 0 missing values for feature:  leaves ."
## [1] "The most frequent factor of this feature is: 1 , which is 87.48 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.halo ."
## [1] "The most frequent factor of this feature is: 2 , which is 58.36 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.marg ."
## [1] "The most frequent factor of this feature is: 0 , which is 60.92 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.size ."
## [1] "The most frequent factor of this feature is: 1 , which is 55.8 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 45 missing values for feature:  leaf.shread ."
## [1] "The most frequent factor of this feature is: 0 , which is 83.16 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.malf ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.32 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  leaf.mild ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.88 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  stem ."
## [1] "The most frequent factor of this feature is: 1 , which is 50.58 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  lodging ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.53 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  stem.cankers ."
## [1] "The most frequent factor of this feature is: 0 , which is 64.64 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  canker.lesion ."
## [1] "The most frequent factor of this feature is: 0 , which is 55.46 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  fruiting.bodies ."
## [1] "The most frequent factor of this feature is: 0 , which is 81.98 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  ext.decay ."
## [1] "The most frequent factor of this feature is: 0 , which is 76.6 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  mycelium ."
## [1] "The most frequent factor of this feature is: 0 , which is 98.96 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  int.discolor ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  sclerotia ."
## [1] "The most frequent factor of this feature is: 0 , which is 96.53 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  fruit.pods ."
## [1] "The most frequent factor of this feature is: 0 , which is 67.95 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  fruit.spots ."
## [1] "The most frequent factor of this feature is: 0 , which is 59.79 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  seed ."
## [1] "The most frequent factor of this feature is: 0 , which is 80.54 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  mold.growth ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.66 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  seed.discolor ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  seed.size ."
## [1] "The most frequent factor of this feature is: 0 , which is 90.02 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  shriveling ."
## [1] "The most frequent factor of this feature is: 0 , which is 93.41 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 31 missing values for feature:  roots ."
## [1] "The most frequent factor of this feature is: 0 , which is 94.35 % of the class."
## [1] "------------------------------------------------"

Now all missing values are filled.

paste('There are now', dim(Soybean.b)[1], 'rows.', sum(!complete.cases(Soybean.b)), 'rows have missing values.')
## [1] "There are now 615 rows. 0 rows have missing values."