Chapter 3, Applied Predictive Modeling

problems 3.1 & 3.2

3.1)

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
    1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

First lets examine the overall distribution of the glass type variable.

library(ggplot2)

ggplot(Glass,aes(x=Glass$Type))+geom_histogram(fill="deepskyblue3",colour="black", stat="count")+ggtitle("Distribution by Type of Glass")+
  xlab("Glass Type") +
  ylab("Count")+
  coord_flip()+
labs(caption="UCI Machine Learning Repository")

We can see there are more cases of Glass Types 1 and 2 within the whole data set. How about the distribution of the predictors?

library(DataExplorer)

Glass2<-subset(Glass, select=c(-Type))

plot_histogram(Glass2)

RI, NA, AI, and SI have a fairly close to normal distribution. We can see a left skew with Mg and possible an outlier. K has right skew along with Ba and Fe.

Are we missing data?

plot_missing(Glass)

How do we see the relationships between predictors?

#correlation matrix and visualization 
correlation_matrix <- round(cor(Glass2),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

rm(Glass2)

The correlation heatmap does not indicate that there are many instances where variables are highly correlated with each other. The only exceptions are that Ca is highly correlated with RI with a coefficient of .81 followed by Ba having mild correlation with Ai. The strongest negative relationship are between Si and RI.

    1. Do there appear to be any outliers in the data? Are any predictors skewed?

Recall the following plot from part a.

Glass2<-subset(Glass, select=c(-Type))

plot_histogram(Glass2)

We will restate our findings from part a. We can see that the following predictors have a close to normal distribution:

  • NA

  • AI

  • SI

The following predictors have right skews:

  • BA

  • CA

  • K

  • RI

  • FE

The following predictors have left skews:

  • MG

The following predictors appear to have outliers:

  • BA

  • FE

  • K

  • MG

We can certainly drill down more into whether outliers are actually present in the data. We can run some simple visual test and see if there are indeed outliers.

We can use an incredible script to identify outliers. The original code is found here: https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/

outlierKD <- function(dt, var) {
     var_name <- eval(substitute(var),eval(dt))
     na1 <- sum(is.na(var_name))
     m1 <- mean(var_name, na.rm = T)
     par(mfrow=c(2, 2), oma=c(0,0,3,0))
     boxplot(var_name, main="With outliers")
     hist(var_name, main="With outliers", xlab=NA, ylab=NA)
     outlier <- boxplot.stats(var_name)$out
     mo <- mean(outlier)
     var_name <- ifelse(var_name %in% outlier, NA, var_name)
     boxplot(var_name, main="Without outliers")
     hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
     title("Outlier Check", outer=TRUE)
     na2 <- sum(is.na(var_name))
     cat("Outliers identified:", na2 - na1, "n")
     cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
     cat("Mean of the outliers:", round(mo, 2), "n")
     m2 <- mean(var_name, na.rm = T)
     cat("Mean without removing outliers:", round(m1, 2), "n")
     cat("Mean if we remove outliers:", round(m2, 2), "n")
     response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
     if(response == "y" | response == "yes"){
          dt[as.character(substitute(var))] <- invisible(var_name)
          assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
          cat("Outliers successfully removed", "n")
          return(invisible(dt))
     } else{
          cat("Nothing changed", "n")
          return(invisible(var_name))
     }
}

outlierKD(Glass, RI);

## Outliers identified: 17 nPropotion (%) of outliers: 8.6 nMean of the outliers: 1.52 nMean without removing outliers: 1.52 nMean if we remove outliers: 1.52 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Na);

## Outliers identified: 7 nPropotion (%) of outliers: 3.4 nMean of the outliers: 12.66 nMean without removing outliers: 13.41 nMean if we remove outliers: 13.43 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Mg);

## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: 2.68 nMean if we remove outliers: 2.68 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Al);

## Outliers identified: 18 nPropotion (%) of outliers: 9.2 nMean of the outliers: 2.09 nMean without removing outliers: 1.44 nMean if we remove outliers: 1.39 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, K);

## Outliers identified: 7 nPropotion (%) of outliers: 3.4 nMean of the outliers: 3.06 nMean without removing outliers: 0.5 nMean if we remove outliers: 0.41 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Ca);

## Outliers identified: 26 nPropotion (%) of outliers: 13.8 nMean of the outliers: 11.17 nMean without removing outliers: 8.96 nMean if we remove outliers: 8.65 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Ba);

## Outliers identified: 38 nPropotion (%) of outliers: 21.6 nMean of the outliers: 0.99 nMean without removing outliers: 0.18 nMean if we remove outliers: 0 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
outlierKD(Glass, Fe)

## Outliers identified: 12 nPropotion (%) of outliers: 5.9 nMean of the outliers: 0.32 nMean without removing outliers: 0.06 nMean if we remove outliers: 0.04 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

The incredible function reveals that there are outliers in almost every variable with the exception of MG.

    1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Since our data consists of numerical variables, we can apply normalization to the data set. This will increase data stability which would be optimal for our classification model.

glass_norm<-as.data.frame(apply(Glass[,1:9], 2,
                                function(x)(x-min(x))/(max(x)-min(x))
                                         ))

glass_norm$Type<-Glass$Type

#head(glass_norm)

plot_histogram(glass_norm)

We can also try a Box-Cox Transformation on variables using the following tutorial: https://setscholars.net/2019/05/25/how-to-preprocess-data-in-r-using-box-cox-transformation/

# load libraries
library(mlbench)
library(caret)

preprocessParams <- preProcess(Glass[,1:9], method=c("BoxCox"))

# summarize transform parameters
print(preprocessParams)
## Created from 214 samples and 5 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
# transform the dataset using the parameters
transformed <- predict(preprocessParams, Glass[,1:9])

# summarize the transformed dataset (note pedigree and age)
#summary(transformed);

plot_histogram(transformed)

Summary Before Box-Cox

summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

Summary After Box-Cox

summary(transformed)
##        RI               Na              Mg              Al         
##  Min.   :0.2810   Min.   :2.373   Min.   :0.000   Min.   :-0.9230  
##  1st Qu.:0.2826   1st Qu.:2.558   1st Qu.:2.115   1st Qu.: 0.1817  
##  Median :0.2829   Median :2.588   Median :3.480   Median : 0.3324  
##  Mean   :0.2831   Mean   :2.594   Mean   :2.685   Mean   : 0.3685  
##  3rd Qu.:0.2833   3rd Qu.:2.626   3rd Qu.:3.600   3rd Qu.: 0.5534  
##  Max.   :0.2875   Max.   :2.855   Max.   :4.490   Max.   : 1.7417  
##        Si             K                Ca               Ba       
##  Min.   :2436   Min.   :0.0000   Min.   :0.7677   Min.   :0.000  
##  1st Qu.:2612   1st Qu.:0.1225   1st Qu.:0.8197   1st Qu.:0.000  
##  Median :2649   Median :0.5550   Median :0.8238   Median :0.000  
##  Mean   :2639   Mean   :0.4971   Mean   :0.8256   Mean   :0.175  
##  3rd Qu.:2670   3rd Qu.:0.6100   3rd Qu.:0.8297   3rd Qu.:0.000  
##  Max.   :2843   Max.   :6.2100   Max.   :0.8666   Max.   :3.150  
##        Fe         
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05701  
##  3rd Qu.:0.10000  
##  Max.   :0.51000

We can see that the transform has made a difference in some of the predictors but not all. Next step would be outlier treatment but that is out of scope for the problem.

3.2)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)
#str(Soybean)
plot_missing(Soybean )

We can see that there are several variables that contain missing entries. Perhpas we can deal with this later on in the problem.

    1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

As we have seen in the literature, a degenerate distribution happens when a predictor variable contains a unique value or values that do not occur often, meaning they have low frequencies.

Lets examine the class variable before looking at our predictors.

ggplot(Soybean,aes(x=Soybean$Class))+geom_histogram(fill="deepskyblue3",colour="black", stat="count")+ggtitle("Distribution by Type of Soybean")+
  xlab("Soybean Type") +
  ylab("Count")+
  coord_flip()+
labs(caption="UCI Machine Learning Repository")

We can see that there are certain classes that appear much more frequently than others such as phytophthora.

Lets examine the predictors in a visual context using https://www.r-bloggers.com/smoothscatter-with-ggplot2/

soy_predictors <- Soybean[,2:36]

par(mfrow = c(3, 6))

for (i in 1:ncol(soy_predictors)) 
  {
  plot(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
  #smoothScatter(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
  theme_classic()
  #scale_fill_continuous(low = "white", high = "red")
  }

library(caret)

nearZeroVar(soy_predictors, names = TRUE, saveMetrics=T)
##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Our tests reveal thatthat we do not have variables with a unique value. We do however see several variables that have rare values such as leaf.mild, mycelium, and sclerotia. These are degenerate variables by definition.

    1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Lets recall our missing data plot before investigating the relation to classes:

plot_missing(Soybean)

As a whole, we can see that we have 4 predictors have the highest amount of missing data by percentage. Overall, we do not exceed 20% in terms of data missing per predictor.

library(tidyverse)

Soybean %>%
mutate(Total = n()) %>% 
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(), Proportion=Missing/Total) %>%
select(Class, Missing, Proportion) %>%
unique()
## # A tibble: 5 x 3
## # Groups:   Class [5]
##   Class                       Missing Proportion
##   <fct>                         <int>      <dbl>
## 1 phytophthora-rot                 68     0.0996
## 2 diaporthe-pod-&-stem-blight      15     0.0220
## 3 cyst-nematode                    14     0.0205
## 4 2-4-d-injury                     16     0.0234
## 5 herbicide-injury                  8     0.0117

If we partition the data by class and drill down into the percent missing per class, we can identify what class is missing the most data. phytophthora-rot is the class that accounts for the most missing data. We could have also done these calcualtions using sqldf package.

    1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We can work on this problem and see how things change according to the approach we decide to take.

  • Method of Imputation using Mice package

https://datascienceplus.com/imputing-missing-data-with-r-mice-package/

After reading the method definitions, it makes more sense to use method = PMM https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

library(rcompanion)
library(psych)
library(DescTools)
library(mice)

the_soy <- mice(Soybean, method="pmm", printFlag=F, seed=112)

the_soy <- complete(the_soy)

the_soy <- as.data.frame(the_soy)

plot_missing(the_soy)

We can see that we no longer have missing data, however it is vital that we check the new distributions of our data.

soy_predictors2 <- the_soy[,2:36]

par(mfrow = c(3, 6))

for (i in 1:ncol(soy_predictors2)) 
  {
  plot(soy_predictors2[ ,i], ylab = names(soy_predictors2[i]))+
  #smoothScatter(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
  theme_classic()
  #scale_fill_continuous(low = "white", high = "red")
  }