Assignment 4

Data 624 - Predictive Analytics

Chapter 3

library(corrplot)

## corrplot 0.84 loaded

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(DataExplorer)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(caret)

## Loading required package: lattice

library(summarytools)

## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')

library (e1071)

3.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

corr <- Glass %>% subset(select=-c(Type)) %>% cor(use='pairwise.complete.obs')
corrplot.mixed(corr, upper='square', lower.col = "black")

glass <- subset(Glass, select = -Type)
predictors <- colnames(glass)

par(mfrow = c(3, 3))
for(i in 1:9) {
  hist(glass[,i], main = predictors[i])
}

par(mfrow=c(3,3))
for(var in names(Glass)[-10]){
  boxplot(Glass[var], main=paste('Boxplot of', var), horizontal = T)
}

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

* From the the boxplot, it appears that all of the predictors, except Mg, have outliers.

* From the histogram, it seems that the variables Ba, Mg, K, Ca, and Fe are heavily skewed.

The skewness value can be calculated to confirm:

describe(Glass)

##       vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04    -0.29
##         se
## RI    0.00
## Na    0.06
## Mg    0.10
## Al    0.03
## Si    0.05
## K     0.04
## Ca    0.10
## Ba    0.03
## Fe    0.01
## Type* 0.12

The skewness value can be calculated to confirm:

Glass[-10] %>% apply(2, skewness) %>% sort(decreasing=T)

##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523

p <- describe(Glass[,1:9])

ggplot(p,aes(x = row.names(p),y=skew))+
  geom_bar(stat='identity') +
  ggtitle("Glass - Skewness")

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

trans <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
transformed <- predict(trans, Glass[-10])

par(mfrow=c(3,3))
for(var in names(transformed)[-10]){
  boxplot(transformed[var], main=paste('Boxplot of', var), horizontal = T)
}

transformed %>% apply(2, skewness) %>% sort(decreasing=T)

##           K          Ba          Fe          RI          Al          Na 
##  6.46008890  3.36867997  1.72981071  1.56566039  0.09105899  0.03384644 
##          Ca          Si          Mg 
## -0.19395573 -0.65090568 -1.13645228

* The centering and scaling did the job of bringing the mean to 0 and standard deviation to 1.

* It appears that the Box-Cox transformation has improved the skewness of Ca, Al, and Na. It was not effective in reducing the skewness for other predictors having heavier skewness.

* For some predictors having high count of zero value, such as Ba and Fe, the skewness may be due to these zeros. It might be beneficial to include an engineered binary feature that identifies if the predictor is zero or non-zero, and apply Box-Cox transform to only the non-zero values of these predictors.

* After performaning the filtered transformation, the skewness of Ba and Fe are significantly reduced:

reduce_skew <- function(vec){
  trans <- vec[vec!=0] %>% BoxCoxTrans() %>% predict(vec[vec!=0])
  return(skewness(trans))
}

paste('The skewness of Ba is now:', reduce_skew(Glass$Ba))

## [1] "The skewness of Ba is now: -0.0544828448359268"

paste('The skewness of Fe is now:', reduce_skew(Glass$Fe))

## [1] "The skewness of Fe is now: 0.0729367099534234"

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

* The variable with degenrate distributions is a variable with “zero-variance” issue, that satisfies both following conditions:

- The fraction of unique values over the sample size is low (say 10%).

- The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

nearZeroVar(Soybean)

## [1] 19 26 28

paste('The degenerate variables are:', paste(names(Soybean[,nearZeroVar(Soybean)]), collapse = ', '))

## [1] "The degenerate variables are: leaf.mild, mycelium, sclerotia"

summary(Soybean[19])

##  leaf.mild 
##  0   :535  
##  1   : 20  
##  2   : 20  
##  NA's:108

For the leaf.mild variable, the factin of unique value over the sample size is 3/683=0.4% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 535/20=26.75 > 20.

summary(Soybean[26])

##  mycelium  
##  0   :639  
##  1   :  6  
##  NA's: 38

For the mycelium variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 639/6=106.5 > 20.

summary(Soybean[28])

##  sclerotia 
##  0   :625  
##  1   : 20  
##  NA's: 38

For the sclerotia variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 625/20=31.25 > 20.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The count of missing values in each variables are found below:

nas <- Soybean[-1] %>% apply(2, is.na) %>% apply(2, sum, na.rm=T)
nas <- sort(nas, decreasing=T)
nas

##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             108             106             106             106             106 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##             100              92              92              92              84 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##              38              38              38              38              38 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##              38              36              31              30              16 
##    plant.growth            stem            date        area.dam          leaves 
##              16              16               1               1               0

Below, a table is constructed to show the relationship of the missing data to the classes. The table is constructed as following:

1. Select the predictor variable

2. Find the row indices where the predictor has missing values

3. Select these rows

4. Count the number of occurrence in each class of the target variable

5. Repeat 1~4 for each predictor variable

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

t_list <- list()
i <- 0
for (var in names(Soybean[-1])) {
  i <- i +1
  row_id <- which(is.na(Soybean[,var]))
  temp <- Soybean[row_id,'Class']
  t_list[[i]] <- as.matrix(table(temp))
}
df <- data.frame(do.call(cbind, t_list))
names(df) <- names(Soybean[-1])
df <- df[names(nas)]
df <- t(df)
kable(df) %>% 
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
 scroll_box(width='100%', height = "500px")

	2-4-d-injury	cyst-nematode	diaporthe-pod-&-stem-blight	herbicide-injury	phytophthora-rot
hail	16	14	15	8	68
sever	16	14	15	8	68
seed.tmt	16	14	15	8	68
lodging	16	14	15	8	68
germ	16	14	6	8	68
leaf.mild	16	14	15	8	55
fruiting.bodies	16	14	0	8	68
fruit.spots	16	14	0	8	68
seed.discolor	16	14	0	8	68
shriveling	16	14	0	8	68
leaf.shread	16	14	15	0	55
seed	16	0	0	8	68
mold.growth	16	0	0	8	68
seed.size	16	0	0	8	68
leaf.halo	0	14	15	0	55
leaf.marg	0	14	15	0	55
leaf.size	0	14	15	0	55
leaf.malf	0	14	15	0	55
fruit.pods	16	0	0	0	68
precip	16	14	0	8	0
stem.cankers	16	14	0	8	0
canker.lesion	16	14	0	8	0
ext.decay	16	14	0	8	0
mycelium	16	14	0	8	0
int.discolor	16	14	0	8	0
sclerotia	16	14	0	8	0
plant.stand	16	14	6	0	0
roots	16	0	15	0	0
temp	16	14	0	0	0
crop.hist	16	0	0	0	0
plant.growth	16	0	0	0	0
stem	16	0	0	0	0
date	1	0	0	0	0
area.dam	1	0	0	0	0
leaves	0	0	0	0	0

Here, the columns are the classes of the target variable, and the rows are the predictors. The numbers are the count of missing values for the predictors.

From this table, it seems that some predictors have same rows with missing values, and the same distribution of classes. Furthere, these predictors’ missing values are biased toward the class phytophthorarot. For example, for the predictor hail, out of the 121 missing values, 68 (56%) of them are phytophthorarot. This indicates “informative missingness”, which can induce significant bias in the model.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the table above, I will eliminate the rows with missing values that have high bias toward the class phytophthorarot. This will remove roughly 68 rows.

# Mark the rows that has missing values and has the class being "phytophthora-rot"
eliminate <- (!complete.cases(Soybean)) & ifelse(Soybean$Class=='phytophthora-rot', 1, 0)

# Eliminate those rows
Soybean.a <- Soybean[!eliminate,]

paste('Eliminated', sum(eliminate), 'rows.')

## [1] "Eliminated 68 rows."

paste(sum(!complete.cases(Soybean.a)), 'rows still contain missing values.')

## [1] "53 rows still contain missing values."

fill_na <- function(df){
  for (i in 2:dim(df)[2]){
    paste('Filling', sum(is.na(df[,i])), 'missing values for feature: ', names(df)[i], '.') %>% print()
    find.mode <- df[,i] %>% table() %>% sort(decreasing = T) %>% prop.table() %>% round(4)
    mode.name <- find.mode %>%  names() %>% .[1]
    paste('The most frequent factor of this feature is:', mode.name, ', which is', find.mode[mode.name]*100, '% of the class.') %>% print()
    df[is.na(df[,i]), i] <- mode.name
    paste('------------------------------------------------') %>% print()
  }
  return(df)
}

Soybean.b <- fill_na(Soybean.a)

## [1] "Filling 1 missing values for feature:  date ."
## [1] "The most frequent factor of this feature is: 5 , which is 24.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 36 missing values for feature:  plant.stand ."
## [1] "The most frequent factor of this feature is: 0 , which is 61.14 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  precip ."
## [1] "The most frequent factor of this feature is: 2 , which is 72.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 30 missing values for feature:  temp ."
## [1] "The most frequent factor of this feature is: 1 , which is 57.09 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  hail ."
## [1] "The most frequent factor of this feature is: 0 , which is 77.4 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  crop.hist ."
## [1] "The most frequent factor of this feature is: 3 , which is 32.39 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 1 missing values for feature:  area.dam ."
## [1] "The most frequent factor of this feature is: 3 , which is 30.46 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  sever ."
## [1] "The most frequent factor of this feature is: 1 , which is 57.3 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  seed.tmt ."
## [1] "The most frequent factor of this feature is: 0 , which is 54.27 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 44 missing values for feature:  germ ."
## [1] "The most frequent factor of this feature is: 1 , which is 37.3 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  plant.growth ."
## [1] "The most frequent factor of this feature is: 0 , which is 73.62 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 0 missing values for feature:  leaves ."
## [1] "The most frequent factor of this feature is: 1 , which is 87.48 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.halo ."
## [1] "The most frequent factor of this feature is: 2 , which is 58.36 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.marg ."
## [1] "The most frequent factor of this feature is: 0 , which is 60.92 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.size ."
## [1] "The most frequent factor of this feature is: 1 , which is 55.8 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 45 missing values for feature:  leaf.shread ."
## [1] "The most frequent factor of this feature is: 0 , which is 83.16 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 29 missing values for feature:  leaf.malf ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.32 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  leaf.mild ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.88 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  stem ."
## [1] "The most frequent factor of this feature is: 1 , which is 50.58 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 53 missing values for feature:  lodging ."
## [1] "The most frequent factor of this feature is: 0 , which is 92.53 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  stem.cankers ."
## [1] "The most frequent factor of this feature is: 0 , which is 64.64 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  canker.lesion ."
## [1] "The most frequent factor of this feature is: 0 , which is 55.46 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  fruiting.bodies ."
## [1] "The most frequent factor of this feature is: 0 , which is 81.98 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  ext.decay ."
## [1] "The most frequent factor of this feature is: 0 , which is 76.6 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  mycelium ."
## [1] "The most frequent factor of this feature is: 0 , which is 98.96 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  int.discolor ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  sclerotia ."
## [1] "The most frequent factor of this feature is: 0 , which is 96.53 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 16 missing values for feature:  fruit.pods ."
## [1] "The most frequent factor of this feature is: 0 , which is 67.95 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  fruit.spots ."
## [1] "The most frequent factor of this feature is: 0 , which is 59.79 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  seed ."
## [1] "The most frequent factor of this feature is: 0 , which is 80.54 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  mold.growth ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.66 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  seed.discolor ."
## [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 24 missing values for feature:  seed.size ."
## [1] "The most frequent factor of this feature is: 0 , which is 90.02 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 38 missing values for feature:  shriveling ."
## [1] "The most frequent factor of this feature is: 0 , which is 93.41 % of the class."
## [1] "------------------------------------------------"
## [1] "Filling 31 missing values for feature:  roots ."
## [1] "The most frequent factor of this feature is: 0 , which is 94.35 % of the class."
## [1] "------------------------------------------------"

Now all missing values are filled.

paste('There are now', dim(Soybean.b)[1], 'rows.', sum(!complete.cases(Soybean.b)), 'rows have missing values.')

## [1] "There are now 615 rows. 0 rows have missing values."

Soybean.b %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")

Assignment 4

Jatin Jain

3/6/2021

The data can be loaded via:

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

* The variable with degenrate distributions is a variable with “zero-variance” issue, that satisfies both following conditions:

- The fraction of unique values over the sample size is low (say 10%).

- The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

For the leaf.mild variable, the factin of unique value over the sample size is 3/683=0.4% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 535/20=26.75 > 20.

For the mycelium variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 639/6=106.5 > 20.

For the sclerotia variable, the factin of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 625/20=31.25 > 20.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the table above, I will eliminate the rows with missing values that have high bias toward the class phytophthorarot. This will remove roughly 68 rows.

Now all missing values are filled.