R Notebook

DATA 624 - HW 4 Exercises - Zachary Herold

A. Loading Packages

## Loading required package: lattice

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Loading required package: foreach

## Loading required package: itertools

## Loading required package: iterators

B. Visualization Schemes and Functions

library(viridis)

## Warning: package 'viridis' was built under R version 3.5.3

## Loading required package: viridisLite

library(hrbrthemes)

## Warning: package 'hrbrthemes' was built under R version 3.5.3

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see http://bit.ly/arialnarrow

library(RColorBrewer)
palette <- c('#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#000000', '#46f0f0', '#f032e6', '#bcf60c')

# Histogram for Binned Data
draw_hist <- function(df, i){
  predictors <- colnames(df)
  var <- df[,i]
  
  g <- ggplot(df, aes(x = var))
  g+geom_histogram(bins=12,alpha=0.5, color = "black", fill = palette[i]) + 
    ggtitle("Glass Composition by: ", predictors[i]) +
    scale_color_manual(values = palette[i])
}

# Violin plot for Continuous Data
draw_violinplot <- function(df, i){
  predictors <- colnames(df)
  var <- df[,i]
  
  g <-ggplot(df, aes(x=Type, y=var))
  g+geom_violin(alpha=0.5, color="gray") +
    geom_jitter(alpha=0.5, size=4, aes(color = palette[i]), position = position_jitter(width = 0.1), show.legend=FALSE) +
    ggtitle("Glass Composition by: ", predictors[i]) +
    scale_color_manual(values = palette[i]) +
    coord_flip()
}

# Bar chart for Categorical Data
draw_barchart <- function(df, i){
  #attach(df)
  predictors <- colnames(df)
  var <- df[,i]
  tab <- as.data.frame(table(df$Class, var))
  
  g <- ggplot(tab, aes(fill=var, x=Var1, y=Freq)) 
  g+geom_bar(position="stack", stat="identity") +
      scale_fill_viridis(discrete = T) +
      ggtitle("Frequency of disease by: ", predictors[i]) +
      theme_ipsum() +
      xlab("") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

3.1. Glass Classification

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

From the summary, we observe no missing values. The large spread of max over 3rd quartile for Al, K, Ca, Ba and Fe suggests the presence of outliers. With the impossibility of negative values, a normal distribution approximation is likely inappropriate.

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Below we construct histograms for the values of each predictors (for all glass types) as a side-view, as well as violinplots that portray the distribution of element quantities in accordance to type as an segmented overhead-view.

par(mfrow = c(5, 5))
loop.vector <- 1:(length(colnames(Glass))-1)

for (i in loop.vector) {
  print(draw_hist(Glass, i))
  print(draw_violinplot(Glass, i))
}

Scatterplot Matrix

This scatterplot matrix gives correlation coefficients in the upper panel and a scatterplot with locally weighted linear regression (loess) lines in red.

From the matrix, we note the highest correlation is between Ca and RI, at +0.81. The largest negative correlation is between Si and RI, at -0.54.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

For this exercise, I define outliers as those observations which have a z-score with an absolute value greater than 4. I detect them by filtering rows which have a scaled value greater than 4 or less than -4 (i.e., are 4 standard deviations from the mean)

Glass_scaled <- scale(Glass[1:9])
Glass_scaled[apply(Glass_scaled, MARGIN = 1, function(x) any(abs(x) > 4)), ]

##              RI         Na           Mg         Al         Si          K
## 107  4.24272557 -3.2792540 -1.861146776  1.3121035 -3.6678717  0.1271772
## 108  5.12521495 -1.3566564 -1.861146776 -0.8911147 -3.2159939 -0.5781368
## 111  2.72471212 -2.6669618 -1.861146776 -1.3517877  0.7217978 -0.7621317
## 112  2.97167743 -2.9241245 -1.861146776 -1.3918462  0.5539575 -0.7621317
## 164 -1.06208933  0.7373829 -0.003142461  4.1161995 -3.5645853  1.8137975
## 172 -1.71407775 -0.4749556 -1.861146776  3.1948537 -2.8028486  8.7596065
## 173 -1.69761339 -0.4994473 -1.861146776  3.1547952 -2.5188111  8.7596065
## 175  0.72923240 -0.6831350 -0.744957617  1.4523083 -0.6080139  0.4031696
## 185 -2.37594478  4.8642325 -1.861146776 -2.2130457  3.5621721 -0.7621317
## 208 -0.01824927  1.2027250 -1.861146776  0.7512843  0.2699200  1.3998089
##             Ca         Ba         Fe
## 107  3.0516999  5.9831819  2.2885225
## 108  5.0824015 -0.3520514  1.8780079
## 111  4.0213775 -0.3520514 -0.5850791
## 112  4.2181237 -0.3520514 -0.5850791
## 164 -2.1691003  4.0725560 -0.5850791
## 172 -1.4031955 -0.3520514 -0.5850791
## 173 -1.4242755 -0.3520514 -0.5850791
## 175  0.5221063  0.1306331  4.6489809
## 185 -1.6210217 -0.3520514 -0.5850791
## 208 -1.7475013  5.4401619 -0.5850791

# from https://stackoverflow.com/questions/9856632/subset-rows-with-all-any-columns-larger-than-a-specific-value

Outlier 1: RI & Ba Outlier 2: RI & Ca Outlier 3: Ca Outlier 4: Ca Outlier 5: Al & Ba Outlier 6: K Outlier 7: K Outlier 8: Fe Outlier 9: Na Outlier 10: Ba

From the visualizations above, the following shape of distribution is noted:

Rightward skew: RI, Na, Al, Ca, K (with zeroes removed), Ba (with zeroes removed), Fe (with zeroes removed) Leftward skew: Mg (with zeroes removed), Si

This can be confirmed using the skewness function of the moments package:

skewness(Glass[loop.vector])

##         RI         Na         Mg         Al         Si          K 
##  1.6140150  0.4509917 -1.1444648  0.9009179 -0.7253173  6.5056358 
##         Ca         Ba         Fe 
##  2.0326774  3.3924309  1.7420068

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

In constructing a multiple logistic regression model, the author would perform the following transformations:

Feature engineering additional categorical variables for elements with zero values in the sample

First we separate the predictors into two categories: (a) Those elements (and Refractive Index RI) which are present in all glass samples, and (b) Those elements which are absent (quantity=0) from some samples (i.e., Mg, K, Ba, Fe)

Since zeroes represent the majority of the dataset for Ba and Fe, we would remove the numerical value for these two, relying on the categorical has/has_not dummy variable solely. We would do the same for K, since any regression coefficient of the quantitative variable would likely be skewed by the extreme outliers. Only for Mg would we keep both quantitative and dummy variables.

Glass_model <- Glass
Glass_model$has_Mg <- ifelse(Glass_model$Mg != 0, 1, 0)
Glass_model$has_K <- ifelse(Glass_model$K != 0, 1, 0)
Glass_model$has_Ba <- ifelse(Glass_model$Ba != 0, 1, 0)
Glass_model$has_Fe <- ifelse(Glass_model$Fe != 0, 1, 0)

Due to the large difference in range and the rightward skew for most other variables, the author would choose between performing normalization or log-transformation.

3.2. Soybean Data

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. http://archive.ics.uci.edu/ml/index.html.

data(Soybean)
dim(Soybean)

## [1] 683  36

A data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class.

[,1] Class the 19 classes [,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6). [,3] plant.stand normal(0),lt-normal(1). [,4] precip lt-norm(0),norm(1),gt-norm(2). [,5] temp lt-norm(0),norm(1),gt-norm(2). [,6] hail yes(0),no(1). [,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3). [,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3). [,9] sever minor(0),pot-severe(1),severe(2). [,10] seed.tmt none(0),fungicide(1),other(2). [,11] germ 90-100%(0),80-89%(1),lt-80%(2). [,12] plant.growth norm(0),abnorm(1). [,13] leaves norm(0),abnorm(1). [,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2). [,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2). [,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2). [,17] leaf.shread absent(0),present(1). [,18] leaf.malf absent(0),present(1). [,19] leaf.mild absent(0),upper-surf(1),lower-surf(2). [,20] stem norm(0),abnorm(1). [,21] lodging yes(0),no(1). [,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3). [,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3). [,24] fruiting.bodies absent(0),present(1). [,25] ext.decay absent(0),firm-and-dry(1),watery(2). [,26] mycelium absent(0),present(1). [,27] int.discolor none(0),brown(1),black(2). [,28] sclerotia absent(0),present(1). [,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3). [,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4). [,31] seed norm(0),abnorm(1). [,32] mold.growth absent(0),present(1). [,33] seed.discolor absent(0),present(1). [,34] seed.size norm(0),lt-norm(1). [,35] shriveling absent(0),present(1). [,36] roots norm(0),rotted(1),galls-cysts(2).

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

We observe below bar graphs for each feature. For some types, there is complete lack of data in relation to certain features.

par(mfrow = c(9, 9))
loop.vector <- 2:length(colnames(Soybean))

for (i in loop.vector) {
  print(draw_barchart(Soybean, i))
}

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can see below that all of the missing values correspond to 5 disease types in particular.

Soybean_nulls <- Soybean[rowSums(is.na(Soybean)) > 0,]
missing <- table(Soybean_nulls$Class)

as.list(missing[missing > 0])

## $`2-4-d-injury`
## [1] 16
## 
## $`cyst-nematode`
## [1] 14
## 
## $`diaporthe-pod-&-stem-blight`
## [1] 15
## 
## $`herbicide-injury`
## [1] 8
## 
## $`phytophthora-rot`
## [1] 68

In the case of these 5 diseases, complete information was not required, as certain features were definitive markers of the disease: 2-4-d injury: leaves == 1 (abnormal) cyst-nematode: roots == 2 (galls-cyst), seed=1 (abnormal), etc.

The data collection process could have indicated values for all features, and in this case, those with missing values can be construed as secondary features for that particular disease classification.

Only in the class of “phytophthora-rot” are there a large number of features with both data points that include and are missing values. There is partial data for hail, sever, seed.tmt, germ, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, fruiting.bodies, fruit.pods, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling. For those missing values, some method for data imputation is required.

As such, the author would propose a two-stage classification system, in wich the first stage removes all columns with missing data and seeks to isolate the 4 diseases for which there are features fully without data. That is, the first stage selects only primary features as predictors. Inconclusive results are fed into the second stage, in which those 4 disease no longer appear in the universe of possible predicted classes. Missing value imputation is applied in this stage.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

classes_withna <- c("2-4-d-injury", "cyst-nematode", "diaporthe-pod-&-stem-blight", "herbicide-injury")
Soybean_sub <- Soybean[!Soybean$Class %in% classes_withna,]

Because of the high dimensionality of the dataset and the prevalence of categorical features, the author would not apply the stategies of knn, mean, or regression imputation. A preferred strategy is random forest imputation, which makes use of the predictors for which data is complete. For this, the author uses the missForest function from the package of the same name.

#Soybean_sub.na <- Soybean_sub[complete.cases(Soybean_sub),]
Soybean_sub.imputed <- missForest(Soybean_sub)$ximp

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

(Soybean_sub.imputed[rowSums(is.na(Soybean_sub.imputed)) > 0,])

# https://stats.stackexchange.com/questions/226803/what-is-the-proper-way-to-use-rfimpute-imputation-by-random-forest-in-r

We can see there are no longer any missing values from the “phytophthora-rot” class.

head(Soybean_sub.imputed[Soybean_sub.imputed$Class=="phytophthora-rot",])