Problem No. 3.1

Load the glass data

data(Glass); glass <- Glass  # i hate pressing shift
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First, let’s examine the distributions of the independent variables:

There’s quite a few different shapes, none of them close to normal. Al and Ca are the closest, but they have longer and thicker tails. Ba and Fe are mostly not present in the glass samples. Mg, RI, and K have multiple peaks, perhaps indicating there are multiple processes that make the glass.

Second, let’s examine the dependent variable Type:

ggplot(glass, aes(x=Type)) +
  geom_bar(colour='black', fill='white')

Over half of the samples fall into groups 1 and 2. The seventh group is the third largest, while the sixth is the least numerous.

Third, let’s examine how the independent variables relate to eachother with the PerformanceAnalytics package:

custom_ggpair <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_point(alpha=0.3, size=0.5) + 
    geom_smooth(method=loess, color="red", size=0.5, ...)
}

ggpairs(glass[,1:9], diag='blank', upper='blank',
        lower=list(continuous=custom_ggpair))

For the most part, each variable appears to have some slight relationship with the other. RI and both Ca and Si have the only obviously linear relationships. Note that the Loess line almost obscures more than it adds here. The correlation chart reflects this:

corrplot::corrplot(cor(glass[,1:9]), type='upper', method='number',
                   order='hclust')

Often it is elucidating to examine the results of this variable clustering directly;

plot(ClustOfVar::hclustvar(glass[,1:9]))

Consistent with the above observations, Si, RI, and Ca form their own cluster of variables. Al, Mg, and Ba form another, and Fe, Na, and K yet another.

Finally, let’s directly examine how each independent variable relates to Type;

glass %>%
  gather(-Type, key='variable', value='value') %>%
  ggplot(aes(x=Type, y=value, fill=variable)) +
    geom_bar(stat='identity', position=position_dodge())

This chart takes a second to interpret, but we see that the real source of variation between types is in the presence of trace amounts of some chemicals. Each type appears to have about the same proportion of Si (pink) and Na (light blue). There is some variation in Ca (olive green) but not a lot.

Do there appear to be any outliers in the data? Are any predictors skewed?

A boxplot will help us determine if there are any outliers and where:

boxplot(glass[, 1:9])

RI and Fe have little variance, so there are no outliers. Mg also lacks outliers, but the rest of the variables have outlying measurements.

As we saw in the graphs from part a, almost all of the variables are skewed. Skewness can be more precisely measured:

sapply(glass[,1:9], e1071::skewness)

##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889 
##         Ca         Ba         Fe 
##  2.0184463  3.3686800  1.7298107

K is extremely skewed. mg and Si are left skewed (negative).

Are there any relevant transformations of one or more predictors that might improve the classification model?

Log transformations would be first guess, though it won’t necessarily be productive for all of them:

glass %>%
  select(-Type) %>%
  mutate_all(log) %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales='free') +
    geom_histogram(aes(y=..density..), colour='black', fill='white') +
    geom_density(alpha=0.2, fill='#cf322e')

Box Cox transformation is also an option, although one I mostly avoid:

glass %>%
  select(-Type) %>%
  mutate_all(function(x) predict(BoxCoxTrans(x), x)) %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales='free') +
    geom_histogram(aes(y=..density..), colour='black', fill='white') +
    geom_density(alpha=0.2, fill='#cf322e')

Problem No. 3.2

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Load the data, being careful with R’s protected language;

data('Soybean'); soybean <- Soybean
colnames(soybean)[1] <- 'class_'

Find problematic variables:

colnames(soybean)[ nearZeroVar(soybean) ]

## [1] "leaf.mild" "mycelium"  "sclerotia"

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

A decision tree on a variable ‘is this observation missing data?’ provides a quick overview of the problem:

soybean$is_missing <- apply(soybean, MARGIN=1, FUN=function(x) max(is.na(x)))

library(rpart)
( dt <- rpart(is_missing ~ ., soybean) )

## n= 683 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 683 99.563690 0.1771596  
##   2) class_=alternarialeaf-spot,anthracnose,bacterial-blight,bacterial-pustule,brown-spot,brown-stem-rot,charcoal-rot,diaporthe-stem-canker,downy-mildew,frog-eye-leaf-spot,phyllosticta-leaf-spot,powdery-mildew,purple-seed-stain,rhizoctonia-root-rot 542  0.000000 0.0000000 *
##   3) class_=2-4-d-injury,cyst-nematode,diaporthe-pod-&-stem-blight,herbicide-injury,phytophthora-rot 141 17.163120 0.8581560  
##     6) roots=0 23  2.608696 0.1304348 *
##     7) roots=1,2 118  0.000000 1.0000000 *

plot(dt); text(dt)

It seems almost all of the missing observations are where:

The class is NOT alternarialeaf-spot, anthracnose, bacterial-blight, bacterial-pustule, brown-spot, brown-stem-rot, charcoal-rot, diaporthe-stem-canker, downy-mildew, frog-eye-leaf-spot, phyllosticta-leaf-spot, powdery-mildew, purple-seed-stain, rhizoctonia-root-rot
and root=0

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We can use Hmisc to impute the mode for the categorical data:

soybean_mode <- soybean %>%
  mutate_all(impute, mode)

A more sophisticated imputation method, also from Hmisc, uses bootstrapping and additive regression:

# mycelium,roots too few unique values ext.decay + fruit.pods + 
impute_arg <- aregImpute(~ date + plant.stand + precip + temp + hail +
                         crop.hist + area.dam + sever + seed.tmt + germ +
                         plant.growth + leaves + leaf.halo + leaf.marg +
                         leaf.size + leaf.shread + leaf.malf + leaf.mild +
                         stem + lodging + stem.cankers + canker.lesion +
                         fruiting.bodies + int.discolor + sclerotia +
                         fruit.spots + seed + mold.growth + seed.discolor +
                         seed.size + shriveling, data=soybean)

## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8

soybean_imp <- as.data.frame(impute.transcan(impute_arg, imputation=1,
                                             data=soybean, list.out=TRUE,
                                             pr=FALSE, check=FALSE))

DATA 624—Homework No. 5

Ben Horvath

March 1, 2020

Problem No. 3.1

Problem No. 3.2