Exercise 3.1 (K&J)

data(Glass)
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

We can view the relationship between variables and our predictor (Type) via a correlation plot from the corrplot package in R

# First, calculate the correlatino between variables in 
glass <- Glass %>% mutate(Type = as.numeric(Type))
corrGlass <- cor(glass)
corrplot(corrGlass)

We see reasonably strong correlations between our Type variable and the predictors for percentage of Barium, Alimunum, Magnesium, and Sodium.

We can also use a corner plot from the GGally::ggpairs function to visually inspect the relationships between our variables. We can also gain a sense of the distributions of each variable from this plot. I’m going to remove the correlations (upper=NULL; see above) to make the plot a bit less busy:

# Use GGally to plot pairwise relationships between variables
ggpairs(glass, upper=NULL)

The predictor variables for the minerals we care about from above seem to be a bit skewed from this visual. Specifically, the below predictor variables seem to be significantly skewed such that a transformation is necessary:

Some of which have strong correlations with our Type variable (see above)

There seem to be a few outlier points for the potassium (K) predictor, but this variable doesn’t correlate strongly with our outcome variable.

Transformations

Let’s build a simple logistic regression model to classify Type. First, we’ll do it with the raw data, adn then with the transformed data to see if that improves our prediction. We’ll build a simple train-test splitas well to evaluate the models

# Create simple training and test datasets
set.seed(1234) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(glass), size = floor(.75 * nrow(glass)), replace = F)
train <- glass[sample, ]
test  <- glass[-sample, ]

raw_model <- multinom(Type ~.,  data=train)
## # weights:  66 (50 variable)
## initial  value 286.681515 
## iter  10 value 178.170738
## iter  20 value 124.734126
## iter  30 value 110.870929
## iter  40 value 108.254006
## iter  50 value 107.230338
## iter  60 value 106.506072
## iter  70 value 106.009768
## iter  80 value 105.606775
## iter  90 value 104.999637
## iter 100 value 104.646171
## final  value 104.646171 
## stopped after 100 iterations
summary(raw_model)
## Call:
## multinom(formula = Type ~ ., data = train)
## 
## Coefficients:
##   (Intercept)         RI         Na         Mg        Al          Si
## 2   37.053482  92.821330 -0.9635396 -4.1316486  2.367778 -1.83846752
## 3   22.351865 -28.265503  1.4737269 -0.3435969  3.514535 -0.10914109
## 4   10.602322   4.170507 -2.1031350 -5.1700146 17.084019 -0.08040162
## 5   -5.713220  -9.153357  7.5518862 -4.4728104 18.128640 -1.24670932
## 6   -9.365253  15.412955 -0.8106857 -7.2261805  8.391411  0.58624394
##              K           Ca          Ba            Fe
## 2   -2.0997985 -2.239068150   -6.188797    3.39903039
## 3   -0.5549199  0.567433213 -124.587680   -0.04227417
## 4    0.1533757  0.003615038    1.168617    3.52391279
## 5 -105.5298008  0.055856582  -39.962363 -178.91207635
## 6   -4.6170684 -4.266785787    3.796153  -48.44823954
## 
## Std. Errors:
##   (Intercept)         RI        Na        Mg       Al        Si          K
## 2  0.08951798 0.19062936 0.5999472 0.9534937 1.419177 0.1512433 2.15216055
## 3  0.04089282 0.06725711 0.7697231 1.1568512 1.745858 0.1938122 2.58297422
## 4  0.14675139 0.28448074 1.4783342 1.6083248 6.072239 0.2720139 3.19634456
## 5  0.06544382 0.10594474 6.5459361 1.8673403 3.974656 1.4830063 0.01323568
## 6  0.09224567 0.18329600 1.7983552 1.7356575 3.285026 0.5348122 3.52505426
##          Ca           Ba           Fe
## 2 0.5788142 5.080058e+00 2.342857e+00
## 3 0.6468024 5.122840e-05 3.621717e+00
## 4 0.9722797 1.167045e+01 8.204996e+00
## 5 2.5921075 8.004143e-01 1.940598e-05
## 6 1.9020923 1.187087e+01 5.177781e-01
## 
## Residual Deviance: 209.2923 
## AIC: 309.2923

Now we can apply some transforms to our predictors. We’ll try a spatial sign transformation, which helps to reduce the effect of outliers by mapping our predictor values onto the surface of a hyper-dimensional sphere. Since all transformed data points will be the same distance from the center of the sphere, their effect as an outlier will be reduced.

In addition, our Ca variable is collinear with some other predictor variables, so we can remove it, as the variance in our class labels is likely captured by those additional predictors outside of Calcium.

# First we scale our dataset then input it into the spatialSign transform
glass_predictors <- glass %>% dplyr::select(-c(Type))
# glass_predictors$Mg <- log(glass_predictors$Mg)


glass_scaled <- data.frame(scale(glass_predictors))
glass_transformed <- as.data.frame(spatialSign(glass_scaled))

glass_transformed <- glass_transformed %>% cbind(Type=glass$Type)

# Remove Calcium and Silicon from dataset
glass_transformed <- glass_transformed %>% dplyr::select(-c(Ca))
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(glass_transformed), size = floor(.75 * nrow(glass_transformed)), replace = F)
train_transformed <- glass_transformed[sample, ]
test_transformed  <- glass_transformed[-sample, ]

transformed_model <- multinom(Type ~.,  data=train_transformed)#, MaxNWts=1450)
## # weights:  60 (45 variable)
## initial  value 286.681515 
## iter  10 value 100.295255
## iter  20 value 76.835096
## iter  30 value 73.899437
## iter  40 value 72.469328
## iter  50 value 71.016328
## iter  60 value 69.721084
## iter  70 value 69.270365
## iter  80 value 69.240896
## final  value 69.240505 
## converged
summary(transformed_model)
## Call:
## multinom(formula = Type ~ ., data = train_transformed)
## 
## Coefficients:
##   (Intercept)         RI          Na           Mg         Al          Si
## 2    3.783184  -1.954268    2.356933    -1.704283   5.436208  -0.8584532
## 3   -3.117764 -17.701978   -7.471643    -4.265298  -3.261058 -17.5584527
## 4 -604.011572 360.873824 -274.989393 -1176.784096 516.967306 600.8810708
## 5 -362.269873 -85.663312  214.272425  -172.166082 384.602045 142.5582186
## 6 -415.419668 817.191067 1048.889403   160.872299 752.162942 607.9154490
##             K          Ba           Fe
## 2    1.324561    9.448656   -0.3846641
## 3  -17.178601    6.843506    1.9930873
## 4  527.328319   47.618942 -202.6382935
## 5 -745.631232 -434.943485 -235.1255588
## 6  580.989344 1024.266165 -405.2680269
## 
## Std. Errors:
##    (Intercept)           RI           Na           Mg           Al           Si
## 2 9.758905e-01 1.970090e+00 1.632730e+00 1.765219e+00 1.678852e+00 1.737909e+00
## 3 2.328477e+00 5.121244e+00 3.954951e+00 4.941777e+00 3.216347e+00 5.511999e+00
## 4 7.152477e+02 1.229219e+02 1.015290e+01 3.776177e+02 2.876679e+00 1.194344e+02
## 5 1.228350e-08 1.077065e-09 3.739326e-09 7.659863e-09 5.318101e-09 4.729215e-10
## 6 3.204367e-10 5.985387e-11 8.283577e-11 1.588639e-10 2.299929e-10 8.915074e-11
##              K           Ba           Fe
## 2 3.076375e+00 5.383879e+00 6.097702e-01
## 3 6.117637e+00 1.092234e+01 1.316750e+00
## 4 5.609749e+00 7.143509e+01 1.178239e+02
## 5 3.135160e-09 1.448088e-09 2.406598e-09
## 6 5.982057e-11 3.858865e-11 2.884024e-11
## 
## Residual Deviance: 138.481 
## AIC: 228.481

With our predictor variables transformed via spatial sign we see an improved (lower) AIC, indicating a simplaer model (this makes sense as we also removed some extraneous predictors).

raw_predict <- predict(raw_model, test)
transformed_predict <- predict(transformed_model, test_transformed)

# Calculate and print accuracy on raw data model and transformed data model
(raw_accuracy <- mean(raw_predict == test$Type))
## [1] 0.6111111
(transformed_accuracy <- mean(transformed_predict == test_transformed$Type))
## [1] 0.5740741

We see comparable accuracy numbers out of a simpler model (fewer inputs; lower AIC value) for our transformed data.

Exercise 3.2 (K&J)

data(Soybean)
head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

Categorical Predictor Frequency Distributions

The table function can be a helpful way to show the relative frequencies between categorical variables.

table(Soybean$Class, Soybean$precip)
##                              
##                                0  1  2
##   2-4-d-injury                 0  0  0
##   alternarialeaf-spot          0  9 82
##   anthracnose                  0  0 44
##   bacterial-blight             0 10 10
##   bacterial-pustule            0 12  8
##   brown-spot                   0 10 82
##   brown-stem-rot              35  9  0
##   charcoal-rot                20  0  0
##   cyst-nematode                0  0  0
##   diaporthe-pod-&-stem-blight  0  2 13
##   diaporthe-stem-canker        0  0 20
##   downy-mildew                 0  0 20
##   frog-eye-leaf-spot           0 10 81
##   herbicide-injury             0  0  0
##   phyllosticta-leaf-spot       9 11  0
##   phytophthora-rot             0 30 58
##   powdery-mildew              10  9  1
##   purple-seed-stain            0  0 20
##   rhizoctonia-root-rot         0  0 20

However, we have 36 variables in our Soybean dataset, so that table would become rather busy quickly. We can produce some mosaic plots form the VCD library which can help us to visualize some of our class frequencies. Lets do this first for some features related to the weather a soyeban experienced, for example: twemp, hail, and precip

weather <- Soybean %>% dplyr::select(precip, hail, temp) %>% table()
mosaic(weather, shade = TRUE)

We can also plot some variables related to properties of the seed (e.g., seed, seed.discolor, seed.size). In this case, we see some larger class imbalance, specifically for the see.discolor and seed.size variables

Soybean %>% dplyr::select(seed, seed.discolor, seed.size) %>% 
  table() %>%
  mosaic(shade = TRUE)

Missing Values

To see which predictors have higher rates of missing values, we can use the freq.na fucntions fromt he questionr package, which lists the number and percentage of missing values per feature

freq.na(Soybean)
##                 missing  %
## hail                121 18
## sever               121 18
## seed.tmt            121 18
## lodging             121 18
## germ                112 16
## leaf.mild           108 16
## fruiting.bodies     106 16
## fruit.spots         106 16
## seed.discolor       106 16
## shriveling          106 16
## leaf.shread         100 15
## seed                 92 13
## mold.growth          92 13
## seed.size            92 13
## leaf.halo            84 12
## leaf.marg            84 12
## leaf.size            84 12
## leaf.malf            84 12
## fruit.pods           84 12
## precip               38  6
## stem.cankers         38  6
## canker.lesion        38  6
## ext.decay            38  6
## mycelium             38  6
## int.discolor         38  6
## sclerotia            38  6
## plant.stand          36  5
## roots                31  5
## temp                 30  4
## crop.hist            16  2
## plant.growth         16  2
## stem                 16  2
## date                  1  0
## area.dam              1  0
## Class                 0  0
## leaves                0  0

From this it looks like we have several variables that are missing roughly a fifth of values (hail, sever, seed.tmt, lodging and others).

# Pick out just predictors with many missing vals
soybean_missing <- Soybean %>% dplyr::select("Class", "sever", "hail", "seed.tmt", "lodging")

p1 <- ggplot(soybean_missing, aes(x=Class, fill=sever)) +
     geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Severing")

p2 <- ggplot(soybean_missing, aes(x=Class, fill=seed.tmt)) +
     geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+ labs(title = "Seed Treatment")

p3 <- ggplot(soybean_missing, aes(x=Class, fill=hail)) +
     geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Received Hail?")

p4 <- ggplot(soybean_missing, aes(x=Class, fill=lodging)) +
     geom_bar(position="fill")  + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Raised in Lodging?")
p1

There’s better balance for our treatment variable, though some classes are missing predictor values entirely

p2

For our lodging property, we see some stronger class imbalance, as well as some completely null class values

p3

p4

Looking at the above plots, the predictors are missing values completely for 4 values for Soybean Class:

  • 2-4-d-injury
  • cyst-nematode
  • diaporthe-&-stem-blight
  • herbicide-injury

Missing Values Strategy

To handle these missing predictors, we could use the mice package in R to impute values. Another method would be to sample from a distribution

Eliminating predictors is a simpler approach, in which we drop the pre

soybean_eliminated <- Soybean %>% dplyr::select(-c("sever", "hail", "seed.tmt", "lodging"))

We can train a basic Multinomial Logistic Regression model with our “bad” predictors eliminated

elim_model <- multinom(Class ~ ., soybean_eliminated)
## Warning in multinom(Class ~ ., soybean_eliminated): groups '2-4-d-injury'
## 'cyst-nematode' 'diaporthe-pod-&-stem-blight' 'herbicide-injury' are empty
## # weights:  900 (826 variable)
## initial  value 1521.924213 
## iter  10 value 110.127151
## iter  20 value 42.728685
## iter  30 value 34.285720
## iter  40 value 32.739471
## iter  50 value 32.520566
## iter  60 value 32.478813
## iter  70 value 32.461286
## iter  80 value 32.452854
## iter  90 value 32.450728
## iter 100 value 32.449732
## final  value 32.449732 
## stopped after 100 iterations
# Print AIC which is a good measure of model complexity and goodness of fit
elim_model$AIC
## [1] 1604.899

Now we can try to train another multinomial logistic regression model on imputed data to compare. We’ll only imput our biggest offender predictors from above to reduce compute complexity

soybean_missing <- soybean_missing %>% dplyr::select(-c(Class))
# Impute using mice
soybean_missing_imputed <- mice(soybean_missing)
## 
##  iter imp variable
##   1   1  sever  hail  seed.tmt  lodging
##   1   2  sever  hail  seed.tmt  lodging
##   1   3  sever  hail  seed.tmt  lodging
##   1   4  sever  hail  seed.tmt  lodging
##   1   5  sever  hail  seed.tmt  lodging
##   2   1  sever  hail  seed.tmt  lodging
##   2   2  sever  hail  seed.tmt  lodging
##   2   3  sever  hail  seed.tmt  lodging
##   2   4  sever  hail  seed.tmt  lodging
##   2   5  sever  hail  seed.tmt  lodging
##   3   1  sever  hail  seed.tmt  lodging
##   3   2  sever  hail  seed.tmt  lodging
##   3   3  sever  hail  seed.tmt  lodging
##   3   4  sever  hail  seed.tmt  lodging
##   3   5  sever  hail  seed.tmt  lodging
##   4   1  sever  hail  seed.tmt  lodging
##   4   2  sever  hail  seed.tmt  lodging
##   4   3  sever  hail  seed.tmt  lodging
##   4   4  sever  hail  seed.tmt  lodging
##   4   5  sever  hail  seed.tmt  lodging
##   5   1  sever  hail  seed.tmt  lodging
##   5   2  sever  hail  seed.tmt  lodging
##   5   3  sever  hail  seed.tmt  lodging
##   5   4  sever  hail  seed.tmt  lodging
##   5   5  sever  hail  seed.tmt  lodging
# Combine imputed and missing data
soybean_imputed <- cbind(soybean_eliminated, soybean_missing_imputed$data)

# Now we can train a Multinom Logistic Regression model on our imputed data
impute_model <- multinom(Class ~ ., soybean_imputed)
## Warning in multinom(Class ~ ., soybean_imputed): groups '2-4-d-injury'
## 'cyst-nematode' 'diaporthe-pod-&-stem-blight' 'herbicide-injury' are empty
## # weights:  990 (910 variable)
## initial  value 1521.924213 
## iter  10 value 91.041479
## iter  20 value 34.468933
## iter  30 value 28.552390
## iter  40 value 27.564076
## iter  50 value 27.443958
## iter  60 value 27.408837
## iter  70 value 27.392397
## iter  80 value 27.386871
## iter  90 value 27.385205
## iter 100 value 27.384153
## final  value 27.384153 
## stopped after 100 iterations
# Print AIC which is a good measure of model complexity and goodness of fit
impute_model$AIC
## [1] 1762.768

We actually see a higher AIC on our imputed data, rather than our eliminated predictor data. This makes intuitive sense as one of the AIC inputs is the number of predictors. When we removed our missing predictors, we improved AIC in this sense.