data(Glass)
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
We can view the relationship between variables and our predictor
(Type) via a correlation plot from the
corrplot package in R
# First, calculate the correlatino between variables in
glass <- Glass %>% mutate(Type = as.numeric(Type))
corrGlass <- cor(glass)
corrplot(corrGlass)
We see reasonably strong correlations between our
Type
variable and the predictors for percentage of Barium, Alimunum,
Magnesium, and Sodium.
We can also use a corner plot from the GGally::ggpairs
function to visually inspect the relationships between our variables. We
can also gain a sense of the distributions of each variable from this
plot. I’m going to remove the correlations (upper=NULL; see
above) to make the plot a bit less busy:
# Use GGally to plot pairwise relationships between variables
ggpairs(glass, upper=NULL)
The predictor variables for the minerals we care about from above seem to be a bit skewed from this visual. Specifically, the below predictor variables seem to be significantly skewed such that a transformation is necessary:
RIMgKCaBaFeSome of which have strong correlations with our Type
variable (see above)
There seem to be a few outlier points for the potassium
(K) predictor, but this variable doesn’t correlate strongly
with our outcome variable.
Let’s build a simple logistic regression model to classify
Type. First, we’ll do it with the raw data, adn then with
the transformed data to see if that improves our prediction. We’ll build
a simple train-test splitas well to evaluate the models
# Create simple training and test datasets
set.seed(1234) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(glass), size = floor(.75 * nrow(glass)), replace = F)
train <- glass[sample, ]
test <- glass[-sample, ]
raw_model <- multinom(Type ~., data=train)
## # weights: 66 (50 variable)
## initial value 286.681515
## iter 10 value 178.170738
## iter 20 value 124.734126
## iter 30 value 110.870929
## iter 40 value 108.254006
## iter 50 value 107.230338
## iter 60 value 106.506072
## iter 70 value 106.009768
## iter 80 value 105.606775
## iter 90 value 104.999637
## iter 100 value 104.646171
## final value 104.646171
## stopped after 100 iterations
summary(raw_model)
## Call:
## multinom(formula = Type ~ ., data = train)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 37.053482 92.821330 -0.9635396 -4.1316486 2.367778 -1.83846752
## 3 22.351865 -28.265503 1.4737269 -0.3435969 3.514535 -0.10914109
## 4 10.602322 4.170507 -2.1031350 -5.1700146 17.084019 -0.08040162
## 5 -5.713220 -9.153357 7.5518862 -4.4728104 18.128640 -1.24670932
## 6 -9.365253 15.412955 -0.8106857 -7.2261805 8.391411 0.58624394
## K Ca Ba Fe
## 2 -2.0997985 -2.239068150 -6.188797 3.39903039
## 3 -0.5549199 0.567433213 -124.587680 -0.04227417
## 4 0.1533757 0.003615038 1.168617 3.52391279
## 5 -105.5298008 0.055856582 -39.962363 -178.91207635
## 6 -4.6170684 -4.266785787 3.796153 -48.44823954
##
## Std. Errors:
## (Intercept) RI Na Mg Al Si K
## 2 0.08951798 0.19062936 0.5999472 0.9534937 1.419177 0.1512433 2.15216055
## 3 0.04089282 0.06725711 0.7697231 1.1568512 1.745858 0.1938122 2.58297422
## 4 0.14675139 0.28448074 1.4783342 1.6083248 6.072239 0.2720139 3.19634456
## 5 0.06544382 0.10594474 6.5459361 1.8673403 3.974656 1.4830063 0.01323568
## 6 0.09224567 0.18329600 1.7983552 1.7356575 3.285026 0.5348122 3.52505426
## Ca Ba Fe
## 2 0.5788142 5.080058e+00 2.342857e+00
## 3 0.6468024 5.122840e-05 3.621717e+00
## 4 0.9722797 1.167045e+01 8.204996e+00
## 5 2.5921075 8.004143e-01 1.940598e-05
## 6 1.9020923 1.187087e+01 5.177781e-01
##
## Residual Deviance: 209.2923
## AIC: 309.2923
Now we can apply some transforms to our predictors. We’ll try a spatial sign transformation, which helps to reduce the effect of outliers by mapping our predictor values onto the surface of a hyper-dimensional sphere. Since all transformed data points will be the same distance from the center of the sphere, their effect as an outlier will be reduced.
In addition, our Ca variable is collinear with some
other predictor variables, so we can remove it, as the variance in our
class labels is likely captured by those additional predictors outside
of Calcium.
# First we scale our dataset then input it into the spatialSign transform
glass_predictors <- glass %>% dplyr::select(-c(Type))
# glass_predictors$Mg <- log(glass_predictors$Mg)
glass_scaled <- data.frame(scale(glass_predictors))
glass_transformed <- as.data.frame(spatialSign(glass_scaled))
glass_transformed <- glass_transformed %>% cbind(Type=glass$Type)
# Remove Calcium and Silicon from dataset
glass_transformed <- glass_transformed %>% dplyr::select(-c(Ca))
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(glass_transformed), size = floor(.75 * nrow(glass_transformed)), replace = F)
train_transformed <- glass_transformed[sample, ]
test_transformed <- glass_transformed[-sample, ]
transformed_model <- multinom(Type ~., data=train_transformed)#, MaxNWts=1450)
## # weights: 60 (45 variable)
## initial value 286.681515
## iter 10 value 100.295255
## iter 20 value 76.835096
## iter 30 value 73.899437
## iter 40 value 72.469328
## iter 50 value 71.016328
## iter 60 value 69.721084
## iter 70 value 69.270365
## iter 80 value 69.240896
## final value 69.240505
## converged
summary(transformed_model)
## Call:
## multinom(formula = Type ~ ., data = train_transformed)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 3.783184 -1.954268 2.356933 -1.704283 5.436208 -0.8584532
## 3 -3.117764 -17.701978 -7.471643 -4.265298 -3.261058 -17.5584527
## 4 -604.011572 360.873824 -274.989393 -1176.784096 516.967306 600.8810708
## 5 -362.269873 -85.663312 214.272425 -172.166082 384.602045 142.5582186
## 6 -415.419668 817.191067 1048.889403 160.872299 752.162942 607.9154490
## K Ba Fe
## 2 1.324561 9.448656 -0.3846641
## 3 -17.178601 6.843506 1.9930873
## 4 527.328319 47.618942 -202.6382935
## 5 -745.631232 -434.943485 -235.1255588
## 6 580.989344 1024.266165 -405.2680269
##
## Std. Errors:
## (Intercept) RI Na Mg Al Si
## 2 9.758905e-01 1.970090e+00 1.632730e+00 1.765219e+00 1.678852e+00 1.737909e+00
## 3 2.328477e+00 5.121244e+00 3.954951e+00 4.941777e+00 3.216347e+00 5.511999e+00
## 4 7.152477e+02 1.229219e+02 1.015290e+01 3.776177e+02 2.876679e+00 1.194344e+02
## 5 1.228350e-08 1.077065e-09 3.739326e-09 7.659863e-09 5.318101e-09 4.729215e-10
## 6 3.204367e-10 5.985387e-11 8.283577e-11 1.588639e-10 2.299929e-10 8.915074e-11
## K Ba Fe
## 2 3.076375e+00 5.383879e+00 6.097702e-01
## 3 6.117637e+00 1.092234e+01 1.316750e+00
## 4 5.609749e+00 7.143509e+01 1.178239e+02
## 5 3.135160e-09 1.448088e-09 2.406598e-09
## 6 5.982057e-11 3.858865e-11 2.884024e-11
##
## Residual Deviance: 138.481
## AIC: 228.481
With our predictor variables transformed via spatial sign we see an improved (lower) AIC, indicating a simplaer model (this makes sense as we also removed some extraneous predictors).
raw_predict <- predict(raw_model, test)
transformed_predict <- predict(transformed_model, test_transformed)
# Calculate and print accuracy on raw data model and transformed data model
(raw_accuracy <- mean(raw_predict == test$Type))
## [1] 0.6111111
(transformed_accuracy <- mean(transformed_predict == test_transformed$Type))
## [1] 0.5740741
We see comparable accuracy numbers out of a simpler model (fewer inputs; lower AIC value) for our transformed data.
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
The table function can be a helpful way to show the
relative frequencies between categorical variables.
table(Soybean$Class, Soybean$precip)
##
## 0 1 2
## 2-4-d-injury 0 0 0
## alternarialeaf-spot 0 9 82
## anthracnose 0 0 44
## bacterial-blight 0 10 10
## bacterial-pustule 0 12 8
## brown-spot 0 10 82
## brown-stem-rot 35 9 0
## charcoal-rot 20 0 0
## cyst-nematode 0 0 0
## diaporthe-pod-&-stem-blight 0 2 13
## diaporthe-stem-canker 0 0 20
## downy-mildew 0 0 20
## frog-eye-leaf-spot 0 10 81
## herbicide-injury 0 0 0
## phyllosticta-leaf-spot 9 11 0
## phytophthora-rot 0 30 58
## powdery-mildew 10 9 1
## purple-seed-stain 0 0 20
## rhizoctonia-root-rot 0 0 20
However, we have 36 variables in our Soybean dataset, so
that table would become rather busy quickly. We can produce some
mosaic plots form the VCD library which can help us to
visualize some of our class frequencies. Lets do this first for some
features related to the weather a soyeban experienced, for example:
twemp, hail, and precip
weather <- Soybean %>% dplyr::select(precip, hail, temp) %>% table()
mosaic(weather, shade = TRUE)
We can also plot some variables related to properties of the seed
(e.g., seed, seed.discolor,
seed.size). In this case, we see some larger class
imbalance, specifically for the see.discolor and
seed.size variables
Soybean %>% dplyr::select(seed, seed.discolor, seed.size) %>%
table() %>%
mosaic(shade = TRUE)
To see which predictors have higher rates of missing values, we can
use the freq.na fucntions fromt he questionr
package, which lists the number and percentage of missing values per
feature
freq.na(Soybean)
## missing %
## hail 121 18
## sever 121 18
## seed.tmt 121 18
## lodging 121 18
## germ 112 16
## leaf.mild 108 16
## fruiting.bodies 106 16
## fruit.spots 106 16
## seed.discolor 106 16
## shriveling 106 16
## leaf.shread 100 15
## seed 92 13
## mold.growth 92 13
## seed.size 92 13
## leaf.halo 84 12
## leaf.marg 84 12
## leaf.size 84 12
## leaf.malf 84 12
## fruit.pods 84 12
## precip 38 6
## stem.cankers 38 6
## canker.lesion 38 6
## ext.decay 38 6
## mycelium 38 6
## int.discolor 38 6
## sclerotia 38 6
## plant.stand 36 5
## roots 31 5
## temp 30 4
## crop.hist 16 2
## plant.growth 16 2
## stem 16 2
## date 1 0
## area.dam 1 0
## Class 0 0
## leaves 0 0
From this it looks like we have several variables that are missing
roughly a fifth of values (hail, sever,
seed.tmt, lodging and others).
# Pick out just predictors with many missing vals
soybean_missing <- Soybean %>% dplyr::select("Class", "sever", "hail", "seed.tmt", "lodging")
p1 <- ggplot(soybean_missing, aes(x=Class, fill=sever)) +
geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Severing")
p2 <- ggplot(soybean_missing, aes(x=Class, fill=seed.tmt)) +
geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+ labs(title = "Seed Treatment")
p3 <- ggplot(soybean_missing, aes(x=Class, fill=hail)) +
geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Received Hail?")
p4 <- ggplot(soybean_missing, aes(x=Class, fill=lodging)) +
geom_bar(position="fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(title = "Raised in Lodging?")
p1
There’s better balance for our treatment variable, though some classes are missing predictor values entirely
p2
For our lodging property, we see some stronger class
imbalance, as well as some completely null class values
p3
p4
Looking at the above plots, the predictors are missing values
completely for 4 values for Soybean Class:
2-4-d-injurycyst-nematodediaporthe-&-stem-blightherbicide-injuryTo handle these missing predictors, we could use the mice
package in R to impute values. Another method would be to sample
from a distribution
Eliminating predictors is a simpler approach, in which we drop the pre
soybean_eliminated <- Soybean %>% dplyr::select(-c("sever", "hail", "seed.tmt", "lodging"))
We can train a basic Multinomial Logistic Regression model with our “bad” predictors eliminated
elim_model <- multinom(Class ~ ., soybean_eliminated)
## Warning in multinom(Class ~ ., soybean_eliminated): groups '2-4-d-injury'
## 'cyst-nematode' 'diaporthe-pod-&-stem-blight' 'herbicide-injury' are empty
## # weights: 900 (826 variable)
## initial value 1521.924213
## iter 10 value 110.127151
## iter 20 value 42.728685
## iter 30 value 34.285720
## iter 40 value 32.739471
## iter 50 value 32.520566
## iter 60 value 32.478813
## iter 70 value 32.461286
## iter 80 value 32.452854
## iter 90 value 32.450728
## iter 100 value 32.449732
## final value 32.449732
## stopped after 100 iterations
# Print AIC which is a good measure of model complexity and goodness of fit
elim_model$AIC
## [1] 1604.899
Now we can try to train another multinomial logistic regression model on imputed data to compare. We’ll only imput our biggest offender predictors from above to reduce compute complexity
soybean_missing <- soybean_missing %>% dplyr::select(-c(Class))
# Impute using mice
soybean_missing_imputed <- mice(soybean_missing)
##
## iter imp variable
## 1 1 sever hail seed.tmt lodging
## 1 2 sever hail seed.tmt lodging
## 1 3 sever hail seed.tmt lodging
## 1 4 sever hail seed.tmt lodging
## 1 5 sever hail seed.tmt lodging
## 2 1 sever hail seed.tmt lodging
## 2 2 sever hail seed.tmt lodging
## 2 3 sever hail seed.tmt lodging
## 2 4 sever hail seed.tmt lodging
## 2 5 sever hail seed.tmt lodging
## 3 1 sever hail seed.tmt lodging
## 3 2 sever hail seed.tmt lodging
## 3 3 sever hail seed.tmt lodging
## 3 4 sever hail seed.tmt lodging
## 3 5 sever hail seed.tmt lodging
## 4 1 sever hail seed.tmt lodging
## 4 2 sever hail seed.tmt lodging
## 4 3 sever hail seed.tmt lodging
## 4 4 sever hail seed.tmt lodging
## 4 5 sever hail seed.tmt lodging
## 5 1 sever hail seed.tmt lodging
## 5 2 sever hail seed.tmt lodging
## 5 3 sever hail seed.tmt lodging
## 5 4 sever hail seed.tmt lodging
## 5 5 sever hail seed.tmt lodging
# Combine imputed and missing data
soybean_imputed <- cbind(soybean_eliminated, soybean_missing_imputed$data)
# Now we can train a Multinom Logistic Regression model on our imputed data
impute_model <- multinom(Class ~ ., soybean_imputed)
## Warning in multinom(Class ~ ., soybean_imputed): groups '2-4-d-injury'
## 'cyst-nematode' 'diaporthe-pod-&-stem-blight' 'herbicide-injury' are empty
## # weights: 990 (910 variable)
## initial value 1521.924213
## iter 10 value 91.041479
## iter 20 value 34.468933
## iter 30 value 28.552390
## iter 40 value 27.564076
## iter 50 value 27.443958
## iter 60 value 27.408837
## iter 70 value 27.392397
## iter 80 value 27.386871
## iter 90 value 27.385205
## iter 100 value 27.384153
## final value 27.384153
## stopped after 100 iterations
# Print AIC which is a good measure of model complexity and goodness of fit
impute_model$AIC
## [1] 1762.768
We actually see a higher AIC on our imputed data, rather than our eliminated predictor data. This makes intuitive sense as one of the AIC inputs is the number of predictors. When we removed our missing predictors, we improved AIC in this sense.