Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

For this assignment, let us use ‘species’ as our outcome or the dependent variable.

Logistic Regression with Binary Outcome

library(skimr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(palmerpenguins)
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
library(pROC)
library(kableExtra)
library(broom)
library(gmodels)

Exploration

Looking at the dataset set we see that there are 2 categorical variables and 4 quantitative variables. Year will be omitted for this exercise.
Within the dependent variable species we see that there are 3 species, therefore we cannot conduct binary logistic regression on this dataset without some reclassification of the dependent variable.

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇
penguins %>% select(-species, -sex, -island, -year) %>% 
  gather() %>%
  ggplot(aes(x= value)) +
  geom_histogram(fill='pink') +
  facet_wrap(~key, scales = 'free')

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"),guide = FALSE) +
  facet_wrap(~species, ncol = 1) + coord_flip()

Dataset Split

In order to conduct binary logistic regression we will convert the dataset into 3 different datasets where the dependant variable (species) is True (1) if the observation is classified as the specific species or False (0), where it is not classified as the species in question. We will do this for all 3 species resulting in 3 datasets that we can run a binary logistical model on

penguins_adelie <- penguins %>% mutate(species, species_update = ifelse(species == "Adelie", 1, 0))
penguins_chinstrap <- penguins %>% mutate(species, species_update = ifelse(species == "Chinstrap", 1, 0))
penguins_gentoo <- penguins %>% mutate(species, species_update = ifelse(species == "Gentoo", 1, 0))

Binary Logistic Model 1: Adile vs All

adelie_m <- glm(species_update ~ body_mass_g + bill_length_mm + flipper_length_mm, data = penguins_adelie, family = "binomial")

summary(adelie_m)
## 
## Call:
## glm(formula = species_update ~ body_mass_g + bill_length_mm + 
##     flipper_length_mm, family = "binomial", data = penguins_adelie)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.66148  -0.06914  -0.00440   0.05122   2.92657  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       84.842092  13.328035   6.366 1.94e-10 ***
## body_mass_g        0.002934   0.001010   2.905  0.00367 ** 
## bill_length_mm    -1.132585   0.182064  -6.221 4.95e-10 ***
## flipper_length_mm -0.242384   0.059448  -4.077 4.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 469.424  on 341  degrees of freedom
## Residual deviance:  65.899  on 338  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 73.899
## 
## Number of Fisher Scoring iterations: 8
penguins_adelie$species_prob <- predict(adelie_m, penguins_adelie, type = "response")
penguins_adelie <- penguins_adelie %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))

Binary Logistic Model 2: Chinstrap vs All

chinstrap_m <- glm(species_update ~ bill_length_mm  + body_mass_g, data = penguins_chinstrap, family = "binomial")

summary(chinstrap_m)
## 
## Call:
## glm(formula = species_update ~ bill_length_mm + body_mass_g, 
##     family = "binomial", data = penguins_chinstrap)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.46107  -0.01209  -0.00110  -0.00003   2.26582  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -44.841232  11.302259  -3.967 7.26e-05 ***
## bill_length_mm   2.268799   0.570229   3.979 6.93e-05 ***
## body_mass_g     -0.014942   0.003811  -3.920 8.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 341.163  on 341  degrees of freedom
## Residual deviance:  27.755  on 339  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 33.755
## 
## Number of Fisher Scoring iterations: 10
penguins_chinstrap$species_prob <- predict(chinstrap_m, penguins_chinstrap, type = "response")
penguins_chinstrap <- penguins_chinstrap %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))

Binary Logistic Model 3: Gentoo vs All

gentoo_m <- glm(species_update ~ bill_length_mm + flipper_length_mm, data = penguins_gentoo, family = "binomial")
summary(gentoo_m)
## 
## Call:
## glm(formula = species_update ~ bill_length_mm + flipper_length_mm, 
##     family = "binomial", data = penguins_gentoo)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31900  -0.02202  -0.00242   0.02091   2.59049  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -129.5325    25.2128  -5.138 2.78e-07 ***
## bill_length_mm      -0.2217     0.1142  -1.942   0.0522 .  
## flipper_length_mm    0.6746     0.1313   5.139 2.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 446.800  on 341  degrees of freedom
## Residual deviance:  44.757  on 339  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 50.757
## 
## Number of Fisher Scoring iterations: 9
penguins_gentoo$species_prob <- predict(gentoo_m, penguins_gentoo, type = "response")
penguins_gentoo <- penguins_gentoo %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))

Performance

+For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)

calcPerf <- function(n, df){
  
ModelName <- n
t <- table(df %>% select(species_update, species_pred))
m_roc <- roc(df$species_update ~ df$species_prob)
AUC <- m_roc$auc

TN <- t[1,1]
FN <- t[2,1]
FP <- t[1,2]
TP <- t[2,2]
N <- TN+FN
P <- TP+FP

TPR <- TP/P
FPR <- FP/N
TNR <- TN/N
FNR <- FN/P

Accuracy <- (TP+TN)/(P+N)
df2 <- data.frame(ModelName, AUC, Accuracy, TPR, FPR, TNR, FNR)

return(df2)

}

df3 <- calcPerf("Adelie v All",penguins_adelie)
df3 <- rbind(df3, calcPerf("Chinstrap v All", penguins_chinstrap))
df3 <- rbind(df3, calcPerf("Gentoo v All", penguins_gentoo))

df3 %>% kable(format = "markdown", digits = 2) %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, font_size = 12, position = "left")
ModelName AUC Accuracy TPR FPR TNR FNR
Adelie v All 0.9935855 0.97 0.96 0.03 0.98 0.03
Chinstrap v All 0.9983362 0.99 0.97 0.01 0.99 0.04
Gentoo v All 0.9966960 0.98 0.95 0.03 0.99 0.02

Multinomial Logistic Regression

penguins$species <- relevel(penguins$species, ref = "Adelie")
#m2 <- multinom(species ~ factor(island) + factor(sex) + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins)
m2 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins)
## # weights:  18 (10 variable)
## initial  value 375.725403 
## iter  10 value 17.512333
## iter  20 value 3.687379
## iter  30 value 1.557667
## iter  40 value 0.012319
## iter  50 value 0.001433
## iter  60 value 0.001141
## iter  70 value 0.000879
## iter  80 value 0.000840
## iter  90 value 0.000831
## iter 100 value 0.000778
## final  value 0.000778 
## stopped after 100 iterations
summary(m2)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g, data = penguins)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap  -34.806703       56.16797     -80.32042         -2.546743
## Gentoo      -4.446499       38.97704     -87.02802         -1.281737
##           body_mass_g
## Chinstrap -0.12676681
## Gentoo     0.02163804
## 
## Std. Errors:
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.330608e+00   53.187991536  3.848396e+01      11.539675662
## Gentoo    3.951616e-05    0.001697574  7.008909e-04       0.007754504
##           body_mass_g
## Chinstrap   0.2522008
## Gentoo      0.1856241
## 
## Residual Deviance: 0.001556389 
## AIC: 20.00156
tidy(m2)
## # A tibble: 10 x 6
##    y.level   term              estimate  std.error   statistic  p.value
##    <chr>     <chr>                <dbl>      <dbl>       <dbl>    <dbl>
##  1 Chinstrap (Intercept)       -34.8     3.33          -10.5   1.46e-25
##  2 Chinstrap bill_length_mm     56.2    53.2             1.06  2.91e- 1
##  3 Chinstrap bill_depth_mm     -80.3    38.5            -2.09  3.69e- 2
##  4 Chinstrap flipper_length_mm  -2.55   11.5            -0.221 8.25e- 1
##  5 Chinstrap body_mass_g        -0.127   0.252          -0.503 6.15e- 1
##  6 Gentoo    (Intercept)        -4.45    0.0000395 -112524.    0.      
##  7 Gentoo    bill_length_mm     39.0     0.00170     22960.    0.      
##  8 Gentoo    bill_depth_mm     -87.0     0.000701  -124168.    0.      
##  9 Gentoo    flipper_length_mm  -1.28    0.00775      -165.    0.      
## 10 Gentoo    body_mass_g         0.0216  0.186           0.117 9.07e- 1

Looking at the results of the multinomial regression, we see that in this situation we have only one model instead of 3. To make a more simple model the factor predictors (sex and island) were not needed as these resulted in coefficients were not significant and resulted in a larger AIC (approx. 75), the year predictor was also not used. For the remaining coefficients their p-values indicate that all are significant indicating that we reject the null hypothesis suggesting that there is a relationship between the predictor variables and species. The exception here is the Gentoo species for which only the body mass seems to be a significant factor to classify the species.

Fit Statistics

Some ways to interpret the multinomial regression model are:

CrossTable(penguins$species[c(-m2$na.action)], predict(m2))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  342 
## 
##  
##                                    | predict(m2) 
## penguins$species[c(-m2$na.action)] |    Adelie | Chinstrap |    Gentoo | Row Total | 
## -----------------------------------|-----------|-----------|-----------|-----------|
##                             Adelie |       151 |         0 |         0 |       151 | 
##                                    |   106.670 |    30.023 |    54.307 |           | 
##                                    |     1.000 |     0.000 |     0.000 |     0.442 | 
##                                    |     1.000 |     0.000 |     0.000 |           | 
##                                    |     0.442 |     0.000 |     0.000 |           | 
## -----------------------------------|-----------|-----------|-----------|-----------|
##                          Chinstrap |         0 |        68 |         0 |        68 | 
##                                    |    30.023 |   219.520 |    24.456 |           | 
##                                    |     0.000 |     1.000 |     0.000 |     0.199 | 
##                                    |     0.000 |     1.000 |     0.000 |           | 
##                                    |     0.000 |     0.199 |     0.000 |           | 
## -----------------------------------|-----------|-----------|-----------|-----------|
##                             Gentoo |         0 |         0 |       123 |       123 | 
##                                    |    54.307 |    24.456 |   140.237 |           | 
##                                    |     0.000 |     0.000 |     1.000 |     0.360 | 
##                                    |     0.000 |     0.000 |     1.000 |           | 
##                                    |     0.000 |     0.000 |     0.360 |           | 
## -----------------------------------|-----------|-----------|-----------|-----------|
##                       Column Total |       151 |        68 |       123 |       342 | 
##                                    |     0.442 |     0.199 |     0.360 |           | 
## -----------------------------------|-----------|-----------|-----------|-----------|
## 
## 
chisq.test(penguins$species[c(-m2$na.action)], predict(m2))
## 
##  Pearson's Chi-squared test
## 
## data:  penguins$species[c(-m2$na.action)] and predict(m2)
## X-squared = 684, df = 4, p-value < 2.2e-16