Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
For this assignment, let us use ‘species’ as our outcome or the dependent variable.
library(skimr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(palmerpenguins)
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
library(pROC)
library(kableExtra)
library(broom)
library(gmodels)
Looking at the dataset set we see that there are 2 categorical variables and 4 quantitative variables. Year will be omitted for this exercise.
Within the dependent variable species we see that there are 3 species, therefore we cannot conduct binary logistic regression on this dataset without some reclassification of the dependent variable.
skim(penguins)
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
penguins %>% select(-species, -sex, -island, -year) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),guide = FALSE) +
facet_wrap(~species, ncol = 1) + coord_flip()
In order to conduct binary logistic regression we will convert the dataset into 3 different datasets where the dependant variable (species) is True (1) if the observation is classified as the specific species or False (0), where it is not classified as the species in question. We will do this for all 3 species resulting in 3 datasets that we can run a binary logistical model on
penguins_adelie <- penguins %>% mutate(species, species_update = ifelse(species == "Adelie", 1, 0))
penguins_chinstrap <- penguins %>% mutate(species, species_update = ifelse(species == "Chinstrap", 1, 0))
penguins_gentoo <- penguins %>% mutate(species, species_update = ifelse(species == "Gentoo", 1, 0))
adelie_m <- glm(species_update ~ body_mass_g + bill_length_mm + flipper_length_mm, data = penguins_adelie, family = "binomial")
summary(adelie_m)
##
## Call:
## glm(formula = species_update ~ body_mass_g + bill_length_mm +
## flipper_length_mm, family = "binomial", data = penguins_adelie)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.66148 -0.06914 -0.00440 0.05122 2.92657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 84.842092 13.328035 6.366 1.94e-10 ***
## body_mass_g 0.002934 0.001010 2.905 0.00367 **
## bill_length_mm -1.132585 0.182064 -6.221 4.95e-10 ***
## flipper_length_mm -0.242384 0.059448 -4.077 4.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 469.424 on 341 degrees of freedom
## Residual deviance: 65.899 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 73.899
##
## Number of Fisher Scoring iterations: 8
penguins_adelie$species_prob <- predict(adelie_m, penguins_adelie, type = "response")
penguins_adelie <- penguins_adelie %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))
chinstrap_m <- glm(species_update ~ bill_length_mm + body_mass_g, data = penguins_chinstrap, family = "binomial")
summary(chinstrap_m)
##
## Call:
## glm(formula = species_update ~ bill_length_mm + body_mass_g,
## family = "binomial", data = penguins_chinstrap)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.46107 -0.01209 -0.00110 -0.00003 2.26582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -44.841232 11.302259 -3.967 7.26e-05 ***
## bill_length_mm 2.268799 0.570229 3.979 6.93e-05 ***
## body_mass_g -0.014942 0.003811 -3.920 8.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.163 on 341 degrees of freedom
## Residual deviance: 27.755 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 33.755
##
## Number of Fisher Scoring iterations: 10
penguins_chinstrap$species_prob <- predict(chinstrap_m, penguins_chinstrap, type = "response")
penguins_chinstrap <- penguins_chinstrap %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))
gentoo_m <- glm(species_update ~ bill_length_mm + flipper_length_mm, data = penguins_gentoo, family = "binomial")
summary(gentoo_m)
##
## Call:
## glm(formula = species_update ~ bill_length_mm + flipper_length_mm,
## family = "binomial", data = penguins_gentoo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31900 -0.02202 -0.00242 0.02091 2.59049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -129.5325 25.2128 -5.138 2.78e-07 ***
## bill_length_mm -0.2217 0.1142 -1.942 0.0522 .
## flipper_length_mm 0.6746 0.1313 5.139 2.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 446.800 on 341 degrees of freedom
## Residual deviance: 44.757 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 50.757
##
## Number of Fisher Scoring iterations: 9
penguins_gentoo$species_prob <- predict(gentoo_m, penguins_gentoo, type = "response")
penguins_gentoo <- penguins_gentoo %>% mutate(species_pred = ifelse(species_prob > .5, 1, 0))
+For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)
calcPerf <- function(n, df){
ModelName <- n
t <- table(df %>% select(species_update, species_pred))
m_roc <- roc(df$species_update ~ df$species_prob)
AUC <- m_roc$auc
TN <- t[1,1]
FN <- t[2,1]
FP <- t[1,2]
TP <- t[2,2]
N <- TN+FN
P <- TP+FP
TPR <- TP/P
FPR <- FP/N
TNR <- TN/N
FNR <- FN/P
Accuracy <- (TP+TN)/(P+N)
df2 <- data.frame(ModelName, AUC, Accuracy, TPR, FPR, TNR, FNR)
return(df2)
}
df3 <- calcPerf("Adelie v All",penguins_adelie)
df3 <- rbind(df3, calcPerf("Chinstrap v All", penguins_chinstrap))
df3 <- rbind(df3, calcPerf("Gentoo v All", penguins_gentoo))
df3 %>% kable(format = "markdown", digits = 2) %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, font_size = 12, position = "left")
| ModelName | AUC | Accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|
| Adelie v All | 0.9935855 | 0.97 | 0.96 | 0.03 | 0.98 | 0.03 |
| Chinstrap v All | 0.9983362 | 0.99 | 0.97 | 0.01 | 0.99 | 0.04 |
| Gentoo v All | 0.9966960 | 0.98 | 0.95 | 0.03 | 0.99 | 0.02 |
penguins$species <- relevel(penguins$species, ref = "Adelie")
#m2 <- multinom(species ~ factor(island) + factor(sex) + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins)
m2 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins)
## # weights: 18 (10 variable)
## initial value 375.725403
## iter 10 value 17.512333
## iter 20 value 3.687379
## iter 30 value 1.557667
## iter 40 value 0.012319
## iter 50 value 0.001433
## iter 60 value 0.001141
## iter 70 value 0.000879
## iter 80 value 0.000840
## iter 90 value 0.000831
## iter 100 value 0.000778
## final value 0.000778
## stopped after 100 iterations
summary(m2)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm +
## flipper_length_mm + body_mass_g, data = penguins)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -34.806703 56.16797 -80.32042 -2.546743
## Gentoo -4.446499 38.97704 -87.02802 -1.281737
## body_mass_g
## Chinstrap -0.12676681
## Gentoo 0.02163804
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.330608e+00 53.187991536 3.848396e+01 11.539675662
## Gentoo 3.951616e-05 0.001697574 7.008909e-04 0.007754504
## body_mass_g
## Chinstrap 0.2522008
## Gentoo 0.1856241
##
## Residual Deviance: 0.001556389
## AIC: 20.00156
tidy(m2)
## # A tibble: 10 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chinstrap (Intercept) -34.8 3.33 -10.5 1.46e-25
## 2 Chinstrap bill_length_mm 56.2 53.2 1.06 2.91e- 1
## 3 Chinstrap bill_depth_mm -80.3 38.5 -2.09 3.69e- 2
## 4 Chinstrap flipper_length_mm -2.55 11.5 -0.221 8.25e- 1
## 5 Chinstrap body_mass_g -0.127 0.252 -0.503 6.15e- 1
## 6 Gentoo (Intercept) -4.45 0.0000395 -112524. 0.
## 7 Gentoo bill_length_mm 39.0 0.00170 22960. 0.
## 8 Gentoo bill_depth_mm -87.0 0.000701 -124168. 0.
## 9 Gentoo flipper_length_mm -1.28 0.00775 -165. 0.
## 10 Gentoo body_mass_g 0.0216 0.186 0.117 9.07e- 1
Looking at the results of the multinomial regression, we see that in this situation we have only one model instead of 3. To make a more simple model the factor predictors (sex and island) were not needed as these resulted in coefficients were not significant and resulted in a larger AIC (approx. 75), the year predictor was also not used. For the remaining coefficients their p-values indicate that all are significant indicating that we reject the null hypothesis suggesting that there is a relationship between the predictor variables and species. The exception here is the Gentoo species for which only the body mass seems to be a significant factor to classify the species.
Some ways to interpret the multinomial regression model are:
CrossTable(penguins$species[c(-m2$na.action)], predict(m2))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 342
##
##
## | predict(m2)
## penguins$species[c(-m2$na.action)] | Adelie | Chinstrap | Gentoo | Row Total |
## -----------------------------------|-----------|-----------|-----------|-----------|
## Adelie | 151 | 0 | 0 | 151 |
## | 106.670 | 30.023 | 54.307 | |
## | 1.000 | 0.000 | 0.000 | 0.442 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.442 | 0.000 | 0.000 | |
## -----------------------------------|-----------|-----------|-----------|-----------|
## Chinstrap | 0 | 68 | 0 | 68 |
## | 30.023 | 219.520 | 24.456 | |
## | 0.000 | 1.000 | 0.000 | 0.199 |
## | 0.000 | 1.000 | 0.000 | |
## | 0.000 | 0.199 | 0.000 | |
## -----------------------------------|-----------|-----------|-----------|-----------|
## Gentoo | 0 | 0 | 123 | 123 |
## | 54.307 | 24.456 | 140.237 | |
## | 0.000 | 0.000 | 1.000 | 0.360 |
## | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.360 | |
## -----------------------------------|-----------|-----------|-----------|-----------|
## Column Total | 151 | 68 | 123 | 342 |
## | 0.442 | 0.199 | 0.360 | |
## -----------------------------------|-----------|-----------|-----------|-----------|
##
##
chisq.test(penguins$species[c(-m2$na.action)], predict(m2))
##
## Pearson's Chi-squared test
##
## data: penguins$species[c(-m2$na.action)] and predict(m2)
## X-squared = 684, df = 4, p-value < 2.2e-16