Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
For this assignment, let us use ‘species’ as our outcome or the dependent variable.
\(1.\) Logistic Regression with a binary outcome. (40)
\(a.\) The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
\(b.\) Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
\(c.\) Provide variable interpretations in your model.
\(2.\) For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)
\(3.\) Multinomial Logistic Regression. (40)
\(a.\) Please fit it a multinomial logistic regression where your outcome variable is ‘species’.
\(b.\) Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.
\(c.\) Please be sure to interpret your variables in the model.
\(4.\) Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)
library(palmerpenguins)
library(tibble)
library(DT)
library(dplyr)
library(explore)
library(ggplot2)
library(knitr)
library(caTools)
library(stats)
library(pROC)
library(nnet)Solution
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
The penguins dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. The dataset shows measurements of penguins live on different islands in different years. Below image shows the description of flipper_length, bill_length, and bill_width.
knitr::include_graphics("https://raw.githubusercontent.com/SubhalaxmiRout002/DATA-622/main/Assignment%201/Screen%20Shot%202021-02-19%20at%205.54.09%20AM.png")## [1] 344 8
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Data contains total 19 NAs, most NAs is sex i.e 11 and flipper_length_mm, body_mass_g, bill_depth_mm, and bill_length_mm contain 8 NAs.
## # A tibble: 344 x 8
## # Groups: island, species [9]
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge… 39.1 18.7 181 3750
## 2 Adelie Torge… 39.5 17.4 186 3800
## 3 Adelie Torge… 40.3 18 195 3250
## 4 Adelie Torge… NA NA NA NA
## 5 Adelie Torge… 36.7 19.3 193 3450
## 6 Adelie Torge… 39.3 20.6 190 3650
## 7 Adelie Torge… 38.9 17.8 181 3625
## 8 Adelie Torge… 39.2 19.6 195 4675
## 9 Adelie Torge… 34.1 18.1 193 3475
## 10 Adelie Torge… 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
ggplot(penguins) + aes(x = island, fill = species) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("indianred2","forestgreen","dodgerblue"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()Chinstrap and Gentoo are located on seperate islands.
The density plot shows interesting facts,
Distribution of bill_length_mm of all penguine species.
hist(penguine_data$bill_length_mm[penguine_data$species=='Gentoo'],col = 'forestgreen'
, xlab = "bill_length_mm"
, main = "Gentoos' distribution of bill_length_mm")hist(penguine_data$bill_length_mm[penguine_data$species=='Chinstrap'],col = 'indianred2'
, xlab = "bill_length_mm"
, main = "Chinstraps' Histogram of bill_length_mm")hist(penguine_data$bill_length_mm[penguine_data$species=='Adelie'],col = 'dodgerblue'
, xlab = "bill_length_mm"
, main = "Adelies' Histogram of bill_length_mm")Hostogram of Adelie seems normally distibuted but other species skewed.
Lets draw a decision tree:
Above trees shows,
Now let’s take a closer look at these variables:
# Flipper length vs. bill length
flipper_bill <- ggplot(data = penguine_data,
aes(x = flipper_length_mm,
y = bill_length_mm)) +
geom_point(aes(color = species,
shape = species),
size = 3,
alpha = 0.8) +
theme_minimal() +
scale_color_manual(values = c("indianred2","forestgreen","dodgerblue")) +
labs(title = "Flipper and bill length",
x = "Flipper length (mm)",
y = "Bill length (mm)",
color = "Penguin species",
shape = "Penguin species") +
theme(legend.position = c(0.85, 0.15),
legend.background = element_rect(fill = "white", color = NA),
plot.title = element_text(hjust = 0.5, size = 15, colour = "black"),
plot.title.position = "plot",
plot.caption = element_text(hjust = 0, face= "italic"),
plot.caption.position = "plot")
flipper_billAbove plot does not show a clear cluster of species. The density plots shows body_mass and flipper_length_mm of Adelie and Chinstrap different than Gentoo, so have a look on body_mass_g and flipper_length_mm.
ggplot(data = penguine_data, aes(x = body_mass_g, y = flipper_length_mm, col = species)) +
geom_point() +
theme_bw() +
labs(title = "Flipper Length and Body Mass of the Palmer Penguins") +
theme(legend.position = c(0.85, 0.15),
legend.background = element_rect(fill = "white", color = NA),
plot.title = element_text(hjust = 0.5, size = 15, colour = "black"),
plot.title.position = "plot",
plot.caption = element_text(hjust = 0, face= "italic"),
plot.caption.position = "plot")This plot clearly shows a positive relation between flippers_length and body_mass. Moreover, Adelie and Chinstrap cluster together, and Gentoo looks separate from the cluster.
Binary logistic regression requires the dependent variable to be binary.
# remove NAS
penguine_data_2 <- penguine_data[complete.cases(penguine_data), ]
# size of data
dim(penguine_data_2)## [1] 333 8
Given, Species as dependent variable, species has 3 categories i.e Adelie,Chinstrap, and Gentoo. Adelie,Chinstrap combines in one cluster due to body_mass and flipper_length and Gentoo in other cluster
# convert to dummy
penguine_data_2$ade_chin <- ifelse(penguine_data_2$species == 'Adelie' | penguine_data_2$species == 'Chinstrap', 1, 0)
# Drop Specises column from data
penguine_data_2 <- penguine_data_2 %>% select(-c(species) )
# View data
DT::datatable(head(penguine_data_2,10))##
## 0 1
## 119 214
New species ade_chin(combination of Adelie and Chinstrap) has 214 rows and Guntoo has 119 rows. Data is not balanced. This case I will go for under sampling.
Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling.
Take 199 random sample data from ade_chin, to make the balance. So, current data set have 119 rows from each species i.e ade_chin and Gentoo.
Split the data in to 2 sets train and test, assign 80% of data to train and 20% of data to test.
set.seed(1)
smpl1 <- penguine_data_2 %>% filter(ade_chin == 1) %>% sample_n(size = 119)
smpl2 <- penguine_data_2 %>% filter(ade_chin == 0)
smpl_119 <- rbind(smpl1, smpl2)# Data split
sample = sample.split(smpl_119$ade_chin, SplitRatio = 0.80)
penguine_train = subset(smpl_119, sample == TRUE)
penguine_test = subset(smpl_119, sample == FALSE)
dim(penguine_train)## [1] 190 8
## [1] 48 8
For first binomial model, choose flipper_length_mm , body_mass_g , bill_length_mm , and bill_depth_mm. I do not select other features such as island, year, and sex. Do not find relevant impact on these features for prediction.
bi_model1 <- glm(ade_chin ~ flipper_length_mm + body_mass_g + bill_length_mm + bill_depth_mm
, data = penguine_train, family=binomial(link="logit"))
summary(bi_model1)##
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + body_mass_g + bill_length_mm +
## bill_depth_mm, family = binomial(link = "logit"), data = penguine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.223e-05 -2.100e-08 0.000e+00 2.100e-08 3.015e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.590e+02 8.088e+05 0.000 1
## flipper_length_mm -1.374e+00 2.518e+03 -0.001 1
## body_mass_g -1.578e-02 4.498e+01 0.000 1
## bill_length_mm 1.389e-01 9.538e+03 0.000 1
## bill_depth_mm 1.140e+01 1.893e+04 0.001 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6340e+02 on 189 degrees of freedom
## Residual deviance: 3.6989e-09 on 185 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
Showing less Null variance : 0.02634, Residual deviance: 0.000036989, AIC: 10.
Equation:
log(p/1-p) = 1.590e+02 - flipper_length_mm * 1.374e+00 - body_mass_g * 1.578e-02 + bill_length_mm * 1.389e-01 + bill_depth_mm * 1.140e+01
Residual deviance is calculated from the model having all the features.On comarison with Linear Regression, think of residual deviance as residual sum of square (RSS) and null deviance as total sum of squares (TSS). The larger the difference between null and residual deviance, better the model.
The model with the lowest AIC will be relatively better. Practically, AIC is always given preference above deviance to evaluate model fit.
Remove bill_length_mm due to not normally distributed.
bi_model2 <- glm(ade_chin ~ flipper_length_mm + body_mass_g + bill_depth_mm
, data = penguine_train, family=binomial(link="logit"))
summary(bi_model2)##
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + body_mass_g + bill_depth_mm,
## family = binomial(link = "logit"), data = penguine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.100e-05 -2.110e-08 0.000e+00 2.110e-08 2.678e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.688e+02 4.379e+05 0.000 1.000
## flipper_length_mm -1.374e+00 2.537e+03 -0.001 1.000
## body_mass_g -1.581e-02 4.347e+01 0.000 1.000
## bill_depth_mm 1.119e+01 1.081e+04 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6340e+02 on 189 degrees of freedom
## Residual deviance: 3.6928e-09 on 186 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
Showing less Null variance : 0.02634, Residual deviance: 0.000036989, AIC: 8.
Equation:
log(p/1-p) = 1.688e+02 - flipper_length_mm * 1.374e+00 - body_mass_g * 1.581e-02 + bill_depth_mm * 1.119e+01
Remove body_mass_g due to positively corre-lated with flipper_length_mm.
bi_model3 <- glm(ade_chin ~ flipper_length_mm + bill_depth_mm
, data = penguine_train, family=binomial(link="logit"))
summary(bi_model3)##
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + bill_depth_mm, family = binomial(link = "logit"),
## data = penguine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.615e-05 -2.100e-08 0.000e+00 2.100e-08 4.642e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 280.510 325904.300 0.001 0.999
## flipper_length_mm -2.434 1657.410 -0.001 0.999
## bill_depth_mm 13.360 10101.308 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.6340e+02 on 189 degrees of freedom
## Residual deviance: 4.8487e-09 on 187 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
Showing less Null variance : 0.02634, Residual deviance: 0.000048487, AIC: 6.
Equation:
log(p/1-p) = 280.510 - flipper_length_mm * 2.434 + bill_depth_mm * 13.360
Out of these 3 models I select model 3 due to low AIC.
Test the model using penguine_test. confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).
The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format.
The area under the curve (AUC), also referred to as index of accuracy (A), represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)
## [1] 1
It indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known as Sensitivity or Recall.
## [1] 1
It indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.
## [1] 0
It indicates how many negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is (TN/TN + FP). It is also known as Specificity.
## [1] 1
It indicates how many positive values, out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is (FN/FN + TP).
## [1] 0
Other metrics for model 3:
## [1] 0
## [1] 1
## [1] 1
Multinomial regression is an extension of binomial logistic regression. The algorithm allows us to predict a categorical dependent variable which has more than two levels. Like any other regression model, the multinomial output can be predicted using one or more independent variable. The independent variables can be of a nominal, ordinal or continuous type.
Allocate number to the species:
penguine_data_3 <- penguine_data[complete.cases(penguine_data), ]
#summary(penguine_data_3)
penguine_data_3$species <- case_when(
penguine_data_3$species == "Adelie" ~ 1,
penguine_data_3$species == "Chinstrap" ~ 2,
penguine_data_3$species == "Gentoo" ~ 3
)
table(penguine_data_3$species)##
## 1 2 3
## 146 68 119
Apply dummy_cols() automates the process, and is useful when you have many columns to general dummy variables from or with many categories within the column. This way we can het all numerical data for model. Split the data in to 2 sets i.e train set and test set and the ratio is 70% train and 30% test.
set.seed(22)
ind <- sample(2, nrow(penguine_data_3), replace = TRUE,
prob = c(0.7, 0.3))
train <- penguine_data_3[ind == 1,]
test <- penguine_data_3[ind == 2,]
dim(train)## [1] 245 8
## [1] 88 8
train <- fastDummies::dummy_cols(train) %>% select (-c(island, sex, year))
test <- fastDummies::dummy_cols(test) %>% select (-c(island, sex, year))
train$species <- as.factor(train$species)
test$species <- as.factor(test$species)
train$species <- relevel(train$species, ref = "1")
test$species <- relevel(test$species, ref = "1")Model with all variables
## # weights: 33 (20 variable)
## initial value 269.160011
## iter 10 value 6.490882
## iter 20 value 1.000447
## iter 30 value 0.007333
## final value 0.000004
## converged
## Call:
## multinom(formula = species ~ ., data = train)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 2 -90.37271 11.193024 -9.19526 -1.300113 0.01892986
## 3 -14.66748 9.790331 -17.77214 -1.672496 0.05237551
## island_Biscoe island_Dream island_Torgersen sex_female sex_male
## 2 -52.31516 4.2585232 -42.31608 -18.490427 -71.88229
## 3 16.51737 -0.2050357 -30.97981 8.801406 -23.46888
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 2 19.715172 792.7148 5.39504 212.94684 4.5525507
## 3 3.087224 164.5498 127.63921 23.94005 0.9888599
## island_Biscoe island_Dream island_Torgersen sex_female sex_male
## 2 1.855764e-12 1.970569e+01 9.486383e-03 1.628462e+02 143.132554
## 3 3.087221e+00 9.552211e-05 2.621256e-13 7.019081e-12 3.087224
##
## Residual Deviance: 8.079054e-06
## AIC: 32.00001
From multinomial model 1, we can get 2 equations:
Eqn 1:
ln[P(species = 2)/(species = 1)] = -90.37271 + bill_length_mm * 1.193024 - bill_depth_mm * 9.19526 - flipper_length_mm * 1.300113 _ body_mass_g * 0.01892986 - island_Biscoe * 52.31516 + island_Dream * 4.2585232 - island_Torgersen * 42.31608 - sex_female * 18.490427 - sex_male * 71.88229
Eqn 2:
ln[P(species = 3)/(species = 1)] = -14.66748 + bill_length_mm * 9.790331 - bill_depth_mm * 17.77214 - flipper_length_mm * 1.672496 + body_mass_g * 0.05237551 + island_Biscoe * 16.51737 - island_Dream * 0.2050357 - island_Torgersen * 30.97981 + sex_female * 8.801406 - sex_male * 23.46888
To get the statistical significance variable perform 2 tail test. Below calculate p-value, if the valie is > 0.5 then that variable is not statistically significant.
z <- summary(mul_model1)$coefficients / summary(mul_model1)$standard.errors
p <- 1 - pnorm(abs(z), 0, 1) * 2
p## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 2 -0.9999954 -0.01126565 -0.9116920 -0.004871327 -0.003317657
## 3 -0.9999980 -0.04744429 -0.1107374 -0.055696368 -0.042240642
## island_Biscoe island_Dream island_Torgersen sex_female sex_male
## 2 -1.0000000 -0.1710951 -1 -0.09040179 -0.3844786
## 3 -0.9999999 -1.0000000 -1 -1.00000000 -1.0000000
We can see, bill_depth_mm, island_Biscoe, island_Dream, island_Torgersen, sex_female, and sex_male have higher p-value.
Next model we will exclude all these variables.
Model with out bill_depth_mm, island_Biscoe, island_Dream, island_Torgersen, sex_female, and sex_male
## # weights: 15 (8 variable)
## initial value 269.160011
## iter 10 value 58.353361
## iter 20 value 17.264952
## iter 30 value 11.867321
## iter 40 value 6.840817
## iter 50 value 6.077581
## iter 60 value 6.052656
## iter 70 value 5.461953
## iter 80 value 5.375168
## iter 90 value 5.317128
## iter 100 value 5.309806
## final value 5.309806
## stopped after 100 iterations
## Call:
## multinom(formula = species ~ bill_length_mm + flipper_length_mm +
## body_mass_g, data = train)
##
## Coefficients:
## (Intercept) bill_length_mm flipper_length_mm body_mass_g
## 2 -174.1776 15.0936970 -1.0568874 -0.075407128
## 3 -203.1864 0.7861321 0.6996343 0.005863934
##
## Std. Errors:
## (Intercept) bill_length_mm flipper_length_mm body_mass_g
## 2 0.0008027276 0.01267965 0.08366188 0.004353293
## 3 0.0010528341 0.56610128 0.11811937 0.003882355
##
## Residual Deviance: 10.61961
## AIC: 26.61961
From multinomial mul_model2, we can write the following equations:
ln[P(species = 2)/(species = 1)] = -174.1776 + bill_length_mm * 15.0936970 - flipper_length_mm * 1.0568874 - body_mass_g * 0.075407128
ln[P(species = 3)/(species = 1)] = -203.1864 + bill_length_mm * 0.7861321 + flipper_length_mm * 0.6996343 + body_mass_g * 0.005863934
Out of 2 Multinomial model, mul_model2 has low AIC, I will go with model 2. Apply prediction on both training and test dataset using model 2.
We will see the confusion matric and caluclate accuracy for both train and test set.
##
## pred_mul 1 2 3
## 1 103 0 1
## 2 0 49 0
## 3 1 0 91
## [1] 0.9918367
With training data confusion matrix shows 99.18% accuracy. Lets check the test data.
##
## pred_mul_test 1 2 3
## 1 39 0 0
## 2 3 19 0
## 3 0 0 27
## [1] 0.9659091
Test data also showing 96.59% accuracy.
Understand the performance of the model is important than model making. Lets see how our model is working.
Suppose, we do not apply any ML algo for model, this dataset will show 942.44 % accuracy for Adelie, 20% accuracy for Chinstrap , and 37.55 % accuracy for Gentoo.
##
## 1 2 3
## 0.4244898 0.2000000 0.3755102
Lets see our training model performance:
##
## pred_mul 1 2 3
## 1 0.990384615 0.000000000 0.009615385
## 2 0.000000000 1.000000000 0.000000000
## 3 0.010869565 0.000000000 0.989130435
Above matrix shows, species predicted correctly
Adelie = 99% Chinstrap = 100 % Gentoo = 98.9%
Which is quite good. Lets see test model performance.
##
## pred_mul_test 1 2 3
## 1 0.9285714 0.0000000 0.0000000
## 2 0.1578947 1.0000000 0.0000000
## 3 0.0000000 0.0000000 1.0000000
Above matrix shows, species predicted correctly
Adelie = 92.8% Chinstrap = 100 % Gentoo = 100%
In train set Adelie prediction was 99% but in test set 92.8%. But Gentoo performance improves than train set.
I personally enjoy this assignment, and learn many new things!
There is room for improvement for the model. If I get time I will do in future.