Logisitic and Multinomial Regression

Data Exploration

The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.

Data summary
Name data
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Species Distribution on Islands

From this plot, we can make a few key observations:

  • Gentoo penguins are only found on Biscoe Island
  • Chinstrap pengiuns only found on Dream Island
  • Adelie penguins are found on all three islands
  • Torgersen Island only has Adelie penguins

These island observations are valuable information in differentiating penguin species.

Sex Distribution

However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.

Body Measurements

When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length

Missing Values & Variable Selection

We noted from the data summary above that 11 observations were missing for the sex variable. Given that sex provided no useful information for differentiation, we can safely drop it from our analysis. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.

Logistic Regression

a. The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

b. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

c. Provide variable interpretations in your model.

We start by reducing the number of classes to two in order to apply logistic regression with a binary outcome: Gentoo and non-Gentoo penguins.

Logistic 1

The first fitted model using all of the independent variables of interests is not statistically significant. We proceed to remove predictors one at a time until we are left with only significant predictors. As we can see from the exploratory data above, a number of variables allow for perfect separation ofthe classes. This makes logistic regression unstable as shown by the p values close to 1.

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = species ~ ., family = binomial(link = "logit"), 
##     data = m1_data)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -4.047e-05  -2.100e-08  -2.100e-08   2.100e-08   4.329e-05  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -1.482e+02  3.346e+05   0.000    1.000
## bill_length_mm     7.144e-03  5.058e+03   0.000    1.000
## bill_depth_mm     -1.142e+01  9.097e+03  -0.001    0.999
## flipper_length_mm  1.239e+00  1.424e+03   0.001    0.999
## body_mass_g        1.805e-02  2.219e+01   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.3415e+02  on 332  degrees of freedom
## Residual deviance: 7.3116e-09  on 328  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

Logistic 2

There are a number of ways to reduce the model. The coefficients below are still unstable. The order of elimnination of variables may seem arbitrary but the elimination order presented in the two subsequent models ultimately yields the the most significant predictors.

## 
## Call:
## glm(formula = species ~ . - flipper_length_mm, family = binomial(link = "logit"), 
##     data = m1_data)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -9.843e-05  -2.100e-08  -2.100e-08   2.100e-08   8.759e-05  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -3.229e+01  5.579e+05   0.000    1.000
## bill_length_mm  3.122e+00  1.387e+04   0.000    1.000
## bill_depth_mm  -2.033e+01  1.057e+04  -0.002    0.998
## body_mass_g     5.044e-02  3.841e+01   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.3415e+02  on 332  degrees of freedom
## Residual deviance: 1.8161e-08  on 329  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

Logistic 3

The resulting model reduced to statistically significant predictors leaves only the bill_length_mm and bill_depth_mm variables. We can interpret the coefficients as follows:

  • A 1mm increase in bill length increases the log odds of the species being Gentoo by about 0.5, which corresponds to an odds ratio of 1.74 (74% greater odds to be Gentoo)
  • A 1mm increase in bill depth decreases the log odds of being a Gentoo penguin by about 4.47, which corresponds to an odds ratio of 0.0114 (98% lesser odds to be Gentoo)
## 
## Call:
## glm(formula = species ~ . - flipper_length_mm - body_mass_g, 
##     family = binomial(link = "logit"), data = m1_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.83346  -0.01519  -0.00119   0.01196   3.03982  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     48.1175    14.8730   3.235  0.00122 ** 
## bill_length_mm   0.5561     0.1344   4.138 3.50e-05 ***
## bill_depth_mm   -4.4750     1.0500  -4.262 2.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.154  on 332  degrees of freedom
## Residual deviance:  29.822  on 330  degrees of freedom
## AIC: 35.822
## 
## Number of Fisher Scoring iterations: 10
##    (Intercept) bill_length_mm  bill_depth_mm 
##   7.891412e+20   1.743836e+00   1.138983e-02

We can also interpret the marginal effects as follows:

  • A 1% increase in bill length on average increases the probability of a Gentoo penguin by 0.7%
  • A 1% increase in bill depth on average decreases the probability of a Gentoo penguin by 5.6%
##    (Intercept) bill_length_mm  bill_depth_mm 
##    0.604837619    0.006990031   -0.056251274

Logistic Classification Metrics

For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)

From the classification report below we can pick out the following metrics:

  • AUC: 0.998
  • Accuracy: 0.982
  • TPR (Sensitivity): 0.9748
  • FPR (1 - TNR): 0.014
  • TNR (Specificity): 0.9860
  • FNR (1 - TPR): 0.0252
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 211   3
##          1   3 116
##                                           
##                Accuracy : 0.982           
##                  95% CI : (0.9612, 0.9934)
##     No Information Rate : 0.6426          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9608          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9748          
##             Specificity : 0.9860          
##          Pos Pred Value : 0.9748          
##          Neg Pred Value : 0.9860          
##              Prevalence : 0.3574          
##          Detection Rate : 0.3483          
##    Detection Prevalence : 0.3574          
##       Balanced Accuracy : 0.9804          
##                                           
##        'Positive' Class : 1               
## 

Multinomial Logistic Regression

a. Please fit it a multinomial logistic regression where your outcome variable is ‘species’.

b. Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.

c. Please be sure to interpret your variables in the model.

The multinomial logistic regression model uses the same starting data as the logistic regression above. We relevel the Adelie species of penguins as a reference level for multinomial regression. The function multinom_metrics is introduced to summarize the p-values of each model as well as their classification performance.

Multinomial 1

The first multinomial model contains all the predictors that were used in the initial logistic model. The model output differs from the usual glm above so we rely on the p values calculated by the multinom_metrics function to evaluate the statistical significance of the predictors. We note from the confusion matrix below that the model results in perfect classification of penguins species with 100% accuracy. However, we also note in the output below that the body_mass_g predictor has a very high p-value for both Chinstrap and Gentoo penguins. This predictor is dropped in subsequent models.

## Call:
## multinom(formula = species ~ ., data = model2_data)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   -34.60273       58.94543     -84.81399         -2.643720
## Gentoo       -4.70502       43.75912     -91.60364         -1.639715
##            body_mass_g
## Chinstrap -0.132491128
## Gentoo     0.007448619
## 
## Std. Errors:
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 4.4008386318    73.52088725  51.084116136       15.71457598
## Gentoo    0.0003319148     0.01602115   0.006439479        0.06591798
##           body_mass_g
## Chinstrap   0.3479725
## Gentoo      1.4739115
## 
## Residual Deviance: 0.0009955954 
## AIC: 20.001
## [1] "P-values:"
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.774758e-15      0.4226972    0.09685792         0.8663995
## Gentoo    0.000000e+00      0.0000000    0.00000000         0.0000000
##           body_mass_g
## Chinstrap   0.7033875
## Gentoo      0.9959678
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       146         0      0
##   Chinstrap      0        68      0
##   Gentoo         0         0    119
## 
## Overall Statistics
##                                     
##                Accuracy : 1         
##                  95% CI : (0.989, 1)
##     No Information Rate : 0.4384    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           1.0000        1.0000
## Specificity                 1.0000           1.0000        1.0000
## Pos Pred Value              1.0000           1.0000        1.0000
## Neg Pred Value              1.0000           1.0000        1.0000
## Prevalence                  0.4384           0.2042        0.3574
## Detection Rate              0.4384           0.2042        0.3574
## Detection Prevalence        0.4384           0.2042        0.3574
## Balanced Accuracy           1.0000           1.0000        1.0000

Multinomial 2

By removing the body_mass_g predictor from the variables to include in the model, we end up with a model with only statististically significant predictors. The confusion matrix reveals that 3 penguins have been misclassified but the model accuracy remains very high at 99.1%. This is our preferred model.

## Call:
## multinom(formula = species ~ . - body_mass_g, data = model2_data)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   -7.918499      2.7701294     -4.387811        -0.1732906
## Gentoo      -3.551966      0.5140629    -32.264687         2.5082206
## 
## Std. Errors:
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 16.431601172     0.95068234    1.76771076         0.1055478
## Gentoo     0.001695331     0.07848069    0.02933815         0.3704786
## 
## Residual Deviance: 15.27589 
## AIC: 31.27589
## [1] "P-values:"
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   0.6298722   3.570210e-03     0.0130574      1.006270e-01
## Gentoo      0.0000000   5.746648e-11     0.0000000      1.285905e-11
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       145         2      0
##   Chinstrap      1        66      0
##   Gentoo         0         0    119
## 
## Overall Statistics
##                                           
##                Accuracy : 0.991           
##                  95% CI : (0.9739, 0.9981)
##     No Information Rate : 0.4384          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9859          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9932           0.9706        1.0000
## Specificity                 0.9893           0.9962        1.0000
## Pos Pred Value              0.9864           0.9851        1.0000
## Neg Pred Value              0.9946           0.9925        1.0000
## Prevalence                  0.4384           0.2042        0.3574
## Detection Rate              0.4354           0.1982        0.3574
## Detection Prevalence        0.4414           0.2012        0.3574
## Balanced Accuracy           0.9912           0.9834        1.0000
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 0.0003639481      15.960700  1.242790e-02         0.8408932
## Gentoo    0.0286682265       1.672071  9.719064e-15        12.2830539

The model logit coefficients are transformed to relative risk ratios (odds) and can be interpretted as follows. In comparison to Adelie penguins and keeping all other variables constant:

  • an increase of 1 unit in bill length makes it more 1.6 times more likely for a penguin to be Gentoo and 15.9 more so to be Chinstrap
  • an increase of 1 unti in bill depth makes it 0.012 time less likely for a penguin to be Chinstrap and even less so Gentoo (nearly 0)
  • an increase of 1 unit in flipper length makes it 0.84 times less likely for a penguin to be Chinstrap but 12.3 more likely to be Gentoo

Multinomial 3

For comparison, we also take a look at another model that uses the same predictors as the final logistic regression model above which is more parsimonious. We see the predictors remain significant but the accuracy has dropped slighly to 96.4%. Because of the reduction in accuracy on the training data, the previous model (Multinomial 2) is preferred.

## Call:
## multinom(formula = species ~ . - flipper_length_mm - body_mass_g, 
##     data = model2_data)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm
## Chinstrap   -25.35966       2.230280     -3.979108
## Gentoo       23.81971       2.716294     -8.310034
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm
## Chinstrap    13.77359      0.6990241      1.497086
## Gentoo       20.24687      0.7167481      1.823403
## 
## Residual Deviance: 47.78413 
## AIC: 59.78413
## [1] "P-values:"
##           (Intercept) bill_length_mm bill_depth_mm
## Chinstrap  0.06559516   0.0014199603  7.862884e-03
## Gentoo     0.23940958   0.0001508011  5.178307e-06
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       143         3      0
##   Chinstrap      3        61      2
##   Gentoo         0         4    117
## 
## Overall Statistics
##                                           
##                Accuracy : 0.964           
##                  95% CI : (0.9379, 0.9812)
##     No Information Rate : 0.4384          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9435          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9795           0.8971        0.9832
## Specificity                 0.9840           0.9811        0.9813
## Pos Pred Value              0.9795           0.9242        0.9669
## Neg Pred Value              0.9840           0.9738        0.9906
## Prevalence                  0.4384           0.2042        0.3574
## Detection Rate              0.4294           0.1832        0.3514
## Detection Prevalence        0.4384           0.1982        0.3634
## Balanced Accuracy           0.9817           0.9391        0.9823

Multinomial Fit

Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)

There are a number of methods that can be used to evaluate the fit of the model:

  • Using the deviance to compare two nested models with a chi-squared test
  • Hosmer-Lemeshow test
  • AIC
  • McFadden’s Pseudo R-squared
  • Fitting separate logit model for diagnosis