Objective

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable.

1. Logistic Regression with a binary outcome.

  1. The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

In the penguin dataset species column there are 3 categories, Gentoo, Adelie, and Chinstrap. Because there are many similarities beween Adelie and Chinstrap we will combine these two species to build a logistic regress with binary outcome against Gentoo. Many similarities include the size of the penguins and the islands it dwells in. Only the Gentoo lives separate in an another island opposed to the Adelie and Chinstrap.

  1. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

  2. Provide variable interpretations in your model.

Year, sex, island are variables that aren’t as important in compared the three species of penguin to one another. However the measurements of bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g does play an important role in comparision between the three species.

## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'caTools' was built under R version 4.0.3
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
## Loading required package: nnet
## Warning: package 'pROC' was built under R version 4.0.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 6 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>
## $bill_length_mm
## [1] 2
## 
## $bill_depth_mm
## [1] 2
## 
## $flipper_length_mm
## [1] 2
## 
## $body_mass_g
## [1] 2
## 
## $Species
## [1] 0
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = Species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g, family = binomial(link = "logit"), data = new_df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.177   0.000   0.000   0.000   1.177  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        6.876e+02  1.096e+05   0.006    0.995
## bill_length_mm    -8.303e-01  6.214e+02  -0.001    0.999
## bill_depth_mm      7.425e+01  8.173e+03   0.009    0.993
## flipper_length_mm -7.354e+00  8.435e+02  -0.009    0.993
## body_mass_g       -1.064e-01  1.188e+01  -0.009    0.993
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 449.7355  on 343  degrees of freedom
## Residual deviance:   2.7726  on 339  degrees of freedom
## AIC: 12.773
## 
## Number of Fisher Scoring iterations: 25

2 . AUC, Accuracy, TPR, FPR, TNR, FNR

Using the values obtained from the confusion matrix we will solve for AUC, Accuracy, TPR, FPR, TNR, FNR.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  0
##          1  0 66
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9648, 1)
##     No Information Rate : 0.6408     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3592     
##          Detection Rate : 0.3592     
##    Detection Prevalence : 0.3592     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Accuracy of 0.9903, 99%. We got an AUC value of 1.0. An AUC value of greater than .70 indicates a good model.

## [1] 1
## [1] 0.9736842
## [1] 0.02631579
## [1] 0

3 . Multinomial Logistic Regression

## # weights:  7 (6 variable)
## initial  value 238.442630 
## iter  10 value 1.600055
## iter  20 value 1.397373
## final  value 1.387525 
## converged
## Call:
## multinom(formula = Species ~ ., data = new_df)
## 
## Coefficients:
##                          Values   Std. Err.
## (Intercept)        -0.770769168 0.005889471
## bill_length_mm      0.060549867 0.234842646
## bill_depth_mm      -3.822613355 0.078788290
## flipper_length_mm   0.293618899 1.069965860
## body_mass_g         0.003086539 0.053954582
## prob              -16.579969585 0.019606818
## 
## Residual Deviance: 2.775049 
## AIC: 14.77505

Here’s a summary of our Multinomial Logistic Regression.

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##      -0.770769168       0.060549867      -3.822613355       0.293618899 
##       body_mass_g              prob 
##       0.003086539     -16.579969585
##   [,1]
## 1  0.0
## 2  0.0
## 3  0.0
## 4  0.5
## 5  0.0
## 6  0.0

Here’s the coefficients of the model.

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##       0.005889471       0.234842646       0.078788290       1.069965860 
##       body_mass_g              prob 
##       0.053954582       0.019606818
##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##     -130.87239623        0.25783165      -48.51753194        0.27441894 
##       body_mass_g              prob 
##        0.05720624     -845.62265095

Using the coefficients and standard errors we are able to calculate the z- values.

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##         0.0000000         0.7965368         0.0000000         0.7837627 
##       body_mass_g              prob 
##         0.9543809         0.0000000

Here are the p-values.

4. Extra Credit

What would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.

Multiple logistic regression analyses, one for each pair of outcomes: One problem with this approach is that each analysis is potentially run on a different sample. The other problem is that without constraining the logistic models, we can end up with the probability of choosing all possible outcome categories greater than 1. Collapsing number of categories to two and then doing a logistic regression: This approach suffers from loss of information and changes the original research questions to very different ones.

We should evaluate chi-square and R squared for our model in question 3.