library(DataExplorer)
library(palmerpenguins)
library(psych)
library(GGally)
library(tidyverse)
library(ggplot2)
library(reshape)
library(kableExtra)
library(MASS)
library(caret)
library(pROC)
library(nnet)For this assignment we are going to use the palmerpenguins package. more information on that package can be found here.
https://allisonhorst.github.io/palmerpenguins/articles/intro.html
Lets take a look at the palmerpenguins package
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex <fct> male, female, female, NA, female, male, female, m...
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Species and Island
| species | island | n |
|---|---|---|
| Adelie | Biscoe | 44 |
| Adelie | Dream | 56 |
| Adelie | Torgersen | 52 |
| Chinstrap | Dream | 68 |
| Gentoo | Biscoe | 124 |
Here you can see that there are 3 different type of species which are Adelie, Chinstrap, and Gentoo.There are 3 different Islands which are Biscoe, Dream, Torgersen. Adelie has the most with a totla of 158, but they are divided within the three different islands. While Chinstrap has the least with only 68 and only on Dream, and Gentoo have a total of 124 and only on Biscoe.
Missing Data
The missing data needs to be removed.
Part 1
Choosing the Independent Variables
The independent variables that will definitely be used are island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, but do we need year and sex as an independent variable. Would sex and year allow us to know what specific species was most likely born this year, and whether or not a specific species has more males than females. A quick comparison here can show us
| species | sex | n |
|---|---|---|
| Adelie | female | 73 |
| Adelie | male | 73 |
| Chinstrap | female | 34 |
| Chinstrap | male | 34 |
| Gentoo | female | 58 |
| Gentoo | male | 61 |
Looking at the numbers there doesn’t seem to be a noticeable difference, so sex of the species does not need to be included.
| species | year | n |
|---|---|---|
| Adelie | 2007 | 44 |
| Adelie | 2008 | 50 |
| Adelie | 2009 | 52 |
| Chinstrap | 2007 | 26 |
| Chinstrap | 2008 | 18 |
| Chinstrap | 2009 | 24 |
| Gentoo | 2007 | 33 |
| Gentoo | 2008 | 45 |
| Gentoo | 2009 | 41 |
Same thing could be said about the year so this would not be needed as a independent variable. Also island would not be included because we already know that Adelie is spread out the three islands while Chinstrap is only on one island as well as Gentoo.
Dependent Variable
Since species will be the dependent variable, and the fact that there are 3 categories under species how does one decide to manipulate the dependent variable species.
| species | n |
|---|---|
| Adelie | 146 |
| Chinstrap | 68 |
| Gentoo | 119 |
Considering that the Adelie Species has the most out of all of the species and the fact that they are spread out throughout the three different island it is decided that it would be Adelie vs Other Species, where 1 is Adelie and 0 is Other Species.
Logistic regression with a binary outcome
mdl1<-glm(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g , family = "binomial", data = penguins_tf)
summary(mdl1)##
## Call:
## glm(formula = species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
## body_mass_g, family = "binomial", data = penguins_tf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.328 0.000 0.000 0.000 1.652
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 27.195927 28.156975 0.966 0.3341
## bill_length_mm -5.106876 2.730998 -1.870 0.0615 .
## bill_depth_mm 8.953805 5.014702 1.786 0.0742 .
## flipper_length_mm 0.052471 0.119287 0.440 0.6600
## body_mass_g 0.006281 0.003952 1.589 0.1120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 456.5751 on 332 degrees of freedom
## Residual deviance: 9.4492 on 328 degrees of freedom
## AIC: 19.449
##
## Number of Fisher Scoring iterations: 13
Variable Interpretations
bill_length_mm - This variable has an estimate of -5.106, since its a negative number it would mean that the bill_length of the Adelie are normally small, and that the Chinstrap and the Gentoo have a larger bill_length.
bill_depth_mm - This variable has an estimate of 8.953 which means that the Adelie have a bigger bill_depth, and the Chinstrap and Gentoo are least likely to have a larger bill_depth.
flipper_length_mm - This variable has an estimate of .054, so that would mean that the Adelie are least likely to have a large flipper_lenth, and the Gentoo and Chinstrap do.
body_mass_g - This variable has an estimate of .00621 means that Adelie are least likely to have a big body-mass
Part 2
AUC, Accuracy, TPR, FPR, TNR, FNR
prob_pen<- predict(mdl1, type="response")
pred_pen<- ifelse(prob_pen > .5,1,0)
confusionMatrix(as.factor(pred_pen), as.factor(penguins_tf$species),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 186 2
## 1 1 144
##
## Accuracy : 0.991
## 95% CI : (0.9739, 0.9981)
## No Information Rate : 0.5616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9817
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9863
## Specificity : 0.9947
## Pos Pred Value : 0.9931
## Neg Pred Value : 0.9894
## Prevalence : 0.4384
## Detection Rate : 0.4324
## Detection Prevalence : 0.4354
## Balanced Accuracy : 0.9905
##
## 'Positive' Class : 1
##
- AUC: 1.00
- Accuracy: 0.991
- TPR (Sensitivity): 0.9863
- FPR (1 - TNR): 0.0137
- TNR (Specificity): 0.9947
- FNR (1 - TPR): 0.0053
Part 3
Multinomial Logistic Regression
Again we remove the missing values
Adelie again will be the reference, and the independent values bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, and the non numerical values will not be included.
penguins_tf2$species = relevel( penguins_tf2$species, ref = 'Adelie')
multipen = multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g , data= penguins_tf2 )## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 16.321214
## iter 20 value 3.754897
## iter 30 value 1.631859
## iter 40 value 0.012427
## iter 50 value 0.001125
## iter 60 value 0.001108
## iter 70 value 0.001006
## iter 80 value 0.000906
## iter 90 value 0.000498
## iter 100 value 0.000498
## final value 0.000498
## stopped after 100 iterations
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm +
## flipper_length_mm + body_mass_g, data = penguins_tf2)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -34.60273 58.94543 -84.81399 -2.643720
## Gentoo -4.70502 43.75912 -91.60364 -1.639715
## body_mass_g
## Chinstrap -0.132491128
## Gentoo 0.007448619
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 4.4008386318 73.52088725 51.084116136 15.71457598
## Gentoo 0.0003319148 0.01602115 0.006439479 0.06591798
## body_mass_g
## Chinstrap 0.3479725
## Gentoo 1.4739115
##
## Residual Deviance: 0.0009955954
## AIC: 20.001
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -7.862758 0.8017507 -1.660281 -0.1682336
## Gentoo -14175.384874 2731.3349932 -14225.318294 -24.8750797
## body_mass_g
## Chinstrap -0.38075171
## Gentoo 0.00505364
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.774758e-15 0.4226972 0.09685792 0.8663995
## Gentoo 0.000000e+00 0.0000000 0.00000000 0.0000000
## body_mass_g
## Chinstrap 0.7033875
## Gentoo 0.9959678
Variable Interpretations
For the coefficients the Chinstrap class
bill_length_mm coefficients of 58.94543 means one unit increase in bill_length increase the log odds of being Chinstrap vs Adelie by 58.94543 units of log odds.
bill_depth_mm coefficients of -84.81399 means one unit increase in bill_depth decrease the log odds of being Chinstrap vs Adelie by 84.81399 units of log odds.
flipper_length_mm coefficients of -2.643720 means one unit increase in flipper_length decrease the log odds of being Chinstrap vs Adelie by 2.643720 units of log odds.
body_mass_g coefficients of -0.132491128 means one unit increase in body_mass_g decrease the log odds of being Chinstrap vs Adelie by 0.132491128 units of log odds.
For the coefficients the Gentoo class
bill_length_mm coefficients of 43.75912 means one unit decrease in bill_length increase the log odds of being Gentoo vs Adelie by 43.75912 units of log odds.
bill_depth_mm coefficients of -91.60364 means one unit increase in bill_depth decrease the log odds of being Gentoo vs Adelie by 91.60364 units of log odds.
flipper_length_mm coefficients of -1.639715 means one unit increase in flipper_length decrease the log odds of being Gentoo vs Adelie by 1.639715 units of log odds.
body_mass_g coefficients of 0.007448619 means one unit increase in body_mass_g increase the log odds of being Gentoo vs Adelie by 0.007448619 units of log odds.