Data 622 Homework 1

Maryluz Cruz

2021-02-19

library(DataExplorer)
library(palmerpenguins)
library(psych)
library(GGally)
library(tidyverse)
library(ggplot2)
library(reshape)
library(kableExtra)
library(MASS)
library(caret)
library(pROC)
library(nnet)

For this assignment we are going to use the palmerpenguins package. more information on that package can be found here.

https://allisonhorst.github.io/palmerpenguins/articles/intro.html

Lets take a look at the palmerpenguins package

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex               <fct> male, female, female, NA, female, male, female, m...
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...
colnames(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Species and Island

penguinsb<-penguins%>% 
  count(species, island)
kable(penguinsb)  
species island n
Adelie Biscoe 44
Adelie Dream 56
Adelie Torgersen 52
Chinstrap Dream 68
Gentoo Biscoe 124

Here you can see that there are 3 different type of species which are Adelie, Chinstrap, and Gentoo.There are 3 different Islands which are Biscoe, Dream, Torgersen. Adelie has the most with a totla of 158, but they are divided within the three different islands. While Chinstrap has the least with only 68 and only on Dream, and Gentoo have a total of 124 and only on Biscoe.

Missing Data

plot_missing(penguins)

The missing data needs to be removed.

penguins_tf<-na.omit(penguins)

Density plot

plot_density(penguins_tf)

Histogram

plot_histogram(penguins_tf)

Pairs.Panel

pairs.panels(penguins_tf)

GGPairs

ggpairs(penguins_tf)

Part 1

Choosing the Independent Variables

The independent variables that will definitely be used are island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, but do we need year and sex as an independent variable. Would sex and year allow us to know what specific species was most likely born this year, and whether or not a specific species has more males than females. A quick comparison here can show us

penguinstable<-penguins_tf%>% 
  count(species, sex)
kable(penguinstable) 
species sex n
Adelie female 73
Adelie male 73
Chinstrap female 34
Chinstrap male 34
Gentoo female 58
Gentoo male 61

Looking at the numbers there doesn’t seem to be a noticeable difference, so sex of the species does not need to be included.

penguinstable<-penguins_tf%>% 
  count(species,  year)
kable(penguinstable) 
species year n
Adelie 2007 44
Adelie 2008 50
Adelie 2009 52
Chinstrap 2007 26
Chinstrap 2008 18
Chinstrap 2009 24
Gentoo 2007 33
Gentoo 2008 45
Gentoo 2009 41

Same thing could be said about the year so this would not be needed as a independent variable. Also island would not be included because we already know that Adelie is spread out the three islands while Chinstrap is only on one island as well as Gentoo.

Dependent Variable

Since species will be the dependent variable, and the fact that there are 3 categories under species how does one decide to manipulate the dependent variable species.

penguins_tfc<-penguins_tf%>%
  group_by(species)%>%
  count()
kable(penguins_tfc)
species n
Adelie 146
Chinstrap 68
Gentoo 119

Considering that the Adelie Species has the most out of all of the species and the fact that they are spread out throughout the three different island it is decided that it would be Adelie vs Other Species, where 1 is Adelie and 0 is Other Species.

penguins_tf$species <- ifelse(penguins_tf$species=="Adelie", 1, 0)

Logistic regression with a binary outcome

mdl1<-glm(species ~ bill_length_mm +  bill_depth_mm + flipper_length_mm  + body_mass_g  , family = "binomial", data = penguins_tf)
summary(mdl1)
## 
## Call:
## glm(formula = species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g, family = "binomial", data = penguins_tf)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.328   0.000   0.000   0.000   1.652  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       27.195927  28.156975   0.966   0.3341  
## bill_length_mm    -5.106876   2.730998  -1.870   0.0615 .
## bill_depth_mm      8.953805   5.014702   1.786   0.0742 .
## flipper_length_mm  0.052471   0.119287   0.440   0.6600  
## body_mass_g        0.006281   0.003952   1.589   0.1120  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.5751  on 332  degrees of freedom
## Residual deviance:   9.4492  on 328  degrees of freedom
## AIC: 19.449
## 
## Number of Fisher Scoring iterations: 13

Variable Interpretations

  • bill_length_mm - This variable has an estimate of -5.106, since its a negative number it would mean that the bill_length of the Adelie are normally small, and that the Chinstrap and the Gentoo have a larger bill_length.

  • bill_depth_mm - This variable has an estimate of 8.953 which means that the Adelie have a bigger bill_depth, and the Chinstrap and Gentoo are least likely to have a larger bill_depth.

  • flipper_length_mm - This variable has an estimate of .054, so that would mean that the Adelie are least likely to have a large flipper_lenth, and the Gentoo and Chinstrap do.

  • body_mass_g - This variable has an estimate of .00621 means that Adelie are least likely to have a big body-mass

Part 2

AUC, Accuracy, TPR, FPR, TNR, FNR

prob_pen<- predict(mdl1, type="response")
pred_pen<- ifelse(prob_pen > .5,1,0)


confusionMatrix(as.factor(pred_pen), as.factor(penguins_tf$species),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 186   2
##          1   1 144
##                                           
##                Accuracy : 0.991           
##                  95% CI : (0.9739, 0.9981)
##     No Information Rate : 0.5616          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9817          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9863          
##             Specificity : 0.9947          
##          Pos Pred Value : 0.9931          
##          Neg Pred Value : 0.9894          
##              Prevalence : 0.4384          
##          Detection Rate : 0.4324          
##    Detection Prevalence : 0.4354          
##       Balanced Accuracy : 0.9905          
##                                           
##        'Positive' Class : 1               
## 

  • AUC: 1.00
  • Accuracy: 0.991
  • TPR (Sensitivity): 0.9863
  • FPR (1 - TNR): 0.0137
  • TNR (Specificity): 0.9947
  • FNR (1 - TPR): 0.0053

Part 3

Multinomial Logistic Regression

Again we remove the missing values

penguins_tf2<- na.omit(penguins)

Adelie again will be the reference, and the independent values bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, and the non numerical values will not be included.

penguins_tf2$species = relevel( penguins_tf2$species, ref = 'Adelie')
multipen = multinom(species ~ bill_length_mm +  bill_depth_mm + flipper_length_mm  + body_mass_g , data= penguins_tf2 )
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 16.321214
## iter  20 value 3.754897
## iter  30 value 1.631859
## iter  40 value 0.012427
## iter  50 value 0.001125
## iter  60 value 0.001108
## iter  70 value 0.001006
## iter  80 value 0.000906
## iter  90 value 0.000498
## iter 100 value 0.000498
## final  value 0.000498 
## stopped after 100 iterations
summary(multipen)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g, data = penguins_tf2)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   -34.60273       58.94543     -84.81399         -2.643720
## Gentoo       -4.70502       43.75912     -91.60364         -1.639715
##            body_mass_g
## Chinstrap -0.132491128
## Gentoo     0.007448619
## 
## Std. Errors:
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 4.4008386318    73.52088725  51.084116136       15.71457598
## Gentoo    0.0003319148     0.01602115   0.006439479        0.06591798
##           body_mass_g
## Chinstrap   0.3479725
## Gentoo      1.4739115
## 
## Residual Deviance: 0.0009955954 
## AIC: 20.001
(z <- summary(multipen)$coefficients / summary( multipen)$standard.errors )
##             (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap     -7.862758      0.8017507     -1.660281        -0.1682336
## Gentoo    -14175.384874   2731.3349932 -14225.318294       -24.8750797
##           body_mass_g
## Chinstrap -0.38075171
## Gentoo     0.00505364
(p <- (1 - pnorm(abs(z), 0, 1 )) *2 )
##            (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.774758e-15      0.4226972    0.09685792         0.8663995
## Gentoo    0.000000e+00      0.0000000    0.00000000         0.0000000
##           body_mass_g
## Chinstrap   0.7033875
## Gentoo      0.9959678

Variable Interpretations

For the coefficients the Chinstrap class

  • bill_length_mm coefficients of 58.94543 means one unit increase in bill_length increase the log odds of being Chinstrap vs Adelie by 58.94543 units of log odds.

  • bill_depth_mm coefficients of -84.81399 means one unit increase in bill_depth decrease the log odds of being Chinstrap vs Adelie by 84.81399 units of log odds.

  • flipper_length_mm coefficients of -2.643720 means one unit increase in flipper_length decrease the log odds of being Chinstrap vs Adelie by 2.643720 units of log odds.

  • body_mass_g coefficients of -0.132491128 means one unit increase in body_mass_g decrease the log odds of being Chinstrap vs Adelie by 0.132491128 units of log odds.

For the coefficients the Gentoo class

  • bill_length_mm coefficients of 43.75912 means one unit decrease in bill_length increase the log odds of being Gentoo vs Adelie by 43.75912 units of log odds.

  • bill_depth_mm coefficients of -91.60364 means one unit increase in bill_depth decrease the log odds of being Gentoo vs Adelie by 91.60364 units of log odds.

  • flipper_length_mm coefficients of -1.639715 means one unit increase in flipper_length decrease the log odds of being Gentoo vs Adelie by 1.639715 units of log odds.

  • body_mass_g coefficients of 0.007448619 means one unit increase in body_mass_g increase the log odds of being Gentoo vs Adelie by 0.007448619 units of log odds.