Objective

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable.

1. Logistic Regression with a binary outcome.

The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

In the penguin dataset species column there are 3 categories, Gentoo, Adelie, and Chinstrap. Because there are many similarities beween Adelie and Chinstrap we will combine these two species to build a logistic regress with binary outcome against Gentoo. Many similarities include the size of the penguins and the islands it dwells in. Only the Gentoo lives separate in an another island opposed to the Adelie and Chinstrap.

Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.

Year, sex, island are variables that aren’t as important in compared the three species of penguin to one another. However the measurements of bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g does play an important role in comparision between the three species.

# Loading required libraries

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr) 
library(caTools)

## Warning: package 'caTools' was built under R version 4.0.3

library(caret)

## Warning: package 'caret' was built under R version 4.0.2

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.3

library(readr)
require(nnet)

## Loading required package: nnet

library(pROC)

## Warning: package 'pROC' was built under R version 4.0.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Load dataset

penguins <- palmerpenguins::penguins

head(penguins)

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 6 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>

penguins$Species <- ifelse(penguins$species == 'Gentoo', 0 , 1)

df1 <- penguins %>% select(-species, -year, -sex, -island)

#find the na of each variable
lapply(df1, function(x) { length(which(is.na(x))) })

## $bill_length_mm
## [1] 2
## 
## $bill_depth_mm
## [1] 2
## 
## $flipper_length_mm
## [1] 2
## 
## $body_mass_g
## [1] 2
## 
## $Species
## [1] 0

# substitute the average of the variable in place of the NA values
df1$bill_length_mm <- ifelse(is.na(df1$bill_length_mm), mean(df1$bill_length_mm, na.rm=TRUE), df1$bill_length_mm)


df1$bill_depth_mm <- ifelse(is.na(df1$bill_depth_mm), mean(df1$bill_depth_mm, na.rm=TRUE), df1$bill_depth_mm)

df1$flipper_length_mm <- ifelse(is.na(df1$flipper_length_mm), mean(df1$flipper_length_mm, na.rm=TRUE), df1$flipper_length_mm)


df1$body_mass_g <- ifelse(is.na(df1$body_mass_g), mean(df1$body_mass_g, na.rm=TRUE), df1$body_mass_g)



df1$Species <- factor(df1$Species)

#na omit

new_df <- na.omit(df1)

# logistic regression model

model <- glm(Species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, family=binomial(link='logit'), data=new_df)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model)

## 
## Call:
## glm(formula = Species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g, family = binomial(link = "logit"), data = new_df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.177   0.000   0.000   0.000   1.177  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        6.876e+02  1.096e+05   0.006    0.995
## bill_length_mm    -8.303e-01  6.214e+02  -0.001    0.999
## bill_depth_mm      7.425e+01  8.173e+03   0.009    0.993
## flipper_length_mm -7.354e+00  8.435e+02  -0.009    0.993
## body_mass_g       -1.064e-01  1.188e+01  -0.009    0.993
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 449.7355  on 343  degrees of freedom
## Residual deviance:   2.7726  on 339  degrees of freedom
## AIC: 12.773
## 
## Number of Fisher Scoring iterations: 25

# split train /test set

set.seed(100) 
sampleSplit <- sample.split(Y=new_df$Species, SplitRatio=0.7) 
trainSet <- subset(x=new_df, sampleSplit==TRUE) 
testSet <- subset(x=new_df, sampleSplit==FALSE)

prob <- predict(model, testSet, type='response') 
preds <- ifelse(prob > 0.5, 1, 0)

2 . AUC, Accuracy, TPR, FPR, TNR, FNR

Using the values obtained from the confusion matrix we will solve for AUC, Accuracy, TPR, FPR, TNR, FNR.

# confusion matrix to find out Accuracy, TPR, FPR, TNR, FNR
confusionMatrix(factor(preds), factor(testSet$Species))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  0
##          1  0 66
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9648, 1)
##     No Information Rate : 0.6408     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3592     
##          Detection Rate : 0.3592     
##    Detection Prevalence : 0.3592     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

#ROC curve


prob=predict(model,type=c("response"))
new_df$prob=prob

g <- roc(Species ~ prob, data = new_df, plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE, print.auc=TRUE, show.thres=TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Accuracy of 0.9903, 99%. We got an AUC value of 1.0. An AUC value of greater than .70 indicates a good model.

TP<-65
FP<-1
TN<-37
FN<-0

TPR = TP/(FN + TP)
TPR

## [1] 1

TNR = TN/(TN + FP)
TNR

## [1] 0.9736842

FPR = FP/(FP + TN)
FPR

## [1] 0.02631579

FNR = FN/(FN + TP)
FNR

## [1] 0

3 . Multinomial Logistic Regression

# Setting the reference
new_df$Species <- relevel(new_df$Species, ref = "1")

multinom_model <- multinom(Species ~ ., data = new_df)

## # weights:  7 (6 variable)
## initial  value 238.442630 
## iter  10 value 1.600055
## iter  20 value 1.397373
## final  value 1.387525 
## converged

# Checking the model

summary(multinom_model)

## Call:
## multinom(formula = Species ~ ., data = new_df)
## 
## Coefficients:
##                          Values   Std. Err.
## (Intercept)        -0.770769168 0.005889471
## bill_length_mm      0.060549867 0.234842646
## bill_depth_mm      -3.822613355 0.078788290
## flipper_length_mm   0.293618899 1.069965860
## body_mass_g         0.003086539 0.053954582
## prob              -16.579969585 0.019606818
## 
## Residual Deviance: 2.775049 
## AIC: 14.77505

Here’s a summary of our Multinomial Logistic Regression.

#the coefficients
coef(multinom_model)

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##      -0.770769168       0.060549867      -3.822613355       0.293618899 
##       body_mass_g              prob 
##       0.003086539     -16.579969585

head(round(fitted(multinom_model), 2))

##   [,1]
## 1  0.0
## 2  0.0
## 3  0.0
## 4  0.5
## 5  0.0
## 6  0.0

Here’s the coefficients of the model.

# To get the z-values, you divide the coefficients by the standard errors.

summary(multinom_model)$standard.errors

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##       0.005889471       0.234842646       0.078788290       1.069965860 
##       body_mass_g              prob 
##       0.053954582       0.019606818

zvalues <- summary(multinom_model)$coefficients / summary(multinom_model)$standard.errors
# Show z-values
zvalues

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##     -130.87239623        0.25783165      -48.51753194        0.27441894 
##       body_mass_g              prob 
##        0.05720624     -845.62265095

Using the coefficients and standard errors we are able to calculate the z- values.

pnorm(abs(zvalues), lower.tail=FALSE)*2

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##         0.0000000         0.7965368         0.0000000         0.7837627 
##       body_mass_g              prob 
##         0.9543809         0.0000000

Here are the p-values.

4. Extra Credit

What would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.

Multiple logistic regression analyses, one for each pair of outcomes: One problem with this approach is that each analysis is potentially run on a different sample. The other problem is that without constraining the logistic models, we can end up with the probability of choosing all possible outcome categories greater than 1. Collapsing number of categories to two and then doing a logistic regression: This approach suffers from loss of information and changes the original research questions to very different ones.

We should evaluate chi-square and R squared for our model in question 3.

Data 622 HW1

Tony Mei

2/18/2021

Objective

1. Logistic Regression with a binary outcome.

2 . AUC, Accuracy, TPR, FPR, TNR, FNR

3 . Multinomial Logistic Regression

4. Extra Credit