Objective
Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable.
1. Logistic Regression with a binary outcome.
- The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
In the penguin dataset species column there are 3 categories, Gentoo, Adelie, and Chinstrap. Because there are many similarities beween Adelie and Chinstrap we will combine these two species to build a logistic regress with binary outcome against Gentoo. Many similarities include the size of the penguins and the islands it dwells in. Only the Gentoo lives separate in an another island opposed to the Adelie and Chinstrap.
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.
Year, sex, island are variables that aren’t as important in compared the three species of penguin to one another. However the measurements of bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g does play an important role in comparision between the three species.
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'caTools' was built under R version 4.0.3
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
## Loading required package: nnet
## Warning: package 'pROC' was built under R version 4.0.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ NA NA NA NA <NA>
## 5 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 6 Adelie Torge~ 39.3 20.6 190 3650 male
## # ... with 1 more variable: year <int>
penguins$Species <- ifelse(penguins$species == 'Gentoo', 0 , 1)
df1 <- penguins %>% select(-species, -year, -sex, -island)
#find the na of each variable
lapply(df1, function(x) { length(which(is.na(x))) })## $bill_length_mm
## [1] 2
##
## $bill_depth_mm
## [1] 2
##
## $flipper_length_mm
## [1] 2
##
## $body_mass_g
## [1] 2
##
## $Species
## [1] 0
# substitute the average of the variable in place of the NA values
df1$bill_length_mm <- ifelse(is.na(df1$bill_length_mm), mean(df1$bill_length_mm, na.rm=TRUE), df1$bill_length_mm)
df1$bill_depth_mm <- ifelse(is.na(df1$bill_depth_mm), mean(df1$bill_depth_mm, na.rm=TRUE), df1$bill_depth_mm)
df1$flipper_length_mm <- ifelse(is.na(df1$flipper_length_mm), mean(df1$flipper_length_mm, na.rm=TRUE), df1$flipper_length_mm)
df1$body_mass_g <- ifelse(is.na(df1$body_mass_g), mean(df1$body_mass_g, na.rm=TRUE), df1$body_mass_g)
df1$Species <- factor(df1$Species)
#na omit
new_df <- na.omit(df1)
# logistic regression model
model <- glm(Species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, family=binomial(link='logit'), data=new_df)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = Species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
## body_mass_g, family = binomial(link = "logit"), data = new_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.177 0.000 0.000 0.000 1.177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.876e+02 1.096e+05 0.006 0.995
## bill_length_mm -8.303e-01 6.214e+02 -0.001 0.999
## bill_depth_mm 7.425e+01 8.173e+03 0.009 0.993
## flipper_length_mm -7.354e+00 8.435e+02 -0.009 0.993
## body_mass_g -1.064e-01 1.188e+01 -0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 449.7355 on 343 degrees of freedom
## Residual deviance: 2.7726 on 339 degrees of freedom
## AIC: 12.773
##
## Number of Fisher Scoring iterations: 25
2 . AUC, Accuracy, TPR, FPR, TNR, FNR
Using the values obtained from the confusion matrix we will solve for AUC, Accuracy, TPR, FPR, TNR, FNR.
# confusion matrix to find out Accuracy, TPR, FPR, TNR, FNR
confusionMatrix(factor(preds), factor(testSet$Species))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 0
## 1 0 66
##
## Accuracy : 1
## 95% CI : (0.9648, 1)
## No Information Rate : 0.6408
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3592
## Detection Rate : 0.3592
## Detection Prevalence : 0.3592
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
#ROC curve
prob=predict(model,type=c("response"))
new_df$prob=prob
g <- roc(Species ~ prob, data = new_df, plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE, print.auc=TRUE, show.thres=TRUE)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Accuracy of 0.9903, 99%. We got an AUC value of 1.0. An AUC value of greater than .70 indicates a good model.
## [1] 1
## [1] 0.9736842
## [1] 0.02631579
## [1] 0
3 . Multinomial Logistic Regression
# Setting the reference
new_df$Species <- relevel(new_df$Species, ref = "1")
multinom_model <- multinom(Species ~ ., data = new_df)## # weights: 7 (6 variable)
## initial value 238.442630
## iter 10 value 1.600055
## iter 20 value 1.397373
## final value 1.387525
## converged
## Call:
## multinom(formula = Species ~ ., data = new_df)
##
## Coefficients:
## Values Std. Err.
## (Intercept) -0.770769168 0.005889471
## bill_length_mm 0.060549867 0.234842646
## bill_depth_mm -3.822613355 0.078788290
## flipper_length_mm 0.293618899 1.069965860
## body_mass_g 0.003086539 0.053954582
## prob -16.579969585 0.019606818
##
## Residual Deviance: 2.775049
## AIC: 14.77505
Here’s a summary of our Multinomial Logistic Regression.
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## -0.770769168 0.060549867 -3.822613355 0.293618899
## body_mass_g prob
## 0.003086539 -16.579969585
## [,1]
## 1 0.0
## 2 0.0
## 3 0.0
## 4 0.5
## 5 0.0
## 6 0.0
Here’s the coefficients of the model.
# To get the z-values, you divide the coefficients by the standard errors.
summary(multinom_model)$standard.errors## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## 0.005889471 0.234842646 0.078788290 1.069965860
## body_mass_g prob
## 0.053954582 0.019606818
zvalues <- summary(multinom_model)$coefficients / summary(multinom_model)$standard.errors
# Show z-values
zvalues## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## -130.87239623 0.25783165 -48.51753194 0.27441894
## body_mass_g prob
## 0.05720624 -845.62265095
Using the coefficients and standard errors we are able to calculate the z- values.
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## 0.0000000 0.7965368 0.0000000 0.7837627
## body_mass_g prob
## 0.9543809 0.0000000
Here are the p-values.
4. Extra Credit
What would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.
Multiple logistic regression analyses, one for each pair of outcomes: One problem with this approach is that each analysis is potentially run on a different sample. The other problem is that without constraining the logistic models, we can end up with the probability of choosing all possible outcome categories greater than 1. Collapsing number of categories to two and then doing a logistic regression: This approach suffers from loss of information and changes the original research questions to very different ones.
We should evaluate chi-square and R squared for our model in question 3.