This week we continued to work on Logistic Regression. We practiced on a Titanic dataset.
library(stableGR)
## Warning: package 'stableGR' was built under R version 3.6.3
## Loading required package: mcmcse
## Warning: package 'mcmcse' was built under R version 3.6.3
## mcmcse: Monte Carlo Standard Errors for MCMC
## Version 1.4-1 created on 2020-01-29.
## copyright (c) 2012, James M. Flegal, University of California, Riverside
## John Hughes, University of Colorado, Denver
## Dootika Vats, University of Warwick
## Ning Dai, University of Minnesota
## For citation information, type citation("mcmcse").
## Type help("mcmcse-package") to get started.
data(titanic.complete)
titanic <- titanic.complete
The response variable that we modeled was ‘Survive’ which is a 1 for passengers who survived and 0 if they did not. For predictors we used ‘Sex’, ‘Age’, and ‘Pclass’ (1st, 2nd, or 3rd class).
head(titanic)
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 22 1 0 7.2500 S
## 2 1 1 female 38 1 0 71.2833 C
## 3 1 3 female 26 0 0 7.9250 S
## 4 1 1 female 35 1 0 53.1000 S
## 5 0 3 male 35 0 0 8.0500 S
## 7 0 1 male 54 0 0 51.8625 S
Below is a model without predictors. This gives the log odds of survival for the average passenger.
myresp<-cbind(titanic$Survived, 1-titanic$Survived)
m1<-glm(myresp~1, family=binomial)
summary(m1)
##
## Call:
## glm(formula = myresp ~ 1, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.018 -1.018 -1.018 1.345 1.345
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38677 0.07636 -5.065 4.08e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 960.9 on 711 degrees of freedom
## Residual deviance: 960.9 on 711 degrees of freedom
## AIC: 962.9
##
## Number of Fisher Scoring iterations: 4
Exponentiating the coefficient gives:
exp(-0.3868)
## [1] 0.6792269
The average odds of survival for a passenger on the Titanic are .67922.
Eventually, we made a model that used class, sex, and adult (adults considered anyone > 18 years old). Here is that model.
#creating variables
sex.binary<-ifelse(titanic$Sex=="female",1,0)
titanic<-cbind(titanic,sex.binary)
adult<-as.numeric(titanic$Age>18,length(titanic))
titanic<-cbind(titanic,adult)
m5<-glm(Survived~Pclass+adult+sex.binary,data = titanic, family=binomial)
summary(m5)
##
## Call:
## glm(formula = Survived ~ Pclass + adult + sex.binary, family = binomial,
## data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4749 -0.7035 -0.4088 0.7138 2.2467
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5070 0.2998 1.691 0.090805 .
## Pclass2 -0.9927 0.2597 -3.822 0.000132 ***
## Pclass3 -2.1628 0.2509 -8.621 < 2e-16 ***
## adult -0.7845 0.2495 -3.144 0.001667 **
## sex.binary 2.5076 0.2048 12.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 960.90 on 711 degrees of freedom
## Residual deviance: 662.13 on 707 degrees of freedom
## AIC: 672.13
##
## Number of Fisher Scoring iterations: 4
exp(m5$coefficients)
## (Intercept) Pclass2 Pclass3 adult sex.binary
## 1.6603256 0.3705827 0.1149978 0.4563587 12.2757124
Here is the regression equation:
\[log(\frac{p}{1-p})=0.5070-0.9927I(2nd Class)-2.1628 I(3rdClass)-0.7845I(Adult)+2.5076 I(Female) \]
By looking at the exponentiated coefficients above, we can understand the relationship between each predictor and the odds of survival. Note that as a person’s class decreases, their odds of survival decrease as well (holding all else constant). In particular, the odds that a 3rd class passenger survived were .11499 times the odds that a 1st class passenger survived. Also, a female (holding class and age constant) had odds of surviving 12.2757 times the odds of a male.
We also worked with a breast cancer dataset.
library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3
data(wbca)
Below is a model predicting the log odds of a tumor being benign:
m1<-glm(Class~.,data = wbca,family = binomial)
summary(m1)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = wbca)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48282 -0.01179 0.04739 0.09678 3.06425
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.16678 1.41491 7.892 2.97e-15 ***
## Adhes -0.39681 0.13384 -2.965 0.00303 **
## BNucl -0.41478 0.10230 -4.055 5.02e-05 ***
## Chrom -0.56456 0.18728 -3.014 0.00257 **
## Epith -0.06440 0.16595 -0.388 0.69795
## Mitos -0.65713 0.36764 -1.787 0.07387 .
## NNucl -0.28659 0.12620 -2.271 0.02315 *
## Thick -0.62675 0.15890 -3.944 8.01e-05 ***
## UShap -0.28011 0.25235 -1.110 0.26699
## USize 0.05718 0.23271 0.246 0.80589
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 881.388 on 680 degrees of freedom
## Residual deviance: 89.464 on 671 degrees of freedom
## AIC: 109.46
##
## Number of Fisher Scoring iterations: 8
This model uses all of the available predictors. We can use the drop1() function to perform drop-in-deviance tests and drop one predictor at a time.
drop1(m1,test = "Chisq")
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +
## UShap + USize
## Df Deviance AIC LRT Pr(>Chi)
## <none> 89.464 109.46
## Adhes 1 98.844 116.84 9.3798 0.002194 **
## BNucl 1 109.000 127.00 19.5363 9.871e-06 ***
## Chrom 1 99.841 117.84 10.3767 0.001276 **
## Epith 1 89.613 107.61 0.1487 0.699763
## Mitos 1 93.551 111.55 4.0868 0.043220 *
## NNucl 1 95.204 113.20 5.7401 0.016582 *
## Thick 1 110.239 128.24 20.7744 5.167e-06 ***
## UShap 1 90.627 108.63 1.1628 0.280887
## USize 1 89.523 107.52 0.0592 0.807802
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To start, we would drop the predictor that has the highest p-value on the right. Therefore, we would first drop USize. Continue that process until all of the predictors are significant. That would look like this:
m4<-glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick,data = wbca,family = binomial)
drop1(m4,test = "Chisq")
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick
## Df Deviance AIC LRT Pr(>Chi)
## <none> 91.884 105.88
## Adhes 1 105.473 117.47 13.588 0.0002276 ***
## BNucl 1 124.813 136.81 32.929 9.558e-09 ***
## Chrom 1 109.699 121.70 17.815 2.435e-05 ***
## Mitos 1 96.494 108.49 4.609 0.0317983 *
## NNucl 1 103.711 115.71 11.826 0.0005840 ***
## Thick 1 130.842 142.84 38.957 4.332e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here is a plot of residuals vs log odds of class:
plot(residuals(m4,type = "deviance")~predict.glm(m4,type="link"),ylab = "Deviance Residuals",xlab="Predicted log odds of benign tumor")
It appears that the model does a pretty good job at predicting when the chances of a tumor being benign are really high and when they are really low. The model does a poor job of predicting the log odds of a tumor being benign are near 0. In such instances, the probability of a tumor being benign is very close to the probability of it being malignant.
Lastly, here is a way to predict the log odds of a tumor being benign, with the following inputs:
n.dat<-data.frame(Adhes=1,BNucl=1,Chrom=3,Mitos=1,NNucl=1,Thick=4)
exp(predict(m4,newdata = n.dat,se=TRUE)$fit)
## 1
## 112.1973
That is, with the given input values, this tumor has a 112.1973 odds of being benign.