This week we continued to work on Logistic Regression. We practiced on a Titanic dataset.

library(stableGR)
## Warning: package 'stableGR' was built under R version 3.6.3
## Loading required package: mcmcse
## Warning: package 'mcmcse' was built under R version 3.6.3
## mcmcse: Monte Carlo Standard Errors for MCMC
## Version 1.4-1 created on 2020-01-29.
## copyright (c) 2012, James M. Flegal, University of California, Riverside
##                     John Hughes, University of Colorado, Denver
##                     Dootika Vats, University of Warwick
##                     Ning Dai, University of Minnesota
##  For citation information, type citation("mcmcse").
##  Type help("mcmcse-package") to get started.
data(titanic.complete)
titanic <- titanic.complete

The response variable that we modeled was ‘Survive’ which is a 1 for passengers who survived and 0 if they did not. For predictors we used ‘Sex’, ‘Age’, and ‘Pclass’ (1st, 2nd, or 3rd class).

head(titanic)
##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S

Below is a model without predictors. This gives the log odds of survival for the average passenger.

myresp<-cbind(titanic$Survived, 1-titanic$Survived)
m1<-glm(myresp~1, family=binomial)
summary(m1)
## 
## Call:
## glm(formula = myresp ~ 1, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.018  -1.018  -1.018   1.345   1.345  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38677    0.07636  -5.065 4.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.9  on 711  degrees of freedom
## Residual deviance: 960.9  on 711  degrees of freedom
## AIC: 962.9
## 
## Number of Fisher Scoring iterations: 4

Exponentiating the coefficient gives:

exp(-0.3868)
## [1] 0.6792269

The average odds of survival for a passenger on the Titanic are .67922.

Eventually, we made a model that used class, sex, and adult (adults considered anyone > 18 years old). Here is that model.

#creating variables
sex.binary<-ifelse(titanic$Sex=="female",1,0)
titanic<-cbind(titanic,sex.binary)
adult<-as.numeric(titanic$Age>18,length(titanic))
titanic<-cbind(titanic,adult)

m5<-glm(Survived~Pclass+adult+sex.binary,data = titanic, family=binomial)
summary(m5)
## 
## Call:
## glm(formula = Survived ~ Pclass + adult + sex.binary, family = binomial, 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4749  -0.7035  -0.4088   0.7138   2.2467  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.5070     0.2998   1.691 0.090805 .  
## Pclass2      -0.9927     0.2597  -3.822 0.000132 ***
## Pclass3      -2.1628     0.2509  -8.621  < 2e-16 ***
## adult        -0.7845     0.2495  -3.144 0.001667 ** 
## sex.binary    2.5076     0.2048  12.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 662.13  on 707  degrees of freedom
## AIC: 672.13
## 
## Number of Fisher Scoring iterations: 4
exp(m5$coefficients)
## (Intercept)     Pclass2     Pclass3       adult  sex.binary 
##   1.6603256   0.3705827   0.1149978   0.4563587  12.2757124

Here is the regression equation:

\[log(\frac{p}{1-p})=0.5070-0.9927I(2nd Class)-2.1628 I(3rdClass)-0.7845I(Adult)+2.5076 I(Female) \]

By looking at the exponentiated coefficients above, we can understand the relationship between each predictor and the odds of survival. Note that as a person’s class decreases, their odds of survival decrease as well (holding all else constant). In particular, the odds that a 3rd class passenger survived were .11499 times the odds that a 1st class passenger survived. Also, a female (holding class and age constant) had odds of surviving 12.2757 times the odds of a male.

We also worked with a breast cancer dataset.

library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3
data(wbca)

Below is a model predicting the log odds of a tumor being benign:

m1<-glm(Class~.,data = wbca,family = binomial)
summary(m1)
## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = wbca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48282  -0.01179   0.04739   0.09678   3.06425  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.16678    1.41491   7.892 2.97e-15 ***
## Adhes       -0.39681    0.13384  -2.965  0.00303 ** 
## BNucl       -0.41478    0.10230  -4.055 5.02e-05 ***
## Chrom       -0.56456    0.18728  -3.014  0.00257 ** 
## Epith       -0.06440    0.16595  -0.388  0.69795    
## Mitos       -0.65713    0.36764  -1.787  0.07387 .  
## NNucl       -0.28659    0.12620  -2.271  0.02315 *  
## Thick       -0.62675    0.15890  -3.944 8.01e-05 ***
## UShap       -0.28011    0.25235  -1.110  0.26699    
## USize        0.05718    0.23271   0.246  0.80589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.388  on 680  degrees of freedom
## Residual deviance:  89.464  on 671  degrees of freedom
## AIC: 109.46
## 
## Number of Fisher Scoring iterations: 8

This model uses all of the available predictors. We can use the drop1() function to perform drop-in-deviance tests and drop one predictor at a time.

drop1(m1,test = "Chisq")
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap + USize
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.464 109.46                      
## Adhes   1   98.844 116.84  9.3798  0.002194 ** 
## BNucl   1  109.000 127.00 19.5363 9.871e-06 ***
## Chrom   1   99.841 117.84 10.3767  0.001276 ** 
## Epith   1   89.613 107.61  0.1487  0.699763    
## Mitos   1   93.551 111.55  4.0868  0.043220 *  
## NNucl   1   95.204 113.20  5.7401  0.016582 *  
## Thick   1  110.239 128.24 20.7744 5.167e-06 ***
## UShap   1   90.627 108.63  1.1628  0.280887    
## USize   1   89.523 107.52  0.0592  0.807802    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To start, we would drop the predictor that has the highest p-value on the right. Therefore, we would first drop USize. Continue that process until all of the predictors are significant. That would look like this:

m4<-glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick,data = wbca,family = binomial)
drop1(m4,test = "Chisq")
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick
##        Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>      91.884 105.88                     
## Adhes   1  105.473 117.47 13.588 0.0002276 ***
## BNucl   1  124.813 136.81 32.929 9.558e-09 ***
## Chrom   1  109.699 121.70 17.815 2.435e-05 ***
## Mitos   1   96.494 108.49  4.609 0.0317983 *  
## NNucl   1  103.711 115.71 11.826 0.0005840 ***
## Thick   1  130.842 142.84 38.957 4.332e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here is a plot of residuals vs log odds of class:

plot(residuals(m4,type = "deviance")~predict.glm(m4,type="link"),ylab = "Deviance Residuals",xlab="Predicted log odds of benign tumor")

It appears that the model does a pretty good job at predicting when the chances of a tumor being benign are really high and when they are really low. The model does a poor job of predicting the log odds of a tumor being benign are near 0. In such instances, the probability of a tumor being benign is very close to the probability of it being malignant.

Lastly, here is a way to predict the log odds of a tumor being benign, with the following inputs:

n.dat<-data.frame(Adhes=1,BNucl=1,Chrom=3,Mitos=1,NNucl=1,Thick=4)
exp(predict(m4,newdata = n.dat,se=TRUE)$fit)
##        1 
## 112.1973

That is, with the given input values, this tumor has a 112.1973 odds of being benign.