Titanic

  1. Do some EDA. Since the data are 0s and 1s, calculate the proportion that survived in each class and for each sex.
f.titanic = subset(titanic, Sex == "female", select = c(Survived, Sex))
m.titanic = subset(titanic, Sex == "male", select = c(Survived, Sex))
sum(f.titanic$Survived == 1) / nrow(f.titanic) #proportion of females that survived
## [1] 0.7528958
sum(m.titanic$Survived == 1) / nrow(m.titanic) #proportion of males that survived
## [1] 0.205298
first.titanic = subset(titanic, Pclass == 1, select = c(Survived, Pclass))
second.titanic = subset(titanic, Pclass == 2, select = c(Survived, Pclass))
third.titanic = subset(titanic, Pclass == 3, select = c(Survived, Pclass))
sum(first.titanic$Survived == 1) / nrow(first.titanic) #proportion of first class passengers that survived
## [1] 0.6521739
sum(second.titanic$Survived == 1) / nrow(second.titanic) #proportion of second class passengers that survived
## [1] 0.4797688
sum(third.titanic$Survived == 1) / nrow(third.titanic) #proportion of third class passengers that survived
## [1] 0.2394366
  1. Create a logistic regression for survival using no predictors.
mod0 = glm(Survived ~ 1, data = titanic, family=binomial)
coef(mod0) #log odds
## (Intercept) 
##   -0.386773
exp(coef(mod0)) #odds
## (Intercept) 
##   0.6792453
  1. Write down the logistic regression model.

log(p/(1-p)) = -0.386773

  1. Interpret the resulting odds of survival.

The odds of survival (using no predictors) are 0.6792453.

  1. Find the probability of survival.
ilogit(exp(coef(mod0)))
## (Intercept) 
##   0.6635702

The probability of survival is 0.6635702.

  1. Create a 95 percent CI for the log odds of survival.
confint(mod0, level = 0.95)
## Waiting for profiling to be done...
##      2.5 %     97.5 % 
## -0.5372197 -0.2377555

We are 95% confident that the log odds of survival are between -0.5372197 and -0.2377555.

  1. Create and interpret a 95 percent CI for the odds of survival.
exp(confint(mod0, level=0.95))
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.5843707 0.7883955

We are 95% confident that the odds of survival are between 0.5843707 and 0.7883955.

  1. Fit a logistic regression to compare odds of survival for children (0) and adults (1). Write the regression equation, interpret the coefficients.
titanic$Adult<-ifelse(titanic$Age>17.0, 1, 0)
mod1 = glm(Survived ~ Adult, data = titanic, family = binomial)
exp(coef(mod1))
## (Intercept)       Adult 
##   1.1730769   0.5201833
exp(-coef(mod1))
## (Intercept)       Adult 
##    0.852459    1.922399

p/(1-p) = 0.5201833*Adult + 1.1730769

The odds of a child surviving are 1.1730769. The odds of a child surviving are 1.922399 times the odds of an adult surviving.

  1. Fit a logistic regression to compare odds of survival for women (1) and men (0). Write the regression equation, interpret the coefficients.
mod2 = glm(Survived ~ Sex, data = titanic, family = binomial)
exp(coef(mod2))
## (Intercept)     Sexmale 
##  3.04687500  0.08478632
exp(-coef(mod2))
## (Intercept)     Sexmale 
##   0.3282051  11.7943548

p/(1-p) = 0.08478632*Sex(male) + 3.04687500

The odds of a woman surviving are 3.04687500. The odds of a woman surviving are 11.7943548 times the odds of a man surviving.

  1. Fit a logistic regression to compare odds of survival first class passengers (1), second class passengers (2) and third class passengers (3). Write the regression equation, interpret the coefficients.
mod3 = glm(Survived ~ Pclass, data = titanic, family = binomial)
exp(coef(mod3))
## (Intercept)     Pclass2     Pclass3 
##   1.8750000   0.4918519   0.1679012
exp(-coef(mod3))
## (Intercept)     Pclass2     Pclass3 
##   0.5333333   2.0331325   5.9558824

p/(1-p) = 0.4918519Pclass2 + 0.1679012Pclass3 + 1.875

The odds of a first class passenger surviving are 1.875. The odds of a first class passenger surviving are 2.0331325 times the odds of a second class passenger surviving, and 5.9558824 times the odds of a third class passenger surviving.

6.Fit a logistic regression with all three predictors. Write the regression equation, interpret the coefficients. Then use it to predict what your odds of survival would have been.

mod4 = glm(Survived ~ 0+ Adult + Sex + Pclass, data = titanic, family = binomial)
coef(mod4)
##      Adult  Sexfemale    Sexmale    Pclass2    Pclass3 
## -1.0690626  3.3126394  0.7878229 -1.0029586 -2.2045250
#my odds of survival: female, adult, second class
exp(-1.0690626 + 3.3126394*1 + 0.7878229*0 - 1.0029586*1 - 2.2045250*0) #my odds
## [1] 3.45775

Regression equation: log(p/(1-p)) = -1.0690626 + 3.3126394Sexfemale + 0.7878229Sexmale - 1.0029586Pclass2 - 2.2045250Pclass3

  1. Test for 2-way interaction terms. Use drop-in-deviance and forward selection.
interaction1 = glm(Survived ~ Adult*Sex, data = titanic, family = binomial)
interaction2 = glm(Survived ~ Adult*Pclass, data = titanic, family = binomial)
interaction3 = glm(Survived ~ Sex*Pclass, data = titanic, family = binomial)

anova(interaction1, mod1, test = "Chisq") #Adult*Sex interaction is better than only Adult
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Adult * Sex
## Model 2: Survived ~ Adult
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       708     735.19                          
## 2       710     950.87 -2  -215.69 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction1, mod2, test = "Chisq") #Adult*Sex interaction is better than only Sex
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Adult * Sex
## Model 2: Survived ~ Sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       708     735.19                         
## 2       710     749.57 -2  -14.383 0.000753 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction2, mod1, test = "Chisq") #Adult*PClass interaction is better than only Adult
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Adult * Pclass
## Model 2: Survived ~ Adult
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       706     831.70                          
## 2       710     950.87 -4  -119.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction2, mod3, test = "Chisq") #Adult*PClass interaction is better than only PClass
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Adult * Pclass
## Model 2: Survived ~ Pclass
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       706     831.70                          
## 2       709     868.11 -3   -36.41 6.132e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction3, mod2, test = "Chisq") #Sex*PClass interaction is better than only Sex
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Sex * Pclass
## Model 2: Survived ~ Sex
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       706     642.13                          
## 2       710     749.57 -4  -107.44 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction3, mod3, test = "Chisq") #Sex*PClass interaction is better than only PClass
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Sex * Pclass
## Model 2: Survived ~ Pclass
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       706     642.13                          
## 2       709     868.11 -3  -225.98 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction models have lower deviances and significant p-values in the drop in deviance tests, so all of the interaction models are better than the models with only one predictor.

Breast Cancer

  1. Fit a logistic regression with Class as the response and the other nine variables as predictors.
bmod1 = glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap + USize, data = wbca, family = binomial)
exp(coef(bmod1))
##  (Intercept)        Adhes        BNucl        Chrom        Epith        Mitos 
## 7.074104e+04 6.724605e-01 6.604837e-01 5.686111e-01 9.376266e-01 5.183370e-01 
##        NNucl        Thick        UShap        USize 
## 7.508198e-01 5.343255e-01 7.556985e-01 1.058850e+00
  1. Using regression coefficients, describe the relationship between the odds of a tumor being malignant and the clump thickness. (Units are not listed in the data set. You can simply say “one unit of thickness.”)

An increase in one unit of clump thickness is associated with a multiplicative change of 0.5343255 in the odds of a tumor being benign, holding all other variables constant.

  1. Use backward elimination and the drop-in-deviance test to eliminate unnecessary predictors. (You may use an automatic function such as drop1)
drop1(bmod1, test = "Chisq") #drop USize
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap + USize
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.464 109.46                      
## Adhes   1   98.844 116.84  9.3798  0.002194 ** 
## BNucl   1  109.000 127.00 19.5363 9.871e-06 ***
## Chrom   1   99.841 117.84 10.3767  0.001276 ** 
## Epith   1   89.613 107.61  0.1487  0.699763    
## Mitos   1   93.551 111.55  4.0868  0.043220 *  
## NNucl   1   95.204 113.20  5.7401  0.016582 *  
## Thick   1  110.239 128.24 20.7744 5.167e-06 ***
## UShap   1   90.627 108.63  1.1628  0.280887    
## USize   1   89.523 107.52  0.0592  0.807802    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod2 = glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap, data = wbca, family = binomial) #model without USize
drop1(bmod2, test = "Chisq") #drop Epith
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + 
##     UShap
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.523 107.52                      
## Adhes   1   99.042 115.04  9.5185  0.002034 ** 
## BNucl   1  109.064 125.06 19.5405 9.849e-06 ***
## Chrom   1  100.153 116.15 10.6296  0.001113 ** 
## Epith   1   89.662 105.66  0.1384  0.709888    
## Mitos   1   93.552 109.55  4.0284  0.044740 *  
## NNucl   1   95.231 111.23  5.7076  0.016891 *  
## Thick   1  110.465 126.47 20.9421 4.734e-06 ***
## UShap   1   91.355 107.36  1.8313  0.175972    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod3 = glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap, data = wbca, family = binomial) #model without USize and Epith
drop1(bmod3, test = "Chisq") #drop UShap
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      89.662 105.66                      
## Adhes   1  100.126 114.13 10.4647 0.0012168 ** 
## BNucl   1  109.762 123.76 20.1000 7.350e-06 ***
## Chrom   1  100.844 114.84 11.1825 0.0008257 ***
## Mitos   1   93.714 107.71  4.0520 0.0441183 *  
## NNucl   1   95.853 109.85  6.1913 0.0128379 *  
## Thick   1  110.632 124.63 20.9705 4.664e-06 ***
## UShap   1   91.884 105.88  2.2227 0.1359951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod4 = glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick, data = wbca, family = binomial) #model without USize, Epith, and UShap
drop1(bmod4, test = "Chisq") #all predictors are significant
## Single term deletions
## 
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick
##        Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>      91.884 105.88                     
## Adhes   1  105.473 117.47 13.588 0.0002276 ***
## BNucl   1  124.813 136.81 32.929 9.558e-09 ***
## Chrom   1  109.699 121.70 17.815 2.435e-05 ***
## Mitos   1   96.494 108.49  4.609 0.0317983 *  
## NNucl   1  103.711 115.71 11.826 0.0005840 ***
## Thick   1  130.842 142.84 38.957 4.332e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Write the resulting regression equation.
coef(bmod4)
## (Intercept)       Adhes       BNucl       Chrom       Mitos       NNucl 
##  11.4729593  -0.4453444  -0.4740987  -0.6331072  -0.6816490  -0.3624117 
##       Thick 
##  -0.7224688

log(p/(1-p)) = 11.4729593 - 0.4453444Adhes - 0.4740987BNucl - 0.6331072Chrom - 0.6816490Mitos - 0.3624117NNucl - 0.7224688Thick

  1. Test for lack of fit.
teststat = summary(bmod4)$deviance
pchisq(teststat, df = nrow(wbca) - length(coef(bmod4)), lower.tail = FALSE)
## [1] 1

The p-value is > 0.05, so there is no evidence of lack of fit.

  1. Check for outliers.
halfnorm(residuals(bmod4))

There appears to be 1 outlier, point 244 which does not follow the trend in the half norm plot.

  1. Plot the deviance residuals and describe what the plot tells you.
plot(residuals(bmod4, type = "deviance") ~ predict(bmod4, type = "link"),  xlab="log odds", ylab ="deviance residuals")

There is very little variability in deviance when log odds are very small or very large. When the log odds are closer to 0, residuals are larger and there is more error.

  1. Use your reduced model to predict the odds of a tumor being malignant for a new patient whose values are 1, 1, 3, 2, 1, 1, 4, 1, 1. (The order is Adhes, BNucl, Chrom, Epith, Mitos, NNucl, Thick, UShap, USize.)
newdat <- data.frame(Adhes=1, BNucl= 1,Chrom= 3, Epith=2,Mitos=1,
NNucl= 1,Thick=4,UShap=1,USize=1)
predict(bmod4, newdata = newdat, se=TRUE)
## $fit
##        1 
## 4.720259 
## 
## $se.fit
## [1] 0.5699009
## 
## $residual.scale
## [1] 1
exp(4.72025)
## [1] 112.1963

Using the reduced model, the odds of the tumor being benign are 112.1963.

  1. Create a 95 percent prediction interval to predict the probability of the tumor (from the previous problem) being malignant. Use your reduced model.
predict.out <- predict(bmod4, newdata = newdat, se=TRUE)
names(predict.out)
## [1] "fit"            "se.fit"         "residual.scale"
pt.est <- predict.out$fit
se.est <- predict.out$se.fit
logoddsCI<-pt.est+c(-1,1)*1.96*se.est
logoddsCI
## [1] 3.603253 5.837264
ilogit(logoddsCI)
## [1] 0.9734871 0.9970917

We are 95% confident that the probability of the tumor being benign is between 0.9734871 and 0.9970917.