Titanic
f.titanic = subset(titanic, Sex == "female", select = c(Survived, Sex))
m.titanic = subset(titanic, Sex == "male", select = c(Survived, Sex))
sum(f.titanic$Survived == 1) / nrow(f.titanic) #proportion of females that survived
## [1] 0.7528958
sum(m.titanic$Survived == 1) / nrow(m.titanic) #proportion of males that survived
## [1] 0.205298
first.titanic = subset(titanic, Pclass == 1, select = c(Survived, Pclass))
second.titanic = subset(titanic, Pclass == 2, select = c(Survived, Pclass))
third.titanic = subset(titanic, Pclass == 3, select = c(Survived, Pclass))
sum(first.titanic$Survived == 1) / nrow(first.titanic) #proportion of first class passengers that survived
## [1] 0.6521739
sum(second.titanic$Survived == 1) / nrow(second.titanic) #proportion of second class passengers that survived
## [1] 0.4797688
sum(third.titanic$Survived == 1) / nrow(third.titanic) #proportion of third class passengers that survived
## [1] 0.2394366
mod0 = glm(Survived ~ 1, data = titanic, family=binomial)
coef(mod0) #log odds
## (Intercept)
## -0.386773
exp(coef(mod0)) #odds
## (Intercept)
## 0.6792453
log(p/(1-p)) = -0.386773
The odds of survival (using no predictors) are 0.6792453.
ilogit(exp(coef(mod0)))
## (Intercept)
## 0.6635702
The probability of survival is 0.6635702.
confint(mod0, level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## -0.5372197 -0.2377555
We are 95% confident that the log odds of survival are between -0.5372197 and -0.2377555.
exp(confint(mod0, level=0.95))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.5843707 0.7883955
We are 95% confident that the odds of survival are between 0.5843707 and 0.7883955.
titanic$Adult<-ifelse(titanic$Age>17.0, 1, 0)
mod1 = glm(Survived ~ Adult, data = titanic, family = binomial)
exp(coef(mod1))
## (Intercept) Adult
## 1.1730769 0.5201833
exp(-coef(mod1))
## (Intercept) Adult
## 0.852459 1.922399
p/(1-p) = 0.5201833*Adult + 1.1730769
The odds of a child surviving are 1.1730769. The odds of a child surviving are 1.922399 times the odds of an adult surviving.
mod2 = glm(Survived ~ Sex, data = titanic, family = binomial)
exp(coef(mod2))
## (Intercept) Sexmale
## 3.04687500 0.08478632
exp(-coef(mod2))
## (Intercept) Sexmale
## 0.3282051 11.7943548
p/(1-p) = 0.08478632*Sex(male) + 3.04687500
The odds of a woman surviving are 3.04687500. The odds of a woman surviving are 11.7943548 times the odds of a man surviving.
mod3 = glm(Survived ~ Pclass, data = titanic, family = binomial)
exp(coef(mod3))
## (Intercept) Pclass2 Pclass3
## 1.8750000 0.4918519 0.1679012
exp(-coef(mod3))
## (Intercept) Pclass2 Pclass3
## 0.5333333 2.0331325 5.9558824
p/(1-p) = 0.4918519Pclass2 + 0.1679012Pclass3 + 1.875
The odds of a first class passenger surviving are 1.875. The odds of a first class passenger surviving are 2.0331325 times the odds of a second class passenger surviving, and 5.9558824 times the odds of a third class passenger surviving.
6.Fit a logistic regression with all three predictors. Write the regression equation, interpret the coefficients. Then use it to predict what your odds of survival would have been.
mod4 = glm(Survived ~ 0+ Adult + Sex + Pclass, data = titanic, family = binomial)
coef(mod4)
## Adult Sexfemale Sexmale Pclass2 Pclass3
## -1.0690626 3.3126394 0.7878229 -1.0029586 -2.2045250
#my odds of survival: female, adult, second class
exp(-1.0690626 + 3.3126394*1 + 0.7878229*0 - 1.0029586*1 - 2.2045250*0) #my odds
## [1] 3.45775
Regression equation: log(p/(1-p)) = -1.0690626 + 3.3126394Sexfemale + 0.7878229Sexmale - 1.0029586Pclass2 - 2.2045250Pclass3
interaction1 = glm(Survived ~ Adult*Sex, data = titanic, family = binomial)
interaction2 = glm(Survived ~ Adult*Pclass, data = titanic, family = binomial)
interaction3 = glm(Survived ~ Sex*Pclass, data = titanic, family = binomial)
anova(interaction1, mod1, test = "Chisq") #Adult*Sex interaction is better than only Adult
## Analysis of Deviance Table
##
## Model 1: Survived ~ Adult * Sex
## Model 2: Survived ~ Adult
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 708 735.19
## 2 710 950.87 -2 -215.69 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction1, mod2, test = "Chisq") #Adult*Sex interaction is better than only Sex
## Analysis of Deviance Table
##
## Model 1: Survived ~ Adult * Sex
## Model 2: Survived ~ Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 708 735.19
## 2 710 749.57 -2 -14.383 0.000753 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction2, mod1, test = "Chisq") #Adult*PClass interaction is better than only Adult
## Analysis of Deviance Table
##
## Model 1: Survived ~ Adult * Pclass
## Model 2: Survived ~ Adult
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 706 831.70
## 2 710 950.87 -4 -119.17 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction2, mod3, test = "Chisq") #Adult*PClass interaction is better than only PClass
## Analysis of Deviance Table
##
## Model 1: Survived ~ Adult * Pclass
## Model 2: Survived ~ Pclass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 706 831.70
## 2 709 868.11 -3 -36.41 6.132e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction3, mod2, test = "Chisq") #Sex*PClass interaction is better than only Sex
## Analysis of Deviance Table
##
## Model 1: Survived ~ Sex * Pclass
## Model 2: Survived ~ Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 706 642.13
## 2 710 749.57 -4 -107.44 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(interaction3, mod3, test = "Chisq") #Sex*PClass interaction is better than only PClass
## Analysis of Deviance Table
##
## Model 1: Survived ~ Sex * Pclass
## Model 2: Survived ~ Pclass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 706 642.13
## 2 709 868.11 -3 -225.98 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction models have lower deviances and significant p-values in the drop in deviance tests, so all of the interaction models are better than the models with only one predictor.
Breast Cancer
bmod1 = glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap + USize, data = wbca, family = binomial)
exp(coef(bmod1))
## (Intercept) Adhes BNucl Chrom Epith Mitos
## 7.074104e+04 6.724605e-01 6.604837e-01 5.686111e-01 9.376266e-01 5.183370e-01
## NNucl Thick UShap USize
## 7.508198e-01 5.343255e-01 7.556985e-01 1.058850e+00
An increase in one unit of clump thickness is associated with a multiplicative change of 0.5343255 in the odds of a tumor being benign, holding all other variables constant.
drop1(bmod1, test = "Chisq") #drop USize
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +
## UShap + USize
## Df Deviance AIC LRT Pr(>Chi)
## <none> 89.464 109.46
## Adhes 1 98.844 116.84 9.3798 0.002194 **
## BNucl 1 109.000 127.00 19.5363 9.871e-06 ***
## Chrom 1 99.841 117.84 10.3767 0.001276 **
## Epith 1 89.613 107.61 0.1487 0.699763
## Mitos 1 93.551 111.55 4.0868 0.043220 *
## NNucl 1 95.204 113.20 5.7401 0.016582 *
## Thick 1 110.239 128.24 20.7744 5.167e-06 ***
## UShap 1 90.627 108.63 1.1628 0.280887
## USize 1 89.523 107.52 0.0592 0.807802
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod2 = glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap, data = wbca, family = binomial) #model without USize
drop1(bmod2, test = "Chisq") #drop Epith
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +
## UShap
## Df Deviance AIC LRT Pr(>Chi)
## <none> 89.523 107.52
## Adhes 1 99.042 115.04 9.5185 0.002034 **
## BNucl 1 109.064 125.06 19.5405 9.849e-06 ***
## Chrom 1 100.153 116.15 10.6296 0.001113 **
## Epith 1 89.662 105.66 0.1384 0.709888
## Mitos 1 93.552 109.55 4.0284 0.044740 *
## NNucl 1 95.231 111.23 5.7076 0.016891 *
## Thick 1 110.465 126.47 20.9421 4.734e-06 ***
## UShap 1 91.355 107.36 1.8313 0.175972
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod3 = glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap, data = wbca, family = binomial) #model without USize and Epith
drop1(bmod3, test = "Chisq") #drop UShap
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap
## Df Deviance AIC LRT Pr(>Chi)
## <none> 89.662 105.66
## Adhes 1 100.126 114.13 10.4647 0.0012168 **
## BNucl 1 109.762 123.76 20.1000 7.350e-06 ***
## Chrom 1 100.844 114.84 11.1825 0.0008257 ***
## Mitos 1 93.714 107.71 4.0520 0.0441183 *
## NNucl 1 95.853 109.85 6.1913 0.0128379 *
## Thick 1 110.632 124.63 20.9705 4.664e-06 ***
## UShap 1 91.884 105.88 2.2227 0.1359951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmod4 = glm(Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick, data = wbca, family = binomial) #model without USize, Epith, and UShap
drop1(bmod4, test = "Chisq") #all predictors are significant
## Single term deletions
##
## Model:
## Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick
## Df Deviance AIC LRT Pr(>Chi)
## <none> 91.884 105.88
## Adhes 1 105.473 117.47 13.588 0.0002276 ***
## BNucl 1 124.813 136.81 32.929 9.558e-09 ***
## Chrom 1 109.699 121.70 17.815 2.435e-05 ***
## Mitos 1 96.494 108.49 4.609 0.0317983 *
## NNucl 1 103.711 115.71 11.826 0.0005840 ***
## Thick 1 130.842 142.84 38.957 4.332e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(bmod4)
## (Intercept) Adhes BNucl Chrom Mitos NNucl
## 11.4729593 -0.4453444 -0.4740987 -0.6331072 -0.6816490 -0.3624117
## Thick
## -0.7224688
log(p/(1-p)) = 11.4729593 - 0.4453444Adhes - 0.4740987BNucl - 0.6331072Chrom - 0.6816490Mitos - 0.3624117NNucl - 0.7224688Thick
teststat = summary(bmod4)$deviance
pchisq(teststat, df = nrow(wbca) - length(coef(bmod4)), lower.tail = FALSE)
## [1] 1
The p-value is > 0.05, so there is no evidence of lack of fit.
halfnorm(residuals(bmod4))
There appears to be 1 outlier, point 244 which does not follow the trend in the half norm plot.
plot(residuals(bmod4, type = "deviance") ~ predict(bmod4, type = "link"), xlab="log odds", ylab ="deviance residuals")
There is very little variability in deviance when log odds are very small or very large. When the log odds are closer to 0, residuals are larger and there is more error.
newdat <- data.frame(Adhes=1, BNucl= 1,Chrom= 3, Epith=2,Mitos=1,
NNucl= 1,Thick=4,UShap=1,USize=1)
predict(bmod4, newdata = newdat, se=TRUE)
## $fit
## 1
## 4.720259
##
## $se.fit
## [1] 0.5699009
##
## $residual.scale
## [1] 1
exp(4.72025)
## [1] 112.1963
Using the reduced model, the odds of the tumor being benign are 112.1963.
predict.out <- predict(bmod4, newdata = newdat, se=TRUE)
names(predict.out)
## [1] "fit" "se.fit" "residual.scale"
pt.est <- predict.out$fit
se.est <- predict.out$se.fit
logoddsCI<-pt.est+c(-1,1)*1.96*se.est
logoddsCI
## [1] 3.603253 5.837264
ilogit(logoddsCI)
## [1] 0.9734871 0.9970917
We are 95% confident that the probability of the tumor being benign is between 0.9734871 and 0.9970917.