The following are notes from readings in ‘Extending the Linear Model with R’ by Julian Faraway for the course DATA621, ‘Business Analystics and Data Mining’ as part of the Masters of Science in Data Science program at CUNY SPS.
R Libraries Used:
library( faraway )
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( MASS )
library( brglm2 )
library( survival )Binomial Data
The Challenger Disaster Example
We are interested in how the probability of failure in a given O-ring is related to the launch temperature and predicting the probability when the temperature is \(31^\circ\)F.
data( orings )
glimpse( orings )## Rows: 23
## Columns: 2
## $ temp <dbl> 53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73,…
## $ damage <dbl> 5, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, …
Visualize the data and fit a simple linear regression model:
p1 <- ggplot( data = orings, aes( x = temp, y = damage/6 ) ) +
geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
#geom_jitter(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = 'lm', col = 'red', se = F ) +
ylab( 'Prob of Damage' ) +
xlab( 'Temperature' ) +
ggtitle( 'Naive Approach: simple linear regression' ) +
theme_classic()
p1## `geom_smooth()` using formula 'y ~ x'
However, there are several problems to the simple linear regression approach. Most obviously that the linear regression method can predict probabilities \(\gt1 \mbox{ & } \lt 0\) and that is problematic, because those values are not possible.
Binomial Regression Model
\[P(\mbox{k out of n}) = \frac{n!}{k!(n-k)!}p^k(1-p)^{(n-k)} = \left( \begin{array}{c} n_i \\ y_i \end{array} \right)p_i^{y_i}(1-p_i)^{n_i-y_i}\] \[P(\mbox{event k happening out n possible ways}) = (\mbox{how many possibilities})\cdot P(\mbox{each of n possibilities})\] to describe the relationship of all the predictors (\(x_{i1}\dots x_{iq}\)) to p (the success probability), we set up a linear predictor \(\eta_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_qx_{iq}\).
\(\eta_i = p_i\) is not appropriate, because it is required that \(0\leq p_i \leq 1\), so here are some other choices to use as linking functions:
- Logit: \(\eta = log( p/(1-p))\)
- Propit \(\eta = \Phi^{-1}(p)\) where \(\Phi\) is the inverse normal cumulative distribution function
- Complimentary log-log: \(\eta = log(-log(1-p))\)
Link Functions: link the linear predictor to the mean of the response in the wider class of models
Here we will use the logit as a linking function:
logitmod <- glm( cbind( damage, 6 - damage ) ~ temp, family = binomial, orings )
sum_logit <- summary( logitmod )
sum_logit$coefficients## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.6629897 3.29626315 3.538246 4.027947e-04
## temp -0.2162337 0.05317703 -4.066298 4.776581e-05
visualize the logit & probit fit to the data
p2 <- ggplot( data = orings, aes( x = temp, y = damage/6 ) ) +
geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = 'glm', col = 'blue', method.args = list( family = binomial(link = "probit")), se = F ) +
geom_smooth( method = 'glm', col = 'red', method.args = list( family = 'binomial'), se = F ) +
ylab( 'Prob of Damage' ) +
xlab( 'Temperature' ) +
ggtitle( 'GLM: logistic regression' ) +
theme_classic()
p2## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
look at the coefficients for the probit glm:
probitmod <- glm( cbind( damage, 6 - damage ) ~ temp, family = binomial( link = "probit"), orings )
sum_probit <- summary( probitmod )
sum_probit$coefficients## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5914531 1.71054594 3.268812 1.08000e-03
## temp -0.1058039 0.02656052 -3.983501 6.79075e-05
Although the coefficients are quite different, the fits are similar, particularly in the range of the data.
Now to predict the value at 31 degrees F
ilogit( sum_logit$coefficients[1,1] + sum_logit$coefficients[2,1]*31 )## [1] 0.9930342
pnorm( sum_probit$coefficients[1,1] + sum_probit$coefficients[2,1]*31 )## [1] 0.9895983
Inference
sum_logit$coefficients## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.6629897 3.29626315 3.538246 4.027947e-04
## temp -0.2162337 0.05317703 -4.066298 4.776581e-05
pchisq( deviance( logitmod ), df.residual( logitmod ), lower = F )## [1] 0.7164099
construct a profile likelihood-based confidence interval
confint( logitmod )## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 5.575195 18.737598
## temp -0.332657 -0.120179
Interpreting Odds
Odds are sometimes a better scale than probability to represent chance. Odds arose as a way to express the payoffs for bets.
\t\t \(o = \frac{p}{1-p}\) \(p = \frac{o}{1 + o}\)
Odss also form the basis of a subjective assessment of probability.
If we have two covariates (\(x_1 \mbox{ & } x_2\)) then:
\[log( odds ) = log(\frac{p}{1-p}) = \beta_0 + \beta_1x_1 + \beta_2x_2\] Incidence of respiratory disease in infants to the age of 1 year
data( babyfood )
xtabs( disease/(disease+nondisease ) ~ sex + food, data = babyfood )## food
## sex Bottle Breast Suppl
## Boy 0.16812227 0.09514170 0.12925170
## Girl 0.12500000 0.06681034 0.12598425
fit the model:
babyfood_mod1 <- glm( cbind( disease, nondisease ) ~ sex + food, family = binomial, data = babyfood )
sum_bfood_mod1 <- summary( babyfood_mod1 )
sum_bfood_mod1##
## Call:
## glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial,
## data = babyfood)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 0.1096 -0.5052 0.1922 -0.1342 0.5896 -0.2284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6127 0.1124 -14.347 < 2e-16 ***
## sexGirl -0.3126 0.1410 -2.216 0.0267 *
## foodBreast -0.6693 0.1530 -4.374 1.22e-05 ***
## foodSuppl -0.1725 0.2056 -0.839 0.4013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.37529 on 5 degrees of freedom
## Residual deviance: 0.72192 on 2 degrees of freedom
## AIC: 40.24
##
## Number of Fisher Scoring iterations: 4
is there a sex-food interactions? ….look at the residual deviance. here it is small for the given degrees of freedon, therefore we can conclude that there is no evidence of an interaction effect.
test the main effects:
drop1( babyfood_mod1, test = 'Chi' )## Single term deletions
##
## Model:
## cbind(disease, nondisease) ~ sex + food
## Df Deviance AIC LRT Pr(>Chi)
## <none> 0.7219 40.240
## sex 1 5.6990 43.217 4.9771 0.02569 *
## food 2 20.8992 56.417 20.1772 4.155e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the drop1 function result shows that both predictors are significant
exp( sum_bfood_mod1$coefficients[3,1 ] )## [1] 0.5120696
Breast feeding reduces the odds of respiratory disease to 51% of that for bottle feeding
find confidence intervals:
exp( confint( babyfood_mod1 ) )## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.1591988 0.2474333
## sexGirl 0.5536209 0.9629225
## foodBreast 0.3781905 0.6895181
## foodSuppl 0.5555372 1.2464312
Prospective and Retrospective Sampling
Prospective Sampling - ‘cohort study’ the predictors are fixed and then the outcome is observed. ex: have a subject group that present with certain features which are monitored accordingly. Retrospective sampling - ‘case-control study’ the outcome is fixed and then the predictors are observed. a group of test condition observations are observed (e.g. disease ) and a seperate group of control subjects are observed (disease-free). retrospective studies are cheaper and faster and more convenient. However they are less reliable because they typically use historical record which are known to inaccuracies and incompletenesses.
babyfood[c(1,3), ]## disease nondisease sex food
## 1 77 381 Boy Bottle
## 3 47 447 Boy Breast
- given the infant is breast fed, the log-odds of having a respiratory disease are log( 47/447 )
- given the infant is bottle fed, the log-odds of having respiratory disease are log( 77/381 )
- log-odds ratio: the difference between these two represents the increased risk of respiratory disease incurred by bottle feeding relative to breast feeding
log( 77/381 ) - log( 47/447 )## [1] 0.653417
Choice of Link Function
data( bliss )
glimpse( bliss )## Rows: 5
## Columns: 3
## $ dead <dbl> 2, 8, 15, 23, 27
## $ alive <dbl> 28, 22, 15, 7, 3
## $ conc <int> 0, 1, 2, 3, 4
fit all three link functions:
blissm1 <- glm( cbind( dead, alive ) ~ conc, family = binomial, data = bliss )
blissm2 <- glm( cbind( dead, alive ) ~ conc, family = binomial( link = probit ), data = bliss )
blissm3 <- glm( cbind( dead, alive ) ~ conc, family = binomial( link = cloglog ), data = bliss )visualize:
par( mfrow=c(1,3) )
x <- seq( -2, 8, 0.2 )
p1 <- ilogit( blissm1$coefficients[1] + blissm1$coefficients[2]*x )
p2 <- pnorm( blissm2$coefficients[1] + blissm2$coefficients[2]*x )
p3 <- 1-exp( -exp( blissm3$coefficients[1] + blissm3$coefficients[2]*x ) )
plot( x, p1, type = "l", ylab = "Probability", xlab = "Dose" )
lines( x, p2, lty = 2 )
lines( x, p3, lty = 5 )
matplot( x, cbind( p2/p1, (1-p2)/(1-p1)), type = 'l', xlab = "Dose", ylab = "Ratio" )
matplot( x, cbind( p3/p1, (1-p3)/(1-p1)), type = 'l', xlab = "Dose", ylab = "Ratio" ) The fits are very close to each other in the region that spans the data (0:4). However, they deviate in the tails as the ratio plots show.
The default choice is the logit link. There are three advantages: it leads to simpler mathematics due the intractability of \(\Phi\); it is easier to interpret using odds and it allows easier analysis of retrospectively sampled data.
Estimation Problems
data( hormone )
plot( estrogen ~ androgen, data = hormone, col = orientation ) fit a binomial model to see if orientation can be predicted from the two hormone values.
When the response is binary, we can use it directly in the glm function:
hormone_mod <- glm( orientation ~ estrogen + androgen, data = hormone, family = binomial )## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary( hormone_mod )##
## Call:
## glm(formula = orientation ~ estrogen + androgen, family = binomial,
## data = hormone)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.759e-05 -2.100e-08 -2.100e-08 2.100e-08 3.380e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -84.49 136095.03 -0.001 1.000
## estrogen -90.22 75910.98 -0.001 0.999
## androgen 100.91 92755.62 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3.5426e+01 on 25 degrees of freedom
## Residual deviance: 2.3229e-09 on 23 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
The residual deviance is very small, indicating a good fit. However, the algorithm did not converge.
A look at the plot of the data reveals that it is linearly seperable so that a perfect fit is possible:
plot( estrogen ~ androgen, data = hormone, col = orientation )
abline( -hormone_mod$coefficients[1]/hormone_mod$coefficients[2],
-hormone_mod$coefficients[3]/hormone_mod$coefficients[2],
col = 'magenta') Use David Firths method which always gives finite estimates
Goodness of Fit
Deviance is a good measure, but there are others: Pearson’s \(\chi^2\) which is analogous to the residual sum of squares in normal linear models
#pearson's chi squared
sum( residuals( blissm1, type = 'pearson' )^2 )## [1] 0.3672674
#deviance
deviance( blissm1 )## [1] 0.3787483
We can see that there is little difference
Niglekerke’s 1991 statistic for how well the model explains the data
(1 - exp((blissm1$dev - blissm1$null)/150))/ (1-exp(-blissm1$null/150))## [1] 0.9953178
That is a very good fit
Prediction and Effective Doses
We wish to predict the outcome for given values of the covariates. use the inverse of the link function to get the prediction on the probability scale:
blissm1_sum <- summary( blissm1 )
x0 <- c( 1, 2.5 )
eta0 <- sum( x0*coef( blissm1 ) )
ilogit( eta0 )## [1] 0.6412854
There is a 64% chance of a death at this dose. now to find the confidence interval:
bliss_pred <- predict( blissm1, newdata = data.frame( conc = 2.5 ), se = T )
c1 <- bliss_pred$fit
c2 <- bliss_pred$se.fit
ilogit( c( c1 - 1.96*c2, c1 + 1.96*c2 ) )## 1 1
## 0.5342962 0.7358471
\[\mbox{Lethal Dose} = \hat{ED50} = \frac{-\hat{\beta}_0}{\hat{\beta}_1}\] \[\mbox{Effective Dose} = x_p = \frac{\mbox{logit}(p)-\beta_0}{\hat{\beta}_1}\]
dose.p( blissm1, p = c( 0.5, 0.9 ) )## Dose SE
## p = 0.5: 2.00000 0.1784367
## p = 0.9: 3.89107 0.3449965
Overdispersion
data( troutegg )
ftable( xtabs( cbind( survive, total ) ~ location + period, data = troutegg ) )## survive total
## location period
## 1 4 89 94
## 7 94 98
## 8 77 86
## 11 141 155
## 2 4 106 108
## 7 91 106
## 8 87 96
## 11 104 122
## 3 4 119 123
## 7 100 130
## 8 88 119
## 11 91 125
## 4 4 104 104
## 7 80 97
## 8 67 99
## 11 111 132
## 5 4 49 93
## 7 11 113
## 8 18 88
## 11 0 138
fit a glm for the two main effects
eggmod <- glm( cbind( survive, total-survive ) ~ location + period, family = binomial, data = troutegg )
summary( eggmod )##
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period,
## family = binomial, data = troutegg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8305 -0.3650 -0.0303 0.6191 3.2434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.6358 0.2813 16.479 < 2e-16 ***
## location2 -0.4168 0.2461 -1.694 0.0903 .
## location3 -1.2421 0.2194 -5.660 1.51e-08 ***
## location4 -0.9509 0.2288 -4.157 3.23e-05 ***
## location5 -4.6138 0.2502 -18.439 < 2e-16 ***
## period7 -2.1702 0.2384 -9.103 < 2e-16 ***
## period8 -2.3256 0.2429 -9.573 < 2e-16 ***
## period11 -2.4500 0.2341 -10.466 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1021.469 on 19 degrees of freedom
## Residual deviance: 64.495 on 12 degrees of freedom
## AIC: 157.03
##
## Number of Fisher Scoring iterations: 5
check for outliers:
halfnorm( residuals( eggmod ) )plot the empiracal logits:
elogits <- log( ( troutegg$survive + 0.5 )/( troutegg$total - troutegg$survive + 0.5 ) )
with( troutegg, interaction.plot( period, location, elogits ) ) Having eliminated the usual suspects, we now investigate overdispersion.
estimate a dipersion parameter:
sigma2 <- sum( residuals( eggmod, type = 'pearson' )^2 )/12
summary( eggmod, dispersion = sigma2 )##
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period,
## family = binomial, data = troutegg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8305 -0.3650 -0.0303 0.6191 3.2434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.6358 0.6495 7.138 9.49e-13 ***
## location2 -0.4168 0.5682 -0.734 0.4632
## location3 -1.2421 0.5066 -2.452 0.0142 *
## location4 -0.9509 0.5281 -1.800 0.0718 .
## location5 -4.6138 0.5777 -7.987 1.39e-15 ***
## period7 -2.1702 0.5504 -3.943 8.05e-05 ***
## period8 -2.3256 0.5609 -4.146 3.38e-05 ***
## period11 -2.4500 0.5405 -4.533 5.82e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 5.330322)
##
## Null deviance: 1021.469 on 19 degrees of freedom
## Residual deviance: 64.495 on 12 degrees of freedom
## AIC: 157.03
##
## Number of Fisher Scoring iterations: 5
Note the change in p-vals for several predictors
Matched Case-Control Studies
Confounding variables are explicitly adjusted for in the experiment design.
In a matched case-control study we match each case with one or more controls that have the same or similar values of some set of potential confounding variables.
data( amlxray )
glimpse( amlxray )## Rows: 238
## Columns: 11
## $ ID <fct> 7004, 7004, 7006, 7006, 7009, 7009, 7010, 7010, 7013, 7013, 70…
## $ disease <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,…
## $ Sex <fct> F, F, M, M, F, F, M, M, M, M, F, M, F, F, F, M, F, F, F, F, M,…
## $ downs <fct> no, no, no, no, no, no, yes, no, no, no, no, no, no, no, no, n…
## $ age <int> 0, 0, 6, 6, 8, 8, 1, 1, 4, 4, 9, 9, 17, 17, 5, 5, 0, 0, 0, 7, …
## $ Mray <fct> no, no, no, no, no, no, no, no, yes, no, yes, no, no, no, no, …
## $ MupRay <fct> no, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, n…
## $ MlowRay <fct> no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no, n…
## $ Fray <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ Cray <fct> no, no, yes, yes, no, no, no, no, no, yes, no, no, yes, yes, n…
## $ CnRay <ord> 1, 1, 3, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
ii <- which( amlxray$downs=="yes" )
ramlxray <- amlxray[ -c( ii, ii+1 ),]
xraymod <- clogit( disease ~ Sex + Mray + CnRay + strata( ID ), data = ramlxray )
summary( xraymod )## Call:
## coxph(formula = Surv(rep(1, 224L), disease) ~ Sex + Mray + CnRay +
## strata(ID), data = ramlxray, method = "exact")
##
## n= 224, number of events= 104
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SexM 0.1946 1.2149 0.3785 0.514 0.6071
## Mrayyes 0.1420 1.1526 0.5730 0.248 0.8042
## CnRay.L 1.8536 6.3825 0.6080 3.049 0.0023 **
## CnRay.Q -0.1960 0.8220 0.5764 -0.340 0.7338
## CnRay.C -0.5359 0.5852 0.5834 -0.919 0.3583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## SexM 1.2149 0.8231 0.5786 2.551
## Mrayyes 1.1526 0.8676 0.3749 3.543
## CnRay.L 6.3825 0.1567 1.9384 21.015
## CnRay.Q 0.8220 1.2166 0.2656 2.544
## CnRay.C 0.5852 1.7089 0.1865 1.836
##
## Concordance= 0.603 (se = 0.052 )
## Likelihood ratio test= 16.79 on 5 df, p=0.005
## Wald test = 11.74 on 5 df, p=0.04
## Score (logrank) test = 15.25 on 5 df, p=0.009
drop some insignificant factors
xraymod2 <- clogit( disease ~ Fray + unclass(CnRay) + strata( ID ), ramlxray )
summary( xraymod2 )## Call:
## coxph(formula = Surv(rep(1, 224L), disease) ~ Fray + unclass(CnRay) +
## strata(ID), data = ramlxray, method = "exact")
##
## n= 224, number of events= 104
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Frayyes 0.6704 1.9550 0.3441 1.948 0.051394 .
## unclass(CnRay) 0.8145 2.2580 0.2368 3.439 0.000584 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Frayyes 1.955 0.5115 0.996 3.838
## unclass(CnRay) 2.258 0.4429 1.419 3.592
##
## Concordance= 0.654 (se = 0.052 )
## Likelihood ratio test= 19.55 on 2 df, p=6e-05
## Wald test = 14.12 on 2 df, p=9e-04
## Score (logrank) test = 17.56 on 2 df, p=2e-04