Extending the Linear Model with R

by Julian Faraway

notes by Bonnie Cooper

The following are notes from readings in ‘Extending the Linear Model with R by Julian Faraway for the course DATA621, ‘Business Analystics and Data Mining’ as part of the Masters of Science in Data Science program at CUNY SPS.

R Libraries Used:

library( faraway )
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( MASS )
library( brglm2 )
library( survival )

Binomial Data

The Challenger Disaster Example

We are interested in how the probability of failure in a given O-ring is related to the launch temperature and predicting the probability when the temperature is \(31^\circ\)F.

data( orings )
glimpse( orings )
## Rows: 23
## Columns: 2
## $ temp   <dbl> 53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73,…
## $ damage <dbl> 5, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, …

Visualize the data and fit a simple linear regression model:

p1 <- ggplot( data = orings, aes( x = temp, y = damage/6 ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  #geom_jitter(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_smooth( method = 'lm', col = 'red', se = F ) +
  ylab( 'Prob of Damage' ) +
  xlab( 'Temperature' ) +
  ggtitle( 'Naive Approach: simple linear regression' ) +
  theme_classic()

p1
## `geom_smooth()` using formula 'y ~ x'

However, there are several problems to the simple linear regression approach. Most obviously that the linear regression method can predict probabilities \(\gt1 \mbox{ & } \lt 0\) and that is problematic, because those values are not possible.

Binomial Regression Model

\[P(\mbox{k out of n}) = \frac{n!}{k!(n-k)!}p^k(1-p)^{(n-k)} = \left( \begin{array}{c} n_i \\ y_i \end{array} \right)p_i^{y_i}(1-p_i)^{n_i-y_i}\] \[P(\mbox{event k happening out n possible ways}) = (\mbox{how many possibilities})\cdot P(\mbox{each of n possibilities})\] to describe the relationship of all the predictors (\(x_{i1}\dots x_{iq}\)) to p (the success probability), we set up a linear predictor \(\eta_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_qx_{iq}\).

\(\eta_i = p_i\) is not appropriate, because it is required that \(0\leq p_i \leq 1\), so here are some other choices to use as linking functions:

  1. Logit: \(\eta = log( p/(1-p))\)
  2. Propit \(\eta = \Phi^{-1}(p)\) where \(\Phi\) is the inverse normal cumulative distribution function
  3. Complimentary log-log: \(\eta = log(-log(1-p))\)

Link Functions: link the linear predictor to the mean of the response in the wider class of models

Here we will use the logit as a linking function:

logitmod <- glm( cbind( damage, 6 - damage ) ~ temp, family = binomial, orings )
sum_logit <- summary( logitmod )
sum_logit$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) 11.6629897 3.29626315  3.538246 4.027947e-04
## temp        -0.2162337 0.05317703 -4.066298 4.776581e-05

visualize the logit & probit fit to the data

p2 <- ggplot( data = orings, aes( x = temp, y = damage/6 ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_smooth( method = 'glm', col = 'blue', method.args = list( family = binomial(link = "probit")), se = F ) +
  geom_smooth( method = 'glm', col = 'red', method.args = list( family = 'binomial'), se = F ) +
  ylab( 'Prob of Damage' ) +
  xlab( 'Temperature' ) +
  ggtitle( 'GLM: logistic regression' ) +
  theme_classic()
p2
## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

look at the coefficients for the probit glm:

probitmod <- glm( cbind( damage, 6 - damage ) ~ temp, family = binomial( link = "probit"), orings )
sum_probit <- summary( probitmod )
sum_probit$coefficients
##               Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)  5.5914531 1.71054594  3.268812 1.08000e-03
## temp        -0.1058039 0.02656052 -3.983501 6.79075e-05

Although the coefficients are quite different, the fits are similar, particularly in the range of the data.

Now to predict the value at 31 degrees F

ilogit( sum_logit$coefficients[1,1] + sum_logit$coefficients[2,1]*31 )
## [1] 0.9930342
pnorm( sum_probit$coefficients[1,1] + sum_probit$coefficients[2,1]*31 )
## [1] 0.9895983

Inference

sum_logit$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) 11.6629897 3.29626315  3.538246 4.027947e-04
## temp        -0.2162337 0.05317703 -4.066298 4.776581e-05
pchisq( deviance( logitmod ), df.residual( logitmod ), lower = F )
## [1] 0.7164099

construct a profile likelihood-based confidence interval

confint( logitmod )
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept)  5.575195 18.737598
## temp        -0.332657 -0.120179

Interpreting Odds

Odds are sometimes a better scale than probability to represent chance. Odds arose as a way to express the payoffs for bets.

\t\t \(o = \frac{p}{1-p}\) \(p = \frac{o}{1 + o}\)

Odss also form the basis of a subjective assessment of probability.
If we have two covariates (\(x_1 \mbox{ & } x_2\)) then:
\[log( odds ) = log(\frac{p}{1-p}) = \beta_0 + \beta_1x_1 + \beta_2x_2\] Incidence of respiratory disease in infants to the age of 1 year

data( babyfood )
xtabs( disease/(disease+nondisease ) ~ sex + food, data = babyfood )
##       food
## sex        Bottle     Breast      Suppl
##   Boy  0.16812227 0.09514170 0.12925170
##   Girl 0.12500000 0.06681034 0.12598425

fit the model:

babyfood_mod1 <- glm( cbind( disease, nondisease ) ~ sex + food, family = binomial, data = babyfood )
sum_bfood_mod1 <- summary( babyfood_mod1 )
sum_bfood_mod1
## 
## Call:
## glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial, 
##     data = babyfood)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
##  0.1096  -0.5052   0.1922  -0.1342   0.5896  -0.2284  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6127     0.1124 -14.347  < 2e-16 ***
## sexGirl      -0.3126     0.1410  -2.216   0.0267 *  
## foodBreast   -0.6693     0.1530  -4.374 1.22e-05 ***
## foodSuppl    -0.1725     0.2056  -0.839   0.4013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.37529  on 5  degrees of freedom
## Residual deviance:  0.72192  on 2  degrees of freedom
## AIC: 40.24
## 
## Number of Fisher Scoring iterations: 4

is there a sex-food interactions? ….look at the residual deviance. here it is small for the given degrees of freedon, therefore we can conclude that there is no evidence of an interaction effect.

test the main effects:

drop1( babyfood_mod1, test = 'Chi' )
## Single term deletions
## 
## Model:
## cbind(disease, nondisease) ~ sex + food
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      0.7219 40.240                      
## sex     1   5.6990 43.217  4.9771   0.02569 *  
## food    2  20.8992 56.417 20.1772 4.155e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the drop1 function result shows that both predictors are significant

exp( sum_bfood_mod1$coefficients[3,1 ] )
## [1] 0.5120696

Breast feeding reduces the odds of respiratory disease to 51% of that for bottle feeding

find confidence intervals:

exp( confint( babyfood_mod1 ) )
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 0.1591988 0.2474333
## sexGirl     0.5536209 0.9629225
## foodBreast  0.3781905 0.6895181
## foodSuppl   0.5555372 1.2464312

Prospective and Retrospective Sampling

Prospective Sampling - ‘cohort study’ the predictors are fixed and then the outcome is observed. ex: have a subject group that present with certain features which are monitored accordingly. Retrospective sampling - ‘case-control study’ the outcome is fixed and then the predictors are observed. a group of test condition observations are observed (e.g. disease ) and a seperate group of control subjects are observed (disease-free). retrospective studies are cheaper and faster and more convenient. However they are less reliable because they typically use historical record which are known to inaccuracies and incompletenesses.

babyfood[c(1,3), ]
##   disease nondisease sex   food
## 1      77        381 Boy Bottle
## 3      47        447 Boy Breast
  • given the infant is breast fed, the log-odds of having a respiratory disease are log( 47/447 )
  • given the infant is bottle fed, the log-odds of having respiratory disease are log( 77/381 )
  • log-odds ratio: the difference between these two represents the increased risk of respiratory disease incurred by bottle feeding relative to breast feeding
log( 77/381 ) - log( 47/447 )
## [1] 0.653417

Estimation Problems

data( hormone )
plot( estrogen ~ androgen, data = hormone, col = orientation )

fit a binomial model to see if orientation can be predicted from the two hormone values.
When the response is binary, we can use it directly in the glm function:

hormone_mod <- glm( orientation ~ estrogen + androgen, data = hormone, family = binomial )
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary( hormone_mod )
## 
## Call:
## glm(formula = orientation ~ estrogen + androgen, family = binomial, 
##     data = hormone)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.759e-05  -2.100e-08  -2.100e-08   2.100e-08   3.380e-05  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -84.49  136095.03  -0.001    1.000
## estrogen       -90.22   75910.98  -0.001    0.999
## androgen       100.91   92755.62   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.5426e+01  on 25  degrees of freedom
## Residual deviance: 2.3229e-09  on 23  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

The residual deviance is very small, indicating a good fit. However, the algorithm did not converge.

A look at the plot of the data reveals that it is linearly seperable so that a perfect fit is possible:

plot( estrogen ~ androgen, data = hormone, col = orientation )
abline( -hormone_mod$coefficients[1]/hormone_mod$coefficients[2], 
        -hormone_mod$coefficients[3]/hormone_mod$coefficients[2],
        col = 'magenta')

Use David Firths method which always gives finite estimates

Goodness of Fit

Deviance is a good measure, but there are others: Pearson’s \(\chi^2\) which is analogous to the residual sum of squares in normal linear models

#pearson's chi squared
sum( residuals( blissm1, type = 'pearson' )^2 )
## [1] 0.3672674
#deviance
deviance( blissm1 )
## [1] 0.3787483

We can see that there is little difference

Niglekerke’s 1991 statistic for how well the model explains the data

(1 - exp((blissm1$dev - blissm1$null)/150))/ (1-exp(-blissm1$null/150))
## [1] 0.9953178

That is a very good fit

Prediction and Effective Doses

We wish to predict the outcome for given values of the covariates. use the inverse of the link function to get the prediction on the probability scale:

blissm1_sum <- summary( blissm1 )
x0 <- c( 1, 2.5 )
eta0 <- sum( x0*coef( blissm1 ) )
ilogit( eta0 )
## [1] 0.6412854

There is a 64% chance of a death at this dose. now to find the confidence interval:

bliss_pred <- predict( blissm1, newdata = data.frame( conc = 2.5 ), se = T )
c1 <- bliss_pred$fit
c2 <- bliss_pred$se.fit
ilogit( c( c1 - 1.96*c2, c1 + 1.96*c2 ) )
##         1         1 
## 0.5342962 0.7358471

\[\mbox{Lethal Dose} = \hat{ED50} = \frac{-\hat{\beta}_0}{\hat{\beta}_1}\] \[\mbox{Effective Dose} = x_p = \frac{\mbox{logit}(p)-\beta_0}{\hat{\beta}_1}\]

dose.p( blissm1, p = c( 0.5, 0.9 ) )
##             Dose        SE
## p = 0.5: 2.00000 0.1784367
## p = 0.9: 3.89107 0.3449965

Overdispersion

data( troutegg )
ftable( xtabs( cbind( survive, total ) ~ location + period, data = troutegg ) )
##                  survive total
## location period               
## 1        4            89    94
##          7            94    98
##          8            77    86
##          11          141   155
## 2        4           106   108
##          7            91   106
##          8            87    96
##          11          104   122
## 3        4           119   123
##          7           100   130
##          8            88   119
##          11           91   125
## 4        4           104   104
##          7            80    97
##          8            67    99
##          11          111   132
## 5        4            49    93
##          7            11   113
##          8            18    88
##          11            0   138

fit a glm for the two main effects

eggmod <- glm( cbind( survive, total-survive ) ~ location + period, family = binomial, data = troutegg )
summary( eggmod )
## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
## location2    -0.4168     0.2461  -1.694   0.0903 .  
## location3    -1.2421     0.2194  -5.660 1.51e-08 ***
## location4    -0.9509     0.2288  -4.157 3.23e-05 ***
## location5    -4.6138     0.2502 -18.439  < 2e-16 ***
## period7      -2.1702     0.2384  -9.103  < 2e-16 ***
## period8      -2.3256     0.2429  -9.573  < 2e-16 ***
## period11     -2.4500     0.2341 -10.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

check for outliers:

halfnorm( residuals( eggmod ) )

plot the empiracal logits:

elogits <- log( ( troutegg$survive + 0.5 )/( troutegg$total - troutegg$survive + 0.5 ) )
with( troutegg, interaction.plot( period, location, elogits ) )

Having eliminated the usual suspects, we now investigate overdispersion.
estimate a dipersion parameter:

sigma2 <- sum( residuals( eggmod, type = 'pearson' )^2 )/12
summary( eggmod, dispersion = sigma2 )
## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.6495   7.138 9.49e-13 ***
## location2    -0.4168     0.5682  -0.734   0.4632    
## location3    -1.2421     0.5066  -2.452   0.0142 *  
## location4    -0.9509     0.5281  -1.800   0.0718 .  
## location5    -4.6138     0.5777  -7.987 1.39e-15 ***
## period7      -2.1702     0.5504  -3.943 8.05e-05 ***
## period8      -2.3256     0.5609  -4.146 3.38e-05 ***
## period11     -2.4500     0.5405  -4.533 5.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 5.330322)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

Note the change in p-vals for several predictors

Matched Case-Control Studies

Confounding variables are explicitly adjusted for in the experiment design.
In a matched case-control study we match each case with one or more controls that have the same or similar values of some set of potential confounding variables.

data( amlxray )
glimpse( amlxray )
## Rows: 238
## Columns: 11
## $ ID      <fct> 7004, 7004, 7006, 7006, 7009, 7009, 7010, 7010, 7013, 7013, 70…
## $ disease <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,…
## $ Sex     <fct> F, F, M, M, F, F, M, M, M, M, F, M, F, F, F, M, F, F, F, F, M,…
## $ downs   <fct> no, no, no, no, no, no, yes, no, no, no, no, no, no, no, no, n…
## $ age     <int> 0, 0, 6, 6, 8, 8, 1, 1, 4, 4, 9, 9, 17, 17, 5, 5, 0, 0, 0, 7, …
## $ Mray    <fct> no, no, no, no, no, no, no, no, yes, no, yes, no, no, no, no, …
## $ MupRay  <fct> no, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, n…
## $ MlowRay <fct> no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no, n…
## $ Fray    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ Cray    <fct> no, no, yes, yes, no, no, no, no, no, yes, no, no, yes, yes, n…
## $ CnRay   <ord> 1, 1, 3, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
ii <- which( amlxray$downs=="yes" )
ramlxray <- amlxray[ -c( ii, ii+1 ),]
xraymod <- clogit( disease ~ Sex + Mray + CnRay + strata( ID ), data = ramlxray )
summary( xraymod )
## Call:
## coxph(formula = Surv(rep(1, 224L), disease) ~ Sex + Mray + CnRay + 
##     strata(ID), data = ramlxray, method = "exact")
## 
##   n= 224, number of events= 104 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)   
## SexM     0.1946    1.2149   0.3785  0.514   0.6071   
## Mrayyes  0.1420    1.1526   0.5730  0.248   0.8042   
## CnRay.L  1.8536    6.3825   0.6080  3.049   0.0023 **
## CnRay.Q -0.1960    0.8220   0.5764 -0.340   0.7338   
## CnRay.C -0.5359    0.5852   0.5834 -0.919   0.3583   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## SexM       1.2149     0.8231    0.5786     2.551
## Mrayyes    1.1526     0.8676    0.3749     3.543
## CnRay.L    6.3825     0.1567    1.9384    21.015
## CnRay.Q    0.8220     1.2166    0.2656     2.544
## CnRay.C    0.5852     1.7089    0.1865     1.836
## 
## Concordance= 0.603  (se = 0.052 )
## Likelihood ratio test= 16.79  on 5 df,   p=0.005
## Wald test            = 11.74  on 5 df,   p=0.04
## Score (logrank) test = 15.25  on 5 df,   p=0.009

drop some insignificant factors

xraymod2 <- clogit( disease ~ Fray + unclass(CnRay) + strata( ID ), ramlxray )
summary( xraymod2 )
## Call:
## coxph(formula = Surv(rep(1, 224L), disease) ~ Fray + unclass(CnRay) + 
##     strata(ID), data = ramlxray, method = "exact")
## 
##   n= 224, number of events= 104 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)    
## Frayyes        0.6704    1.9550   0.3441 1.948 0.051394 .  
## unclass(CnRay) 0.8145    2.2580   0.2368 3.439 0.000584 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## Frayyes            1.955     0.5115     0.996     3.838
## unclass(CnRay)     2.258     0.4429     1.419     3.592
## 
## Concordance= 0.654  (se = 0.052 )
## Likelihood ratio test= 19.55  on 2 df,   p=6e-05
## Wald test            = 14.12  on 2 df,   p=9e-04
## Score (logrank) test = 17.56  on 2 df,   p=2e-04