BIOS621 Session 2

Levi Waldron

Welcome and outline - session 2

Learning objectives - session 2

Multiple Linear Regression Model

Systematic component:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

Multiple Linear Regression Model

Systematic plus random component:

\(y_i = E[y|x] + \epsilon_i\)

\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon_i\)

Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Generalized Linear Models

Components of GLM

Linear Regression as GLM

Logistic Regression as GLM

\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]

\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

logit function

logit <- function(P) log(P/(1-P))
plot(logit, xlab="Probability", ylab="Log-odds",
     cex.lab=1.5, cex.axis=1.5)

Inverse logit function

invLogit <- function(x) 1/(1+exp(-x))

Additive vs. Multiplicative models

Motivating example: contraceptive use data

From http://data.princeton.edu/wws509/datasets/#cuse

##     age    education wantsMore    notUsing          using      
##  <25  :4   high:8    no :8     Min.   :  8.00   Min.   : 4.00  
##  25-29:4   low :8    yes:8     1st Qu.: 31.00   1st Qu.: 9.50  
##  30-39:4                       Median : 56.50   Median :29.00  
##  40-49:4                       Mean   : 68.75   Mean   :31.69  
##                                3rd Qu.: 85.75   3rd Qu.:49.00  
##                                Max.   :212.00   Max.   :80.00

Motivating example: contraceptive use data

No interactions:

fit1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore, 
           data=cuse, family=binomial("logit"))
summary(fit1)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, 
##     family = binomial("logit"), data = cuse)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5148  -0.9376   0.2408   0.9822   1.7333  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.8082     0.1590  -5.083 3.71e-07 ***
## age25-29       0.3894     0.1759   2.214  0.02681 *  
## age30-39       0.9086     0.1646   5.519 3.40e-08 ***
## age40-49       1.1892     0.2144   5.546 2.92e-08 ***
## educationlow  -0.3250     0.1240  -2.620  0.00879 ** 
## wantsMoreyes  -0.8330     0.1175  -7.091 1.33e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  29.917  on 10  degrees of freedom
## AIC: 113.43
## 
## Number of Fisher Scoring iterations: 4

Pearson residuals for logistic regression

Take the difference between observed and fitted values (on probability scale 0-1), and divide by the standard deviation of the observed value.

\[ r_i = \frac{y_i - \hat y_i}{ \sqrt{ Var(\hat y_i) }} \]

Summing the squared Pearson residuals produces the Pearson Chi-squared statistic:

\[ \chi ^2 = \sum_i r_i^2 \]

Deviance residuals for logistic regression

\[ d_i = s_i \sqrt{ -2 ( y_i \log \hat y_i + (1-y_i) \log (1 - \hat y_i) ) } \]

Where \(s_i = 1\) if \(y_i = 1\) and \(s_i = -1\) if \(y_i = 0\).

Model likelihood and deviance

\(\Delta (\textrm{D}) = -2 * \Delta (\textrm{log likelihood})\)

Likelihood Ratio Test

Likelihood Ratio Test (cont’d)

fit0 <- glm(cbind(using, notUsing) ~ -1, data=cuse, 
            family=binomial("logit"))
anova(fit0, fit1, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: cbind(using, notUsing) ~ -1
## Model 2: cbind(using, notUsing) ~ age + education + wantsMore
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        16     389.85                          
## 2        10      29.92  6   359.94 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wald test for individual regression coefficients

Note: Wald test confidence intervals on coefficients can provide poor coverage in some cases, even with relatively large samples

Lab Exercises

  1. What is the mean fraction of women using birth control for each age group? Each education level? For women who do or don’t want more children?
    • Hint: look at the “data wrangling” cheat sheet functions mutate, group_by, and summarize
  2. Based on fit1, write on paper the model for expected probability of using birth control. Don’t forget the logit function.
  3. Based on fit1, what is the expected probability of an individual 25-29 years old, with high education, who wants more children, using birth control? Calculate it manually, and using predict(fit1)
  4. Based on fit1: Relative to women under 25 who want to have children, what is the predicted increase in odds that a woman 40-49 years old who does not want to have children will be taking birth control?
  5. Using a likelihood ratio test, is there evidence that a model with interactions improves on fit1 (no interactions)?
  6. Which, if any, variables have the strongest interactions?