Agenda

We will learn about the following topics in this section.

For the most part, I’ll just showcase how you do things without going into theoretical explanation. The rest of the course will take care of the latter.

Usual Preparation

## Load tidyverse
library(tidyverse)
## Load CSV file
framingham <- read_csv("./framingham.csv")

Formula syntax

The formula syntax used for regression modeling is the following.

Y ~ X1 + X2 + X3

Y is the outcome variable (dependent variable) to be explained and the RHS is the predictors (independent variables) that are used to explain the outcome variable.

The intercept (constant) term is implicit. If you want to remove the constant term, use -1 explicitly.

Y ~ -1 + X1 + X2 + X3

If you want to include a product term (interaction term) between X1 and X2, use the following.

Y ~ X1 + X2 + X1:X2 + X3

or

Y ~ X1*X2 + X3

If you want to include the quadratic term for X1, use the following.

Y ~ X1 + I(X1^2) + X2 + X3

Because ^2 has special meaning in the formula syntax, it in enclosed in I() to inhibit such interpretation. Another potentially better way is

Y ~ poly(X1, 2) + X2 + X3

There are subtle differences in these models.

Simple linear regression

Let’s check out simple linear regression, a linear model with a single predictor. Here we focus on modeling the average total cholesterol as a function of age.

Fitting

Linear models are fit with the lm() function.

## Fitting a linear model with least square
lm1 <- lm(formula = TOTCHOL ~ AGE,
          data    = framingham)

Result assessment

Printing the model gives the coefficient values.

lm1
## 
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
## 
## Coefficients:
## (Intercept)          AGE  
##     172.912        1.284

To examine what other information this object contains, try names() to see the data elements.

names(lm1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"

More informative results can be obtained from the summary method. Note the observations dropped because of missingness (complete case analysis).

(lm1_summary <- summary(lm1))
## 
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155.78  -29.52   -2.98   25.17  457.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 172.91217    3.81696   45.30   <2e-16 ***
## AGE           1.28386    0.07535   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.25 on 4380 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.06215,    Adjusted R-squared:  0.06194 
## F-statistic: 290.3 on 1 and 4380 DF,  p-value: < 2.2e-16

As usual, this is also a result object with various types of information.

names(lm1_summary)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"

Funnily enough, the corresponding confidence intervals are not given by default. confint() can be used. More readable results may be obtained by the tableone package.

## confint() gives just the CI
confint(lm1)
##                  2.5 %     97.5 %
## (Intercept) 165.429005 180.395339
## AGE           1.136123   1.431589
## This may give a more familiar presentation. Don't exponentiate in lm!
tableone::ShowRegTable(lm1, exp = FALSE)
##             coef [confint]          p     
## (Intercept) 172.91 [165.43, 180.40] <0.001
## AGE           1.28 [1.14, 1.43]     <0.001

On-the-fly variable manipulation

The formula is very flexible and you can do many transformations on the fly. Here AGE is being mean centered to make the intercept value more interpretable. However, constructing these variables up front in the dataset using mutate() may be a better practice.

lm2 <- lm(formula = TOTCHOL ~ I(AGE - mean(AGE, na.rm = TRUE)),
          data    = framingham)
summary(lm2)
## 
## Call:
## lm(formula = TOTCHOL ~ I(AGE - mean(AGE, na.rm = TRUE)), data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155.78  -29.52   -2.98   25.17  457.61 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      237.00970    0.65330  362.79   <2e-16 ***
## I(AGE - mean(AGE, na.rm = TRUE))   1.28386    0.07535   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.25 on 4380 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.06215,    Adjusted R-squared:  0.06194 
## F-statistic: 290.3 on 1 and 4380 DF,  p-value: < 2.2e-16

Graphical diagnostics (ggfortify)

ggfortify appears useful in providing graphical modeling diagnostics automatically.

library(ggfortify)
autoplot(lm1, which = 1:6, ncol = 3, label.size = 3)

More detailed discussion using the car package can be found here.

Graphical diagnostics (DIY)

Let’s try a simple comparison of model fit (prediction) to the raw data. We use the add_predictions() function in the modelr package to add the prediction (pred) to the dataset and plot them together using ggplot2.

## Load modelr package for add_predictions()
library(modelr)
##
framingham %>%
    ## Add prediction to dataset
    add_predictions(lm1) %>%
    ## Plot
    ggplot(mapping = aes(x = AGE, y = TOTCHOL)) +
    geom_point() +
    geom_line(mapping = aes(y = pred), size = 2, color = "blue") +
    theme_bw()
## Warning: Removed 52 rows containing missing values (geom_point).

The residual (error) vs fitted (prediction) plot in the 6-panel plot can be created as follows.

framingham %>%
    ## Add prediction to dataset
    add_predictions(lm1) %>%
    ## Also add errors to dataset
    add_residuals(lm1) %>%
    ## Plot
    ggplot(mapping = aes(x = pred, y = resid)) +
    geom_point() +
    geom_smooth(color = "blue", size = 2) +
    theme_bw()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 52 rows containing non-finite values (stat_smooth).
## Warning: Removed 52 rows containing missing values (geom_point).

Multiple linear regression

Now let’s examine more complex models with multiple variables (two!). Here we model the average total cholesterol as a function of both age and sex.

Fitting and result assessment

This is almost exactly the same except for the additional variable SEX.

lm2 <- lm(TOTCHOL ~ AGE + SEX,
          data = framingham)
summary(lm2)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158.27  -29.68   -3.12   25.20  460.86 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) 175.6959     3.8604   45.51    < 2e-16 ***
## AGE           1.2796     0.0752   17.02    < 2e-16 ***
## SEX          -5.8152     1.3128   -4.43 0.00000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.15 on 4379 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.06634,    Adjusted R-squared:  0.06591 
## F-statistic: 155.6 on 2 and 4379 DF,  p-value: < 2.2e-16

If you want to include an interaction term, the formula is the following.

lm3 <- lm(TOTCHOL ~ AGE + SEX + AGE:SEX,
          data = framingham)
summary(lm3)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.46  -28.59   -3.24   24.44  462.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.6914     5.0388   26.33   <2e-16 ***
## AGE           2.1396     0.0993   21.55   <2e-16 ***
## SEX          90.0487     7.5157   11.98   <2e-16 ***
## AGE:SEX      -1.9218     0.1484  -12.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.36 on 4378 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.1008, Adjusted R-squared:  0.1002 
## F-statistic: 163.5 on 3 and 4378 DF,  p-value: < 2.2e-16

Prediction

Let’s examine how these two models differ by predicting for a fake dataset. An all-combinations dataset can be constructed with crossing(). Here we need to perform a wide-to-long conversion using gather().

## Create a fake data set to predict for.
newdata <- crossing(AGE = seq(from = 50, to = 70, by = 1),
                    SEX = c(0,1))
newdata
## # A tibble: 42 x 2
##      AGE   SEX
##    <dbl> <dbl>
##  1    50     0
##  2    50     1
##  3    51     0
##  4    51     1
##  5    52     0
##  6    52     1
##  7    53     0
##  8    53     1
##  9    54     0
## 10    54     1
## # ... with 32 more rows
## Predict for this new dataset using both models.
newdata_long <- newdata %>%
    ## Predict using Model lm2, prediction variable is named lm2
    add_predictions(model = lm2, var = "lm2") %>%
    ## Predict using Model lm3, prediction variable is named lm3
    add_predictions(model = lm3, var = "lm3") %>%
    ## lm2 and lm3 variables will form a single variable pred,
    ## They are distinguished by the model variable
    gather(key = model, value = pred,
           lm2, lm3)
newdata_long
## # A tibble: 84 x 4
##      AGE   SEX model     pred
##    <dbl> <dbl> <chr>    <dbl>
##  1    50     0   lm2 239.6751
##  2    50     1   lm2 233.8599
##  3    51     0   lm2 240.9547
##  4    51     1   lm2 235.1394
##  5    52     0   lm2 242.2343
##  6    52     1   lm2 236.4190
##  7    53     0   lm2 243.5139
##  8    53     1   lm2 237.6986
##  9    54     0   lm2 244.7935
## 10    54     1   lm2 238.9782
## # ... with 74 more rows
## Plot
ggplot(data = newdata_long,
       mapping = aes(x = AGE, y = pred, group = SEX, color = factor(SEX))) +
    geom_line() +
    facet_grid(. ~ model, labeller = label_both) +
    coord_cartesian(ylim = c(0,300)) +
    theme_bw()

Model comparison

It is clear from the visualization that Model 2 constrains the two lines to the same slopes in both sex, whereas Model 3 allows different slopes. Which seems to be a more feasible model for the data? One option when the models are nested (roughly speaking, dropping terms from the more complex model gives the simpler model), you can use anova() function, which conducts a formal test to compare the fit of two models (here partial F-test).

anova(lm2, lm3)
## Analysis of Variance Table
## 
## Model 1: TOTCHOL ~ AGE + SEX
## Model 2: TOTCHOL ~ AGE + SEX + AGE:SEX
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   4379 8155059                                  
## 2   4378 7854337  1    300722 167.62 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A more general way is to use Akaike (pronounced Aka-Ike) information criteria (AIC; smaller the better). Here I created a list of models and mapped AIC function to each element.

map(list(lm2 = lm2, lm3 = lm3), AIC)
## $lm2
## [1] 45435.17
## 
## $lm3
## [1] 45272.52

Generalized linear model

Generalized linear models (models for outcomes that are binary, count, etc) are fit with the glm() function. in addition to the model formula and dataset, we need specify the “family”.

Binary logistic regression model

For a binary outcome variable, binary logistic regression is commonly used. Note the results are on the coefficient scale (log odds ratios).

glm_logistic <- glm(formula = PREVCHD ~ AGE + SEX,
                    family  = binomial(link = "logit"),
                    data    = framingham)
## Summary on log odds ratio scale.
summary(glm_logistic)
## 
## Call:
## glm(formula = PREVCHD ~ AGE + SEX, family = binomial(link = "logit"), 
##     data = framingham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8219  -0.3346  -0.2136  -0.1440   3.0617  
## 
## Coefficients:
##              Estimate Std. Error z value      Pr(>|z|)    
## (Intercept) -9.707187   0.582144 -16.675       < 2e-16 ***
## AGE          0.114307   0.009912  11.532       < 2e-16 ***
## SEX          0.908359   0.155909   5.826 0.00000000567 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1593.5  on 4433  degrees of freedom
## Residual deviance: 1400.6  on 4431  degrees of freedom
## AIC: 1406.6
## 
## Number of Fisher Scoring iterations: 7
## Exponentiation will obtain odds ratios.
tableone::ShowRegTable(glm_logistic)
##             exp(coef) [confint] p     
## (Intercept) 0.00 [0.00, 0.00]   <0.001
## AGE         1.12 [1.10, 1.14]   <0.001
## SEX         2.48 [1.83, 3.38]   <0.001

Poisson regression model

For a count outcome variable, Poisson regression is often used. It’s not a good example (too many zeros; zero inflated data), but I’ll show an example taking the cigarettes per day as the outcome variable.

glm_poisson <-  glm(formula = CIGPDAY ~ AGE + SEX,
                    family  = poisson(link = "log"),
                    data    = framingham)
## Summary on log-mean ratio scale.
summary(glm_poisson)
## 
## Call:
## glm(formula = CIGPDAY ~ AGE + SEX, family = poisson(link = "log"), 
##     data = framingham)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.494  -3.670  -2.699   1.995  11.918  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.1843893  0.0301395  105.66   <2e-16 ***
## AGE         -0.0297036  0.0006043  -49.15   <2e-16 ***
## SEX          0.8444116  0.0105274   80.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 71142  on 4401  degrees of freedom
## Residual deviance: 61732  on 4399  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 71440
## 
## Number of Fisher Scoring iterations: 6
## Exponentiation will obtain odds ratios.
tableone::ShowRegTable(glm_poisson)
##             exp(coef) [confint]  p     
## (Intercept) 24.15 [22.77, 25.62] <0.001
## AGE          0.97 [0.97, 0.97]   <0.001
## SEX          2.33 [2.28, 2.38]   <0.001

Survival analysis

Survival analysis functions are included in the survival package.

library(survival)

Kaplan-Meier plot

Kaplan-Meier survival estimates are obtained as follows. A survival outcome consists of two columns, one specifying the observation time, and the other specifying if the event was observed or not. Here we examine the time-to-death outcome over the 24 years of follow up.

## Non-parametric survival estimates
kmfit1 <- survfit(Surv(time = TIMEDTH, event = DEATH) ~ SEX, data = framingham)
## Show survival estimates at these years
summary(kmfit1, times = c(0,6,12,18,24))
## Call: survfit(formula = Surv(time = TIMEDTH, event = DEATH) ~ SEX, 
##     data = framingham)
## 
##                 SEX=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   2490       0    1.000 0.00000        1.000        1.000
##     6   2396      94    0.962 0.00382        0.955        0.970
##    12   2237     159    0.898 0.00605        0.887        0.910
##    18   2019     218    0.811 0.00785        0.796        0.826
##    24   1783     236    0.716 0.00904        0.699        0.734
## 
##                 SEX=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1944       0    1.000 0.00000        1.000        1.000
##     6   1809     135    0.931 0.00577        0.919        0.942
##    12   1633     176    0.840 0.00831        0.824        0.856
##    18   1391     242    0.716 0.01023        0.696        0.736
##    24   1101     290    0.566 0.01124        0.545        0.589

The survminer package is useful in producing informative plots (check out the vignettes).

library(survminer)
## Loading required package: ggpubr
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
ggsurvplot(kmfit1, risk.table = TRUE)

Cox regression model

Cox regression is fit with coxph(). The formula is the same except for the LHS, which is identical to survfit()

coxph1 <- coxph(Surv(time = TIMEDTH, event = DEATH) ~ AGE + SEX,
                data = framingham)
summary(coxph1)
## Call:
## coxph(formula = Surv(time = TIMEDTH, event = DEATH) ~ AGE + SEX, 
##     data = framingham)
## 
##   n= 4434, number of events= 1550 
## 
##         coef exp(coef) se(coef)     z Pr(>|z|)    
## AGE 0.092328  1.096724 0.003193 28.92   <2e-16 ***
## SEX 0.614190  1.848160 0.051161 12.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## AGE     1.097     0.9118     1.090     1.104
## SEX     1.848     0.5411     1.672     2.043
## 
## Concordance= 0.718  (se = 0.007 )
## Rsquare= 0.203   (max possible= 0.997 )
## Likelihood ratio test= 1006  on 2 df,   p=0
## Wald test            = 943.6  on 2 df,   p=0
## Score (logrank) test = 1029  on 2 df,   p=0

Other models/estimation procedures