We will learn about the following topics in this section.
For the most part, I’ll just showcase how you do things without going into theoretical explanation. The rest of the course will take care of the latter.
## Load tidyverse
library(tidyverse)
## Load CSV file
framingham <- read_csv("./framingham.csv")
The formula syntax used for regression modeling is the following.
Y ~ X1 + X2 + X3
Y is the outcome variable (dependent variable) to be explained and the RHS is the predictors (independent variables) that are used to explain the outcome variable.
The intercept (constant) term is implicit. If you want to remove the constant term, use -1 explicitly.
Y ~ -1 + X1 + X2 + X3
If you want to include a product term (interaction term) between X1 and X2, use the following.
Y ~ X1 + X2 + X1:X2 + X3
or
Y ~ X1*X2 + X3
If you want to include the quadratic term for X1, use the following.
Y ~ X1 + I(X1^2) + X2 + X3
Because ^2 has special meaning in the formula syntax, it in enclosed in I() to inhibit such interpretation. Another potentially better way is
Y ~ poly(X1, 2) + X2 + X3
There are subtle differences in these models.
Let’s check out simple linear regression, a linear model with a single predictor. Here we focus on modeling the average total cholesterol as a function of age.
Linear models are fit with the lm() function.
## Fitting a linear model with least square
lm1 <- lm(formula = TOTCHOL ~ AGE,
data = framingham)
Printing the model gives the coefficient values.
lm1
##
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
##
## Coefficients:
## (Intercept) AGE
## 172.912 1.284
To examine what other information this object contains, try names() to see the data elements.
names(lm1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "xlevels" "call" "terms"
## [13] "model"
More informative results can be obtained from the summary method. Note the observations dropped because of missingness (complete case analysis).
(lm1_summary <- summary(lm1))
##
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.78 -29.52 -2.98 25.17 457.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172.91217 3.81696 45.30 <2e-16 ***
## AGE 1.28386 0.07535 17.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.25 on 4380 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.06215, Adjusted R-squared: 0.06194
## F-statistic: 290.3 on 1 and 4380 DF, p-value: < 2.2e-16
As usual, this is also a result object with various types of information.
names(lm1_summary)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled" "na.action"
Funnily enough, the corresponding confidence intervals are not given by default. confint() can be used. More readable results may be obtained by the tableone package.
## confint() gives just the CI
confint(lm1)
## 2.5 % 97.5 %
## (Intercept) 165.429005 180.395339
## AGE 1.136123 1.431589
## This may give a more familiar presentation. Don't exponentiate in lm!
tableone::ShowRegTable(lm1, exp = FALSE)
## coef [confint] p
## (Intercept) 172.91 [165.43, 180.40] <0.001
## AGE 1.28 [1.14, 1.43] <0.001
The formula is very flexible and you can do many transformations on the fly. Here AGE is being mean centered to make the intercept value more interpretable. However, constructing these variables up front in the dataset using mutate() may be a better practice.
lm2 <- lm(formula = TOTCHOL ~ I(AGE - mean(AGE, na.rm = TRUE)),
data = framingham)
summary(lm2)
##
## Call:
## lm(formula = TOTCHOL ~ I(AGE - mean(AGE, na.rm = TRUE)), data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.78 -29.52 -2.98 25.17 457.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 237.00970 0.65330 362.79 <2e-16 ***
## I(AGE - mean(AGE, na.rm = TRUE)) 1.28386 0.07535 17.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.25 on 4380 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.06215, Adjusted R-squared: 0.06194
## F-statistic: 290.3 on 1 and 4380 DF, p-value: < 2.2e-16
ggfortify appears useful in providing graphical modeling diagnostics automatically.
library(ggfortify)
autoplot(lm1, which = 1:6, ncol = 3, label.size = 3)
More detailed discussion using the car package can be found here.
Let’s try a simple comparison of model fit (prediction) to the raw data. We use the add_predictions() function in the modelr package to add the prediction (pred) to the dataset and plot them together using ggplot2.
## Load modelr package for add_predictions()
library(modelr)
##
framingham %>%
## Add prediction to dataset
add_predictions(lm1) %>%
## Plot
ggplot(mapping = aes(x = AGE, y = TOTCHOL)) +
geom_point() +
geom_line(mapping = aes(y = pred), size = 2, color = "blue") +
theme_bw()
## Warning: Removed 52 rows containing missing values (geom_point).
The residual (error) vs fitted (prediction) plot in the 6-panel plot can be created as follows.
framingham %>%
## Add prediction to dataset
add_predictions(lm1) %>%
## Also add errors to dataset
add_residuals(lm1) %>%
## Plot
ggplot(mapping = aes(x = pred, y = resid)) +
geom_point() +
geom_smooth(color = "blue", size = 2) +
theme_bw()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 52 rows containing non-finite values (stat_smooth).
## Warning: Removed 52 rows containing missing values (geom_point).
Now let’s examine more complex models with multiple variables (two!). Here we model the average total cholesterol as a function of both age and sex.
This is almost exactly the same except for the additional variable SEX.
lm2 <- lm(TOTCHOL ~ AGE + SEX,
data = framingham)
summary(lm2)
##
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.27 -29.68 -3.12 25.20 460.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 175.6959 3.8604 45.51 < 2e-16 ***
## AGE 1.2796 0.0752 17.02 < 2e-16 ***
## SEX -5.8152 1.3128 -4.43 0.00000967 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.15 on 4379 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.06634, Adjusted R-squared: 0.06591
## F-statistic: 155.6 on 2 and 4379 DF, p-value: < 2.2e-16
If you want to include an interaction term, the formula is the following.
lm3 <- lm(TOTCHOL ~ AGE + SEX + AGE:SEX,
data = framingham)
summary(lm3)
##
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.46 -28.59 -3.24 24.44 462.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.6914 5.0388 26.33 <2e-16 ***
## AGE 2.1396 0.0993 21.55 <2e-16 ***
## SEX 90.0487 7.5157 11.98 <2e-16 ***
## AGE:SEX -1.9218 0.1484 -12.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.36 on 4378 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.1008, Adjusted R-squared: 0.1002
## F-statistic: 163.5 on 3 and 4378 DF, p-value: < 2.2e-16
Let’s examine how these two models differ by predicting for a fake dataset. An all-combinations dataset can be constructed with crossing(). Here we need to perform a wide-to-long conversion using gather().
## Create a fake data set to predict for.
newdata <- crossing(AGE = seq(from = 50, to = 70, by = 1),
SEX = c(0,1))
newdata
## # A tibble: 42 x 2
## AGE SEX
## <dbl> <dbl>
## 1 50 0
## 2 50 1
## 3 51 0
## 4 51 1
## 5 52 0
## 6 52 1
## 7 53 0
## 8 53 1
## 9 54 0
## 10 54 1
## # ... with 32 more rows
## Predict for this new dataset using both models.
newdata_long <- newdata %>%
## Predict using Model lm2, prediction variable is named lm2
add_predictions(model = lm2, var = "lm2") %>%
## Predict using Model lm3, prediction variable is named lm3
add_predictions(model = lm3, var = "lm3") %>%
## lm2 and lm3 variables will form a single variable pred,
## They are distinguished by the model variable
gather(key = model, value = pred,
lm2, lm3)
newdata_long
## # A tibble: 84 x 4
## AGE SEX model pred
## <dbl> <dbl> <chr> <dbl>
## 1 50 0 lm2 239.6751
## 2 50 1 lm2 233.8599
## 3 51 0 lm2 240.9547
## 4 51 1 lm2 235.1394
## 5 52 0 lm2 242.2343
## 6 52 1 lm2 236.4190
## 7 53 0 lm2 243.5139
## 8 53 1 lm2 237.6986
## 9 54 0 lm2 244.7935
## 10 54 1 lm2 238.9782
## # ... with 74 more rows
## Plot
ggplot(data = newdata_long,
mapping = aes(x = AGE, y = pred, group = SEX, color = factor(SEX))) +
geom_line() +
facet_grid(. ~ model, labeller = label_both) +
coord_cartesian(ylim = c(0,300)) +
theme_bw()
It is clear from the visualization that Model 2 constrains the two lines to the same slopes in both sex, whereas Model 3 allows different slopes. Which seems to be a more feasible model for the data? One option when the models are nested (roughly speaking, dropping terms from the more complex model gives the simpler model), you can use anova() function, which conducts a formal test to compare the fit of two models (here partial F-test).
anova(lm2, lm3)
## Analysis of Variance Table
##
## Model 1: TOTCHOL ~ AGE + SEX
## Model 2: TOTCHOL ~ AGE + SEX + AGE:SEX
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4379 8155059
## 2 4378 7854337 1 300722 167.62 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A more general way is to use Akaike (pronounced Aka-Ike) information criteria (AIC; smaller the better). Here I created a list of models and mapped AIC function to each element.
map(list(lm2 = lm2, lm3 = lm3), AIC)
## $lm2
## [1] 45435.17
##
## $lm3
## [1] 45272.52
Generalized linear models (models for outcomes that are binary, count, etc) are fit with the glm() function. in addition to the model formula and dataset, we need specify the “family”.
For a binary outcome variable, binary logistic regression is commonly used. Note the results are on the coefficient scale (log odds ratios).
glm_logistic <- glm(formula = PREVCHD ~ AGE + SEX,
family = binomial(link = "logit"),
data = framingham)
## Summary on log odds ratio scale.
summary(glm_logistic)
##
## Call:
## glm(formula = PREVCHD ~ AGE + SEX, family = binomial(link = "logit"),
## data = framingham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8219 -0.3346 -0.2136 -0.1440 3.0617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.707187 0.582144 -16.675 < 2e-16 ***
## AGE 0.114307 0.009912 11.532 < 2e-16 ***
## SEX 0.908359 0.155909 5.826 0.00000000567 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1593.5 on 4433 degrees of freedom
## Residual deviance: 1400.6 on 4431 degrees of freedom
## AIC: 1406.6
##
## Number of Fisher Scoring iterations: 7
## Exponentiation will obtain odds ratios.
tableone::ShowRegTable(glm_logistic)
## exp(coef) [confint] p
## (Intercept) 0.00 [0.00, 0.00] <0.001
## AGE 1.12 [1.10, 1.14] <0.001
## SEX 2.48 [1.83, 3.38] <0.001
For a count outcome variable, Poisson regression is often used. It’s not a good example (too many zeros; zero inflated data), but I’ll show an example taking the cigarettes per day as the outcome variable.
glm_poisson <- glm(formula = CIGPDAY ~ AGE + SEX,
family = poisson(link = "log"),
data = framingham)
## Summary on log-mean ratio scale.
summary(glm_poisson)
##
## Call:
## glm(formula = CIGPDAY ~ AGE + SEX, family = poisson(link = "log"),
## data = framingham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.494 -3.670 -2.699 1.995 11.918
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1843893 0.0301395 105.66 <2e-16 ***
## AGE -0.0297036 0.0006043 -49.15 <2e-16 ***
## SEX 0.8444116 0.0105274 80.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 71142 on 4401 degrees of freedom
## Residual deviance: 61732 on 4399 degrees of freedom
## (32 observations deleted due to missingness)
## AIC: 71440
##
## Number of Fisher Scoring iterations: 6
## Exponentiation will obtain odds ratios.
tableone::ShowRegTable(glm_poisson)
## exp(coef) [confint] p
## (Intercept) 24.15 [22.77, 25.62] <0.001
## AGE 0.97 [0.97, 0.97] <0.001
## SEX 2.33 [2.28, 2.38] <0.001
Survival analysis functions are included in the survival package.
library(survival)
Kaplan-Meier survival estimates are obtained as follows. A survival outcome consists of two columns, one specifying the observation time, and the other specifying if the event was observed or not. Here we examine the time-to-death outcome over the 24 years of follow up.
## Non-parametric survival estimates
kmfit1 <- survfit(Surv(time = TIMEDTH, event = DEATH) ~ SEX, data = framingham)
## Show survival estimates at these years
summary(kmfit1, times = c(0,6,12,18,24))
## Call: survfit(formula = Surv(time = TIMEDTH, event = DEATH) ~ SEX,
## data = framingham)
##
## SEX=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 2490 0 1.000 0.00000 1.000 1.000
## 6 2396 94 0.962 0.00382 0.955 0.970
## 12 2237 159 0.898 0.00605 0.887 0.910
## 18 2019 218 0.811 0.00785 0.796 0.826
## 24 1783 236 0.716 0.00904 0.699 0.734
##
## SEX=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 1944 0 1.000 0.00000 1.000 1.000
## 6 1809 135 0.931 0.00577 0.919 0.942
## 12 1633 176 0.840 0.00831 0.824 0.856
## 18 1391 242 0.716 0.01023 0.696 0.736
## 24 1101 290 0.566 0.01124 0.545 0.589
The survminer package is useful in producing informative plots (check out the vignettes).
library(survminer)
## Loading required package: ggpubr
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
ggsurvplot(kmfit1, risk.table = TRUE)
Cox regression is fit with coxph(). The formula is the same except for the LHS, which is identical to survfit()
coxph1 <- coxph(Surv(time = TIMEDTH, event = DEATH) ~ AGE + SEX,
data = framingham)
summary(coxph1)
## Call:
## coxph(formula = Surv(time = TIMEDTH, event = DEATH) ~ AGE + SEX,
## data = framingham)
##
## n= 4434, number of events= 1550
##
## coef exp(coef) se(coef) z Pr(>|z|)
## AGE 0.092328 1.096724 0.003193 28.92 <2e-16 ***
## SEX 0.614190 1.848160 0.051161 12.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## AGE 1.097 0.9118 1.090 1.104
## SEX 1.848 0.5411 1.672 2.043
##
## Concordance= 0.718 (se = 0.007 )
## Rsquare= 0.203 (max possible= 0.997 )
## Likelihood ratio test= 1006 on 2 df, p=0
## Wald test = 943.6 on 2 df, p=0
## Score (logrank) test = 1029 on 2 df, p=0