This quick tutorial is designed as a short introduction to regression models, logistic regression models to calculate the log-odds of disease for different exposure groups, and using multivariate regression to control for potential confounders. Regression models are very common basic statistical techniques used in almost every field. Logistic regressions are a subset used when the outcome is binary (0,1), like disease status or death. For 250B we want you to be able to read and interpret the output of a logistic regression, as you will see plenty in public health papers, and understand how multivariate regressions can be used to control confounding.
For those not interested in R/Stata, you are welcome to ignore the coding intricacies in this document (and ignore the graphing code even if you are). Focus on the results from the models, how you interpret the logistic regression model output, what it means to add multiple variables in a model, and what an interaction term is and what it means.
This explainer goes more in depth, showing the R code used in fitting regressions an attempting to offer a conceptual understanding of the math behind regressions, and how they can be used to control for confounding. You don’t need to know how to code or hand-calculate a regression for this class, but you’ll probably have to run one in Stata, R, or another statistical software at some point in your academic career. All the analysis done on this page is in R, but could have been done in Stata. (R is just used to make this nice-looking document).
Disclaimer: This is a very simple overview that glosses over much nuance that you will learn more about in biostatistics courses.
In this explainer, we will fit and interpret some regression models using data from the CORONIS heart study (Full publications at: http://ije.oxfordjournals.org/content/22/3/428.full.pdf+html). Below is the abstract:
The Coronary Risk Factor Study (CORIS) examined the feasibility and effectiveness of a multifactorial community intervention programme to reduce coronary heart disease (CHD) risk factor levels.Three Afrikaner communities were surveyed before and after a 4-year intervention in two of the communities, the third serving as a control (C). Intervention was primarily by small mass media (low-intensity intervention, LII) or by small mass media plus interpersonal intervention to high-risk individuals (high-intensity intervention, HII). After allowing for change in C, significant net reductions in blood pressure, smoking, and risk score were obtained in LII and HII alike. Though the total cholesterol (TC) fell by 10-12%, there was no net reduction in favour of the intervention communities. However, LII and HII resulted in significant increases in high-density lipoprotein cholesterol (HDL-C) levels and HDL-C/TC ratios in comparison to C. Overall, the LII community fared almost as well as the HII community, and high-risk individuals did not show a greater change in risk factors than others. We conclude that community-based intervention works, and that in these particular communities a media-based health education programme was more cost-effective than one which adds a greater degree of interpersonal intervention.
Unfortunately, this data set does not include intervention arm, so we can only evaluate observational difference between study subjects, rather than contrast intervention arms. The data were found online as part of the All of Statistics textbook (which I highly recommend epi/biostats students download as a future reference, I use it a lot). I’ve attached it here. There is not great documentation of how the author got the data/what cleaning has been done, which I believe is why the variables in the data do not match what is in the publication.
First we’ll read in the data and examine the first 6 rows.
setwd("C:/Users/andre/Dropbox/250B redesign/R problemsets")
df<-read.delim("coris_heartdiseasedata.txt",header=T,sep=",")
head(df)
row.names sbp tobacco ldl adiposity famhist typea obesity alcohol age
1 1 160 12.00 5.73 23.11 1 49 25.30 97.20 52
2 2 144 0.01 4.41 28.61 0 55 28.87 2.06 63
3 3 118 0.08 3.48 32.28 1 52 29.14 3.81 46
4 4 170 7.50 6.41 38.03 1 51 31.99 24.26 58
5 5 134 13.60 3.50 27.78 1 60 25.99 57.34 49
6 6 132 6.20 6.47 36.21 1 62 30.77 14.14 45
chd
1 1
2 1
3 0
4 1
5 1
6 0
Suppose we only care about a binary tobacco use=yes or tobacco use=no. Below a binary tobacco exposure created from the continious tobacco usage variable. We’ll use this binary variable as an exposure or potential confounder later.
#Create binary "anytobacco variable"
df$anysmoke<-0
df$anysmoke[df$tobacco>0]<-1
Remember the 2x2 table format:
| . | Disease | No Disease | Total |
|---|---|---|---|
| Exposed | a | b | a+b |
| Unexposed | c | d | c+d |
| Total | a+c | b+d | a+b+c+d |
The table() command can be used with two variables, to give us a 2x2 table
mytable <- table(df$anysmoke,df$chd) # A will be rows, B will be columns
mytable # print table
0 1
0 92 15
1 210 145
Note that because R sorts the exposure and outcome 0-> the organization of this table is flipped from what we are used to.
\[OR=\frac{145*92}{210*15}=4.235\]
Smoking multiplies the odds of coronary heart disease 4.24 times compared to a non-smoking reference group. We’ll come back to this later when we examine logistic regressions
Basic linear regressions fit a line through a cloud of points to minimize the average squared distance between the points and line). A more statistical introduction can be found here: http://onlinestatbook.com/2/regression/intro.html
For example, plot age on the X-axis and obesity on the Y-axis
plot(df$age,df$obesity)
How does obesity change with age? You can maybe see that it, on average, increases a bit with age. Is there an easier way to visualize or describe this? A regression describes a cloud of points with an equation predicting a value of Y at each value of X. The model is fit so that the overall (summed) squared vertical distance of each point from the line is minimized.
Univariate (single predictor/exposure variable) regression models take the following form: \[outcome=intercept+slope\times exposure\] \[Y={\beta}_{0}+{\beta}_{1}X\]
Y is a continious outcome, and X is often a continious exposure, although it could be binary. The intercept is the value Y takes when X is 0, and the slope is the increase in Y with each increase in X.
We will fit a regression to the above cloud of points above, and add it to the graph.
plot(df$age, df$obesity)
fit<-lm(df$obesity~df$age)
abline(fit,col="red",lwd=1.5)
fit is the R object holding the results of the regression model fitting. Let’s examine them:
summary(fit)
Call:
lm(formula = df$obesity ~ df$age)
Residuals:
Min 1Q Median 3Q Max
-11.6121 -2.6293 -0.4255 2.0844 22.7085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.44082 0.58185 38.568 < 2e-16 ***
df$age 0.08416 0.01286 6.543 1.61e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.035 on 460 degrees of freedom
Multiple R-squared: 0.08513, Adjusted R-squared: 0.08315
F-statistic: 42.81 on 1 and 460 DF, p-value: 1.614e-10
The “Estimate” column" is the \({\beta}\) estimate. So the fit linear regression model for age’s association with obesity is: \[Obesity=22.44+0.084\times Age\] The “Pr(>|t|)” column is the p=value testing whether the estimate is different from the null (0 for a linear regression). It combines the magnitude of the estimate with the size of the standard error though a Wald test.
Suppose our outcome in a death/not-death or disease/not-disease binary outcome. In this dataset, let’s fit a regression between age and risk of coronary heart disease. Let’s first fit a linear regression like above.
plot(df$age, df$chd, xlim=c(0,100))
fit2<-lm(df$chd~df$age)
abline(fit2,col="red",lwd=1.5)
summary(fit2)
Call:
lm(formula = df$chd ~ df$age)
Residuals:
Min 1Q Median 3Q Max
-0.6039 -0.3729 -0.1418 0.4690 0.9676
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.17434 0.06380 -2.733 0.00653 **
df$age 0.01216 0.00141 8.621 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4424 on 460 degrees of freedom
Multiple R-squared: 0.1391, Adjusted R-squared: 0.1372
F-statistic: 74.33 on 1 and 460 DF, p-value: < 2.2e-16
What’s the predicted cumulative risk of coronary heart disease for a 40 year old? \[-0.174+(0.012*40)=0.31\] That seems a plausable number. What’s the predicted cumulative risk of coronary heart disease for a 100 year old?
\[-0.174+(0.012*100)=1.026\] There is a predicted greater than %100 risk of coronary heart disease. Linear regressions, when applied to binary outcomes, can create non-sensical predictions for low or high values of \(X\). Logistic regressions overcome this issue by estimating the log-odds of an outcome, which lead to odds bounded between 0 and 1.
Logistic regressions change the standard linear regression equation to: \[log(\frac{Y}{1-Y})={\beta}_{0}+{\beta}_{1}X\]
\(Y\) will alway between 0 and 1 (which is what we want because risks, or proportions of the outcomes that are diseases, are always between 0 and 1). Use algebra to transform the above equation so the left side is \(Y=\) to see why. See thatin a logistic regression fit to the predicted risk of disease at each \(X\) is between 0 and 1 below.
plot(df$age, df$chd, xlim=c(0,100))
#The below line is R code for fitting a logistic regression in R. The family="binomial" specifies that it is a logistic regression, with binomial outcome data.
logistic.fit<-glm(chd~age, family="binomial", data=df)
curve(predict(logistic.fit,data.frame(age=x),type="resp"),add=TRUE, col="red")
This is a plot of the predicted risk of outcome (transformed from the log-odds). Notice that the tails approach 0 and 1 but never cross. We don’t have the same issues of non-sensical predictions like above with a linear regression. Let’s examine the R output of the model.
summary(logistic.fit)
Call:
glm(formula = chd ~ age, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4321 -0.9215 -0.5392 1.0952 2.2433
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.521710 0.416031 -8.465 < 2e-16 ***
age 0.064108 0.008532 7.513 5.76e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 525.56 on 460 degrees of freedom
AIC: 529.56
Number of Fisher Scoring iterations: 4
\[{\beta}_{0}= -3.522\] \[{\beta}_{1}= 0.064\] \[log(\frac{Y}{1-Y})={-3.522}+0.064\times X\] \[Odds=e^{(-3.522+0.064\times X)}\]
* What’s odds of CHD for a 40 year old? \[e^(-3.522+0.064\times 40)=0.382128\] * What’s odds of CHD for a 41 year old? \[e^(-3.522+0.064\times 41)=0.407384\] * What’s the odds ratio between a 41 year old and 40 year old? \[\frac{0.002143}{0.002285}=1.066092\] * What’s the exponentiated coefficient of age? \[e^{{\beta}_{1}}=1.066092\] * Why are these the same? \[{Odds}_{41v40}=\frac{e^{{\beta}_{0}+{\beta}_{1}*41}}{e^{{\beta}_{0}+{\beta}_{1}*40}}=\frac{e^{{\beta}_{0}}}{e^{{\beta}_{0}}}\times \frac{e^{{\beta}_{1}*41}}{e^{{\beta}_{1}*40}}=\frac{e^{{\beta}_{1}*40}\times e^{{\beta}_{1}*1}}{e^{{\beta}_{1}*40}}=^{{\beta}_{1}}=0.9379\]
So the exponentiated coefficient is the odds ratio between X and X+1, a one-unit increase in X. If X is a binary variable like “smoking”, the OR between smoking and non-smoking is just the exponentiated coefficient from the model.
Let’s examine this in a model with smoking as the exposure and CHD as the outcome.
logistic.fit.smoke<-glm(chd~anysmoke, family="binomial", data=df)
summary(logistic.fit.smoke)
Call:
glm(formula = chd ~ anysmoke, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0247 -1.0247 -0.5496 1.3382 1.9823
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8137 0.2785 -6.514 7.34e-11 ***
anysmoke 1.4434 0.2987 4.833 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 566.90 on 460 degrees of freedom
AIC: 570.9
Number of Fisher Scoring iterations: 4
#Exponentiate the coefficients to get the Odds ratio
exp(logistic.fit.smoke$coefficients)
(Intercept) anysmoke
0.1630435 4.2349206
In R and stata it is very easy to add additional exposure/independent variables in the model (making a multivariate regression model).
logfit.multi<-glm(chd~age+anysmoke, family="binomial", data=df)
summary(logfit.multi)
Call:
glm(formula = chd ~ age + anysmoke, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4540 -0.9139 -0.4894 1.0615 2.4835
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.031792 0.480847 -8.385 < 2e-16 ***
age 0.058514 0.008894 6.579 4.75e-11 ***
anysmoke 0.916926 0.319456 2.870 0.0041 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 516.50 on 459 degrees of freedom
AIC: 522.5
Number of Fisher Scoring iterations: 4
#Exponentiation the coefficients to calculate the ORs
exp(logfit.multi$coefficients)
(Intercept) age anysmoke
0.0177425 1.0602600 2.5015892
OR of smokers compared to non-smokers: 2.5016
OR of one additional year of age compared to a reference level: 1.0603
However, this new model is slightly harder to interpret. The odds ratios of both age and smoking are smaller than in the models with just age and just smoking. Why might this be? Let’s examine this new model graphically.
plot(df$age, df$chd, xlim=c(0,100))
curve(plogis(-4.031792 + 0.058514*x), 0, 110, add=TRUE, lwd=1, lty=1)
curve(plogis(-4.031792 +0.916926+ 0.058514*x), 0,110, add=TRUE, lwd=3, lty=3)
legend('bottomright', c('Nonsmokers','Smokers'), lty=c(1, 3), col=2:1, lwd=c(1,3), bty='n', cex=0.8)
Looking at the graph, the multivariate regression is equivalent to fitting two regression models, with the same slope for age, stratified by smoking status. The ratio between the odds predicted from each smoking-stratified line is therefore the odds ratio between smokers and nonsmokers, which is constant regardless of age. Therefore, there is no multiplicative interaction, and, as logistic regression predict on the multiplicative scale, there is no effect modification.
Note: It is certain that there is not effect modification in this multivariate regression model, because effect modification is explicitly included in the logistic regression equation. This doesn’t mean that there isn’t actual effect modification by smoking occuring in the data. We haven’t checked yet.
Why are the odds ratios calculated from the multivariate logistic regression model different from those calculated from univariate (single exposure) models? CONFOUNDING!!! If there was zero confounding by smoking on the age-CHD relationship, adding smoking to the model would not change the coefficient of age. This is conceptually identical to checking confounding by comparing stratified measures of association to the crude measure of association. So a common way of checking confounding is to add a potential confounder to a regression model and examine if/how much the coefficient of the exposure of interest changes. Often this is done with many potential confounders, and the multivariate model with confounders included is called the adjusted model. This can be a much more complicated process fraught with investigator decisions than presented here. It is covered much more in biostatistics courses. But these adjusted models are commonly found in published public health literature.
What is causing the confounding in this situation? We need to examine the relationship between age and smoking. We can do this graphically, or analytically by fitting a new logistic regression model with the exposure as age, and the outcome as smoking, or a ttest between smoking status to test if mean age is different.
GraphicallY:
library(popbio)
Warning: package 'popbio' was built under R version 3.2.4
logi.hist.plot(df$age,df$anysmoke,boxp=FALSE,type="hist",col="gray")
Older People are far likelier to smoke than younger study subjects.
Analytically:
smoke.age<-glm(anysmoke~age, family="binomial", data=df)
summary(smoke.age)
Call:
glm(formula = anysmoke ~ age, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3877 0.3451 0.4707 0.7084 1.3133
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.348952 0.332969 -4.051 5.09e-05 ***
age 0.064688 0.008538 7.577 3.54e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 500.07 on 461 degrees of freedom
Residual deviance: 433.29 on 460 degrees of freedom
AIC: 437.29
Number of Fisher Scoring iterations: 4
#Exponentiation the coefficients to calculate the ORs
exp(smoke.age$coefficients)
(Intercept) age
0.259512 1.066826
The graphical findings are supported analytically. Each increase in year in age multiplies the odds of smoking by 1.067.
The effect of age on CHD odds was artificially inflated in this data sample by confounding from smoking status. Alternatively, the effect of smoking on CHD odds was artificially inflated in this data sample by confounding by age. Older people have higher CHD risk, and older people are more likely to smoke, which also increases CHD risk. Without accounting for one exposure, the other exposure is going to seem like it has a higher effect on CHD odds than it truely does.
But is this just confounding? Possibly, not only are old people more likely to smoke, but that greater age increases the negative effects of smoking, and smoking increases the negative effects of age on heart health. This would be effect modification.
To incorporate potential effect modification in a regression equation, add a statistical interaction term, a multiplication of two exposures included as its own term in the model.
\[log(\frac{Y}{1-Y})={\beta}_{0}+{\beta}_{1}{X}_{1}+{\beta}_{2}{X}_{2}+{\beta}_{3}{X}_{1}{X}_{2}\] In this dataset: \[log({Odds}_{CHD})={\beta}_{0}+{\beta}_{1}\times Age+{\beta}_{2}\times Smoking+{\beta}_{3}\times Smoking\times Age\]
Below, the model is fitted in R. The age*anysmoke term fits age+anysmoke+(age X anysmoke).
logfit.multi.int<-glm(chd~age*anysmoke, family="binomial", data=df)
summary(logfit.multi.int)
Call:
glm(formula = chd ~ age * anysmoke, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4358 -0.9232 -0.5006 1.0718 2.5838
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.46666 0.94923 -4.706 2.53e-06 ***
age 0.06853 0.02041 3.358 0.000785 ***
anysmoke 1.47319 1.06750 1.380 0.167574
age:anysmoke -0.01254 0.02269 -0.553 0.580382
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 516.19 on 458 degrees of freedom
AIC: 524.19
Number of Fisher Scoring iterations: 5
#Exponentiation the coefficients to calculate the ORs
exp(logfit.multi.int$coefficients)
(Intercept) age anysmoke age:anysmoke
0.01148565 1.07092750 4.36314431 0.98753739
The coefficient of age:anysmoke is negative. This means any effect modification between age and smoking is antagonistic. This means that the odds ratio of CHD for smokers of a certain age compared to non-smokeres of the reference age is less than what one would expect if they just combined the individual odds ratios of smoking and the odds ratios of being a certain age.
Calculations of the odds ratios between smokers and nonsmokers at different ages, based on the fit model equation, can back the assertion that there is effect modification.
Smokers have a higher odds ratio of CHD compared to nonsmokers at age 20 than at age 60. Another way of framing that is that smoking increases the odds of chd compared to non-smokers, but this effect decreases as age increases. The magnitude of the effect of smoking is modified by age.
What’s the odds ration between a smoking 60-year old and a non-smoking 20-year old? \[OR=\frac{e^{-4.46666+0.06853\times 60+1.47319-0.01254\times 1\times 60}}{e^{-4.46666+0.06853\times 20}}=\frac{e^{0.06853\times 60+1.47319-0.01254\times 1\times 60}}{e^{0.06853\times 20}}\] \[=e^{0.06853\times 40+1.47319-0.01254\times 60}=31.88\]
The p-value of the interaction term can be used to assess the strength of the effect modification. Here, it is greate than 0.05, the common significance threshold, but it is close, so I personally would include it in the model. It is common to include any confounders with a univariate regression p-value <0.2 in an adjusted model. and to include tested interaction terms with a p-values <0.2 in the adjusted model. But these are complicated decisions with no set template, so are often made differently investigator to investigator. This is another reason why setting an analyis plan prior to starting analysis is important, and pre-specifying decisions like this.
Effect modification often is more clear when examined graphically. Below, you can see that, now that the effect modification term has been added, the lines of predicted risk for smokers and non-smokers, over age, now cross over. In this model, older smokers have less CHD risk than older non-smokers. This may clash with your understandings of the risks from smoking.
plot(df$age, df$chd, xlim=c(0,100))
curve(plogis(-4.46666 + 0.06853*x), 0, 110, add=TRUE, lwd=1, lty=1)
curve(plogis(-4.46666 + 0.06853*x+1.47319-0.01254*1.47319*x), 0,110, add=TRUE, lwd=3, lty=3)
legend('bottomright', c('Nonsmokers','Smokers'), lty=c(1, 3), col=2:1, lwd=c(1,3), bty='n', cex=0.8)