Overview

This quick tutorial is designed as a short introduction to regression models, logistic regression models to calculate the log-odds of disease for different exposure groups, and using multivariate regression to control for potential confounders. Regression models are very common basic statistical techniques used in almost every field. Logistic regressions are a subset used when the outcome is binary (0,1), like disease status or death. For 250B we want you to be able to read and interpret the output of a logistic regression, as you will see plenty in public health papers, and understand how multivariate regressions can be used to control confounding.

For those not interested in R/Stata, you are welcome to ignore the coding intricacies in this document (and ignore the graphing code even if you are). Focus on the results from the models, how you interpret the logistic regression model output, what it means to add multiple variables in a model, and what an interaction term is and what it means.

This explainer goes more in depth, showing the R code used in fitting regressions an attempting to offer a conceptual understanding of the math behind regressions, and how they can be used to control for confounding. You don’t need to know how to code or hand-calculate a regression for this class, but you’ll probably have to run one in Stata, R, or another statistical software at some point in your academic career. All the analysis done on this page is in R, but could have been done in Stata. (R is just used to make this nice-looking document).

Disclaimer: This is a very simple overview that glosses over much nuance that you will learn more about in biostatistics courses.

In this explainer, we will fit and interpret some regression models using data from the CORONIS heart study (Full publications at: http://ije.oxfordjournals.org/content/22/3/428.full.pdf+html). Below is the abstract:

The Coronary Risk Factor Study (CORIS) examined the feasibility and effectiveness of a multifactorial community intervention programme to reduce coronary heart disease (CHD) risk factor levels.Three Afrikaner communities were surveyed before and after a 4-year intervention in two of the communities, the third serving as a control (C). Intervention was primarily by small mass media (low-intensity intervention, LII) or by small mass media plus interpersonal intervention to high-risk individuals (high-intensity intervention, HII). After allowing for change in C, significant net reductions in blood pressure, smoking, and risk score were obtained in LII and HII alike. Though the total cholesterol (TC) fell by 10-12%, there was no net reduction in favour of the intervention communities. However, LII and HII resulted in significant increases in high-density lipoprotein cholesterol (HDL-C) levels and HDL-C/TC ratios in comparison to C. Overall, the LII community fared almost as well as the HII community, and high-risk individuals did not show a greater change in risk factors than others. We conclude that community-based intervention works, and that in these particular communities a media-based health education programme was more cost-effective than one which adds a greater degree of interpersonal intervention.

Unfortunately, this data set does not include intervention arm, so we can only evaluate observational difference between study subjects, rather than contrast intervention arms. The data were found online as part of the All of Statistics textbook (which I highly recommend epi/biostats students download as a future reference, I use it a lot). I’ve attached it here. There is not great documentation of how the author got the data/what cleaning has been done, which I believe is why the variables in the data do not match what is in the publication.

First we’ll read in the data and examine the first 6 rows.

setwd("C:/Users/andre/Dropbox/250B redesign/R problemsets")
df<-read.delim("coris_heartdiseasedata.txt",header=T,sep=",")
head(df)

  row.names sbp tobacco  ldl adiposity famhist typea obesity alcohol age
1         1 160   12.00 5.73     23.11       1    49   25.30   97.20  52
2         2 144    0.01 4.41     28.61       0    55   28.87    2.06  63
3         3 118    0.08 3.48     32.28       1    52   29.14    3.81  46
4         4 170    7.50 6.41     38.03       1    51   31.99   24.26  58
5         5 134   13.60 3.50     27.78       1    60   25.99   57.34  49
6         6 132    6.20 6.47     36.21       1    62   30.77   14.14  45
  chd
1   1
2   1
3   0
4   1
5   1
6   0

Variables included in CORONIS study:

sbp: Systolic blood pressure, a continious variable
tobacco: Tobacco use, grams per day
ldl: Low-density lipoprotien (the “good” choleserol) level, a continious variable
adiposity: Level of fatty tissue, a continious variable
famhist: Family history of heart disease, 1=Yes, 0=No.
typea: Survey scoring for type A personality (greater stress response), a continious variable
obesity: Obesity as measured by BMI, a continious variable
alcohol: Alcohol use score, a continious variable
age: age in years, a continious variable
chd: Coronary heart disease, the outcome. 1=Yes, 0=No

Create some new summary variables.

Suppose we only care about a binary tobacco use=yes or tobacco use=no. Below a binary tobacco exposure created from the continious tobacco usage variable. We’ll use this binary variable as an exposure or potential confounder later.

#Create binary "anytobacco variable"
df$anysmoke<-0
df$anysmoke[df$tobacco>0]<-1

OR from 2x2 tables

Remember the 2x2 table format:

.	Disease	No Disease	Total
Exposed	a	b	a+b
Unexposed	c	d	c+d
Total	a+c	b+d	a+b+c+d

The table() command can be used with two variables, to give us a 2x2 table

 mytable <- table(df$anysmoke,df$chd) # A will be rows, B will be columns 
 mytable # print table

Note that because R sorts the exposure and outcome 0-> the organization of this table is flipped from what we are used to.

Calculate the odds ratio between any smoking and CHD

\[OR=\frac{145*92}{210*15}=4.235\]

Smoking multiplies the odds of coronary heart disease 4.24 times compared to a non-smoking reference group. We’ll come back to this later when we examine logistic regressions

Regression models.

Basic linear regressions fit a line through a cloud of points to minimize the average squared distance between the points and line). A more statistical introduction can be found here: http://onlinestatbook.com/2/regression/intro.html

For example, plot age on the X-axis and obesity on the Y-axis

plot(df$age,df$obesity)

How does obesity change with age? You can maybe see that it, on average, increases a bit with age. Is there an easier way to visualize or describe this? A regression describes a cloud of points with an equation predicting a value of Y at each value of X. The model is fit so that the overall (summed) squared vertical distance of each point from the line is minimized.

Univariate (single predictor/exposure variable) regression models take the following form: \[outcome=intercept+slope\times exposure\] \[Y={\beta}_{0}+{\beta}_{1}X\]

Y is a continious outcome, and X is often a continious exposure, although it could be binary. The intercept is the value Y takes when X is 0, and the slope is the increase in Y with each increase in X.

We will fit a regression to the above cloud of points above, and add it to the graph.

plot(df$age, df$obesity)
fit<-lm(df$obesity~df$age)
abline(fit,col="red",lwd=1.5)

fit is the R object holding the results of the regression model fitting. Let’s examine them:

summary(fit)


Call:
lm(formula = df$obesity ~ df$age)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.6121  -2.6293  -0.4255   2.0844  22.7085 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 22.44082    0.58185  38.568  < 2e-16 ***
df$age       0.08416    0.01286   6.543 1.61e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.035 on 460 degrees of freedom
Multiple R-squared:  0.08513,   Adjusted R-squared:  0.08315 
F-statistic: 42.81 on 1 and 460 DF,  p-value: 1.614e-10

The “Estimate” column" is the \({\beta}\) estimate. So the fit linear regression model for age’s association with obesity is: \[Obesity=22.44+0.084\times Age\] The “Pr(>|t|)” column is the p=value testing whether the estimate is different from the null (0 for a linear regression). It combines the magnitude of the estimate with the size of the standard error though a Wald test.

Regressions for binary outcomes

Suppose our outcome in a death/not-death or disease/not-disease binary outcome. In this dataset, let’s fit a regression between age and risk of coronary heart disease. Let’s first fit a linear regression like above.

plot(df$age, df$chd, xlim=c(0,100))
fit2<-lm(df$chd~df$age)
abline(fit2,col="red",lwd=1.5)

summary(fit2)


Call:
lm(formula = df$chd ~ df$age)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6039 -0.3729 -0.1418  0.4690  0.9676 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.17434    0.06380  -2.733  0.00653 ** 
df$age       0.01216    0.00141   8.621  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4424 on 460 degrees of freedom
Multiple R-squared:  0.1391,    Adjusted R-squared:  0.1372 
F-statistic: 74.33 on 1 and 460 DF,  p-value: < 2.2e-16

What’s the predicted cumulative risk of coronary heart disease for a 40 year old? \[-0.174+(0.012*40)=0.31\] That seems a plausable number. What’s the predicted cumulative risk of coronary heart disease for a 100 year old?
\[-0.174+(0.012*100)=1.026\] There is a predicted greater than %100 risk of coronary heart disease. Linear regressions, when applied to binary outcomes, can create non-sensical predictions for low or high values of \(X\). Logistic regressions overcome this issue by estimating the log-odds of an outcome, which lead to odds bounded between 0 and 1.

Logistic regression models

Logistic regressions change the standard linear regression equation to: \[log(\frac{Y}{1-Y})={\beta}_{0}+{\beta}_{1}X\]

\(Y\) will alway between 0 and 1 (which is what we want because risks, or proportions of the outcomes that are diseases, are always between 0 and 1). Use algebra to transform the above equation so the left side is \(Y=\) to see why. See thatin a logistic regression fit to the predicted risk of disease at each \(X\) is between 0 and 1 below.

plot(df$age, df$chd, xlim=c(0,100))
#The below line is R code for fitting a logistic regression in R. The family="binomial" specifies that it is a logistic regression, with binomial outcome data.
logistic.fit<-glm(chd~age, family="binomial", data=df)
curve(predict(logistic.fit,data.frame(age=x),type="resp"),add=TRUE, col="red")

This is a plot of the predicted risk of outcome (transformed from the log-odds). Notice that the tails approach 0 and 1 but never cross. We don’t have the same issues of non-sensical predictions like above with a linear regression. Let’s examine the R output of the model.

summary(logistic.fit)


Call:
glm(formula = chd ~ age, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4321  -0.9215  -0.5392   1.0952   2.2433  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.521710   0.416031  -8.465  < 2e-16 ***
age          0.064108   0.008532   7.513 5.76e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 525.56  on 460  degrees of freedom
AIC: 529.56

Number of Fisher Scoring iterations: 4

\[{\beta}_{0}= -3.522\] \[{\beta}_{1}= 0.064\] \[log(\frac{Y}{1-Y})={-3.522}+0.064\times X\] \[Odds=e^{(-3.522+0.064\times X)}\]

* What’s odds of CHD for a 40 year old? \[e^(-3.522+0.064\times 40)=0.382128\] * What’s odds of CHD for a 41 year old? \[e^(-3.522+0.064\times 41)=0.407384\] * What’s the odds ratio between a 41 year old and 40 year old? \[\frac{0.002143}{0.002285}=1.066092\] * What’s the exponentiated coefficient of age? \[e^{{\beta}_{1}}=1.066092\] * Why are these the same? \[{Odds}_{41v40}=\frac{e^{{\beta}_{0}+{\beta}_{1}*41}}{e^{{\beta}_{0}+{\beta}_{1}*40}}=\frac{e^{{\beta}_{0}}}{e^{{\beta}_{0}}}\times \frac{e^{{\beta}_{1}*41}}{e^{{\beta}_{1}*40}}=\frac{e^{{\beta}_{1}*40}\times e^{{\beta}_{1}*1}}{e^{{\beta}_{1}*40}}=^{{\beta}_{1}}=0.9379\]

So the exponentiated coefficient is the odds ratio between X and X+1, a one-unit increase in X. If X is a binary variable like “smoking”, the OR between smoking and non-smoking is just the exponentiated coefficient from the model.

Let’s examine this in a model with smoking as the exposure and CHD as the outcome.

logistic.fit.smoke<-glm(chd~anysmoke, family="binomial", data=df)
summary(logistic.fit.smoke)


Call:
glm(formula = chd ~ anysmoke, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0247  -1.0247  -0.5496   1.3382   1.9823  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8137     0.2785  -6.514 7.34e-11 ***
anysmoke      1.4434     0.2987   4.833 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 566.90  on 460  degrees of freedom
AIC: 570.9

Number of Fisher Scoring iterations: 4

#Exponentiate the coefficients to get the Odds ratio
exp(logistic.fit.smoke$coefficients)

(Intercept)    anysmoke 
  0.1630435   4.2349206

The OR of smoking is 4.235, exactly the OR we calculated from the 2x2 table above. The logistic regression OR won’t always be exactly the 2x2 table OR, but it will be very close.

Confounding and multivariate regression models

In R and stata it is very easy to add additional exposure/independent variables in the model (making a multivariate regression model).

logfit.multi<-glm(chd~age+anysmoke, family="binomial", data=df)
summary(logfit.multi)


Call:
glm(formula = chd ~ age + anysmoke, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4540  -0.9139  -0.4894   1.0615   2.4835  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.031792   0.480847  -8.385  < 2e-16 ***
age          0.058514   0.008894   6.579 4.75e-11 ***
anysmoke     0.916926   0.319456   2.870   0.0041 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 516.50  on 459  degrees of freedom
AIC: 522.5

Number of Fisher Scoring iterations: 4

#Exponentiation the coefficients to calculate the ORs
exp(logfit.multi$coefficients)

(Intercept)         age    anysmoke 
  0.0177425   1.0602600   2.5015892

OR of smokers compared to non-smokers: 2.5016
OR of one additional year of age compared to a reference level: 1.0603
However, this new model is slightly harder to interpret. The odds ratios of both age and smoking are smaller than in the models with just age and just smoking. Why might this be? Let’s examine this new model graphically.

plot(df$age, df$chd, xlim=c(0,100))

curve(plogis(-4.031792 + 0.058514*x), 0, 110, add=TRUE, lwd=1, lty=1)  
curve(plogis(-4.031792 +0.916926+ 0.058514*x), 0,110, add=TRUE, lwd=3, lty=3)  
legend('bottomright', c('Nonsmokers','Smokers'), lty=c(1, 3), col=2:1, lwd=c(1,3), bty='n', cex=0.8)

Looking at the graph, the multivariate regression is equivalent to fitting two regression models, with the same slope for age, stratified by smoking status. The ratio between the odds predicted from each smoking-stratified line is therefore the odds ratio between smokers and nonsmokers, which is constant regardless of age. Therefore, there is no multiplicative interaction, and, as logistic regression predict on the multiplicative scale, there is no effect modification.

Note: It is certain that there is not effect modification in this multivariate regression model, because effect modification is explicitly included in the logistic regression equation. This doesn’t mean that there isn’t actual effect modification by smoking occuring in the data. We haven’t checked yet.

Why are the odds ratios calculated from the multivariate logistic regression model different from those calculated from univariate (single exposure) models? CONFOUNDING!!! If there was zero confounding by smoking on the age-CHD relationship, adding smoking to the model would not change the coefficient of age. This is conceptually identical to checking confounding by comparing stratified measures of association to the crude measure of association. So a common way of checking confounding is to add a potential confounder to a regression model and examine if/how much the coefficient of the exposure of interest changes. Often this is done with many potential confounders, and the multivariate model with confounders included is called the adjusted model. This can be a much more complicated process fraught with investigator decisions than presented here. It is covered much more in biostatistics courses. But these adjusted models are commonly found in published public health literature.

What is causing the confounding in this situation? We need to examine the relationship between age and smoking. We can do this graphically, or analytically by fitting a new logistic regression model with the exposure as age, and the outcome as smoking, or a ttest between smoking status to test if mean age is different.
GraphicallY:

library(popbio)

Warning: package 'popbio' was built under R version 3.2.4

logi.hist.plot(df$age,df$anysmoke,boxp=FALSE,type="hist",col="gray")

Older People are far likelier to smoke than younger study subjects.

Analytically:

smoke.age<-glm(anysmoke~age, family="binomial", data=df)
summary(smoke.age)


Call:
glm(formula = anysmoke ~ age, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3877   0.3451   0.4707   0.7084   1.3133  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.348952   0.332969  -4.051 5.09e-05 ***
age          0.064688   0.008538   7.577 3.54e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 500.07  on 461  degrees of freedom
Residual deviance: 433.29  on 460  degrees of freedom
AIC: 437.29

Number of Fisher Scoring iterations: 4

#Exponentiation the coefficients to calculate the ORs
exp(smoke.age$coefficients)

(Intercept)         age 
   0.259512    1.066826

The graphical findings are supported analytically. Each increase in year in age multiplies the odds of smoking by 1.067.

The effect of age on CHD odds was artificially inflated in this data sample by confounding from smoking status. Alternatively, the effect of smoking on CHD odds was artificially inflated in this data sample by confounding by age. Older people have higher CHD risk, and older people are more likely to smoke, which also increases CHD risk. Without accounting for one exposure, the other exposure is going to seem like it has a higher effect on CHD odds than it truely does.

But is this just confounding? Possibly, not only are old people more likely to smoke, but that greater age increases the negative effects of smoking, and smoking increases the negative effects of age on heart health. This would be effect modification.

Effect modification in regression models

To incorporate potential effect modification in a regression equation, add a statistical interaction term, a multiplication of two exposures included as its own term in the model.

\[log(\frac{Y}{1-Y})={\beta}_{0}+{\beta}_{1}{X}_{1}+{\beta}_{2}{X}_{2}+{\beta}_{3}{X}_{1}{X}_{2}\] In this dataset: \[log({Odds}_{CHD})={\beta}_{0}+{\beta}_{1}\times Age+{\beta}_{2}\times Smoking+{\beta}_{3}\times Smoking\times Age\]

Below, the model is fitted in R. The age*anysmoke term fits age+anysmoke+(age X anysmoke).

logfit.multi.int<-glm(chd~age*anysmoke, family="binomial", data=df)
summary(logfit.multi.int)


Call:
glm(formula = chd ~ age * anysmoke, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4358  -0.9232  -0.5006   1.0718   2.5838  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.46666    0.94923  -4.706 2.53e-06 ***
age           0.06853    0.02041   3.358 0.000785 ***
anysmoke      1.47319    1.06750   1.380 0.167574    
age:anysmoke -0.01254    0.02269  -0.553 0.580382    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 516.19  on 458  degrees of freedom
AIC: 524.19

Number of Fisher Scoring iterations: 5

#Exponentiation the coefficients to calculate the ORs
exp(logfit.multi.int$coefficients)

 (Intercept)          age     anysmoke age:anysmoke 
  0.01148565   1.07092750   4.36314431   0.98753739

The coefficient of age:anysmoke is negative. This means any effect modification between age and smoking is antagonistic. This means that the odds ratio of CHD for smokers of a certain age compared to non-smokeres of the reference age is less than what one would expect if they just combined the individual odds ratios of smoking and the odds ratios of being a certain age.

Calculations of the odds ratios between smokers and nonsmokers at different ages, based on the fit model equation, can back the assertion that there is effect modification.

Smokers at age 20: \[Odds= e^{-4.46666+0.06853\times 20+1.47319-0.01254\times 1\times 20}=0.154\]
Nonsmokers at age 20: \[Odds= e^{-4.46666+0.06853\times 20}=0.045\]
OR at age 20: \[{OR}_{20}=\frac{0.154}{0.045}=3.40\]
Smokers at age 60: \[Odds= e^{-4.46666+0.06853\times 60+1.47319-0.01254\times 1\times 60}=1.44\]
Nonsmokers at age 60: \[Odds= e^{-4.46666+0.06853\times 60}=0.70\]
OR at age 60: \[{OR}_{60}=\frac{1.44}{0.70}=2.06\]

Smokers have a higher odds ratio of CHD compared to nonsmokers at age 20 than at age 60. Another way of framing that is that smoking increases the odds of chd compared to non-smokers, but this effect decreases as age increases. The magnitude of the effect of smoking is modified by age.

What’s the odds ration between a smoking 60-year old and a non-smoking 20-year old? \[OR=\frac{e^{-4.46666+0.06853\times 60+1.47319-0.01254\times 1\times 60}}{e^{-4.46666+0.06853\times 20}}=\frac{e^{0.06853\times 60+1.47319-0.01254\times 1\times 60}}{e^{0.06853\times 20}}\] \[=e^{0.06853\times 40+1.47319-0.01254\times 60}=31.88\]

The p-value of the interaction term can be used to assess the strength of the effect modification. Here, it is greate than 0.05, the common significance threshold, but it is close, so I personally would include it in the model. It is common to include any confounders with a univariate regression p-value <0.2 in an adjusted model. and to include tested interaction terms with a p-values <0.2 in the adjusted model. But these are complicated decisions with no set template, so are often made differently investigator to investigator. This is another reason why setting an analyis plan prior to starting analysis is important, and pre-specifying decisions like this.

Effect modification often is more clear when examined graphically. Below, you can see that, now that the effect modification term has been added, the lines of predicted risk for smokers and non-smokers, over age, now cross over. In this model, older smokers have less CHD risk than older non-smokers. This may clash with your understandings of the risks from smoking.

plot(df$age, df$chd, xlim=c(0,100))

curve(plogis(-4.46666 + 0.06853*x), 0, 110, add=TRUE, lwd=1, lty=1)  
curve(plogis(-4.46666 + 0.06853*x+1.47319-0.01254*1.47319*x), 0,110, add=TRUE, lwd=3, lty=3)  
legend('bottomright', c('Nonsmokers','Smokers'), lty=c(1, 3), col=2:1, lwd=c(1,3), bty='n', cex=0.8)

Logistic Regression Explainer

Andrew Mertens

September 19th, 2016