Multiple Linear Regression

Learning Objectives

Understand the basic principles of multiple linear regression
Learn how to perform and interpret the results of linear regression
Learn how to adjust for confounders

Sources

Rosner, Bernard. Fundamentals of Biostatistics. Cengage Learning, 2017
Triola, Mario F., and Laura Iossi. Elementary Statistics. Pearson, 2018

Rpubs

https://rpubs.com/zaidyousif/1145060

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
#library(lmtest)

Load data

setwd("E:/Biostat and Study Design/204/Lectures/Data")
Diabetes.df <- openxlsx::read.xlsx('diabetes.xlsx')

Simple Linear Regression

In a previous lecture, we learn about the concept and interpretation of simple linear regression. To summarize, the goal of simple linear regression is to identify the best line that fits. Simple linear regression has the following this formula:

\[\hat{y} = {b}_0 + b_1 x\]

where \(\hat{y}\) is the continuous dependent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope/coefficients, and \(x\) is the independent variable. The following are the requirements for simple linear regression:

The sample of paired (x, y) data is a random sample of quantitative data.
Visual examination of the scatterplot shows that the points approximate a straight-line pattern.
No significant outliers. Outliers can have a strong effect on the regression equation, so remove any outliers if they are known to be errors.

A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.

image title
Source : https://dustinstansbury.github.io/theclevermachine/cutting-your-losses

Multiple Linear Regression

In practice, researchers often are interested in investigating the relationship between one continuous dependent (y) variable and multiple independent variables \((x_1,x_2,...,x_n)\). This type of problem is the subject matter of multiple-regression analysis. The general formula for multiple is:

\[\hat{y} = {b}_0 + b_1 x_1+b_2 x_2+...+b_p x_p+\epsilon\]

where \(\hat{y}\) is the dependent variable, \(b_0\) is the y-intercept, \(b_p\) the change in \(y\) for every unit change of the independent variable \(x_p\), and \(\epsilon\) is an error term that is normally distributed with mean 0 and variance \(\sigma^2\).

The independent variables can be any of any type:

Numerical : values on the number line such as age (years), heart rate (bpm), serum creatinine (mg/dL).
Categorical: Observations takes on several values:
- Binary: observations that one of two values such as sex (male vs. female), smoking status (smoker vs. not-smoker).
- Ordinal: categories (levels) can be ranked such as cancer stage (I, II, III, IV).
- Nominal: categories (levels) cannot be ranked such as blood type (A, B, AB or O).

Linear regression models must satisfy the following assumptions:

Linearity: There exists a linear relationship between each independent and dependent variable.
Independence: The observations are independent.
Homoscedasticity: The variance of the residuals is constant at every point in the linear model.

image title

Normality: The residuals of the model are normally distributed.

image title

Controlling for Confounding Variables

A confounder is a variable that correlates with both the dependent variable and the independent variable. A Confounder presence distort the relationship between variables being studied. Consequently, the results do not reflect the the true relationship between the variables of interest. Due their undesirable effect, investigators attempt to account for the effect of confouder variables by so-called ‘adjustment for confounding variables’.

image title

The definition and the variable type of exposure, outcome, and confounder vary according to the research question and the statistical model.

image title

In order for a variable to be considered a confounder, it needs to satisfy the following assumptions:

Has association with the exposure and is unequally distributed between the exposed and nonexposed groups.
Has association with the outcomes, that is, it should be a risk factor, a preventive factor, or a surrogate marker.
It must not be a factor in the causal pathway of the disease.

Multiple Linear Regression in R

Example: Determine if there is a relationship between age and post-prandial glucose level. For this analysis, we are going to use the Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

image title

We can express the hypothesized association between age and glucose level using the following formula:

\[\hat{y}_{glucose} = {b}_0 + b_{age}+\epsilon\] \({H_0}: \beta_{age} = 0\)

\({H_1}: \beta_{age} \neq\ 0\)

We start by confirming a linear relationship between glucose level and age.

Diabetes.df %>% ggplot(aes(x=Age, y=Glucose)) + 
    geom_point()+
    geom_smooth(method=lm) +
    theme_light() #scatter plot + line plot

## `geom_smooth()` using formula = 'y ~ x'

We proceed to building a simple linear regression model

glucose.model <- lm(Glucose~Age,data = Diabetes.df) #build linear regression model
summary(glucose.model) #generate summary

## 
## Call:
## lm(formula = Glucose ~ Age, data = Diabetes.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.453  -20.849   -3.058   18.304   86.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.08016    3.34095   29.06  < 2e-16 ***
## Age          0.71642    0.09476    7.56 1.15e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.86 on 766 degrees of freedom
## Multiple R-squared:  0.06944,    Adjusted R-squared:  0.06822 
## F-statistic: 57.16 on 1 and 766 DF,  p-value: 1.15e-13

confint(glucose.model) #generate 95% CI intervals

##                  2.5 %      97.5 %
## (Intercept) 90.5216601 103.6386585
## Age          0.5304001   0.9024361

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. For every unit increase in age (year), the mean glucose level increases by 0.72 mg/dL (95% CI 0.53-0.90).

Adjusting for Confounders

We could potentially improve the performance of our model by adjusting for potential confounders. Making a determination of potential confounders is dependent on your research question. For this research question, we will consider body mass index (BMI) as a confounder by confirming it meets the assumptions noted previously.

BMI is associated with age
BMI has been suggested as a risk factor for elevated blood glucose
It is unlikely that BMI is on the causal pathway between age and glucose level.

\[\hat{y}_{glucose} = {b}_0 + b_{age}+b_{BMI}+\epsilon\]

image title

We can adjust our model for potential confounders by including the variables in the model equation.

glucose.model.confounder  <- lm(Glucose~Age+BMI,data = Diabetes.df) #multivariable linear regression model

We confirm the assumptions of linear regression by examining diagnostic plots

par(mfrow = c(1,2)) #set up plot matrix
plot(glucose.model.confounder,which=c(1,2)) #plot figures 1 and 2

The left plot helps us assess whether a linear fit is appropriate. We evaluate the plot by looking for signs of curvature of the red line. The right plot helps to assess if the distribution of residuals is normal.

par(mfrow = c(1,1))#set up plot matrix
plot(glucose.model.confounder,which=c(3)) #plot figure 3

The figure shows if residuals are spread equally along the ranges of fitted values. This is how you can check the assumption of equal variance (homoscedasticity). We look for the red line to be parallel to the x-axis. Evaluating the figures leads to the conclusion that our model satisfies the linear regression assumptions.

summary(glucose.model.confounder) #Generate summary of the model

## 
## Call:
## lm(formula = Glucose ~ Age + BMI, data = Diabetes.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -132.310  -19.102   -1.977   18.310   83.766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.29517    5.40191  13.013  < 2e-16 ***
## Age          0.69555    0.09257   7.514 1.61e-13 ***
## BMI          0.85891    0.13808   6.220 8.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.13 on 765 degrees of freedom
## Multiple R-squared:  0.1142, Adjusted R-squared:  0.1119 
## F-statistic: 49.33 on 2 and 765 DF,  p-value: < 2.2e-16

confint(glucose.model.confounder) #Generate confidence intervals

##                  2.5 %     97.5 %
## (Intercept) 59.6908477 80.8994969
## Age          0.5138259  0.8772727
## BMI          0.5878444  1.1299717

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. Holding other variables constant, For every unit increase in age (year), the mean glucose level increases by 0.70 mg/dL (95% CI 0.51-0.88).

\(R^2\), or the coefficient of determination, is a value between 0 and 1 that measures how well our regression line fits our data. \(R^2\) can be interpreted as the percent of variance in our dependent variable that can be explained by our model. Adjusted \(R^2\) indicates how well our regression line our data, but adjusted for the number of terms in a model. If you add useless variables to a model, adjusted \(R^2\) will decrease. If you add more useful variables, Adjusted \(R^2\) will increase. Adjusted \(R^2\) will always be less than or equal to \(R^2\). The adjusted \(R^2\) for our model is 0.112 This means that 11.2% of the variation in glucose level can be explained by our model.

Multiple Linear Regression

Zaid Yousif, PharmD, MAS

2/4/2024

Learning Objectives

Sources

Rpubs

Load Libraries

Load data

Simple Linear Regression

Multiple Linear Regression

Controlling for Confounding Variables

Multiple Linear Regression in R

Adjusting for Confounders