Learning Objectives

Sources

Rpubs

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
#library(lmtest)

Load data

setwd("E:/Biostat and Study Design/204/Lectures/Data")
Diabetes.df <- openxlsx::read.xlsx('diabetes.xlsx')

Simple Linear Regression

In a previous lecture, we learn about the concept and interpretation of simple linear regression. To summarize, the goal of simple linear regression is to identify the best line that fits. Simple linear regression has the following this formula:

\[\hat{y} = {b}_0 + b_1 x\]

where \(\hat{y}\) is the continuous dependent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope/coefficients, and \(x\) is the independent variable. The following are the requirements for simple linear regression:

A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.


Source : https://dustinstansbury.github.io/theclevermachine/cutting-your-losses

Multiple Linear Regression

In practice, researchers often are interested in investigating the relationship between one continuous dependent (y) variable and multiple independent variables \((x_1,x_2,...,x_n)\). This type of problem is the subject matter of multiple-regression analysis. The general formula for multiple is:

\[\hat{y} = {b}_0 + b_1 x_1+b_2 x_2+...+b_p x_p+\epsilon\]

where \(\hat{y}\) is the dependent variable, \(b_0\) is the y-intercept, \(b_p\) the change in \(y\) for every unit change of the independent variable \(x_p\), and \(\epsilon\) is an error term that is normally distributed with mean 0 and variance \(\sigma^2\).

The independent variables can be any of any type:

Linear regression models must satisfy the following assumptions:

Controlling for Confounding Variables

A confounder is a variable that correlates with both the dependent variable and the independent variable. A Confounder presence distort the relationship between variables being studied. Consequently, the results do not reflect the the true relationship between the variables of interest. Due their undesirable effect, investigators attempt to account for the effect of confouder variables by so-called ‘adjustment for confounding variables’.

The definition and the variable type of exposure, outcome, and confounder vary according to the research question and the statistical model.

In order for a variable to be considered a confounder, it needs to satisfy the following assumptions:

Multiple Linear Regression in R

Example: Determine if there is a relationship between age and post-prandial glucose level. For this analysis, we are going to use the Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

We can express the hypothesized association between age and glucose level using the following formula:

\[\hat{y}_{glucose} = {b}_0 + b_{age}+\epsilon\] \({H_0}: \beta_{age} = 0\)

\({H_1}: \beta_{age} \neq\ 0\)

We start by confirming a linear relationship between glucose level and age.

Diabetes.df %>% ggplot(aes(x=Age, y=Glucose)) + 
    geom_point()+
    geom_smooth(method=lm) +
    theme_light() #scatter plot + line plot
## `geom_smooth()` using formula = 'y ~ x'

We proceed to building a simple linear regression model

glucose.model <- lm(Glucose~Age,data = Diabetes.df) #build linear regression model
summary(glucose.model) #generate summary
## 
## Call:
## lm(formula = Glucose ~ Age, data = Diabetes.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.453  -20.849   -3.058   18.304   86.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.08016    3.34095   29.06  < 2e-16 ***
## Age          0.71642    0.09476    7.56 1.15e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.86 on 766 degrees of freedom
## Multiple R-squared:  0.06944,    Adjusted R-squared:  0.06822 
## F-statistic: 57.16 on 1 and 766 DF,  p-value: 1.15e-13
confint(glucose.model) #generate 95% CI intervals
##                  2.5 %      97.5 %
## (Intercept) 90.5216601 103.6386585
## Age          0.5304001   0.9024361

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. For every unit increase in age (year), the mean glucose level increases by 0.72 mg/dL (95% CI 0.53-0.90).

Adjusting for Confounders

We could potentially improve the performance of our model by adjusting for potential confounders. Making a determination of potential confounders is dependent on your research question. For this research question, we will consider body mass index (BMI) as a confounder by confirming it meets the assumptions noted previously.

  • BMI is associated with age
  • BMI has been suggested as a risk factor for elevated blood glucose
  • It is unlikely that BMI is on the causal pathway between age and glucose level.

\[\hat{y}_{glucose} = {b}_0 + b_{age}+b_{BMI}+\epsilon\]

We can adjust our model for potential confounders by including the variables in the model equation.

glucose.model.confounder  <- lm(Glucose~Age+BMI,data = Diabetes.df) #multivariable linear regression model

We confirm the assumptions of linear regression by examining diagnostic plots

par(mfrow = c(1,2)) #set up plot matrix
plot(glucose.model.confounder,which=c(1,2)) #plot figures 1 and 2

The left plot helps us assess whether a linear fit is appropriate. We evaluate the plot by looking for signs of curvature of the red line. The right plot helps to assess if the distribution of residuals is normal.

par(mfrow = c(1,1))#set up plot matrix
plot(glucose.model.confounder,which=c(3)) #plot figure 3

The figure shows if residuals are spread equally along the ranges of fitted values. This is how you can check the assumption of equal variance (homoscedasticity). We look for the red line to be parallel to the x-axis. Evaluating the figures leads to the conclusion that our model satisfies the linear regression assumptions.

summary(glucose.model.confounder) #Generate summary of the model
## 
## Call:
## lm(formula = Glucose ~ Age + BMI, data = Diabetes.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -132.310  -19.102   -1.977   18.310   83.766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.29517    5.40191  13.013  < 2e-16 ***
## Age          0.69555    0.09257   7.514 1.61e-13 ***
## BMI          0.85891    0.13808   6.220 8.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.13 on 765 degrees of freedom
## Multiple R-squared:  0.1142, Adjusted R-squared:  0.1119 
## F-statistic: 49.33 on 2 and 765 DF,  p-value: < 2.2e-16
confint(glucose.model.confounder) #Generate confidence intervals
##                  2.5 %     97.5 %
## (Intercept) 59.6908477 80.8994969
## Age          0.5138259  0.8772727
## BMI          0.5878444  1.1299717

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. Holding other variables constant, For every unit increase in age (year), the mean glucose level increases by 0.70 mg/dL (95% CI 0.51-0.88).

\(R^2\), or the coefficient of determination, is a value between 0 and 1 that measures how well our regression line fits our data. \(R^2\) can be interpreted as the percent of variance in our dependent variable that can be explained by our model. Adjusted \(R^2\) indicates how well our regression line our data, but adjusted for the number of terms in a model. If you add useless variables to a model, adjusted \(R^2\) will decrease. If you add more useful variables, Adjusted \(R^2\) will increase. Adjusted \(R^2\) will always be less than or equal to \(R^2\). The adjusted \(R^2\) for our model is 0.112 This means that 11.2% of the variation in glucose level can be explained by our model.