library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
#library(lmtest)
setwd("E:/Biostat and Study Design/204/Lectures/Data")
Diabetes.df <- openxlsx::read.xlsx('diabetes.xlsx')
In a previous lecture, we learn about the concept and interpretation of simple linear regression. To summarize, the goal of simple linear regression is to identify the best line that fits. Simple linear regression has the following this formula:
\[\hat{y} = {b}_0 + b_1 x\]
where \(\hat{y}\) is the continuous dependent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope/coefficients, and \(x\) is the independent variable. The following are the requirements for simple linear regression:
A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.
In practice, researchers often are interested in investigating the relationship between one continuous dependent (y) variable and multiple independent variables \((x_1,x_2,...,x_n)\). This type of problem is the subject matter of multiple-regression analysis. The general formula for multiple is:
\[\hat{y} = {b}_0 + b_1 x_1+b_2 x_2+...+b_p x_p+\epsilon\]
where \(\hat{y}\) is the dependent variable, \(b_0\) is the y-intercept, \(b_p\) the change in \(y\) for every unit change of the independent variable \(x_p\), and \(\epsilon\) is an error term that is normally distributed with mean 0 and variance \(\sigma^2\).
The independent variables can be any of any type:
Linear regression models must satisfy the following assumptions:
Linearity: There exists a linear relationship between each independent and dependent variable.
Independence: The observations are independent.
Homoscedasticity: The variance of the residuals is constant at every point in the linear model.
A confounder is a variable that correlates with both the dependent variable and the independent variable. A Confounder presence distort the relationship between variables being studied. Consequently, the results do not reflect the the true relationship between the variables of interest. Due their undesirable effect, investigators attempt to account for the effect of confouder variables by so-called ‘adjustment for confounding variables’.
The definition and the variable type of exposure, outcome, and confounder vary according to the research question and the statistical model.
In order for a variable to be considered a confounder, it needs to satisfy the following assumptions:
Example: Determine if there is a relationship between age and post-prandial glucose level. For this analysis, we are going to use the Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
We can express the hypothesized association between age and glucose level using the following formula:
\[\hat{y}_{glucose} = {b}_0 + b_{age}+\epsilon\] \({H_0}: \beta_{age} = 0\)
\({H_1}: \beta_{age} \neq\ 0\)
We start by confirming a linear relationship between glucose level and age.
Diabetes.df %>% ggplot(aes(x=Age, y=Glucose)) +
geom_point()+
geom_smooth(method=lm) +
theme_light() #scatter plot + line plot
## `geom_smooth()` using formula = 'y ~ x'
We proceed to building a simple linear regression model
glucose.model <- lm(Glucose~Age,data = Diabetes.df) #build linear regression model
summary(glucose.model) #generate summary
##
## Call:
## lm(formula = Glucose ~ Age, data = Diabetes.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.453 -20.849 -3.058 18.304 86.159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.08016 3.34095 29.06 < 2e-16 ***
## Age 0.71642 0.09476 7.56 1.15e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.86 on 766 degrees of freedom
## Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822
## F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
confint(glucose.model) #generate 95% CI intervals
## 2.5 % 97.5 %
## (Intercept) 90.5216601 103.6386585
## Age 0.5304001 0.9024361
Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. For every unit increase in age (year), the mean glucose level increases by 0.72 mg/dL (95% CI 0.53-0.90).
We could potentially improve the performance of our model by adjusting for potential confounders. Making a determination of potential confounders is dependent on your research question. For this research question, we will consider body mass index (BMI) as a confounder by confirming it meets the assumptions noted previously.
\[\hat{y}_{glucose} = {b}_0 + b_{age}+b_{BMI}+\epsilon\]
We can adjust our model for potential confounders by including the variables in the model equation.
glucose.model.confounder <- lm(Glucose~Age+BMI,data = Diabetes.df) #multivariable linear regression model
We confirm the assumptions of linear regression by examining diagnostic plots
par(mfrow = c(1,2)) #set up plot matrix
plot(glucose.model.confounder,which=c(1,2)) #plot figures 1 and 2
The left plot helps us assess whether a linear fit is appropriate. We evaluate the plot by looking for signs of curvature of the red line. The right plot helps to assess if the distribution of residuals is normal.
par(mfrow = c(1,1))#set up plot matrix
plot(glucose.model.confounder,which=c(3)) #plot figure 3
The figure shows if residuals are spread equally along the ranges of fitted values. This is how you can check the assumption of equal variance (homoscedasticity). We look for the red line to be parallel to the x-axis. Evaluating the figures leads to the conclusion that our model satisfies the linear regression assumptions.
summary(glucose.model.confounder) #Generate summary of the model
##
## Call:
## lm(formula = Glucose ~ Age + BMI, data = Diabetes.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.310 -19.102 -1.977 18.310 83.766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.29517 5.40191 13.013 < 2e-16 ***
## Age 0.69555 0.09257 7.514 1.61e-13 ***
## BMI 0.85891 0.13808 6.220 8.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.13 on 765 degrees of freedom
## Multiple R-squared: 0.1142, Adjusted R-squared: 0.1119
## F-statistic: 49.33 on 2 and 765 DF, p-value: < 2.2e-16
confint(glucose.model.confounder) #Generate confidence intervals
## 2.5 % 97.5 %
## (Intercept) 59.6908477 80.8994969
## Age 0.5138259 0.8772727
## BMI 0.5878444 1.1299717
Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that age is a predictor of glucose level. Holding other variables constant, For every unit increase in age (year), the mean glucose level increases by 0.70 mg/dL (95% CI 0.51-0.88).
\(R^2\), or the coefficient of determination, is a value between 0 and 1 that measures how well our regression line fits our data. \(R^2\) can be interpreted as the percent of variance in our dependent variable that can be explained by our model. Adjusted \(R^2\) indicates how well our regression line our data, but adjusted for the number of terms in a model. If you add useless variables to a model, adjusted \(R^2\) will decrease. If you add more useful variables, Adjusted \(R^2\) will increase. Adjusted \(R^2\) will always be less than or equal to \(R^2\). The adjusted \(R^2\) for our model is 0.112 This means that 11.2% of the variation in glucose level can be explained by our model.