⊕This tutorial is located on Rpubs. ⊕The entire R Markdown code is located on my GitHub page.

Introduction

⊕We will Use the diabetes.csv data set, which you can download at this location.

Linear regression models (also known as “Ordinary Least Squares” model) allow us to determine if changing the values on a variable is associated with the values of another variable. In other words, if I make a 1-unit change in \(X\), how much does Y change? In fact, linear regression is similar to the algebraic equation for a simple line (\(Y = mx + b\), where \(m\) is the slope, \(X\) is the parameter that is changing, and \(b\) is the Y-intercept). In biostatistics, we use linear regression models to test the association between two or more variables where the outcome is a continuous data type.

⊕In many text books, the linear regression model is called the “Ordinary Least Squares” or OLS model because it minimizes the squared errors (e.g., distance from the best-fit line).

⊕The linear regression model is pretty robust when the assumptions don’t hold. Regardless, it’s good practice to test these assumptions.

There are several conditions that need to be satisfied in order for us to use the results from a linear regression model. These include:

Outcome variable is normally distributed (parametric)
Observations are independent
Residuals are not correlated with the \(X\) variables (homoscedasticity)
Association between \(X\) and \(Y\) is linear

However, the linear regression model is pretty robust to violations of these assumptions; hence, its popularity. Moreover, it is very easy to interpret as this article will demonstrate.

Simple linear regression

The structural form of a linear regression model:

\(\large Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon\)

Typical notations of the linear regression include:

\(Y_i\) denotes the outcome (or dependent) variable for subject \(i\)
\(X_{1i}\) denotes the predictor of interest (\(X_1\)) for subject \(i\)
\(\beta_0\) denotes the Y-intercept when \(X\) is zero
\(\beta_1\) denotes the slope or the change in \(Y\) with a 1-unit change in \(X\)
\(\epsilon\) denotes the error or residuals

Data

⊕The UC Irvine Machine Learning Respository contains a lot of datasets that are used to validate their machine learning models. You can check out more about their work at their website. This tutorial will use the Pima Indians Diabetes Dataset, which you can download from this GitHub location. The data was originally posted on the UC Irvine Machine Learning Repository, but can be found on Kaggle. The Pima Indians Diabetes Dataset orginated from the National Institute of Diabetes and Digestive and Kidney Diseases as part of a study to predict diabetes among adult females (21 years old and older) of Pima Indian heritage.

The following variables are included in the data

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 = no diabetes or 1 = diabetes)

Packages

You will need to install and load the ggplot and predict3d packages for this tutorial.

You will also need to install and load the following packages.

#### Load the libraries
library("ggplot2")
# install.packages("devtools")
library("devtools")
# install.packages("predict3d")
## Note: 02-11-2023: This has been removed from CRAN, so you need to use the following code to install predict3d:
## devtools::install_github("cardiomoon/predict3d") ## Make sure to have the devtools installed
library("predict3d")
# install.packages("psych")
library("psych")
# install.packages("magrittr")
# library("magrittr") ## allows for rounding using the %>%
library("dplyr")
# install.packages("gtsummary") ## Allows for publication ready tables
library("gtsummary")
# install.packages("DescTools")
library("DescTools") ## Needed for normality testing
# install.packages("nortest")
library("nortest") ## Needed for normality testing
# install.packages("lmtest")
library("lmtest") ## Need for heteroskedasticity testing
# install.packages("sandwich")
library("sandwich")  ## Needed for estimating Huber-White sandwich standard errors
# install.packages("broom")
library("broom")  ## Needed package to generate tidy tables
# install.packages("broom.helpers)
library("broom.helpers")  ## Needed to create regression coefficient tibbles

Load Data

Load the data from the GitHub site. You can use the knitr::kable function to generate a table.

#### Load Data
diabetes.data <- read.csv("https://raw.githubusercontent.com/mbounthavong/R-tutorials/refs/heads/main/Data/diabetes.csv")

knitr::kable(
  head(diabetes.data), caption = "Table 1. First six rows of the Pima Indians Diabetes Dataset"
)

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

## Warning in attr(x, "format"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Table 1. First six rows of the Pima Indians Diabetes Dataset
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1
5	116	74	0	0	25.6	0.201	30	0

Visualize the data

Once the data have been loaded, take a look at the summary statistics. You can do this using the describeBy function, which is part of the psych package.

knitr::kable(
  describeBy(diabetes.data) %>% round(2) 
)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Pregnancies	1	768	3.85	3.37	3.00	3.46	2.97	0.00	17.00	17.00	0.90	0.14	0.12
Glucose	2	768	120.89	31.97	117.00	119.38	29.65	0.00	199.00	199.00	0.17	0.62	1.15
BloodPressure	3	768	69.11	19.36	72.00	71.36	11.86	0.00	122.00	122.00	-1.84	5.12	0.70
SkinThickness	4	768	20.54	15.95	23.00	19.94	17.79	0.00	99.00	99.00	0.11	-0.53	0.58
Insulin	5	768	79.80	115.24	30.50	56.75	45.22	0.00	846.00	846.00	2.26	7.13	4.16
BMI	6	768	31.99	7.88	32.00	31.96	6.82	0.00	67.10	67.10	-0.43	3.24	0.28
DiabetesPedigreeFunction	7	768	0.47	0.33	0.37	0.42	0.25	0.08	2.42	2.34	1.91	5.53	0.01
Age	8	768	33.24	11.76	29.00	31.54	10.38	21.00	81.00	60.00	1.13	0.62	0.42
Outcome	9	768	0.35	0.48	0.00	0.31	0.00	0.00	1.00	1.00	0.63	-1.60	0.02

There are 768 subjects with nine variables. We can see that the average number of pregnancies is 3.85 with a standard deviation (SD) of 3.37). We also see that the average age of the sample was 33.24 years (SD, 11.76), average BMI was 31.99 (SD, 7.8), and the average glucose level was 120.89 (SD, 31.97).

Motivating example: Evalaute the association between Age and Glucose level

Let’s suppose our main research question is to determine whether age was associated with glucose level. We will set the glucose level as our dependent variable and age as our independent variable (or predictor of interest). Then, we update our linear regression model’s structural form:

⊕We call this the “expected” value because we are predicting this using a model. All regression models take existing data and attempt to make predictions. However, if your assumptions are violated, then these predictions are erroneous. As George Box once wrote, “All models are wrong, but some are useful.”

\(\large E[Glucose_{i} | Age_{i}] = \beta_0 + \beta_1 Age_{i} + \epsilon,\)

where \(E[Glucose_{i} | Age_{i}]\) denotes the expected Glucose level for subject \(i\) given the Age of subject \(i\).

We can also represent this relationship in a directed acyclic graph (DAG) diagram.

⊕I used DAGitty to generate the DAG diagram. DAGitty can be found here.

DAG diagram illustrating the causal relationship between Age and Glucose level.

Visualize the association between Age and Glucose level

Let’s look at how Age is related to Glucose level by plotting their relationship. We’ll do this using ggplot. As Age increases, the Glucose level also increases. There appears to be a positive relationship between Age and Glucose level.

### Plot the association between the subject's age and glucose level
ggplot(diabetes.data, aes(x = Age, y = Glucose)) +
  geom_point() +
  stat_smooth()

⊕In general, when describing a regression model, \(Y\) is denoted as the outcome and \(X\) is denoted as the predictor of interest.

Constructing the linear regression model

Now, we can construct a linear regression model with Glucose level as the dependent variable and Age as the independent variable (or predictor of interest). We will use the lm() function with Glucose level as the \(Y\) variable and Age as the \(X\) variable. The formula for a regression model in R uses the ~ symbol. For example, if was want to regress Age on Glucose level, we use the notation Glucose ~ Age.

By using the lm() function, we can construct the linear regression model: lm(Glucose ~ Age, data = diabetes.data). We can create an object that will contain this linear model; I called this object linear.model1.

We generate the 95% confidence interval (CI) by using the confint() function.

⊕\(\beta_{1}\) coefficient is in the linear regression output as Age, which is 0.71642.

Here is how we put all of this together in R:

### Linear regression model (Y = Glucose, X = Age)
linear.model1 <- lm(Glucose ~ Age, data = diabetes.data)
summary(linear.model1)

## 
## Call:
## lm(formula = Glucose ~ Age, data = diabetes.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.453  -20.849   -3.058   18.304   86.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.08016    3.34095   29.06  < 2e-16 ***
## Age          0.71642    0.09476    7.56 1.15e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.86 on 766 degrees of freedom
## Multiple R-squared:  0.06944,    Adjusted R-squared:  0.06822 
## F-statistic: 57.16 on 1 and 766 DF,  p-value: 1.15e-13

⊕The lm() function does not generate 95% CI, so you will need to use the confint() function.

### Generate the 95% CI
confint(linear.model1)

##                  2.5 %      97.5 %
## (Intercept) 90.5216601 103.6386585
## Age          0.5304001   0.9024361

Interpret the linear regression output

⊕Make sure to have the library gtsummary loaded to create tables from the regression output.

We are interested in the coefficients. To make interpreting the output easier, we can create a table to visualize the critical elements.

#### Present the output in a table
model1 <- tbl_regression(linear.model1, intercept = TRUE)
as_gt(model1) %>% 
              gt::tab_header("Table 2. Linear regression model output (Glucose ~ Age)") %>% 
              gt::tab_options(table.align='left')

Characteristic	Beta	95% CI	p-value
Table 2. Linear regression model output (Glucose ~ Age)
(Intercept)	97	91, 104	<0.001
Age	0.72	0.53, 0.90	<0.001
Abbreviation: CI = Confidence Interval

The Intercept denotes the \(Y\) intercept when \(X\) is equal to zero. In this case, it would be where Glucose level would be on the linear plot when Age is equal to zero, which is 97.08-units of Glucose.

⊕When possible, present the coefficient’s point estimate and the 95% CI. For example, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level (95% CI: 0.53, 0.90). The Age coefficient denotes the change in Glucose level for a one-unit increase in Age. In other words, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level. Since the 95% CI is between 0.53-units and 0.90-units of Glucose, it does not include zero, so this association is statistically significant. We can also look a the p-value of the Age coefficient to determine whether this is statistically significant (<0.0001). However, it is preferable to present the 95% CI when describing the association between \(X\) and \(Y\).

The Adjusted R squared denotes the amount of data that are explained by the linear regression model. In other words, the current linear regression model explains 6.8% of the data.

Visualize predicted model

We can plot the linear form of the model against the actual data using the ggPredict() function.

ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)

Here is a version without the scatterplot.

ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

Adding a confounder

⊕We generate a new variable called pregnancy.history that is a dichotomous variable (0 = no history of pregnancy and 1 = history of pregnancy).

Let’s add to our current model by including a confounder. We have a variable called Pregnancies but this provides the number of past pregnancies. We want to create a new variable that has a dichotomous outcome: History of Pregnancy or No history of pregnancy. To do this, we need to to subset the data and create rules. Anyone with 1 or more pregnancies will be coded as 1. Anyone who does not have a history of pregnancy will be coded as 0.

#### Generate groups based on pregnancy history (Group 0 = 0, Group 1 = 1 or more pregnancies)
diabetes.data$pregnancy.history[diabetes.data$Pregnancies == 0] = 0
diabetes.data$pregnancy.history[diabetes.data$Pregnancies > 0] = 1

table(diabetes.data$pregnancy.history)

## 
##   0   1 
## 111 657

We see that there are 657 women who a history of pregnancy and 111 women with no history of pregnancy.

A DAG diagram illustrating the relationship between Pregnancy History as a confounder on the Age to Glucose relationship can be drawn.

DAG diagram illustrating the causal relationship between Age and Glucose level and Pregnancy History as a confounder.

In our linear regression model, we have the following structural form:

\(\large E[Glucose_{i} | Age_{i}] = \beta_0 + \beta_1 Age_{i} + \epsilon,\)

where \(E[Glucose_{i} | Age_{i}]\) denotes the expected Glucose level for subject \(i\) given the Age of subject \(i\).

But we can include a confounder pregnancy.history:

\(\large E[Glucose_{i} | Age_{i}, PregnancyHistory_{i}] = \beta_0 + \beta_1 Age_{i} + \beta_2 Pregnancy History_{i} + \epsilon,\)

where \(E[Glucose_{i} | Age_{i}, PregnancyHistory_{i}]\) denotes the expected Glucose level for subject \(i\) given the Age of subject \(i\) controlling for Pregnancy History of subject \(i\).

⊕All you need to do to add another varialbe in the regression model is to use the + symbol. For example lm(Glucose ~ Age + pregnancy.history, data = diabetes.data).

Using the lm() function, we can add pregnancy.history to the linear regression model:

Here is how we put all of this together in R:

### Linear regression model (Y = Glucose, X1 = Age, X2 = Pregnancy History)
linear.model2 <- lm(Glucose ~ Age + pregnancy.history, data = diabetes.data)
summary(linear.model2)

## 
## Call:
## lm(formula = Glucose ~ Age + pregnancy.history, data = diabetes.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.715  -20.546   -2.991   17.316   87.734 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       102.00752    3.95100  25.818  < 2e-16 ***
## Age                 0.76050    0.09638   7.891 1.04e-14 ***
## pregnancy.history  -7.47264    3.22137  -2.320   0.0206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 765 degrees of freedom
## Multiple R-squared:  0.07594,    Adjusted R-squared:  0.07352 
## F-statistic: 31.43 on 2 and 765 DF,  p-value: 7.592e-14

confint(linear.model2)

##                         2.5 %      97.5 %
## (Intercept)        94.2514334 109.7636012
## Age                 0.5712954   0.9497004
## pregnancy.history -13.7964201  -1.1488593

We can present the model output into a table.

#### Present the output in a table
model2 <- tbl_regression(linear.model2, intercept = TRUE)
as_gt(model2) %>% 
              gt::tab_header("Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)") %>% 
              gt::tab_options(table.align='left')

Characteristic	Beta	95% CI	p-value
Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)
(Intercept)	102	94, 110	<0.001
Age	0.76	0.57, 0.95	<0.001
pregnancy.history	-7.5	-14, -1.1	0.021
Abbreviation: CI = Confidence Interval

You can see that the Age coefficient is slightly different from our first model. It is 0.76 with a 95% CI of 0.57, 0.95. Compare this to the previous model’s result, which was 0.72; 95% CI: 0.53, 0.90.

#### Merge the two linear regression model's outputs
model1 <- tbl_regression(linear.model1, intercept = TRUE)
model2 <- tbl_regression(linear.model2, intercept = TRUE)
table1 <- tbl_merge(tbls = list(model1, model2),
          tab_spanner = c("**Model 1**", "**Model 2**"))
as_gt(table1) %>% 
              gt::tab_header("Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]") %>% 
              gt::tab_options(table.align='left')

Characteristic	Model 1			Model 2
Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]
Characteristic	Beta	95% CI	p-value	Beta	95% CI	p-value
(Intercept)	97	91, 104	<0.001	102	94, 110	<0.001
Age	0.72	0.53, 0.90	<0.001	0.76	0.57, 0.95	<0.001
pregnancy.history				-7.5	-14, -1.1	0.021
Abbreviation: CI = Confidence Interval

Model 1 is considered the crude model or the unadjusted model. Model 2 is the adjusted model because it is adjusting based on the Pregnancy History confounder. Notice that the \(\beta_{1, unadjusted}\) is 0.72 which is lower than the \(\beta_{1, adjusted}\) result which is 0.76.

Additionally, the Adjusted R squared is higher in model 2 (7.35%) compared to model 1, which was 6.82%. This means that Model 2 does a better job of explaining the data than Model 1.

Let’s plot Model 2’s results.

ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

⊕You can derive the difference between the groups with a history of pregnancy and without a history of pregnancy by substracting 102 - 94.5, which is 7.5, the \(\beta_{2}\) for pregnancy.history.

We can see that the group that had a history of pregnancy is lower than the group that did not have a history of pregnancy. This makes sense when you look at the pregnancy.history coefficient. It is -7.5, which means that a subject with a history of pregnancy is associated with a 7.5 decrease in Glucose level (95% CI: -13.80, -1.15) compared to a subject without a history of pregnancy controlling for age. Therefore, for all ranges of Age, the group with a history of pregnancy will have Glucose levels that are 7.5 units lower than a group without a history of pregnancy. You can visualize this on the plot; the linear lines do not cross and remain constant across all ranges of Age. But there is a positive correlation between Age and Glucose level.

Evaluate residual plots

⊕It is good practice to look at the residuals of the regression model to make sure that the assumptions hold.

Recall the assumptions for the linear regression model:

Outcome variable is normally distributed (parametric)
Observations are independent
Residuals are not correlated with the \(X\) variables (homoscedasticity)
Association between \(X\) and \(Y\) is linear

Homoscedasticity v. Heteroscedasticity.

We can test to see if the residuals are uncorrelated to with the \(X\) variables (homoscedasticity).

Let’s use the crude linear regression model from our previous example:

\(\large E[Glucose_{i} | Age_{i}] = \beta_0 + \beta_1 Age_{i} + \epsilon,\)

where \(E[Glucose_{i} | Age_{i}]\) denotes the expected Glucose level for subject \(i\) given the Age of subject \(i\).

linear.model1 <- lm(Glucose ~ Age, data = diabetes.data)
summary(linear.model1)

## 
## Call:
## lm(formula = Glucose ~ Age, data = diabetes.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.453  -20.849   -3.058   18.304   86.159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.08016    3.34095   29.06  < 2e-16 ***
## Age          0.71642    0.09476    7.56 1.15e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.86 on 766 degrees of freedom
## Multiple R-squared:  0.06944,    Adjusted R-squared:  0.06822 
## F-statistic: 57.16 on 1 and 766 DF,  p-value: 1.15e-13

confint(linear.model1)

##                  2.5 %      97.5 %
## (Intercept) 90.5216601 103.6386585
## Age          0.5304001   0.9024361

⊕Since the model generates predictions, we check to see if the residual are correlated with fitted or predicted values of Glucose. If there is an association, then we have heteroscedasticity, which is a violation of the linear regression model assumption.

We can plot the residuals and see if they are associated with increasing values of the expected or predicted Glucose level. If there is no association, we should expect to see a uniform distribution across all ranges of the expected values of Glucose.

#### Plot the residuals against the predicted model (Is it homoscedastic?)
plot(linear.model1$res ~ linear.model1$fitted)

Reviewing the residual plot along the fitted values, there doesn’t appear to be any evidence of heteroskedasticity. Upon visual inspection, we can see that the residuals are uniform across the expected Glucose level range (also called the “fitted” values). We can verify this visual inspection by performing the Breusch-Pagan test of heteroskedasticity. We will need to install and load the lmtest package and use the bptest() function.

bptest(linear.model1)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear.model1
## BP = 2.4585, df = 1, p-value = 0.1169

According to the BP-test results, the p-value is 0.1169, which means that we fail to reject the null that the variance of the residuals are constant. In other words, there are no associations between the residuals of the model and the predicted values generated from the model.

If, however, there was heteroskedasticity, we could address this by estimating robust standard errors. For linear regression models, the most common method is to use the Huber-White sandwich estimation method. To do this, we need to use the sandwich package and the coeftest() function. (Note: In practice, I default to the robust standard errors rather than let the lm() function estimate these for me.)

### Huber-White sandwich estimation
robust1 <- coeftest(linear.model1, vcov = vcovHC(linear.model1, type = "HC1"))
robust1

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 97.080159   3.265127  29.732 < 2.2e-16 ***
## Age          0.716418   0.095573   7.496  1.82e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(robust1)

##                  2.5 %      97.5 %
## (Intercept) 90.6704993 103.4898193
## Age          0.5288023   0.9040339

⊕Recall that the 95% CI is calculated using the standard error (SE).

Notice that the standard errors are slightly difference from the ones estimated in the previous model. In the previous model, the 95% CI was between 0.530 and 0.902. In the model with the robust standard errors, the 95% CI was between 0.529 and 0.904. The differences are trivial in this example, but could be important when the 95% CI is close to the null value.

⊕Comparison of standard errors between robust SE and non-robust SE.

We can also evaluate if the residuals are normally distributed. We can generate a histogram and a Q-Q plot. The historgram has a slight left skew and the Q-Q plot has its tails deviate from the neutral line.

#### Set up the matrix
par(mfrow = c(1, 2))

#### Histogram of the residuals
hist(linear.model1$res)

#### QQ-plot of the residuals against the QQ line
qqnorm(linear.model1$res); qqline(linear.model1$res, col = "2", lwd = 1, lty = 1)

⊕You only need to use one of these tests. They will generally give the same results.

We can also test for the normality of the residuals. Common tests of normality include the Shapiro-Wilk’s test, the Jarque Bera test, and the Kolmogorov-Smirnov (Lilliforms) test. I provided their codes below. Despite the differences in their output, the conclusions are all the same: the residuals are not normally distributed. Despite not being normally distributed, the linear regression model is pretty robust to violations of this assumption. You can make a concluding statement that there is an association between Age and Glucose level based on these findings.

#### Test normality using Shapiro-Wilk's test
shapiro.test(linear.model1$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  linear.model1$res
## W = 0.97467, p-value = 2.838e-10

#### Test normality using Jarque Bera test
JarqueBeraTest(linear.model1$res, robust = FALSE) ### Does not use robust method

## 
##  Jarque Bera Test
## 
## data:  linear.model1$res
## X-squared = 25.565, df = 2, p-value = 2.81e-06

#### Test normality using the Kolmogorov-Smirnov test
lillie.test(linear.model1$res)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  linear.model1$res
## D = 0.065397, p-value = 2.941e-08

R tutorial on linear regression model

Mark Bounthavong

01/28/2022; updated on 03/13/2023; updated: 04/26/2025

Introduction

Simple linear regression

Data

Packages

Load Data

Visualize the data

Motivating example: Evalaute the association between Age and Glucose level

Visualize the association between Age and Glucose level

Constructing the linear regression model

Interpret the linear regression output

Visualize predicted model

Adding a confounder

Evaluate residual plots

Conclusions

Acknowledgements

References

Work in progress