Linear regression review

Big Data Summer Institute 2022

June 22, 2022

Soumik Purkayastha
[https://soumikp.github.io]

[Press ‘k’ on your keyboard and use arrows/space bar to navigate.]

1A: Overview

Simple linear regression (SLR) is a technique for modeling the relationship between two numerical variables $x$ and $y$.

SLR can be visualized using a scatterplot in the xy-plane.

By estimating a straight line that `best fits’ the data on a scatterplot, we obtain a linear model that can not only be used for prediction, but also for inference (with some additional assumptions).

The PREVEND study: As adults age, cognitive function changes over time; largely due to various cerebrovascular and neurodegenerative changes.

The Prevention of REnal and Vascular END-stage Disease (PREVEND) study measured various clinical data for participants.

Cognitive function was assessed with the Ruff Figural Fluency Test (RFFT). This will be our response of interest.
Scores range from 0 to 175; higher scores indicate better cognitive function.
We will work with a random sample of 500 participants.

Various associated (potentially) demographic and cardiovascular risk factors were collected for each participant.

1B: Assumptions for linear regression

The following conditions should be true in a scatterplot for a line to be considered a reasonable approximation to the relationship in the plot and for the application of the methods of inference (discussed later):

Linearity: the data show a linear trend.
Constant variability: the variability of the response variable about the line remains roughly constant as the predictor variable changes.
Independent observations: the (x,y) pairs are independent; i.e., values of one pair provide no information about values of other pairs.
Approximate normality of residuals: definition coming later…

The variability in volume is noticeably less for smaller values of body size than for larger values.

1C: Residuals in linear regression

Our overarching goal is to fit a ‘good’ line to this data cloud. Suppose we know what the ‘best’ line that fits this data is. This means, for any given value of predictor $x$ we obtain a predicted $\hat{y}.$ In particular, for each observed data $(x_i, y_i)$ with $y_i$ being the observed response, we get a predicted response $\hat{y}_i.$

The vertical distance between a point in the scatterplot and the predicted value on the regression line is the residual for the point. For an observation $(x_i, y_i)$, where $\hat{y}_i$ is the predicted value given by \[\hat{y} = b_0 + b_1 x,\] the residual is the value \[e_i = y_i - \hat{y}_i.\]

1D: Least squares regression

The least squares regression line is the line which minimizes the sum of the squared residuals for all the points in the plot.

In other words, the least squares line is the line with coefficients $b_0$ and $b_1$ such that the quantity \[e_1^2 + e_2^2 + \ldots + e_n^2\] is the smallest posssible value it can take.

1E: Statistical model for least squares regression

For a general population of ordered pairs $(x, y)$, the population regression model is \[Y = \beta_0 + \beta_1X + \epsilon,\] where $\epsilon \sim N(0, \sigma)$ is the error term.

Since $\mathrm{E}(\epsilon) = 0$, the population regression model may be rewritten in terms of conditional behaviour of $Y$ given $X = x$, i.e., \[\mathrm{E}(Y|X=x) = \beta_0 + \beta_1x.\]

The terms $\beta_0$ and $\beta_1$ are parameters - with $b_0$ and $b_1$ serving as estimates.

We will use R to (a) obtain estimates $b_0$ and $b_1$, (b) check validity of assumptions of linear regression.

1F: `lm()` for categorical predictors with two levels

Although the response variable in linear regression is necessarily numerical, the predictor may be either numerical or categorical. Simple linear regression only allows for categorical predictor variables with two levels. Examining categorical predictors with more than two levels requires multiple linear regression.

Fitting a simple linear regression model with a two-level categorical predictor is analogous to comparing the means of two groups, where the groups are defined by the categorical variable.

Here, we examine if there are any gender-based differences in RFFT scores in the PREVEND dataset. First, we compare the gender-stratified distribution of RFFT scores by means of violin plots.

We compare the group means as follows

prevend.samp %>% 
  rowwise() %>% 
  mutate(Gender = factor(ifelse(Gender == 1, "Male", "Female"))) %>% 
  group_by(Gender) %>% ## stratifying by gender
  summarise(m = mean(RFFT), ## strata-specific means
            q1 = quantile(RFFT, 0.25), ## strata-specific quantiles 
            q2 = quantile(RFFT, 0.5),
            q3 = quantile(RFFT, 0.75))

## # A tibble: 2 × 5
##   Gender     m    q1    q2    q3
##   <fct>  <dbl> <dbl> <dbl> <dbl>
## 1 Male    69.2  48.5    66    87
## 2 Female  67.7  44      68    88

and fit a linear model of RFFT by gender

lm(RFFT ~ Gender, data = data)

## 
## Call:
## lm(formula = RFFT ~ Gender, data = data)
## 
## Coefficients:
##  (Intercept)  GenderFemale  
##       69.238        -1.578

Note that the estimated intercept is the group mean for one category (called the baseline) - here it is the Gender = Male group. The estimated slope is the difference of the group means.

1G: The quantity $R^2$

The correlation coefficient $r$ measures the strength of the linear relationship between two variables.

It is more common to use $R^2$ to measure the strength of a linear fit.

$R^2$ describes the amount of variation in the response that is explained by the least squares line.

\[R^2 = \frac{\text{variance of predicted y-values }}{\text{variance of observed y-values}} = \frac{\mathrm{V}(\hat{Y})}{\mathrm{V}({Y})}.\]

If a linear model perfectly captured the variability in the observed data, then $\mathrm{V}(\hat{Y}) = \mathrm{V}(Y)$ and $R^2$ would be 1.

$R^2$ can also be calculated as follows \[\begin{align*} R^2 &= \frac{\text{variance of observed y-values} - \text{variance of residuals}}{\text{variance of observed y-values}}\\ &= \frac{\mathrm{V}({Y}) - \mathrm{V}({e})}{\mathrm{V}({Y})}. \end{align*}\]

The variability of the residuals about the fitted line represents the remaining variability after the model is fit. In other words, $\mathrm{V}({e})$ is the variability unexplained by the model.

1H: Hypothesis testing in regression

Inference in a regression context is usually about the slope parameter $\beta_1$. The null hypothesis is most commonly a hypothesis of ‘no association’, which may be formulated mathematically as \[H_0: \beta_1 = 0 \text{ [denoting X and Y are NOT associated]}.\] The alternative hypothesis is given by \[H_0: \beta_1 \neq 0 \text{ [denoting X and Y are associated]}.\] We use estimator $b_1$ in the statistic used to test whether hypothesis $H_0$ is true. The test statistic is given by \[t = \frac{b_1 - \beta_1^0}{\text{s.e.}(b_1)} = \frac{b_1}{\text{s.e.}(b_1)},\] where $\beta_1^0$ equals zero when the null hypothesis is true.

The $t-$statistic defined above follows a $t-$distribution with degrees of freedom $n − 2$, where $n$ is the number of ordered pairs $(x_i, y_i)$ in the dataset.

In order to test whether $H_0$ holds or not, we construct the $95\%$ confidence interval associated with $\beta_1$ and investigate if the resultant confidence interval contains zero or not. The $95\%$ C.I. is given by \[b_1 \pm \left(t^* \times \text{s.e.}(b_1) \right),\] where $t^*$ is the 97.5-th percentile of a t-distribution with (n-2) degrees of freedom. (may be computed in R as qt(p = 0.975, df = n-2)).

Thankfully, R does most of the hard work for us!

Lab 1: `R` for simple linear regression

Notes for review may be found here.

Exercise 1 of 1: Using `lm()` to analyse data from the PREVEND study

Click here for more details

This lab uses data from the Prevention of REnal and Vascular END-stage Disease (PREVEND) study, which took place between 2003 and 2006 in the Netherlands. Clinical and demographic data for a random sample of 500 individuals are stored in the prevend.samp dataset in the oibiostat package.

Housekeeping

The lectures have little active learning components where you will be reviewing how to use R to analyse some real-life datasets. In order to do so effectively, please make sure you run the following code chunk

install.packages("librarian")
librarian::shelf(tidyverse, patchwork, ggsci)

and then run

nhanes.samp.adult.500 <- read_csv("https://raw.githubusercontent.com/soumikp/bdsi_2022/main/data/nhanes.samp.adult.500.csv")
prevend.samp <- read_csv("https://raw.githubusercontent.com/soumikp/bdsi_2022/main/data/prevend.samp.csv")

If your installation was not successful, you’ll get an error message.

Back to the problem

As adults age, cognitive function declines over time; this is largely due to various cerebrovascular and neurodegenerative changes. The Ruff Figural Fluency Test (RFFT) is one measure of cognitive function that provides information about cognitive abilities such as planning and the ability to switch between different tasks. Scores on the RFFT range from 0 to 175 points, where higher scores are indicative of better cognitive function.

The goal of this lab is to begin exploring the relationship between age and RFFT score in the prevend.samp dataset.

Create a scatterplot of RFFT score and age in years in prevend.samp.

Click here for answer

  prevend.samp %>% 
  ggplot(aes(x = Age, y = RFFT)) + 
  geom_point() + ## bonus
  theme_bw() + 
  ylab("RFFT score") + 
  xlab("Age (in years)") + 
  labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500).") + 
  theme(legend.position = "bottom")

BONUS: can you stratify by gender? Use different shapes and colours to differentiate between observations from male participants and female participants?

Click here for answer

  prevend.samp %>% 
  rowwise() %>% 
  mutate(Gender = factor(ifelse(Gender == 1, "Male", "Female"))) %>% 
  ggplot(aes(x = Age, y = RFFT)) + 
  geom_point(aes(color = Gender, shape = Gender)) + ## bonus
  theme_bw() + 
  ylab("RFFT score") + 
  xlab("Age (in years)") + 
  labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500).") + 
  theme(legend.position = "bottom") + 
  scale_color_aaas()

Examine the plot and consider possible lines that are a reasonable approximation for the relationship in the plot. Consider the line $\hat{y} = -20 + 2x$. We add the line to the scatterplot. Does the line appear to be a good fit to the data?

Click here for answer

  prevend.samp  %>% 
  ggplot(aes(x = Age, y = RFFT)) + 
  geom_point(color = pal_aaas("default", alpha = 1)(9)[4]) +
  geom_abline(slope = 2, intercept = -20, 
          linetype = "dashed", color = pal_aaas("default", alpha = 1)(9)[6], size = 1) +    
  theme_bw() + 
  ylab("RFFT score") + 
  xlab("Age (in years)") + 
  labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500).") + 
  theme(legend.position = "bottom")

No, the line does not appear to be a good fit to the data. The general trend in the data is a negative relationship, while the line has a positive slope.

Calculate the SSE, the sum of the squared residuals, for this line. Do you expect this SSE to be relatively low or relatively high? Explain your answer.
Click here for answer
```
  #enter line coefficients
  b0 = -20
  b1 = 2
  #calculate sse
  y = prevend.samp$RFFT
  x = prevend.samp$Age
  sse = sum((y - (b0 + b1*x))^2)
  sse
```
```
## [1] 1206875
```
Since we do not expect the model fit to be ‘good’, the SSE should be relatively high, indicating a poor model fit and high amount of error (from large residuals) associated with the model. The SSE is 1,206,875.

Create a scatterplot of RFFT score and age in years in prevend.samp, then add a line of best fit.

Click here for answer

prevend.samp %>% 
  ggplot(aes(x = Age, y = RFFT)) + 
  geom_point(color = pal_aaas("default", alpha = 1)(9)[4]) +
  theme_bw() + 
  stat_smooth(method = lm, se = TRUE, 
          color = pal_aaas("default", alpha = 1)(9)[6], linetype = "dashed", size = 1) +
  ylab("RFFT score") + 
  xlab("Age (in years)") + 
  labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500) with line of best fit (grey band gives 95% CI)") + 
  theme(legend.position = "bottom")

## `geom_smooth()` using formula 'y ~ x'

Print the summary of the best-fitting model and interpret your findings.
Click here for answer
```
summary(lm(RFFT ~ Age, data = prevend.samp))
```
```
## 
## Call:
## lm(formula = RFFT ~ Age, data = prevend.samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.879 -16.845  -1.095  15.524  58.564 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 137.54972    5.01614   27.42   <2e-16 ***
## Age          -1.26136    0.08953  -14.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.19 on 498 degrees of freedom
## Multiple R-squared:  0.285,  Adjusted R-squared:  0.2836 
## F-statistic: 198.5 on 1 and 498 DF,  p-value: < 2.2e-16
```
- The slope is -1.261 and the intercept is 137.55.
- According to the model slope, an increase in age of 1 year is associated with an average decrease in RFFT score by 1.3 points.
- The model intercept suggests that the average RFFT score for an individual of age 0 is 137.55. The intercept value does not have interpretive meaning in this setting, since it is not reasonable to assess a newborn’s cognitive function with a test like the RFFT.
- Based on the linear model, RFFT score differs, on average, by $1.261(T_2 - T_1) = 12.61$ points for individuals that have an age difference of $T_2 - T_1 = 10$ years; the younger individual (with age $T_1$) is expected to have a higher RFFT score.
- The average RFFT score for an individual who is 70 years old is given by $\hat{RFFT} = 137.55 - (1.261)(70) = 49.28$ points.
- It is not valid to use the linear model to estimate RFFT score for an individual who is 20 years old: The data in the sample only extend from ages 36 to 81 and should not be used to predict scores for individuals outside that age range. It may well be that the relationship between RFFT and age is different for individuals in a different age group.

2A: Introduction to multiple linear regression

In most practical settings, more than one explanatory variable is likely to be associated with a response.

Multiple linear regression is used to estimate the linear relationship between a response variable $Y$ and several predictors $x_1, x_2,\ldots, x_p$, where $p$ is the number of predictors.

We extend the statistical model from just one predictor to $p-$many predictors as follows \[\mathrm{E}(Y|x_1, x_2, \ldots, x_p) = \beta_0 +\beta_1x_1 +\beta_2x_2 +···+\beta_px_p.\]

There are several applications of multiple regression. We will focus on two areas:

Estimating an association between a response variable and primary predictor of interest while adjusting for possible confounding variables.
Constructing a model that effectively explains the observed variation in the response variable.

We will again make use of the prevend.sample dataset in this module, but this time we’ll consider a more complicated example concerning the effect of statin use on cognitive function (through RFFT scores).

Statin use and cognitive function.

Statins are a class of drugs widely used to lower cholesterol. If followed, recent guidelines for prescribing statins would lead to statin use in almost half of Americans between 40 - 75 years of age and nearly all men over 60.

A few small studies have suggested that statins may be associated with lower cognitive ability.

The PREVEND study collected data on statin use as well as other demographic factors.

2B: Age, statin use and cognitive function

Red dots indicate users and blue dots represent non-users of Statin.
From figure (A) we see a deterioriation in cognitive function as age increases: can you recall what the estimated slope was?
From Figure (B) we observe that the average RFFT score is probably lower among users of statin as compared to non-users: can you formulate this as a statistical hypothesis and test for the same?
From Figure (C) we observe that the average age is probably higher among users of statin as compared to non-users: can you again formulate this as a statistical hypothesis?
Challenge: investigate the association between cognitive function and statin use after adjusting for age as a potential confounder.

2C: Interpreting coefficients in multiple linear regression

The statistical model for multiple regression is based on \[\mathrm{E}(Y|x_1, x_2, \ldots, x_p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p,\] where $p$ is the number of predictors.

The coefficient $\beta_j$ of a predictor $X_j$ is estimated by $b_j$, say. It iss the predicted mean change in the response corresponding to a one unit change in $x_j$ , when the values of all other predictors remain constant.

The practical interpretation is that a coefficient in multiple regression estimates the association between a response and that predictor, after adjusting for the other predictors in the model.

Assumptions for multiple regression:

Similar to those of simple linear regression…

Linearity: For each predictor variable $X_j$, change in the predictor is linearly related to change in the response variable when the value of all other predictors is held constant.
Constant variability: The residuals have approximately constant variance.
Independent observations: Each set of observations $(y, x_1, x_2, \ldots, x_p)$ is independent.
Normality of residuals: The residuals are approximately normally distributed.

It is not possible to make a scatterplot of a response against several simultaneous predictors :(

2D: Model diagnostics and assessment

It is not possible to make a scatterplot of a response against several simultaneous predictors.

To assess linearity: use a modified residual plot.
- For each (numerical) predictor, plot the residuals on the y-axis and the predictor values on the x-axis.
- Patterns/curvature are indicative of non-linearity.
To assess constant variablity: use the same approach as for simple regression
- For each (numerical) predictor, plot the residuals on the y-axis and the predicted values on the x-axis.
To assess normality of residuals: use the same approach as for simple regression
- construct and study Q-Q plot for the residuals.

Adjusted $R^2$ as a tool for model assessment

More information is always nice: As variables are added, $R^2$ always increases.

But at what cost?: As variables are added, model complexity increases.

We consider using adjusted $R^2$ ($R^2_{\text{adj}}$) as a tool for model assesment.

\[R^2_{\text{adj}} = 1 - (1 - R^2)\times \frac{n-1}{n-1-p}.\]

It is often used to balance predictive ability with model complexity. $R^2_{\text{adj}}$ incorporates a penalty for including predictors that do not contribute much towards explaining observed variation in the response variable.

Unlike $R^2$, $R^2_{\text{adj}}$ does not have an inherent interpretation.

2E: Testing hypotheses for multiple linear regression models

Testing hypothesss about slope coefficients: $t-$tests.

Typically, the hypotheses of interest are $H^k_{0}: \beta_k = 0$ ($X_k$ and $Y$ are NOT associated) with alternative $H^k_A: \beta_k \neq 0$ ($X_k$ and $Y$ are associated).

The test statistic is again a t-statistic under the null, with degrees of freedom $= n - p - 1$, given by \[t_k = \frac{b_k - \beta_k^0}{\text{s.e.}(b_k)},\] with a $95\%$ confidence interval for $\beta_k$ given by \[b_k \pm \left(t^* \times \text{s.e.}(b_k) \right),\] where $t^*$ is the $97.5-$th percentile for a $t-$distribution with $n-p-1$ degrees of freedom.

Testing overall goodness of model: $F-$tests.

The $F-$statistic is used in an overall test of the model to assess whether the predictors in the model, considered as a group, are associated with the response.

\[H_0 :\beta_1 = \beta_2 =\ldots=\beta_p = 0 \text{ vs } H_A: H_0 \text{ is false,}\] where $H_0$ is false if and only if at least one of the slope coefficients $\beta_k$ is not 0

There is sufficient evidence to reject $H_0$ if the $p-$value of the $F-$statistic is smaller than or equal to $\alpha.$

Again, R does the hard work for us: the F-statistic and its associated p-value are displayed in the output.

2F: Interaction in regression

The multiple regression model assumes that when one of the predictors $x_j$ is changed by 1 unit and the values of the other variables remain constant, the predicted response changes by $\beta_j$, regardless of the values of the other variables.

A statistical interaction occurs when this assumption is not true, such that the effect of one explanatory variable $x_j$ on the response depends on the particular value(s) of one or more other explanatory variables.

As an example, we specifically examine interaction in a two-variable setting, where one of the predictors is categorical and the other is numerical.

Consider data on total cholesterol level (mmol/L) from age (yrs.) and diabetes status in a dataset of size $n = 473$.

We fit the linear model \[\mathrm{E}(\text{Total cholesterol|Age, Diabetes status}) = \beta_0 + \beta_1 \text{Age} + \beta_2 \mathrm{I}(\text{has diabetes}).\] This model yields two fitted lines, one for individuals with diabetes and one for individuals without diabetes - we overlay these fitted lines on the scatterplot.

Next, we consider two separate models for the relationship between total cholesterol and age; one in diabetic individuals and one in non-diabetic individuals.

So it seems that diabetes status influences how age and cholesterol levels are associated. This calls for the inclusion of an interaction term in the model.

\[\mathrm{E}(\text{Total cholesterol|Age, Diabetes status}) = \beta_0 + \beta_1 \text{ Age } + \beta_2 \ \mathrm{I}(\text{has diabetes}) + \beta_3 \text{ Age } \times \ \mathrm{I}(\text{has diabetes}).\] The estimated model coefficients are given by

##     (Intercept)             Age     DiabetesYes Age:DiabetesYes 
##     4.695702513     0.009638183     1.718704342    -0.033451562

\[\widehat{TotChol} = 4.70 + 0.0096 \times \text{ Age} + 1.72 \times \ \mathrm{I}(\text{has diabetes}) - 0.033 \times \text{ Age }\times \mathrm{I}(\text{has diabetes}).\]

Hence the fitted model equation for diabetics is given by \[\widehat{TotChol} = 6.42 - 0.023 \times \text{Age},\] and that for non-diabetics is given by \[\widehat{TotChol} = 4.70 + 0.0096 \times \text{Age}.\]

These were the fitted models we obtained by fitting the models separately!

Lab 2: `R` for multiple linear regression

Notes for review may be found here.

Exercise 1 of 2: Verify validity of linear model assumptions for PREVEND study

Click here for more details

Having fit a multiple regression model predicting RFFT score from statin use and age, we will check the assumptions for multiple linear regression.

library(oibiostat)
data(prevend.samp)

prevend.samp <- prevend.samp %>% 
  rowwise() %>% 
  mutate(Statin = factor(ifelse(Statin == 1, "Yes", "No"))) %>% 
  select(c(RFFT, Age, Statin))

#fit a multiple regression model
model1 = lm(RFFT ~ Statin + Age, data = prevend.samp)

Assess linearity with respect to age using a scatterplot with residual values on the $y$-axis and values of age on the $x$-axis. Is it necessary to assess linearity with respect to statin use?
Click here for more details
```
plot(resid(model1) ~ prevend.samp$Age,
 main = "Residuals vs Age in PREVEND (n = 500)",
 xlab = "Age (years)", ylab = "Residual",
 pch = 21, col = "cornflowerblue", bg = "slategray2",
 cex = 0.60)
abline(h = 0, col = "red", lty = 2)
```
There are no apparent trends; the data scatter evenly above and below the horizontal line. There does not seem to be remaining nonlinearity with respect to age after the model is fit.

It is not necessary to assess linearity with respect to statin use since statin use is measured as a categorical variable. A line drawn through two points (that is, the mean of the two groups defined by a binary variable) is necessarily linear.
Assess whether the residuals have approximately constant variance.
Click here for more details
```
#assess constant variance of residuals
  plot(resid(model1) ~ fitted(model1),
 main = "Resid. vs Predicted RFFT in PREVEND (n = 500)",
 xlab = "Predicted RFFT Score", ylab = "Residual",
 pch = 21, col = "cornflowerblue", bg = "slategray2",
 cex = 0.60)
abline(h = 0, col = "red", lty = 2)
```
The variance of the residuals is somewhat smaller for lower predicted values of RFFT score, but this may simply be an artifact from observing few individuals with relatively low predicted scores. It seems reasonable to assume approximately constant variance.

Assess whether the residuals are approximately normally distributed.

Click here for more details

#assess normality of residuals
qqnorm(resid(model1),
       pch = 21, col = "cornflowerblue", bg = "slategray2", cex = 0.75,
       main = "Q-Q Plot of Residuals")
qqline(resid(model1), col = "red", lwd = 2)

The residuals are reasonably normally distributed, with only slight departures from normality in the tails.

How well does the model explain the variability in observed RFFT score?
Click here for more details
```
summary(model1)$r.squared
```
```
## [1] 0.2851629
```
The $R^2$ is 0.285; the model explains 28.5% of the observed variation in RFFT score. The moderately low $R^2$ suggests that the model is missing other predictors of RFFT score.
Calculate the adjusted $R^2_{\text{adj}}$ for the multiple regression model predicting RFFT score from statin use and age.
Click here for more details
```
summary(model1)$adj.r.squared
```
```
## [1] 0.2822863
```
The $R^2_{\text{adj}}$ is 0.282.

Exercise 2 of 2: Model interaction effects for PREVEND study

Click here for more details

The following set of questions step through taking a closer look at the association of RFFT score with age and statin with prevend.samp, a sample of $n = 500$ individuals from the PREVEND data.

Run the code in the template to fit a model for predicting RFFT score from age, statin use, and the interaction term between age and statin use.

#load the data
library(oibiostat)
data("prevend.samp")
#convert Statin to a factor
prevend.samp$Statin = factor(prevend.samp$Statin, levels = c(0, 1),
                             labels = c("NonUser", "User"))
#fit the model
model.RFFT.interact = lm(RFFT ~ Age*Statin, data = prevend.samp)
coef(model.RFFT.interact)

##    (Intercept)            Age     StatinUser Age:StatinUser 
##    140.2031114     -1.3149119    -13.9720216      0.2474466

Write the overall model equation.

Click here for more details
\[\widehat{RFFT} = 140.20 - 1.31 \times \text{Age} - 13.97 \times \mathrm{I}(\text{Statin user}) + 0.25\times \text{Age} \times \mathrm{I}(\text{Statin User})\]
Write the model equation for statin users.

Click here for more details
\[\widehat{RFFT} = 126.23 - 1.06 \times \text{Age}\]
Write the model equation for non-statin users.

Click here for more details
\[\widehat{RFFT} = 140.20 - 1.31 \times \text{Age}\]

Is there evidence of a statistically significant interaction between age and statin use?

Click here for more details

summary(model.RFFT.interact)

## 
## Call:
## lm(formula = RFFT ~ Age * Statin, data = prevend.samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.551 -16.963  -1.179  15.764  58.802 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    140.2031     5.6209  24.943   <2e-16 ***
## Age             -1.3149     0.1040 -12.646   <2e-16 ***
## StatinUser     -13.9720    15.0113  -0.931    0.352    
## Age:StatinUser   0.2474     0.2468   1.003    0.317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.21 on 496 degrees of freedom
## Multiple R-squared:  0.2866, Adjusted R-squared:  0.2823 
## F-statistic: 66.42 on 3 and 496 DF,  p-value: < 2.2e-16

No, there is not evidence of a statistically significant interaction between age and statin use. The $p$-value associated with the interaction term is 0.317.

Linear regression review

1A: Overview

The PREVEND study: As adults age, cognitive function changes over time; largely due to various cerebrovascular and neurodegenerative changes.

1B: Assumptions for linear regression

1C: Residuals in linear regression

1D: Least squares regression

1E: Statistical model for least squares regression

1F: `lm()` for categorical predictors with two levels

1G: The quantity \(R^2\)

1H: Hypothesis testing in regression

Lab 1: `R` for simple linear regression

Exercise 1 of 1: Using `lm()` to analyse data from the PREVEND study

Housekeeping

Back to the problem

2A: Introduction to multiple linear regression

Statin use and cognitive function.

2B: Age, statin use and cognitive function

2C: Interpreting coefficients in multiple linear regression

Assumptions for multiple regression:

2D: Model diagnostics and assessment

Adjusted \(R^2\) as a tool for model assessment

2E: Testing hypotheses for multiple linear regression models

Testing hypothesss about slope coefficients: \(t-\)tests.

Testing overall goodness of model: \(F-\)tests.

2F: Interaction in regression

Lab 2: `R` for multiple linear regression

Exercise 1 of 2: Verify validity of linear model assumptions for PREVEND study

Exercise 2 of 2: Model interaction effects for PREVEND study

\(\infty\)

Linear regression review

1A: Overview

The PREVEND study: As adults age, cognitive function changes over time; largely due to various cerebrovascular and neurodegenerative changes.

1B: Assumptions for linear regression

1C: Residuals in linear regression

1D: Least squares regression

1E: Statistical model for least squares regression

1F: lm() for categorical predictors with two levels

1G: The quantity \(R^2\)

1H: Hypothesis testing in regression

Lab 1: R for simple linear regression

Exercise 1 of 1: Using lm() to analyse data from the PREVEND study

Housekeeping

Back to the problem

2A: Introduction to multiple linear regression

Statin use and cognitive function.

2B: Age, statin use and cognitive function

2C: Interpreting coefficients in multiple linear regression

Assumptions for multiple regression:

2D: Model diagnostics and assessment

Adjusted \(R^2\) as a tool for model assessment

2E: Testing hypotheses for multiple linear regression models

Testing hypothesss about slope coefficients: \(t-\)tests.

Testing overall goodness of model: \(F-\)tests.

2F: Interaction in regression

Lab 2: R for multiple linear regression

Exercise 1 of 2: Verify validity of linear model assumptions for PREVEND study

Exercise 2 of 2: Model interaction effects for PREVEND study

\(\infty\)

1F: `lm()` for categorical predictors with two levels

Lab 1: `R` for simple linear regression

Exercise 1 of 1: Using `lm()` to analyse data from the PREVEND study

Lab 2: `R` for multiple linear regression