Big Data Summer Institute 2022
June 22, 2022
Soumik Purkayastha
[https://soumikp.github.io]
[Press ‘k’ on
your keyboard and use arrows/space bar to navigate.]
Simple linear regression (SLR) is a technique for modeling the relationship between two numerical variables \(x\) and \(y\).
SLR can be visualized using a scatterplot in the xy-plane.
By estimating a straight line that `best fits’ the data on a scatterplot, we obtain a linear model that can not only be used for prediction, but also for inference (with some additional assumptions).
The Prevention of REnal and Vascular END-stage Disease (PREVEND) study measured various clinical data for participants.
Cognitive function was assessed with the Ruff Figural Fluency Test (RFFT). This will be our response of interest.
Scores range from 0 to 175; higher scores indicate better cognitive function.
We will work with a random sample of 500 participants.
Various associated (potentially) demographic and cardiovascular risk factors were collected for each participant.
The following conditions should be true in a scatterplot for a line to be considered a reasonable approximation to the relationship in the plot and for the application of the methods of inference (discussed later):
Our overarching goal is to fit a ‘good’ line to this data cloud. Suppose we know what the ‘best’ line that fits this data is. This means, for any given value of predictor \(x\) we obtain a predicted \(\hat{y}.\) In particular, for each observed data \((x_i, y_i)\) with \(y_i\) being the observed response, we get a predicted response \(\hat{y}_i.\)
The least squares regression line is the line which minimizes the sum of the squared residuals for all the points in the plot.
In other words, the least squares line is the line with coefficients \(b_0\) and \(b_1\) such that the quantity \[e_1^2 + e_2^2 + \ldots + e_n^2\] is the smallest posssible value it can take.
For a general population of ordered pairs \((x, y)\), the population regression model is \[Y = \beta_0 + \beta_1X + \epsilon,\] where \(\epsilon \sim N(0, \sigma)\) is the error term.
Since \(\mathrm{E}(\epsilon) = 0\), the population regression model may be rewritten in terms of conditional behaviour of \(Y\) given \(X = x\), i.e., \[\mathrm{E}(Y|X=x) = \beta_0 + \beta_1x.\]
The terms \(\beta_0\) and \(\beta_1\) are parameters - with \(b_0\) and \(b_1\) serving as estimates.
We will use R to (a) obtain estimates \(b_0\) and \(b_1\), (b) check validity of assumptions of
linear regression.
lm() for categorical predictors with two
levelsAlthough the response variable in linear regression is necessarily numerical, the predictor may be either numerical or categorical. Simple linear regression only allows for categorical predictor variables with two levels. Examining categorical predictors with more than two levels requires multiple linear regression.
Fitting a simple linear regression model with a two-level categorical predictor is analogous to comparing the means of two groups, where the groups are defined by the categorical variable.
Here, we examine if there are any gender-based differences in RFFT scores in the PREVEND dataset. First, we compare the gender-stratified distribution of RFFT scores by means of violin plots.
We compare the group means as follows
prevend.samp %>%
rowwise() %>%
mutate(Gender = factor(ifelse(Gender == 1, "Male", "Female"))) %>%
group_by(Gender) %>% ## stratifying by gender
summarise(m = mean(RFFT), ## strata-specific means
q1 = quantile(RFFT, 0.25), ## strata-specific quantiles
q2 = quantile(RFFT, 0.5),
q3 = quantile(RFFT, 0.75))## # A tibble: 2 × 5
## Gender m q1 q2 q3
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Male 69.2 48.5 66 87
## 2 Female 67.7 44 68 88
and fit a linear model of RFFT by gender
##
## Call:
## lm(formula = RFFT ~ Gender, data = data)
##
## Coefficients:
## (Intercept) GenderFemale
## 69.238 -1.578
Note that the estimated intercept is the group mean for one category
(called the baseline) - here it is the Gender = Male group.
The estimated slope is the difference of the group means.
The correlation coefficient \(r\) measures the strength of the linear relationship between two variables.
It is more common to use \(R^2\) to measure the strength of a linear fit.
\(R^2\) describes the amount of variation in the response that is explained by the least squares line.
\[R^2 = \frac{\text{variance of predicted y-values }}{\text{variance of observed y-values}} = \frac{\mathrm{V}(\hat{Y})}{\mathrm{V}({Y})}.\]
If a linear model perfectly captured the variability in the observed data, then \(\mathrm{V}(\hat{Y}) = \mathrm{V}(Y)\) and \(R^2\) would be 1.
\(R^2\) can also be calculated as follows \[\begin{align*} R^2 &= \frac{\text{variance of observed y-values} - \text{variance of residuals}}{\text{variance of observed y-values}}\\ &= \frac{\mathrm{V}({Y}) - \mathrm{V}({e})}{\mathrm{V}({Y})}. \end{align*}\]
The variability of the residuals about the fitted line represents the remaining variability after the model is fit. In other words, \(\mathrm{V}({e})\) is the variability unexplained by the model.
Inference in a regression context is usually about the slope parameter \(\beta_1\). The null hypothesis is most commonly a hypothesis of ‘no association’, which may be formulated mathematically as \[H_0: \beta_1 = 0 \text{ [denoting X and Y are NOT associated]}.\] The alternative hypothesis is given by \[H_0: \beta_1 \neq 0 \text{ [denoting X and Y are associated]}.\] We use estimator \(b_1\) in the statistic used to test whether hypothesis \(H_0\) is true. The test statistic is given by \[t = \frac{b_1 - \beta_1^0}{\text{s.e.}(b_1)} = \frac{b_1}{\text{s.e.}(b_1)},\] where \(\beta_1^0\) equals zero when the null hypothesis is true.
The \(t-\)statistic defined above follows a \(t-\)distribution with degrees of freedom \(n − 2\), where \(n\) is the number of ordered pairs \((x_i, y_i)\) in the dataset.
In order to test whether \(H_0\) holds or not, we construct the \(95\%\) confidence interval associated with
\(\beta_1\) and investigate if the
resultant confidence interval contains zero or not. The \(95\%\) C.I. is given by \[b_1 \pm \left(t^* \times \text{s.e.}(b_1)
\right),\] where \(t^*\)
is the 97.5-th percentile of a t-distribution with (n-2) degrees of
freedom. (may be computed in R as
qt(p = 0.975, df = n-2)).
R for simple linear regressionNotes for review may be found here.
lm() to analyse data from the PREVEND
studyThis lab uses data from the Prevention of REnal and Vascular
END-stage Disease (PREVEND) study, which took place between 2003 and
2006 in the Netherlands. Clinical and demographic data for a random
sample of 500 individuals are stored in the prevend.samp
dataset in the oibiostat package.
The lectures have little active learning components where you will be
reviewing how to use R to analyse some real-life datasets.
In order to do so effectively, please make sure you run the following
code chunk
and then run
nhanes.samp.adult.500 <- read_csv("https://raw.githubusercontent.com/soumikp/bdsi_2022/main/data/nhanes.samp.adult.500.csv")
prevend.samp <- read_csv("https://raw.githubusercontent.com/soumikp/bdsi_2022/main/data/prevend.samp.csv")If your installation was not successful, you’ll get an error message.
As adults age, cognitive function declines over time; this is largely due to various cerebrovascular and neurodegenerative changes. The Ruff Figural Fluency Test (RFFT) is one measure of cognitive function that provides information about cognitive abilities such as planning and the ability to switch between different tasks. Scores on the RFFT range from 0 to 175 points, where higher scores are indicative of better cognitive function.
The goal of this lab is to begin exploring the relationship between
age and RFFT score in the prevend.samp dataset.
prevend.samp.
prevend.samp %>%
rowwise() %>%
mutate(Gender = factor(ifelse(Gender == 1, "Male", "Female"))) %>%
ggplot(aes(x = Age, y = RFFT)) +
geom_point(aes(color = Gender, shape = Gender)) + ## bonus
theme_bw() +
ylab("RFFT score") +
xlab("Age (in years)") +
labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500).") +
theme(legend.position = "bottom") +
scale_color_aaas() prevend.samp %>%
ggplot(aes(x = Age, y = RFFT)) +
geom_point(color = pal_aaas("default", alpha = 1)(9)[4]) +
geom_abline(slope = 2, intercept = -20,
linetype = "dashed", color = pal_aaas("default", alpha = 1)(9)[6], size = 1) +
theme_bw() +
ylab("RFFT score") +
xlab("Age (in years)") +
labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500).") +
theme(legend.position = "bottom") #enter line coefficients
b0 = -20
b1 = 2
#calculate sse
y = prevend.samp$RFFT
x = prevend.samp$Age
sse = sum((y - (b0 + b1*x))^2)
sse## [1] 1206875
Since we do not expect the model fit to be
‘good’, the SSE should be relatively high, indicating a poor model fit
and high amount of error (from large residuals) associated with the
model. The SSE is 1,206,875.
prevend.samp, then add a line of best fit.
prevend.samp %>%
ggplot(aes(x = Age, y = RFFT)) +
geom_point(color = pal_aaas("default", alpha = 1)(9)[4]) +
theme_bw() +
stat_smooth(method = lm, se = TRUE,
color = pal_aaas("default", alpha = 1)(9)[6], linetype = "dashed", size = 1) +
ylab("RFFT score") +
xlab("Age (in years)") +
labs(title = "Scatterplot of RFFT scores and age (in years) for PREVEND data (n = 500) with line of best fit (grey band gives 95% CI)") +
theme(legend.position = "bottom")## `geom_smooth()` using formula 'y ~ x'
##
## Call:
## lm(formula = RFFT ~ Age, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.879 -16.845 -1.095 15.524 58.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.54972 5.01614 27.42 <2e-16 ***
## Age -1.26136 0.08953 -14.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.19 on 498 degrees of freedom
## Multiple R-squared: 0.285, Adjusted R-squared: 0.2836
## F-statistic: 198.5 on 1 and 498 DF, p-value: < 2.2e-16
In most practical settings, more than one explanatory variable is likely to be associated with a response.
Multiple linear regression is used to estimate the linear relationship between a response variable \(Y\) and several predictors \(x_1, x_2,\ldots, x_p\), where \(p\) is the number of predictors.
We extend the statistical model from just one predictor to \(p-\)many predictors as follows \[\mathrm{E}(Y|x_1, x_2, \ldots, x_p) = \beta_0 +\beta_1x_1 +\beta_2x_2 +···+\beta_px_p.\]
There are several applications of multiple regression. We will focus on two areas:
We will again make use of the prevend.sample dataset in
this module, but this time we’ll consider a more complicated example
concerning the effect of statin use on
cognitive function (through RFFT
scores).
Statins are a class of drugs widely used to lower cholesterol. If followed, recent guidelines for prescribing statins would lead to statin use in almost half of Americans between 40 - 75 years of age and nearly all men over 60.
A few small studies have suggested that statins may be associated with lower cognitive ability.
The PREVEND study collected data on statin use as well as other demographic factors.
The statistical model for multiple regression is based on \[\mathrm{E}(Y|x_1, x_2, \ldots, x_p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p,\] where \(p\) is the number of predictors.
The coefficient \(\beta_j\) of a predictor \(X_j\) is estimated by \(b_j\), say. It iss the predicted mean change in the response corresponding to a one unit change in \(x_j\) , when the values of all other predictors remain constant.
The practical interpretation is that a coefficient in multiple regression estimates the association between a response and that predictor, after adjusting for the other predictors in the model.
Similar to those of simple linear regression…
It is not possible to make a scatterplot of a response against several simultaneous predictors :(
It is not possible to make a scatterplot of a response against several simultaneous predictors.
More information is always nice: As variables are added, \(R^2\) always increases.
But at what cost?: As variables are added, model complexity increases.
We consider using adjusted \(R^2\) (\(R^2_{\text{adj}}\)) as a tool for model assesment.
\[R^2_{\text{adj}} = 1 - (1 - R^2)\times \frac{n-1}{n-1-p}.\]
It is often used to balance predictive ability with model complexity. \(R^2_{\text{adj}}\) incorporates a penalty for including predictors that do not contribute much towards explaining observed variation in the response variable.
Unlike \(R^2\), \(R^2_{\text{adj}}\) does not have an inherent interpretation.
Typically, the hypotheses of interest are \(H^k_{0}: \beta_k = 0\) (\(X_k\) and \(Y\) are NOT associated) with alternative \(H^k_A: \beta_k \neq 0\) (\(X_k\) and \(Y\) are associated).
The test statistic is again a t-statistic under the null, with degrees of freedom \(= n - p - 1\), given by \[t_k = \frac{b_k - \beta_k^0}{\text{s.e.}(b_k)},\] with a \(95\%\) confidence interval for \(\beta_k\) given by \[b_k \pm \left(t^* \times \text{s.e.}(b_k) \right),\] where \(t^*\) is the \(97.5-\)th percentile for a \(t-\)distribution with \(n-p-1\) degrees of freedom.
The \(F-\)statistic is used in an overall test of the model to assess whether the predictors in the model, considered as a group, are associated with the response.
\[H_0 :\beta_1 = \beta_2 =\ldots=\beta_p = 0 \text{ vs } H_A: H_0 \text{ is false,}\] where \(H_0\) is false if and only if at least one of the slope coefficients \(\beta_k\) is not 0
There is sufficient evidence to reject \(H_0\) if the \(p-\)value of the \(F-\)statistic is smaller than or equal to \(\alpha.\)
Again, R does the hard work for us: the F-statistic and its associated p-value are displayed in the output.
The multiple regression model assumes that when one of the predictors \(x_j\) is changed by 1 unit and the values of the other variables remain constant, the predicted response changes by \(\beta_j\), regardless of the values of the other variables.
A statistical interaction occurs when this assumption is not true, such that the effect of one explanatory variable \(x_j\) on the response depends on the particular value(s) of one or more other explanatory variables.
As an example, we specifically examine interaction in a two-variable setting, where one of the predictors is categorical and the other is numerical.
Consider data on total cholesterol level (mmol/L) from age (yrs.) and diabetes status in a dataset of size \(n = 473\).
We fit the linear model \[\mathrm{E}(\text{Total cholesterol|Age, Diabetes status}) = \beta_0 + \beta_1 \text{Age} + \beta_2 \mathrm{I}(\text{has diabetes}).\] This model yields two fitted lines, one for individuals with diabetes and one for individuals without diabetes - we overlay these fitted lines on the scatterplot.
Next, we consider two separate models for the relationship between total cholesterol and age; one in diabetic individuals and one in non-diabetic individuals.
\[\mathrm{E}(\text{Total cholesterol|Age, Diabetes status}) = \beta_0 + \beta_1 \text{ Age } + \beta_2 \ \mathrm{I}(\text{has diabetes}) + \beta_3 \text{ Age } \times \ \mathrm{I}(\text{has diabetes}).\] The estimated model coefficients are given by
## (Intercept) Age DiabetesYes Age:DiabetesYes
## 4.695702513 0.009638183 1.718704342 -0.033451562
\[\widehat{TotChol} = 4.70 + 0.0096 \times \text{ Age} + 1.72 \times \ \mathrm{I}(\text{has diabetes}) - 0.033 \times \text{ Age }\times \mathrm{I}(\text{has diabetes}).\]
Hence the fitted model equation for diabetics is given by \[\widehat{TotChol} = 6.42 - 0.023 \times \text{Age},\] and that for non-diabetics is given by \[\widehat{TotChol} = 4.70 + 0.0096 \times \text{Age}.\]
R for multiple linear regressionNotes for review may be found here.
Having fit a multiple regression model predicting RFFT score from statin use and age, we will check the assumptions for multiple linear regression.
library(oibiostat)
data(prevend.samp)
prevend.samp <- prevend.samp %>%
rowwise() %>%
mutate(Statin = factor(ifelse(Statin == 1, "Yes", "No"))) %>%
select(c(RFFT, Age, Statin))
#fit a multiple regression model
model1 = lm(RFFT ~ Statin + Age, data = prevend.samp)Assess linearity with respect to age using a scatterplot with residual values on the \(y\)-axis and values of age on the \(x\)-axis. Is it necessary to assess linearity with respect to statin use?
Click here for more details
plot(resid(model1) ~ prevend.samp$Age,
main = "Residuals vs Age in PREVEND (n = 500)",
xlab = "Age (years)", ylab = "Residual",
pch = 21, col = "cornflowerblue", bg = "slategray2",
cex = 0.60)
abline(h = 0, col = "red", lty = 2)
There are no apparent trends; the data
scatter evenly above and below the horizontal line. There does not seem
to be remaining nonlinearity with respect to age after the model is
fit.
It is not necessary to assess linearity with respect to statin use since statin use is measured as a categorical variable. A line drawn through two points (that is, the mean of the two groups defined by a binary variable) is necessarily linear.
Assess whether the residuals have approximately constant variance.
Click here for more details
#assess constant variance of residuals
plot(resid(model1) ~ fitted(model1),
main = "Resid. vs Predicted RFFT in PREVEND (n = 500)",
xlab = "Predicted RFFT Score", ylab = "Residual",
pch = 21, col = "cornflowerblue", bg = "slategray2",
cex = 0.60)
abline(h = 0, col = "red", lty = 2)
The variance of the residuals is somewhat
smaller for lower predicted values of RFFT score, but this may simply be
an artifact from observing few individuals with relatively low predicted
scores. It seems reasonable to assume approximately constant
variance.
Assess whether the residuals are approximately normally distributed.
Click here for more details
#assess normality of residuals
qqnorm(resid(model1),
pch = 21, col = "cornflowerblue", bg = "slategray2", cex = 0.75,
main = "Q-Q Plot of Residuals")
qqline(resid(model1), col = "red", lwd = 2)
The residuals are reasonably normally
distributed, with only slight departures from normality in the
tails.
How well does the model explain the variability in observed RFFT score?
Calculate the adjusted \(R^2_{\text{adj}}\) for the multiple regression model predicting RFFT score from statin use and age.
The following set of questions step through taking a closer look at
the association of RFFT score with age and statin with
prevend.samp, a sample of \(n =
500\) individuals from the PREVEND data.
Run the code in the template to fit a model for predicting RFFT score from age, statin use, and the interaction term between age and statin use.
#load the data
library(oibiostat)
data("prevend.samp")
#convert Statin to a factor
prevend.samp$Statin = factor(prevend.samp$Statin, levels = c(0, 1),
labels = c("NonUser", "User"))
#fit the model
model.RFFT.interact = lm(RFFT ~ Age*Statin, data = prevend.samp)
coef(model.RFFT.interact)## (Intercept) Age StatinUser Age:StatinUser
## 140.2031114 -1.3149119 -13.9720216 0.2474466
##
## Call:
## lm(formula = RFFT ~ Age * Statin, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.551 -16.963 -1.179 15.764 58.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.2031 5.6209 24.943 <2e-16 ***
## Age -1.3149 0.1040 -12.646 <2e-16 ***
## StatinUser -13.9720 15.0113 -0.931 0.352
## Age:StatinUser 0.2474 0.2468 1.003 0.317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.21 on 496 degrees of freedom
## Multiple R-squared: 0.2866, Adjusted R-squared: 0.2823
## F-statistic: 66.42 on 3 and 496 DF, p-value: < 2.2e-16
No, there is not evidence of a statistically
significant interaction between age and statin use. The \(p\)-value associated with the interaction
term is 0.317.