Content you should have understood before watching this video:
- Number 2, ‘Variables’
- Number 3, ‘Variation in data’
- Number 4, ‘Basic statistical metrics’
- Number 5, ‘Standard deviation and standard error’
- Number 6, ‘Populations, samples, hypotheses’
- Number 7, ‘Distributions’
- Number 8, ‘Quantiles and probabilities’
- Number 12, ‘Error types’
- Number 15, ‘The t-test’
- Number 16, ‘Correlation Analysis’
Regression Analysis
The origin of the word
The original publication that coined the term ‘regression’ - it dealt with body sizes of parents and their children.
Regression Analysis
- Measures the dependency of one variable on another
- A linear regression is a predictive model: e.g. it may allow us to predict diabetes risk from the body mass index of a person
- We assume causality, i.e. one variable drives the other one - unlike in correlation analysis!
- The two coefficients of a simple linear regression are the slope and the intercept
- In a linear regression, we test for the significance of the slope (only rarely for the significance of the intercept)
Regression Analysis
\(\hat{y} = ax + b\)
\(\hat{y}\) are the estimated y values:
\(\hat{y} - y = {\color{red}{d}}\)
Assuming that the line should go through the mean of x and y, our task is to find \(a\) that minimises the sum of the squared residuals \({\color{red}{d}}\)
Regression Analysis
Assumptions
- \(x\) is measured without error (for the type of regression presented here)
- The residuals are normally distributed
- Constant variance in y (no heterogeneity of variance, or ‘heteroscedasticity’)
We know how to test for normality, but we’ll have to learn how to assess heteroscedasticity
Regression in R
in R, we only need to type:
m1 = lm(y ~ x) m1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.04133 1.02517
and we are done!
Regression in R
How do we interpret these numbers?
- The intercept is what we called \(b\) and \(a\) is the slope
- If we want to predict a \(y\)-value for a given \(x\), we need to put the intercept and the slope into the formula:
\(y\)-value = \(x\) * slope + intercept
The slope (often called a) and the intercept (often called b) are the coefficients
Regression in R
To plot this is easy too, same as in correlation analysis:
plot(y ~ x) abline(m1) #plots the regression line, based on model 'm1'
But it is important to know what is going on behind the curtain: R automatically finds the best fit using the least squares method (minimising the squared residuals by pivoting a line through the mean of x and y.
Regression in R
Using summary(m1) (‘m1’ being your model output) additionally tests the intercept and slope for significance:
- intercept null hypothesis, using a t-statistic: ‘the intercept is not significantly different from zero’. In plain English: when x is zero, y is zero too.
- slope null hypothesis, using a t-statistic: ‘the slope is not significantly different from zero’.
Regression in R
summary(m1) #m1 as defined in previous slide
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.5034 -0.4339 0.2452 0.7240 1.6150
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04133 0.85395 0.048 0.963
x 1.02517 0.13763 7.449 7.27e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.25 on 8 degrees of freedom
Multiple R-squared: 0.874, Adjusted R-squared: 0.8582
F-statistic: 55.49 on 1 and 8 DF, p-value: 7.272e-05
Regression Analysis
Example where the significance of the intercept matters
Photospectrometry: At concentration zero - is the absorption significantly different from zero?
Regression in R
R-squared value: the proportion of explained vs. total variance, corresponds to the squared correlation coefficient \(R\).
\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)
Remember?
\(R^2\) is the proportion of variance in \(y\) explained by \(x\)
How can we visualise this?
Regression in R
\(R^2 = 1 - \frac{{\color{blue}{SS_{residual}}}}{{\color{red}{SS_{total}}}} = 1 - \frac{{\color{blue}{\sum{(y_i - \hat{y}_i)^2}}}}{{\color{red}{\sum{(y_i - \bar{y}_i)^2}}}}\)
Visual representation ofRegression in R
Example using the ‘swiss’ data set
Does the percentage of households involved in agriculture (‘agriculture’) influence the fertility index (‘fertility’)? Can we establish a perdictive model that forecasts ‘fertility’ based on ‘agriculture’?
First, always plot the data.
Indeed, the higher ‘agriculture’, the higher the variable ‘fertility’:
#using the inbuilt 'swiss' data set plot(swiss$Fertility ~ swiss$Agriculture)
Regression in R
Let us run and interpret a linear regression on the data:
m1 = lm(Fertility ~ Agriculture, data = swiss) summary(m1)
Call:
lm(formula = Fertility ~ Agriculture, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-25.5374 -7.8685 -0.6362 9.0464 24.4858
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.30438 4.25126 14.185 <2e-16 ***
Agriculture 0.19420 0.07671 2.532 0.0149 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
Regression in R
Example using the ‘swiss’ data set
- How do we interpret the output?
- What assumptions do we need to verify again…?
To check for normality of the residuals, we look at a qq-plot
To check the assumption of homogeneity of variance, we plot the residuals against the fitted values
Testing the model assumptions - homoscedasticity
plot(m1) #note this gives you 4 plots, best used after par(mfrow = c(2, 2)) #we will consider plots 1, 2, and 4 only
This plot checks for homogeneity of variance of the residuals, we do now want to see a pattern, e.g. larger residuals at higher fitted values (trumpet shape).
Testing the model assumptions - normality
In this plot we check for the normality of the residuals in a qq-plot. The qq-plot plots the quantiles of your residual distribution against the quantiles of a standard normal distribution.
Testing the model assumptions - outliers
This plot tells us about very influential data points (outside the 0.5 line, or worse outside the 1) that have a strong influence on the regression line.
Regression in R: how to report the data
The higher the percentage of families involved in agriculture, the larger the families (linear regression, R-squared = 0.12, p < 0.05).
Regression in R: use the model to predict values
Intercept: 60.3
Slope: 0.19
What fertility would we predict if the percentage of people working in agriculture is 50%?
\(Fertility = 0.19 \cdot 50 + 60.3 = 69.8\)
Learning tool: Linear regression app
Model diagnostics (testing model assumptions)
https://gallery.shinyapps.io/slr_diag/
Regression
https://gallery.shinyapps.io/simple_regression/
In a nutshell
- Regression might feel similar to correlation, but it’s actually quite different, because y depends on x
- The slope, the intercept, their P-values, and the R-squared values are the key
- Regression can be used to predict, or to test for significance (intercept, slope)