17 - Regression Analysis

Department of Environmental Science, AUT

Regression: Prerequisites

Content you should have understood before watching this video:

Number 2, ‘Variables’
Number 3, ‘Variation in data’
Number 4, ‘Basic statistical metrics’
Number 5, ‘Standard deviation and standard error’
Number 6, ‘Populations, samples, hypotheses’
Number 7, ‘Distributions’
Number 8, ‘Quantiles and probabilities’
Number 12, ‘Error types’
Number 15, ‘The t-test’
Number 16, ‘Correlation Analysis’

Regression Analysis

The origin of the word

The original publication that coined the term ‘regression’ - it dealt with body sizes of parents and their children.

Regression Analysis

Regression

Measures the dependency of one variable on another
A linear regression is a predictive model: e.g. it may allow us to predict diabetes risk from the body mass index of a person
We assume causality, i.e. one variable drives the other one - unlike in correlation analysis!
The two coefficients of a simple linear regression are the slope and the intercept
In a linear regression, we test for the significance of the slope (only rarely for the significance of the intercept)

Regression Analysis

\(\hat{y} = ax + b\)

\(\hat{y}\) are the estimated y values:

\(\hat{y} - y = {\color{red}{d}}\)

Assuming that the line should go through the mean of x and y, our task is to find \(a\) that minimises the sum of the squared residuals \({\color{red}{d}}\)

Regression Analysis

Assumptions

Regression

\(x\) is measured without error (for the type of regression presented here)
The residuals are normally distributed
Constant variance in y (no heterogeneity of variance, or ‘heteroscedasticity’)

We know how to test for normality, but we’ll have to learn how to assess heteroscedasticity

Regression in R

Regression

in R, we only need to type:

m1 = lm(y ~ x)
m1

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
    0.04133      1.02517

and we are done!

Regression in R

Regression

How do we interpret these numbers?

The intercept is what we called \(b\) and \(a\) is the slope
If we want to predict a \(y\)-value for a given \(x\), we need to put the intercept and the slope into the formula:

\(y\)-value = \(x\) * slope + intercept

The slope (often called a) and the intercept (often called b) are the coefficients

Regression in R

To plot this is easy too, same as in correlation analysis:

plot(y ~ x)
abline(m1) #plots the regression line, based on model 'm1'

But it is important to know what is going on behind the curtain: R automatically finds the best fit using the least squares method (minimising the squared residuals by pivoting a line through the mean of x and y.

Regression in R

Regression

Using summary(m1) (‘m1’ being your model output) additionally tests the intercept and slope for significance:

intercept null hypothesis, using a t-statistic: ‘the intercept is not significantly different from zero’. In plain English: when x is zero, y is zero too.
slope null hypothesis, using a t-statistic: ‘the slope is not significantly different from zero’.

Regression in R

Regression

summary(m1) #m1 as defined in previous slide

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5034 -0.4339  0.2452  0.7240  1.6150 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.04133    0.85395   0.048    0.963    
x            1.02517    0.13763   7.449 7.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.25 on 8 degrees of freedom
Multiple R-squared:  0.874, Adjusted R-squared:  0.8582 
F-statistic: 55.49 on 1 and 8 DF,  p-value: 7.272e-05

Regression Analysis

Example where the significance of the intercept matters

Regression

Photospectrometry: At concentration zero - is the absorption significantly different from zero?

Regression in R

Regression

R-squared value: the proportion of explained vs. total variance, corresponds to the squared correlation coefficient \(R\).

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Remember?

\(R^2\) is the proportion of variance in \(y\) explained by \(x\)

How can we visualise this?

Regression in R

Regression

\(R^2 = 1 - \frac{{\color{blue}{SS_{residual}}}}{{\color{red}{SS_{total}}}} = 1 - \frac{{\color{blue}{\sum{(y_i - \hat{y}_i)^2}}}}{{\color{red}{\sum{(y_i - \bar{y}_i)^2}}}}\)

Visual representation of

sum of squares (residuals)

vs.

sum of squares (total)

Regression in R

Example using the ‘swiss’ data set

Regression

Does the percentage of households involved in agriculture (‘agriculture’) influence the fertility index (‘fertility’)? Can we establish a perdictive model that forecasts ‘fertility’ based on ‘agriculture’?

First, always plot the data.

Indeed, the higher ‘agriculture’, the higher the variable ‘fertility’:

#using the inbuilt 'swiss' data set
plot(swiss$Fertility ~ swiss$Agriculture)

Regression in R

Regression

Let us run and interpret a linear regression on the data:

m1 = lm(Fertility ~ Agriculture, data = swiss)
summary(m1)

Call:
lm(formula = Fertility ~ Agriculture, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.5374  -7.8685  -0.6362   9.0464  24.4858 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.30438    4.25126  14.185   <2e-16 ***
Agriculture  0.19420    0.07671   2.532   0.0149 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared:  0.1247,    Adjusted R-squared:  0.1052 
F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Regression in R

Example using the ‘swiss’ data set

Regression

How do we interpret the output?
What assumptions do we need to verify again…?

To check for normality of the residuals, we look at a qq-plot

To check the assumption of homogeneity of variance, we plot the residuals against the fitted values

Testing the model assumptions - homoscedasticity

Regression

plot(m1) #note this gives you 4 plots, best used after par(mfrow = c(2, 2))
#we will consider plots 1, 2, and 4 only

This plot checks for homogeneity of variance of the residuals, we do now want to see a pattern, e.g. larger residuals at higher fitted values (trumpet shape).

Testing the model assumptions - normality

Regression

In this plot we check for the normality of the residuals in a qq-plot. The qq-plot plots the quantiles of your residual distribution against the quantiles of a standard normal distribution.

Testing the model assumptions - outliers

Regression

This plot tells us about very influential data points (outside the 0.5 line, or worse outside the 1) that have a strong influence on the regression line.

Regression in R: how to report the data

The higher the percentage of families involved in agriculture, the larger the families (linear regression, R-squared = 0.12, p < 0.05).

Regression in R: use the model to predict values

Intercept: 60.3

Slope: 0.19

What fertility would we predict if the percentage of people working in agriculture is 50%?

\(Fertility = 0.19 \cdot 50 + 60.3 = 69.8\)

Learning tool: Linear regression app

Regression

Model diagnostics (testing model assumptions)

https://gallery.shinyapps.io/slr_diag/

Regression

https://gallery.shinyapps.io/simple_regression/

In a nutshell

Regression

Regression might feel similar to correlation, but it’s actually quite different, because y depends on x
The slope, the intercept, their P-values, and the R-squared values are the key
Regression can be used to predict, or to test for significance (intercept, slope)