Department of Environmental Science, AUT

Regression: Prerequisites

Content you should have understood before watching this video:

  • Number 2, ‘Variables’
  • Number 3, ‘Variation in data’
  • Number 4, ‘Basic statistical metrics’
  • Number 5, ‘Standard deviation and standard error’
  • Number 6, ‘Populations, samples, hypotheses’
  • Number 7, ‘Distributions’
  • Number 8, ‘Quantiles and probabilities’
  • Number 12, ‘Error types’
  • Number 15, ‘The t-test’
  • Number 16, ‘Correlation Analysis’

Regression Analysis

The origin of the word

The original publication that coined the term ‘regression’ - it dealt with body sizes of parents and their children.

Regression Analysis

Regression
  • Measures the dependency of one variable on another
  • A linear regression is a predictive model: e.g. it may allow us to predict diabetes risk from the body mass index of a person
  • We assume causality, i.e. one variable drives the other one - unlike in correlation analysis!
  • The two coefficients of a simple linear regression are the slope and the intercept
  • In a linear regression, we test for the significance of the slope (only rarely for the significance of the intercept)

Regression Analysis

\(\hat{y} = ax + b\)

\(\hat{y}\) are the estimated y values:

\(\hat{y} - y = {\color{red}{d}}\)

Assuming that the line should go through the mean of x and y, our task is to find \(a\) that minimises the sum of the squared residuals \({\color{red}{d}}\)

Regression Analysis

Assumptions

Regression
  • \(x\) is measured without error (for the type of regression presented here)
  • The residuals are normally distributed
  • Constant variance in y (no heterogeneity of variance, or ‘heteroscedasticity’)

We know how to test for normality, but we’ll have to learn how to assess heteroscedasticity

Regression in R

Regression

in R, we only need to type:

m1 = lm(y ~ x)
m1
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
    0.04133      1.02517  

and we are done!

Regression in R

Regression

How do we interpret these numbers?

  • The intercept is what we called \(b\) and \(a\) is the slope
  • If we want to predict a \(y\)-value for a given \(x\), we need to put the intercept and the slope into the formula:

\(y\)-value = \(x\) * slope + intercept

The slope (often called a) and the intercept (often called b) are the coefficients

Regression in R

To plot this is easy too, same as in correlation analysis:

plot(y ~ x)
abline(m1) #plots the regression line, based on model 'm1'

But it is important to know what is going on behind the curtain: R automatically finds the best fit using the least squares method (minimising the squared residuals by pivoting a line through the mean of x and y.

Regression in R

Regression

Using summary(m1) (‘m1’ being your model output) additionally tests the intercept and slope for significance:

  • intercept null hypothesis, using a t-statistic: ‘the intercept is not significantly different from zero’. In plain English: when x is zero, y is zero too.
  • slope null hypothesis, using a t-statistic: ‘the slope is not significantly different from zero’.

Regression in R

Regression
summary(m1) #m1 as defined in previous slide
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5034 -0.4339  0.2452  0.7240  1.6150 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.04133    0.85395   0.048    0.963    
x            1.02517    0.13763   7.449 7.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.25 on 8 degrees of freedom
Multiple R-squared:  0.874, Adjusted R-squared:  0.8582 
F-statistic: 55.49 on 1 and 8 DF,  p-value: 7.272e-05

Regression Analysis

Example where the significance of the intercept matters

Regression

Photospectrometry: At concentration zero - is the absorption significantly different from zero?

Regression in R

Regression

R-squared value: the proportion of explained vs. total variance, corresponds to the squared correlation coefficient \(R\).

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Remember?

\(R^2\) is the proportion of variance in \(y\) explained by \(x\)

How can we visualise this?

Regression in R

Regression

\(R^2 = 1 - \frac{{\color{blue}{SS_{residual}}}}{{\color{red}{SS_{total}}}} = 1 - \frac{{\color{blue}{\sum{(y_i - \hat{y}_i)^2}}}}{{\color{red}{\sum{(y_i - \bar{y}_i)^2}}}}\)

Visual representation of
sum of squares (residuals)
vs. 
sum of squares (total)

Regression in R

Example using the ‘swiss’ data set

Regression

Does the percentage of households involved in agriculture (‘agriculture’) influence the fertility index (‘fertility’)? Can we establish a perdictive model that forecasts ‘fertility’ based on ‘agriculture’?

First, always plot the data.

Indeed, the higher ‘agriculture’, the higher the variable ‘fertility’:

#using the inbuilt 'swiss' data set
plot(swiss$Fertility ~ swiss$Agriculture)

Regression in R

Regression

Let us run and interpret a linear regression on the data:

m1 = lm(Fertility ~ Agriculture, data = swiss)
summary(m1)
Call:
lm(formula = Fertility ~ Agriculture, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.5374  -7.8685  -0.6362   9.0464  24.4858 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.30438    4.25126  14.185   <2e-16 ***
Agriculture  0.19420    0.07671   2.532   0.0149 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared:  0.1247,    Adjusted R-squared:  0.1052 
F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Regression in R

Example using the ‘swiss’ data set

Regression
  • How do we interpret the output?
  • What assumptions do we need to verify again…?

To check for normality of the residuals, we look at a qq-plot

To check the assumption of homogeneity of variance, we plot the residuals against the fitted values

Testing the model assumptions - homoscedasticity

Regression
plot(m1) #note this gives you 4 plots, best used after par(mfrow = c(2, 2))
#we will consider plots 1, 2, and 4 only

This plot checks for homogeneity of variance of the residuals, we do now want to see a pattern, e.g. larger residuals at higher fitted values (trumpet shape).

Testing the model assumptions - normality

Regression

In this plot we check for the normality of the residuals in a qq-plot. The qq-plot plots the quantiles of your residual distribution against the quantiles of a standard normal distribution.

Testing the model assumptions - outliers

Regression

This plot tells us about very influential data points (outside the 0.5 line, or worse outside the 1) that have a strong influence on the regression line.

Regression in R: how to report the data

The higher the percentage of families involved in agriculture, the larger the families (linear regression, R-squared = 0.12, p < 0.05).

Regression in R: use the model to predict values

Intercept: 60.3

Slope: 0.19

What fertility would we predict if the percentage of people working in agriculture is 50%?

\(Fertility = 0.19 \cdot 50 + 60.3 = 69.8\)

Learning tool: Linear regression app

In a nutshell

Regression
  • Regression might feel similar to correlation, but it’s actually quite different, because y depends on x
  • The slope, the intercept, their P-values, and the R-squared values are the key
  • Regression can be used to predict, or to test for significance (intercept, slope)