October 10, 2022

What will we do today

Week 11
  • Recap week 10: Example questions
  • Regression
  • Regression app in R
  • Lab report 7
  • Today’s lab / short answer example questions

Assignment 2

Week 11
  • Will be available from Friday 14th in the morning
  • Due Friday 21st 4pm
  • If you cannot meet the deadline, you need to apply for special consideration (SCA) on Canvas (with a good reason and proof of it)
  • It must be your own work, do not plagiarise, we have dedicated software now to detect it

Example Question (1)

Week 11

A Pearson correlation coefficient of 0.52 means

  1. that the x variable causes the y variable to increase
  2. that the x and y are positively correlated
  3. that x and y are just not significantly correlated
  4. that x and y are highly significantly correlated

Example Question (2)

Week 11

The correlation coefficient is

  1. the same as the covariance
  2. a standardised version of the covariance
  3. the square root of the covariance
  4. none of the above

Example Question (3)

Week 11

When a data set looks non-normal but I still would like to check for correlations between variables

  1. I should use a Wilcoxon test
  2. I should use a Pearson correlation test
  3. I should use a Kendall or Spearman correlation test
  4. I can use a Wilcoxon OR a Kendall test

Example Question (4)

Week 11

A correlation test tests

  1. whether a histogram shows a normal distribution
  2. whether two groups are significantly different
  3. whether two variables are correlated
  4. none of the above

Example Question (5)

Week 11

In a correlation analysis, if my p-value falls below 5% but in reality there is no correlation between the two variables,

  1. I probably did not test whether the variables were normal
  2. I might erroneously have used continuous variables
  3. I commit a type II error
  4. I commit a type I error

Example Question (6)

Week 11

In a correlation analysis we normally have

  1. two categorical variables
  2. one categorical variable and one continuous variable
  3. two continuous variables
  4. any two variables

Example Question (7)

Week 11

Assume that your waiting time for a bus is uniformly distributed between 0 and 10 minutes. What is the probability to wait at least 5 minutes or longer?

  1. This cannot be calculated from the above
  2. About 5 %
  3. About 50 %
  4. About 10 %

Regression Analysis

The origin of the word

The original publication that coined the term ‘regression’ - it dealt with body sizes of parents and their children.

Francis Galton observed that tall parents tended to have not as tall offpring, the metric ‘regressed’.

Regression Analysis

Week 11
  • Measures the dependency of one variable on another
  • E.g. it tests the hypothesis ‘The body mass index of a person can be used to predict the risk for diabetes’
  • A linear regression is a predictive model: e.g. it may allow us to predict diabetes risk from the body mass index of a person
  • We assume causality, i.e. one variable drives the other one - unlike in correlation analysis!
  • The two coefficients of a simple linear regression are the slope and the intercept
  • In a linear regression, we test for the significance of the slope (only rarely for the significance of the intercept)

Regression Analysis

\(\hat{y} = ax + b\)

\(\hat{y}\) are the estimated y values:

\(\hat{y} - y = {\color{red}{d}}\)

Assuming that the line should go through the mean of x and y, our task is to find \(a\) that minimises the sum of the squared residuals \({\color{red}{d}}\)

Regression Analysis

\(\hat{y} - y = {\color{red}{d}}\)

now replace \(\hat{y}\) by \(ax + b\) and solve for \({\color{red}{d}}\) to obtain

\({\color{red}{d}} = y - ax - b\)

We therefore have to minimise \(\sum{(y - ax - b)^2}\). We do this by setting the first derivative to zero (remember calculus…?)

Regression Analysis

Week 11

\(\frac{d\sum{(y - ax - b)^2}}{da} = -2\sum{x(y - ax - b)} = 0\)

We find that (without proof, see Crawley 2014, the R book):

\(a = \frac{\sum{(x - \bar{x})(y - \bar{y})}}{\sum{(x - \bar{x})^2}}\)

Because the regression line by definition goes through the mean of x and y, we can work out the intercept:

\(\bar{y} = a\bar{x} + b\)

\(b = \bar{y} - a \bar{x}\)

done! confused? https://gallery.shinyapps.io/simple_regression/

Regression Analysis

Assumptions

Week 11
  • \(x\) is measured without error (for the type of regression presented here)
  • The residuals are normally distributed
  • Constant variance in y (no heterogeneity of variance, or ‘heteroscedasticity’)

We know how to test for normality, but we’ll have to learn how to assess heteroscedasticity

Regression in R

Week 11

in R, we only need to type:

m1 = lm(y ~ x)
m1
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
    0.04133      1.02517  

and we are done!

Regression in R

Week 11

How do we interpret these numbers?

  • The intercept is what we called \(b\) and \(a\) is the slope
  • If we want to predict a \(y\)-value for a given \(x\), we need to put the intercept and the slope into the formula:

\(y\)-value = \(x\) * slope + intercept

The slope (sometimes called a) and the intercept (sometimes called b) are the coefficients

Regression in R

To plot this is easy too, same as in correlation analysis:

plot(y ~ x)
abline(m1) #plots the regression line, based on model 'm1'

But it is important to know what is going on behind the curtain: R automatically finds the best fit using the least squares method (minimising the squared residuals by pivoting a line through the mean of x and y (as we have just done manually)

Regression in R

Week 11

Using summary(m1) (‘m1’ being your model output) additionally tests the intercept and slope for significance:

  • intercept null hypothesis, using a t-statistic: ‘the intercept is not significantly different from zero’. In plain English: when x is zero, y is zero too.
  • slope null hypothesis, using a t-statistic: ‘the slope is not significantly different from zero’.

Regression in R

Week 11
summary(m1) #m1 as defined in previous slide
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5034 -0.4339  0.2452  0.7240  1.6150 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.04133    0.85395   0.048    0.963    
x            1.02517    0.13763   7.449 7.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.25 on 8 degrees of freedom
Multiple R-squared:  0.874, Adjusted R-squared:  0.8582 
F-statistic: 55.49 on 1 and 8 DF,  p-value: 7.272e-05

Regression in R

Week 11
  • First, we see the call (the data input)
  • Below, we see the statistics (min, median, max, etc.) of the residuals
  • Then come the coefficients, i.e. the intercept and the slope, including their estimate, standard error, t-value and corresponding p-value.
  • Note that the intercept is rarely important! The null hypothesis is that it is different from zero - rarely of interest!
  • Last come the residual standard error, degrees of freedom, R-squared values (look at the adjusted R-squared value, see following slides) and an F-statistic (the last one is not important at this stage)

Regression Analysis

Example where the significance of the intercept matters

Week 11

Photospectrometry: At concentration zero - is the absorption significantly different from zero?

Regression in R

Week 11

R-squared value: the proportion of explained vs. total variance, corresponds to the squared correlation coefficient \(R\).

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Remember?

\(R^2\) is the proportion of variance in \(y\) explained by \(x\)

How can we visualise this?

Regression in R

Week 11

\(R^2 = 1 - \frac{{\color{blue}{SS_{residual}}}}{{\color{red}{SS_{total}}}} = 1 - \frac{{\color{blue}{\sum{(y_i - \hat{y}_i)^2}}}}{{\color{red}{\sum{(y_i - \bar{y}_i)^2}}}}\)

Visual representation of
sum of squares (residuals)
vs. 
sum of squares (total)

Regression in R

Example using the ‘swiss’ data set

Week 11

Does the percentage of households involved in agriculture (‘agriculture’) influence the fertility index (‘fertility’)? Can we establish a perdictive model that forecasts ‘fertility’ based on ‘agriculture’?

First, always plot the data.

Indeed, the higher ‘agriculture’, the higher the variable ‘fertility’:

#using the inbuilt 'swiss' data set
plot(swiss$Fertility ~ swiss$Agriculture)

Regression in R

Week 11

Let us run and interpret a linear regression on the data:

m1 = lm(Fertility ~ Agriculture, data = swiss)
summary(m1)
Call:
lm(formula = Fertility ~ Agriculture, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.5374  -7.8685  -0.6362   9.0464  24.4858 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.30438    4.25126  14.185   <2e-16 ***
Agriculture  0.19420    0.07671   2.532   0.0149 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared:  0.1247,    Adjusted R-squared:  0.1052 
F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Regression in R

Example using the ‘swiss’ data set

Week 11
  • How do we interpret the output?
  • What assumptions do we need to verify again…?

To check for normality of the residuals, we extract those from the model (m1$residuals) and use our known tools (visual, Shapiro-Wilk test)

To check the assumption of homogeneity of variance, we plot the residuals against the fitted values

Regression in R - Assumptions

Week 11

So we need to check whether the residuals follow a normal distribution and whether we see heterogeneity of variance

par(mfrow = c(1,2))
qqnorm(m1$residuals)
qqline(m1$residuals)
plot(m1$residuals ~ m1$fitted.values)

This is looking reasonable, most data points sit on the line. There is no obvious heterogeneity of variance. We can trust the model output.

Assumptions, the quick way

Week 11
plot(m1) #note this gives you 4 plots, best used after par(mfrow = c(2, 2))
#we will consider plots 1, 2, and 4 only

This plot checks for homogeneity of variance of the residuals, we do now want to see a pattern, e.g. larger residuals at higher fitted values (trumpet shape).

Assumptions, the quicker way

Week 11

In this plot we check for the normality of the residuals in a qq-plot, just like we did earlier.

Assumptions, the quicker way

Week 11

This plot tells us about very influential outliers. The values outside the 0.5 line (or worse outside the 1) have a strong influence on the regression line and may need looking at.

Regression in R: how to report the data

plot(swiss$Fertility ~ swiss$Agriculture)
abline(m1)
text(70, 50, 'R-squared = 0.12')

The higher the percentage of families involved in agriculture, the larger the families (linear regression, R-squared = 0.12, p < 0.05).

Regression in R: use the model to predict values

Intercept: 60.3

Slope: 0.19

What fertility would we predict if the percentage of people working in agriculture is 50%?

\(Fertility = 0.19 \cdot 50 + 60.3 = 69.8\)

Learning tool: Linear regression app

What will we have learnt by the end of this week?

Week 11
  • How to perform a linear regression
  • How to interpret the output of a linear regression
  • How to check for model assumptions
  • How to report a linear regression analysis

Glossary

Week 11
  • regression, to regress y on x
  • intercept
  • slope
  • coefficients
  • R-squared value
  • heterogeneity of variance
  • residuals, residual distribution
  • fitted values