Regression

Eamonn Mallon

2025-02-14

What are you going to learn today

What is a regression
The difference between it and correlation
You did regressions routinely in school
How does stats replicate this algorithm
Getting R to do it
Interpreting your results

What is regression

statistical method you use when both the response variable and the explanatory variable are continuous variables
Continous variables = real numbers with decimal places – things like heights, weights, volumes, or temperatures
Remember in ANOVA, explanatory variables are called factors which have levels
- sex could be a factor and has two levels (male and female)
Should I do ANOVA or regression?

Should I do ANOVA or regression?

Can you do a boxplot (ANOVA)
Can you do a scatterplot (Regression)

Difference between correlation and regression

Lots of books have very complicated explanations of this
Usually
- correlation is if both x and y are random
- regression is if you pick x (experiment)
But this is more of a guidline than a rule
I use regressions when cause and effect is important and you need the equation of the line

Equation of the line

\[ y = a + bx \]
y response variable
x explanatory variable
Two parameters a, the intercept and b the slope of the line.

Think back to school

a you can read off the graph (lets say 2)
\[b = \frac{change in y}{change in x}\] x = (9-4)/(7.5-2.5) =1
\[ y = 2 + x \]

R can do that for you

lm(y~x)


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          2            1

\[ y = 2 + x \]

How did R (and stats) just do that?

Short answer: the way you did in school
The essence of regression analysis is using sample data to estimate parameter values (a and b).
Stats finds slope (b) and intercept (a) by minimising SSE
SSE, think back to ANOVA, the sum of squares of the error, here the variation that can’t be explained by the line
- Square all the residuals (do you remember why?)
- Sum them all up

What are residuals in a regression

Residuals (d) are the difference between the actual value of y and the predicted value of y(\(\hat{y}\))
\[ d = y - \hat{y} \]
But \(\hat{y}\) must be on the line \(a + bx\)
\[ d = y - \hat{y} \\ = y - (a +bx) \\ = y - a -bx\]

Minimising SSE

change the value of the slope (b)
work out the new intercept \(a = \bar y - b\bar x\) (the line has to go through the mean values of x and y)
predict the fitted values of growth for each level of tannin (\(a + bx\))
work out the residuals (\(y - a -bx\), previous slide)
square them and add them up (\(\sum (y - a- bx)^2\))
associate this value of SSE[i] with the current estimate of the slope b[i]

Minimising SSE

b <- seq(-1.43,-1,0.002) 
sse <- numeric(length(b)) 
for (i in 1:length(b)) {
  a <- mean(reg.data$growth)-b[i]*mean(reg.data$tannin) 
  residual <- reg.data$growth - a - b[i]*reg.data$tannin 
  sse[i] <- sum(residual^2)
  }
plot(b,sse,type="l",ylim=c(19,24)) 
  arrows(-1.216,20.07225,-1.216,19,col="red") 
  abline(h=20.07225,col="green",lty=2)
  lines(b,sse)
  
 print(b[which(sse==min(sse))])

Minimising SSE

[1] -1.216

So we have the slope (b), how do we get the intercept (a)

\[ y = a + bx\\a = y - bx \]
The line has to got through the mean of y (6.9) and x (4)
\[ a = \bar y - b\bar x \]
We know everything on the left hand side, so can calculate a
\[ a = 6.9-(-1.2 \times 4)\\ = 6.9 + 4.8\\= 11.7\]

Therefore we can write the equation of the line from the parameters we have calculated (a and b).
\[ y= 11.7 -1.2x \]

So is it significant?

We will not go deep into this (in first year) except to say that you work out the significance using an ANOVA table
Rather I will spend the time teaching you the R code required and its interpretation

Regression in R

model <- lm(reg.data$growth~reg.data$tannin)
summary(model)


Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,    Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

\[ y = 11.75 - 1.2x \] Tanin levels affect growth (Regression: \(R^2\) = 0.79, \(F_{1,7}\) = 30.97, p = \(0.0009\))

R squared ?

This measures the goodness of fit of the line
1 (perfect) 0 (no fit)
Its the square of r (the correlation coefficient)
\[ R^2 = SSR/SSY \]
SSR the variation explained by the regression line

What you learned today

bs1040marks <- read.csv("~/Dropbox/Teaching/first_year_stats/lectures/5.regressions/bs1040marks.csv")
bs1040_model<-lm(real~mock, data = bs1040marks)
summary(bs1040_model)


Call:
lm(formula = real ~ mock, data = bs1040marks)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.164  -7.702   0.017   7.620  39.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.8859     3.8308   8.846 4.56e-16 ***
mock          1.4186     0.3494   4.060 7.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.87 on 202 degrees of freedom
  (102 observations deleted due to missingness)
Multiple R-squared:  0.07545,   Adjusted R-squared:  0.07087 
F-statistic: 16.48 on 1 and 202 DF,  p-value: 7.012e-05

What you learned today

library(ggplot2)
ggplot(bs1040marks, aes(x = mock, y = real)) +
    geom_point(color = "blue", alpha = 0.6) +  # Scatter plot points
    geom_smooth(method = "lm", color = "red", se = TRUE) +  # Regression line
    theme_minimal() +
    labs(title = "Mock scores predict BS1040 scores",
         x = "Mock scores",
         y = "Exam scores")