What is regression

  • statistical method you use when both the response variable and the explanatory variable are continuous variables
  • Continous variables = real numbers with decimal places – things like heights, weights, volumes, or temperatures
  • Remember in ANOVA, explanatory variables are called factors which have levels
    • sex could be a factor and has two levels (male and female)
  • Should I do ANOVA or regression?

How did R (and stats) just do that?

  • Short answer: the way you did in school
  • The essence of regression analysis is using sample data to estimate parameter values (a and b).
  • Stats finds slope (b) and intercept (a) by minimising SSE
  • SSE, think back to ANOVA, the sum of squares of the error, here the variation that can’t be explained by the line
    • Square all the residuals (do you remember why?)
    • Sum them all up

What are residuals in a regression

  • Residuals (d) are the difference between the actual value of y and the predicted value of y(\(\hat{y}\))
  • \[ d = y - \hat{y} \]
  • But \(\hat{y}\) must be on the line \(a + bx\)
  • \[ d = y - \hat{y} \\ = y - (a +bx) \\ = y - a -bx\]

So we have the slope (b), how do we get the intercept (a)

  • \[ y = a + bx\\a = y - bx \]

  • The line has to got through the mean of y (6.9) and x (4)

  • \[ a = \bar y - b\bar x \]

  • We know everything on the left hand side, so can calculate a

  • \[ a = 6.9-(-1.2 \times 4)\\ = 6.9 + 4.8\\= 11.7\]

  • Therefore we can write the equation of the line from the parameters we have calculated (a and b).

  • \[ y= 11.7 -1.2x \]

Regression in R

model <- lm(reg.data$growth~reg.data$tannin)
summary(model)

Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,    Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

\[ y = 11.75 - 1.2x \] Tanin levels affect growth (Regression: \(R^2\) = 0.79, \(F_{1,7}\) = 30.97, p = \(0.0009\))

What you learned today

bs1040marks <- read.csv("~/Dropbox/Teaching/first_year_stats/lectures/5.regressions/bs1040marks.csv")
bs1040_model<-lm(real~mock, data = bs1040marks)
summary(bs1040_model)

Call:
lm(formula = real ~ mock, data = bs1040marks)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.164  -7.702   0.017   7.620  39.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.8859     3.8308   8.846 4.56e-16 ***
mock          1.4186     0.3494   4.060 7.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.87 on 202 degrees of freedom
  (102 observations deleted due to missingness)
Multiple R-squared:  0.07545,   Adjusted R-squared:  0.07087 
F-statistic: 16.48 on 1 and 202 DF,  p-value: 7.012e-05

What you learned today

library(ggplot2)
ggplot(bs1040marks, aes(x = mock, y = real)) +
    geom_point(color = "blue", alpha = 0.6) +  # Scatter plot points
    geom_smooth(method = "lm", color = "red", se = TRUE) +  # Regression line
    theme_minimal() +
    labs(title = "Mock scores predict BS1040 scores",
         x = "Mock scores",
         y = "Exam scores")