Lecture 13 How statistics draws a line

Eamonn Mallon
14/10/2020

How is this line calculated in stats

plot of chunk unnamed-chunk-1

  • The essence of regression analysis is using sample data to estimate parameter values (a and b).
  • Stats finds slope (b) and intercept (a) by minimising SSE
  • SSE, think back to ANOVA, the sum of squares of the error, here the variation that can't be explained by the line
    • Square all the residuals (do you remember why?)
    • Sum them all up

What are residuals in a regression

  • Residuals (d) are the difference between the actual value of y and the predicted value of y(\( \hat{y} \))
  • \[ d = y - \hat{y} \]
  • But \( \hat{y} \) must be on the line \( a + bx \)
  • \[ d = y - \hat{y} = y - (a +bx) = y - a -bx \]

Minimising SSE

  • change the value of the slope (b)
  • work out the new intercept \( a = \bar y - b\bar x \) (the line has to go through the mean values of x and y)
  • predict the fitted values of growth for each level of tannin (\( a + bx \))
  • work out the residuals (\( y - a -bx \), previous slide)
  • square them and add them up (\( \sum (y - a- bx)^2 \))
  • associate this value of SSE[i] with the current estimate of the slope b[i]

Minimising SSE

b <- seq(-1.43,-1,0.002) 
sse <- numeric(length(b)) 
for (i in 1:length(b)) {
  a <- mean(reg.data$growth)-b[i]*mean(reg.data$tannin) 
  residual <- reg.data$growth - a - b[i]*reg.data$tannin 
  sse[i] <- sum(residual^2)
  }
plot(b,sse,type="l",ylim=c(19,24)) 
  arrows(-1.216,20.07225,-1.216,19,col="red") 
  abline(h=20.07225,col="green",lty=2)
  lines(b,sse)

 print(b[which(sse==min(sse))])

Minimising SSE

plot of chunk unnamed-chunk-3

[1] -1.216

So we have the slope (b), how do we get the intercept (a)

\[ y = a + bx \] \[ a = y - bx \]

  • The line has to got through the mean of y (6.9) and x (4)

\[ a = \bar y - b\bar x \]

  • We know everything on the left hand side, so can calculate a (a = 6.9-(-1.2)4 = 11.7)
  • Therefore we can write the equation of the line from the parameters we have calculated (a and b).

\[ y= 11.7 -1.2x \]

So is it significant?

  • We will not go deep into this (in first year) except to say that you work out the significance using an ANOVA table
  • Rather I will spend the time teaching you the R code reqired and its interpretation

Regression in R

model <- lm(reg.data$growth~reg.data$tannin)
summary(model)

Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,    Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

\[ y = 11.75 - 1.2x \]

Tanin levels affect growth (Regression: \( R^2 \) = 0.79, \( F_{1,7} \) = 30.97, p = \( 0.0009 \))

R squared ?

  • This measures the goodness of fit of the line
  • 1 (perfect) 0 (no fit)
  • Its the square of r (the correlation coefficient)
  • \[ R^2 = SSR/SSY \]
  • SSR the variation explained by the regression line

r2