Lecture 13 How statistics draws a line

Eamonn Mallon
14/10/2020

How is this line calculated in stats

plot of chunk unnamed-chunk-1

The essence of regression analysis is using sample data to estimate parameter values (a and b).
Stats finds slope (b) and intercept (a) by minimising SSE
SSE, think back to ANOVA, the sum of squares of the error, here the variation that can't be explained by the line
- Square all the residuals (do you remember why?)
- Sum them all up

What are residuals in a regression

Residuals (d) are the difference between the actual value of y and the predicted value of y(\( \hat{y} \))
\[ d = y - \hat{y} \]
But \( \hat{y} \) must be on the line \( a + bx \)
\[ d = y - \hat{y} = y - (a +bx) = y - a -bx \]

Minimising SSE

change the value of the slope (b)
work out the new intercept \( a = \bar y - b\bar x \) (the line has to go through the mean values of x and y)
predict the fitted values of growth for each level of tannin (\( a + bx \))
work out the residuals (\( y - a -bx \), previous slide)
square them and add them up (\( \sum (y - a- bx)^2 \))
associate this value of SSE[i] with the current estimate of the slope b[i]

Minimising SSE

b <- seq(-1.43,-1,0.002) 
sse <- numeric(length(b)) 
for (i in 1:length(b)) {
  a <- mean(reg.data$growth)-b[i]*mean(reg.data$tannin) 
  residual <- reg.data$growth - a - b[i]*reg.data$tannin 
  sse[i] <- sum(residual^2)
  }
plot(b,sse,type="l",ylim=c(19,24)) 
  arrows(-1.216,20.07225,-1.216,19,col="red") 
  abline(h=20.07225,col="green",lty=2)
  lines(b,sse)

 print(b[which(sse==min(sse))])

Minimising SSE

plot of chunk unnamed-chunk-3

[1] -1.216

So we have the slope (b), how do we get the intercept (a)

\[ y = a + bx \] \[ a = y - bx \]

The line has to got through the mean of y (6.9) and x (4)

\[ a = \bar y - b\bar x \]

We know everything on the left hand side, so can calculate a (a = 6.9-(-1.2)4 = 11.7)
Therefore we can write the equation of the line from the parameters we have calculated (a and b).

\[ y= 11.7 -1.2x \]

So is it significant?

We will not go deep into this (in first year) except to say that you work out the significance using an ANOVA table
Rather I will spend the time teaching you the R code reqired and its interpretation

Regression in R

model <- lm(reg.data$growth~reg.data$tannin)
summary(model)


Call:
lm(formula = reg.data$growth ~ reg.data$tannin)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      11.7556     1.0408  11.295 9.54e-06 ***
reg.data$tannin  -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,    Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

\[ y = 11.75 - 1.2x \]

Tanin levels affect growth (Regression: \( R^2 \) = 0.79, \( F_{1,7} \) = 30.97, p = \( 0.0009 \))

R squared ?

This measures the goodness of fit of the line
1 (perfect) 0 (no fit)
Its the square of r (the correlation coefficient)
\[ R^2 = SSR/SSY \]
SSR the variation explained by the regression line