MT5762 Lecture 13

C. Donovan

A simple linear model and likelihood

The simple linear regression model has the form:

\[ \begin{align*} \textrm{response variable} =& \textrm{intercept} + \textrm{slope} \times \textrm{explanatory variable} (+ \textrm{noise})\\ y=&\beta_0+\beta_1 x\\ \end{align*} \]

So there is a simple model for the signal/underlying non-random component. In practice our observations will deviate from this.

Two-parameter model

Hence there are two numbers (parameters) that describe a linear model:

  • The slope (\( \beta_1 \)) This describes how steep the relationship is (aka slope/gradient). The \( y \)-value changes by \( \beta_1 \) for every one unit increase in \( x \).
  • The intercept parameter (\( \beta_0 \)) The \( y \)-value when \( x=0 \). Where the line cuts the \( y \)-axis.

Determining the best line

Where should the line lie?

We want to establish a criterion for “best” fit to the data

The least-squares line

For example - where should the line go?

[examples and drawing ensues]

We may not be able to immediately agree, but we can agree on what is a badly fitting line.

The least-squares line

So, we've settled on a criterion - Ordinary Least Squares (OLS). We look to minimise discrepencies between what we observed and what the model predicts i.e. aim for small \( (y_i-\hat{y_i}) \):

\[ \min \left(\sum_i (y_i-\hat{y}_i)^2 \right) \]

This idea is used all over the place. We are minimising the sum-of-squares (SS)

The least-squares line

How do we change SS?

  • \( y_i \) are clearly fixed, so we change \( \hat{y} \)
  • How do we change \( \hat{y} \)?

\( \hat{y} \) is generally a function of our bunch of \( x \) and some parameters \( \boldsymbol{\theta} \):

\[ y = f(x_1, ..., \boldsymbol{\theta}) \]

  • \( x \) are also fixed - we change \( \hat{y} \) (and hence SS) by choosing \( f \) and altering its \( \boldsymbol{\theta} \)

The least-squares line

How do we change SS?

  • So for a simple \( y = \beta_0 + \beta_1 x \), we move \( \beta_0 \) and \( \beta_1 \)
  • Some changes make SS go up (therefore bad), some make SS go down (therefore good)
  • These are general ideas:
    • We have a parameter space (here defined by \( \beta_0 \) and \( \beta_1 \))
    • We have an objective function, which defines values of parameters to be better or worse than others
    • Here our “objective” is to minimise SS (the objective function - sometimes called a cost function)

I can't overstate the importance of this general process: linear models, generalised linear models, non-linear models, neural networks, in fact any model fitted to data

The least-squares line

The intercept and slope estimates (\( \hat{\beta}_0 \) and \( \hat{\beta}_1 \) respectively) can be found, by hand, using:

\[ \hat{\beta}_1=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2} \]

\[ \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x} \]

of course, you'd use a computer

The least-squares line

We'll fit a wee regression here. Using the loan default data again, we'll model the debt-to-income ratio DEBTINC as a function of the value of the property VALUE

  loanData <- read_csv("data/hmeq.csv")

  loanLM <- lm(DEBTINC ~ VALUE, data = loanData, x = T, y = T)

  plot(DEBTINC ~ VALUE, data = loanData)

plot of chunk unnamed-chunk-2

  summary(loanLM)   

Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.335  -4.669   0.954   5.167 168.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.175e+01  2.585e-01 122.823   <2e-16 ***
VALUE       1.997e-05  2.193e-06   9.103   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.491 on 4660 degrees of freedom
  (1298 observations deleted due to missingness)
Multiple R-squared:  0.01747,   Adjusted R-squared:  0.01726 
F-statistic: 82.86 on 1 and 4660 DF,  p-value: < 2.2e-16

Easy aye?

Testing for no linear relationship

You'll see we get \( p \)-values in the output

These imply tests, although none were explicit in our model specification

We get them for “free” and are often are relevant

Testing for no linear relationship

Is the debt-to-income ratio really related to property value?

  summary(loanLM)

Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.335  -4.669   0.954   5.167 168.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.175e+01  2.585e-01 122.823   <2e-16 ***
VALUE       1.997e-05  2.193e-06   9.103   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.491 on 4660 degrees of freedom
  (1298 observations deleted due to missingness)
Multiple R-squared:  0.01747,   Adjusted R-squared:  0.01726 
F-statistic: 82.86 on 1 and 4660 DF,  p-value: < 2.2e-16

Is \(Y\) significantly related to \(X\)?

  • Each fitted line is an estimate of the true/population regression line
  • The null hypothesis is that there is no linear relationship between \( x \) and \( y \) i.e. true slope is zero: \( H_0:\beta_1=0 \).
  • The alternative hypothesis is that there is a linear relationship between \( x \) and \( y \) i.e. true slope is not zero: \( H_1:\beta_1 \neq 0 \).

If \( H_0 \) were true, we'd expect \( t \)-test statistics between -2 and 2 a lot of the time. Here we have 9.1, the \( p \)-value is correspondingly very small.

We conclude a significant linear relationship between \( x \) and \( y \)

What can we conclude?

We can get confidence intevals for the parameters quite easily too:

  confint(loanLM)
                   2.5 %       97.5 %
(Intercept) 3.124005e+01 3.225352e+01
VALUE       1.566567e-05 2.426589e-05

So, for every 100,000 increase in value of the property, average debt-to-income ratio increases by about 1.57 to 2.43 (I'm playing with the units for interpretability).

Note the intercept is pretty meaningingless (why?)

R-squared (and correlation)

You'll observe in the output:

  summary(loanLM)

Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.335  -4.669   0.954   5.167 168.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.175e+01  2.585e-01 122.823   <2e-16 ***
VALUE       1.997e-05  2.193e-06   9.103   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.491 on 4660 degrees of freedom
  (1298 observations deleted due to missingness)
Multiple R-squared:  0.01747,   Adjusted R-squared:  0.01726 
F-statistic: 82.86 on 1 and 4660 DF,  p-value: < 2.2e-16

A “multiple R-squared” of 0.01747. These are commonplace and require discussion.

First the (Pearson) Correlation

This is a measure of linear association between two numeric variables

It takes on value:

  • 1 when they have a perfect positive association
  • -1 when they have a perfect negative association
  • 0 when they have no linear association
  • Varying values between

(Pearson) Correlation - examples

  library(mvtnorm)

  sigma_0.8 <- matrix(c(1, 0.8, 0.8, 1), ncol = 2)
  sigma_0.8neg <- matrix(c(1, -0.8, -0.8, 1), ncol = 2)

  y_0.8 <- rmvnorm(100, sigma = sigma_0.8)
  y_0 <- rmvnorm(100, mean = c(0,0))
  y_0.8neg <- rmvnorm(100, sigma = sigma_0.8neg)
  cor(y_0.8)
         [,1]     [,2]
[1,] 1.000000 0.850337
[2,] 0.850337 1.000000
  cor(y_0)
           [,1]       [,2]
[1,]  1.0000000 -0.2352583
[2,] -0.2352583  1.0000000
  cor(y_0.8neg)
           [,1]       [,2]
[1,]  1.0000000 -0.8351714
[2,] -0.8351714  1.0000000

(Pearson) Correlation - examples

We can test for the signifcance of these things - the outputs are quite self-explanatory given what we have seen to date.

  cor.test(y_0.8neg[,1], y_0.8neg[,2])

    Pearson's product-moment correlation

data:  y_0.8neg[, 1] and y_0.8neg[, 2]
t = -15.033, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8862127 -0.7640981
sample estimates:
       cor 
-0.8351714 

\(R^2\) - the ubiquitous R-square

  • Can be expressed in many different ways giving different interpretations
  • Two are given here:

\[ R^2=1-\frac{(n-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} = 1-\frac{{\mathrm SSE}/(n-1)}{{\mathrm SST}/(n-1)} \]

  • SSE is the sum of squared errors for your model, the SST is the total sum of squares for your data.
  • This formulation shows the interpretation as “the proportion of variance explained by the model”

The vanilla-flavoured \(R^2\)

  • An alternative formulation (which justifies the name \( R^2 \)) is
    \[ R^2=[Corr(\hat{\mathbf{y}}, \mathbf{y})]^2 \]
  • where the \( \bar{\hat{y}} \) is the mean of the fitted values, \( \mathbf{y} \) and \( \hat{\mathbf{y}} \) are the vectors of observed and predicted response values.
  • So it is the square of the correlation between observed reponse values and those predicted under the model

The vanilla-flavoured \(R^2\)

  • Bounded by 0 and 1,
  • \( R^2=1 \) is a perfect agreement between the model predictions and the observed data
  • The value 1 would indicate that all of the variability in the response is accounted for by the model
  • A value of zero indicates the worst possible fit, and there is no relationship between the observed data and the model predictions.

Linear regression model assumptions/checking

Linear regression model assumptions and checking

When we fit a linear model we assume the following:

  • Linearity the relationship is linear
  • Normality the errors are Normally distributed with mean zero
  • Constant spread the errors have the same standard deviation for all values of \( x \)
  • Independent observations the errors are independent

We are only checking two things: the model for signal and the model for noise

Linear regression model assumptions and checking

Reposed as the two things:

  • Signal was the straight line a good model for the signal?
  • Noise does the stuff left over look like independent draws from a single Normal distribution?

The residuals (\( y_i-\hat{y_i} \)) give us clues about the nature of the noise - so we look at these mainly

Using residuals to check linear model assumptions

The residuals from the fitted model are the estimates of the unobserved errors and make a useful tool for checking many of our assumptions

  • Linearity Scatter plot of \( y \) versus \( x \). We want to see a linear trend with constant scatter about the trend. Check for outliers.

Outliers are “weird” observations:

  • a very large residual
  • inconsistent with the model (signal or noise)

They may be genuine - our model may be wrong. They may be mistakes.

Is the noise sufficiently Normal?

As for previous tests/models:

  • Normal shape use Normal quantile-quantile (normal-QQ) plot of the residuals and/or a test for Normality:
    • \( H_0: \) The data come from a Normal distribution
    • \( H_1: \) The data don't come from a Normal distribution
    • small \( p \)-values provide evidence against the null hypothesis - conclude the data isn't Normally distributed when the \( p \)-value is small

Many types of data are obviously a priori non-Normal. See the GLMs in MT5761 for more tools.

Is the noise sufficiently Normal?

NB You may encounter skewness and kurtosis measures.

These two parameters can be used to describe aspects of shape relative to Normality:

  • skewness \( \gamma_1 \): a measure of asymmetry, \( \gamma_1=0 \) for the Normal distribution.
  • Kurtosis \( \gamma_2 \): a measure of peaky-ness of the density of the distribution, \( \gamma_2=3 \) for the Normal distribution.

You should be conversant with terms left/right skew and lepto/platy-kurtosis (too pointy or too flat for a Normal)

Using residuals to check linear model assumptions

Similar to the simpler tests/models, we can plot the residuals in various ways to see if independent draws from a single Normal distribution seems plausible

  • Constant spread Scatter plot of residuals versus \( x \) or residuals versus the fitted values (\( \hat{y} \)). We want to see a patternless horizontal band (analogous to looking for homoscedacisity in ANOVA or \( t \)-tests)
  • Independent observations Scatter plot of residuals in serial/some logical order (e.g. order/time of collection?)

Assessing the model

First is the shape of the noise Normal?

    qqnorm(residuals(loanLM))
    qqline(residuals(loanLM))

plot of chunk unnamed-chunk-10

  shapiro.test(residuals(loanLM))

    Shapiro-Wilk normality test

data:  residuals(loanLM)
W = 0.82582, p-value < 2.2e-16

So we reject \( H_0 \), the data are pretty non-Normal. There is marked skewness.

Assessing the model

Is it a single Normal distribution providing the noise?

  plot(loanLM$x[,2], resid(loanLM))

plot of chunk unnamed-chunk-12

  • There is some indication of higher variance with lower \( x \), but nothing shocking.
  • Additionally there isn't any particular pattern, so maybe a straight line was an OK model for the signal

there will be more detailed diagnostics in the linear models next week

Assessing the model

You can get several plots with the following for an lm object:

  par(mfrow = c(2,2)) # create a grid for 4 plots
  plot(loanLM)

plot of chunk unnamed-chunk-14

Matrix form for this model

We're often working with models of the form:

\[ y_i = \beta_0 + \beta_1 \times x_i + ... \]

This is long-winded. Matrices help.

Matrix form for this model

This compact expression is really useful:

\[ y = \mathbf{X}\boldsymbol{\beta} \]

Which for our fitted regression would be:

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \end{bmatrix} \]

Which is simply \( y_i = \beta_0 + \beta_1 x_i \), but could cover much more for the same space.

Matrix form for this model

  set.seed(345)
  x <- runif(8)
  y <- 2 + x*2 + rnorm(8)

  smallLM <- lm(y ~ x)
  plot(x, y)
  abline(coef(smallLM))

plot of chunk unnamed-chunk-16

Matrix form for this model

  XMat <- cbind(rep(1, 6), x)
  modelEst <- as.vector(coef(smallLM))

  XMat
               x
[1,] 1 0.2162537
[2,] 1 0.2747640
[3,] 1 0.3899251
[4,] 1 0.6557397
[5,] 1 0.4358664
[6,] 1 0.8034841
[7,] 1 0.3856799
[8,] 1 0.8333017
  modelEst
[1] 2.120458 2.763582
  cbind(XMat %*% modelEst, fitted(smallLM))
      [,1]     [,2]
1 2.718093 2.718093
2 2.879791 2.879791
3 3.198048 3.198048
4 3.932649 3.932649
5 3.325011 3.325011
6 4.340952 4.340952
7 3.186316 3.186316
8 4.423356 4.423356

Fitting using likelihood - hand-waving

[does stuff on the board - attending lectures is useful]

Recap and look-forwards

We've covered:

  • The basic linear regression model
  • The idea of optimising parameters for \( f \) based on an objective function
  • The Ordinary Least Squares (OLS) objective
  • Fitting and interpreting the regression, including tests
  • R-square and correlation
  • Checking assumptions
  • Really basic matrix algebra
  • A hand-wave at likelihood

Next:

  • Linear models
  • Using these to build complex models