C. Donovan
The simple linear regression model has the form:
\[ \begin{align*} \textrm{response variable} =& \textrm{intercept} + \textrm{slope} \times \textrm{explanatory variable} (+ \textrm{noise})\\ y=&\beta_0+\beta_1 x\\ \end{align*} \]
So there is a simple model for the signal/underlying non-random component. In practice our observations will deviate from this.
Hence there are two numbers (parameters) that describe a linear model:
Where should the line lie?
We want to establish a criterion for “best” fit to the data
For example - where should the line go?
[examples and drawing ensues]
We may not be able to immediately agree, but we can agree on what is a badly fitting line.
So, we've settled on a criterion - Ordinary Least Squares (OLS). We look to minimise discrepencies between what we observed and what the model predicts i.e. aim for small \( (y_i-\hat{y_i}) \):
\[ \min \left(\sum_i (y_i-\hat{y}_i)^2 \right) \]
This idea is used all over the place. We are minimising the sum-of-squares (SS)
How do we change SS?
\( \hat{y} \) is generally a function of our bunch of \( x \) and some parameters \( \boldsymbol{\theta} \):
\[ y = f(x_1, ..., \boldsymbol{\theta}) \]
How do we change SS?
I can't overstate the importance of this general process: linear models, generalised linear models, non-linear models, neural networks, in fact any model fitted to data
The intercept and slope estimates (\( \hat{\beta}_0 \) and \( \hat{\beta}_1 \) respectively) can be found, by hand, using:
\[ \hat{\beta}_1=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2} \]
\[ \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x} \]
of course, you'd use a computer
We'll fit a wee regression here. Using the loan default data again, we'll model the debt-to-income ratio DEBTINC
as a function of the value of the property VALUE
loanData <- read_csv("data/hmeq.csv")
loanLM <- lm(DEBTINC ~ VALUE, data = loanData, x = T, y = T)
plot(DEBTINC ~ VALUE, data = loanData)
summary(loanLM)
Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)
Residuals:
Min 1Q Median 3Q Max
-33.335 -4.669 0.954 5.167 168.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.175e+01 2.585e-01 122.823 <2e-16 ***
VALUE 1.997e-05 2.193e-06 9.103 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.491 on 4660 degrees of freedom
(1298 observations deleted due to missingness)
Multiple R-squared: 0.01747, Adjusted R-squared: 0.01726
F-statistic: 82.86 on 1 and 4660 DF, p-value: < 2.2e-16
Easy aye?
You'll see we get \( p \)-values in the output
These imply tests, although none were explicit in our model specification
We get them for “free” and are often are relevant
Is the debt-to-income ratio really related to property value?
summary(loanLM)
Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)
Residuals:
Min 1Q Median 3Q Max
-33.335 -4.669 0.954 5.167 168.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.175e+01 2.585e-01 122.823 <2e-16 ***
VALUE 1.997e-05 2.193e-06 9.103 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.491 on 4660 degrees of freedom
(1298 observations deleted due to missingness)
Multiple R-squared: 0.01747, Adjusted R-squared: 0.01726
F-statistic: 82.86 on 1 and 4660 DF, p-value: < 2.2e-16
If \( H_0 \) were true, we'd expect \( t \)-test statistics between -2 and 2 a lot of the time. Here we have 9.1, the \( p \)-value is correspondingly very small.
We conclude a significant linear relationship between \( x \) and \( y \)
We can get confidence intevals for the parameters quite easily too:
confint(loanLM)
2.5 % 97.5 %
(Intercept) 3.124005e+01 3.225352e+01
VALUE 1.566567e-05 2.426589e-05
So, for every 100,000 increase in value of the property, average debt-to-income ratio increases by about 1.57 to 2.43 (I'm playing with the units for interpretability).
Note the intercept is pretty meaningingless (why?)
You'll observe in the output:
summary(loanLM)
Call:
lm(formula = DEBTINC ~ VALUE, data = loanData, x = T, y = T)
Residuals:
Min 1Q Median 3Q Max
-33.335 -4.669 0.954 5.167 168.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.175e+01 2.585e-01 122.823 <2e-16 ***
VALUE 1.997e-05 2.193e-06 9.103 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.491 on 4660 degrees of freedom
(1298 observations deleted due to missingness)
Multiple R-squared: 0.01747, Adjusted R-squared: 0.01726
F-statistic: 82.86 on 1 and 4660 DF, p-value: < 2.2e-16
A “multiple R-squared” of 0.01747. These are commonplace and require discussion.
This is a measure of linear association between two numeric variables
It takes on value:
library(mvtnorm)
sigma_0.8 <- matrix(c(1, 0.8, 0.8, 1), ncol = 2)
sigma_0.8neg <- matrix(c(1, -0.8, -0.8, 1), ncol = 2)
y_0.8 <- rmvnorm(100, sigma = sigma_0.8)
y_0 <- rmvnorm(100, mean = c(0,0))
y_0.8neg <- rmvnorm(100, sigma = sigma_0.8neg)
cor(y_0.8)
[,1] [,2]
[1,] 1.000000 0.850337
[2,] 0.850337 1.000000
cor(y_0)
[,1] [,2]
[1,] 1.0000000 -0.2352583
[2,] -0.2352583 1.0000000
cor(y_0.8neg)
[,1] [,2]
[1,] 1.0000000 -0.8351714
[2,] -0.8351714 1.0000000
We can test for the signifcance of these things - the outputs are quite self-explanatory given what we have seen to date.
cor.test(y_0.8neg[,1], y_0.8neg[,2])
Pearson's product-moment correlation
data: y_0.8neg[, 1] and y_0.8neg[, 2]
t = -15.033, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8862127 -0.7640981
sample estimates:
cor
-0.8351714
\[ R^2=1-\frac{(n-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} = 1-\frac{{\mathrm SSE}/(n-1)}{{\mathrm SST}/(n-1)} \]
When we fit a linear model we assume the following:
We are only checking two things: the model for signal and the model for noise
Reposed as the two things:
The residuals (\( y_i-\hat{y_i} \)) give us clues about the nature of the noise - so we look at these mainly
The residuals from the fitted model are the estimates of the unobserved errors and make a useful tool for checking many of our assumptions
Outliers are “weird” observations:
They may be genuine - our model may be wrong. They may be mistakes.
As for previous tests/models:
Many types of data are obviously a priori non-Normal. See the GLMs in MT5761 for more tools.
NB You may encounter skewness and kurtosis measures.
These two parameters can be used to describe aspects of shape relative to Normality:
You should be conversant with terms left/right skew and lepto/platy-kurtosis (too pointy or too flat for a Normal)
Similar to the simpler tests/models, we can plot the residuals in various ways to see if independent draws from a single Normal distribution seems plausible
First is the shape of the noise Normal?
qqnorm(residuals(loanLM))
qqline(residuals(loanLM))
shapiro.test(residuals(loanLM))
Shapiro-Wilk normality test
data: residuals(loanLM)
W = 0.82582, p-value < 2.2e-16
So we reject \( H_0 \), the data are pretty non-Normal. There is marked skewness.
Is it a single Normal distribution providing the noise?
plot(loanLM$x[,2], resid(loanLM))
there will be more detailed diagnostics in the linear models next week
You can get several plots with the following for an lm
object:
par(mfrow = c(2,2)) # create a grid for 4 plots
plot(loanLM)
We're often working with models of the form:
\[ y_i = \beta_0 + \beta_1 \times x_i + ... \]
This is long-winded. Matrices help.
This compact expression is really useful:
\[ y = \mathbf{X}\boldsymbol{\beta} \]
Which for our fitted regression would be:
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \end{bmatrix} \]
Which is simply \( y_i = \beta_0 + \beta_1 x_i \), but could cover much more for the same space.
set.seed(345)
x <- runif(8)
y <- 2 + x*2 + rnorm(8)
smallLM <- lm(y ~ x)
plot(x, y)
abline(coef(smallLM))
XMat <- cbind(rep(1, 6), x)
modelEst <- as.vector(coef(smallLM))
XMat
x
[1,] 1 0.2162537
[2,] 1 0.2747640
[3,] 1 0.3899251
[4,] 1 0.6557397
[5,] 1 0.4358664
[6,] 1 0.8034841
[7,] 1 0.3856799
[8,] 1 0.8333017
modelEst
[1] 2.120458 2.763582
cbind(XMat %*% modelEst, fitted(smallLM))
[,1] [,2]
1 2.718093 2.718093
2 2.879791 2.879791
3 3.198048 3.198048
4 3.932649 3.932649
5 3.325011 3.325011
6 4.340952 4.340952
7 3.186316 3.186316
8 4.423356 4.423356
[does stuff on the board - attending lectures is useful]
We've covered:
Next: