Linear regression works

This short document will demonstrate that linear regression indeed works. You will also find out that fitting a linear model doesn't neccessarily mean fitting a straight line through your data. Sometimes models can be nice and curvy, too. If you examine the key objects, you should be able to descipher how to organize your data and how to call proper functions to perform linear regression analysis.

Trivial example for babies

First, let's show a cheap example. We simulate some data, add some noise and try to fit a straight line through it. If you've read any text books on regression, model specification of \( y = \beta_0 + \beta_1 x + \epsilon \) should look familiar. This is the formula we use to simulate data and estimate model parameters. How do we decide which model to use? One way is to look at the data and consider which functions would fit through it. Using fore knowledge of the phenomenon, model diagnostics and visually inspecting the model can guide us in choosing an appropriate model.

set.seed(357)

# values on x axis go from
data.from <- 30
# to
data.to <- 100
n <- 200  # number of data points

# line will pass the y axis at (x = intercept)
intercept <- 30
# and the rate of increase for every x will be
beta1 <- 0.7
# with noise
epsilon <- rnorm(n, mean = 0, sd = 5)

x <- seq(from = data.from, to = data.to, length.out = n)
y <- intercept + beta1 * x + epsilon  # final model with some noise

plot(x, y)

xy <- data.frame(y, x)

mdl <- lm(y ~ x, data = xy)
lines(x, fitted(mdl), lwd = 1)

plot of chunk simple_sim

# abline(mdl, lwd = 2) # for straight lines, abline() is fine
summary(mdl)

## 
## Call:
## lm(formula = y ~ x, data = xy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.52  -2.73  -0.15   2.80  13.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.2260     1.0997    27.5   <2e-16 ***
## x             0.6953     0.0161    43.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.64 on 198 degrees of freedom
## Multiple R-squared:  0.903,  Adjusted R-squared:  0.903 
## F-statistic: 1.85e+03 on 1 and 198 DF,  p-value: <2e-16


confint(mdl)

##               2.5 %  97.5 %
## (Intercept) 28.0573 32.3947
## x            0.6634  0.7271

First, we set the intercept (\( \beta_0 \)) to 30 and \( \beta_1 \) to 0.7. This means that the \( y=30 \) at \( x=0 \), and the slope of the line will be 0.7, so an increase of 1 on \( x \) will mean a change of 0.7 in \( y \). Line going through the data points represents predicted mean values of this model.

Using linear regression (function lm), we “guessed” the simulated coefficients quite accurately. According to the parameter confidence interval, intercept lies somewhere between 26 and 32 while \( \beta_1 \) lies somewhere between 0.67 and 0.75. True estimate is bound to change a bit, we are simulating random data after all.

This is a simulation in a controlled environment and the “guessing”“ of parameters is rather obvious. But imagine having acces to only the data and based on that information alone, you can find a model that probably produced the data at hand. Based on this we can conclud ethat if the assumptions are met, you can count on linear regression to find the correct parameters. We've just showed that this indeed is the case.

It's important to check the diagnostics to verify that the assumptions of the model hold. This step is usually taken before any interpretation. The upper left part of the below figure should display points like "sky at night”, meaning points should be distributed randomly with no obvious patterns. The red line should be horizontal and straight and as close to \( y=0 \) as possible. Even though we simulated the data, albeit with some noise, the values in QQ plot don't lie exactly on the line, which is to be expected. More noise will add more possibility for discrepancy between the theoretical and actual simulated values. It would be fruitless to expect a “perfect” fit of simulated data. At the end of the day, they are still random. For fun, try changing the amount of noise (\( \epsilon \), epsilon) and see how confidence interval of model parameters change. The result will also depend on the number of points. What happens if your \( n \) is small? Experiment!

par(mfrow = c(2, 2))
plot(mdl)

plot of chunk diag_trivialmodel

Less trivial model

To make things more interesting, we simulate data for a more complex model and try to fit three models. One of the models will be correct. We show that linear regression can fit non-straight lines, too. In statistical terms, our model will be \( y = \beta_0 + \beta_1 x + \beta_2 z^2 + \epsilon \). What we do is actually bend the straight line a bit so that our naive model of a straight line won't fit well.

set.seed(357)  #try other seeds, like 10, 123, 400
data.from <- 0
data.to <- 10
n <- 200  # number of data.points

# range of values for z variable
z.from <- 1
z.to <- 10

# generate 2 explaining covariates
x <- seq(from = data.from, to = data.to, length.out = n)
z <- seq(from = z.from, to = z.to, length.out = n)

intercept <- 30
beta1 <- 10
beta2 <- 3
epsilon <- rnorm(length(x), mean = 0, sd = 200)

y <- intercept + (beta1 * x) + (beta2 * z^3) + epsilon

Plot the data and prepare variables to be used for predicting.

xy <- data.frame(x, y, z)
par(mfrow = c(1, 1))
plot(xy[, c("x", "y")])

plot of chunk unnamed-chunk-2


# This is used to predict the fitted values, see below.
p.x <- seq(from = data.from, to = data.to, length.out = n)
p.z <- seq(from = z.from, to = z.to, length.out = n)

Model 1

mdl1 <- lm(y ~ x, data = xy)
summary(mdl1)

## 
## Call:
## lm(formula = y ~ x, data = xy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -660.1 -259.3  -49.6  239.3  966.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -520.84      48.75   -10.7   <2e-16 ***
## x             286.83       8.43    34.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 346 on 198 degrees of freedom
## Multiple R-squared:  0.854,  Adjusted R-squared:  0.853 
## F-statistic: 1.16e+03 on 1 and 198 DF,  p-value: <2e-16

par(mfrow = c(2, 2))  # have a window ready to accept 4 plots
plot(mdl1)

plot of chunk mdl1


# Predict fitted values and plot them
par(mfrow = c(1, 1))  # set window back to 1x1
plot(xy[, c("x", "y")])

# Based on our model and the independent variable x (p.x), we predict
# values of y.  Prediction data need to be organized exactly like the data
# in our original data.frame `xy`
fit.mdl1 <- predict(mdl1, newdata = data.frame(x = p.x))
fit.mdl1 <- data.frame(p.x, y = fit.mdl1)  # we overwrite the calculated y values by appending them to the independent variable x

lines(fit.mdl1, lwd = 2, col = "red")

plot of chunk mdl1

Notice the pattern in residuals in the upper left par of the diagnostic figure. This suggests that our model may not fit the data well and warrants another more curvy approach. This is a good school example how visually inspecting the model fit can reveal the connection between the predicted mean values of the model and observations. This is visually harder to do for models in many dimensions, in which case we rely on plotting residuals against the fit.

Model 2

mdl2 <- lm(y ~ I(x^2), data = xy)
summary(mdl2)

## 
## Call:
## lm(formula = y ~ I(x^2), data = xy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -462.8 -126.1   -2.3  144.4  509.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -65.624     21.293   -3.08   0.0023 ** 
## I(x^2)        29.295      0.474   61.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201 on 198 degrees of freedom
## Multiple R-squared:  0.951,  Adjusted R-squared:  0.95 
## F-statistic: 3.81e+03 on 1 and 198 DF,  p-value: <2e-16


par(mfrow = c(2, 2))
plot(mdl2)

plot of chunk mdl2


par(mfrow = c(1, 1))
plot(xy[, c("x", "y")])

fit.mdl2 <- predict(mdl2, newdata = data.frame(x = p.x))
fit.mdl2 <- data.frame(x = p.x, y = fit.mdl2)
lines(fit.mdl2, lwd = 2, col = "green")

plot of chunk mdl2

Pattern of residuals has improved, but it's still a slight pattern, especially for low and high \( x \) values. For demonstrational purposes, we try on a true model.

Model 3

mdl3 <- lm(y ~ x + I(z^3), data = xy)
summary(mdl3)

## 
## Call:
## lm(formula = y ~ x + I(z^3), data = xy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -461.0 -113.9    0.2  112.7  529.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.161     35.379    0.23    0.818    
## x             21.314     12.796    1.67    0.097 .  
## I(z^3)         2.864      0.129   22.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185 on 197 degrees of freedom
## Multiple R-squared:  0.958,  Adjusted R-squared:  0.958 
## F-statistic: 2.26e+03 on 2 and 197 DF,  p-value: <2e-16


par(mfrow = c(2, 2))
plot(mdl3)

plot of chunk mdl3


par(mfrow = c(1, 1))
plot(xy[, c("x", "y")])
fit.mdl3 <- predict(mdl3, newdata = data.frame(x = p.x, z = p.z))
fit.mdl3 <- data.frame(x = p.x, z = p.z, y = fit.mdl3)
lines(fit.mdl3[, c("x", "y")], lwd = 2, col = "blue")

plot of chunk mdl3

Notice that the pattern in residuals has practically disappeared. QQ plot looks quite good as well. This is how the diagnostics look in a text book example. Real world problems have a nack for not fitting well, plagued by small \( n \) and often times poor experimental design. This is where mixed effects models come into play (a topic for another post).

Comparing three models, we note that by each model, fit improved as evident by the lowering of AIC values. Model 3 has the best fit as it is the true model we used to generate the data. This can also be seen in the below figure where we plot the fitted values for each model.

AIC(mdl1, mdl2, mdl3)

##      df  AIC
## mdl1  3 2910
## mdl2  3 2693
## mdl3  4 2662

For easier inspection, we plot all thee fitted values on one graph.

plot(xy[, c("x", "y")])
true.mdl <- data.frame(p.x = x, y = intercept + (beta1 * x) + (beta2 * z^3))
lines(true.mdl, col = "black", lwd = 2)  # true line with no noise
lines(fit.mdl1, lwd = 1, col = "red")
lines(fit.mdl2, lwd = 1, col = "green")
lines(fit.mdl3[, c("x", "y")], lwd = 1, col = "blue")
legend("topleft", legend = c("True model", "Model 1", "Model 2", "Model 3"), 
    col = c("black", "red", "green", "blue"), lty = "solid", box.col = NA)

plot of chunk plotall

We notice that the worst fit is by model 1 (red line). It doesn't follow the curve at all. Second best model would be model 2 (green line) and the best fit is exhibited by model 3 (blue line). Blue and black lines overlap a lot, indicating our model predicts the true model quite well.

To help us show how accurate our estimates are, intervals can be used. There are two types of intervals (see my previous post), and we plot them both for our best model. For comparison, we also add the two less optimal models. If the true model falls within the confidence interval, in our case 95% CI, we can consider our model to be useful. Prediction interval will be used if we were interested in where future observations are bound to fall.

na.x <- data.frame(x = p.x, z = p.z)
int.conf <- predict(mdl3, newdata = na.x, interval = "confidence", level = 0.95)
int.conf <- data.frame(na.x, int.conf)

int.pred <- predict(mdl3, newdata = na.x, interval = "prediction", level = 0.95)
int.pred <- data.frame(na.x, int.pred)

library(ggplot2)
suppressMessages(library(gridExtra))

fig1 <- 
  ggplot(int.conf, aes(x = x, y = fit)) +
  theme_bw() +
  ggtitle("Confidence interval") +
  geom_point(data = xy, aes(x = x, y = y)) +
  geom_line(data = true.mdl, aes(x = p.x, y = y), colour = "black", linetype = "dashed") +
  geom_smooth(data = int.conf, aes(ymin = lwr, ymax = upr), stat = "identity") +
  geom_line(data = fit.mdl1, aes(x = p.x, y = y), colour = "red") +
  geom_line(data = fit.mdl2, aes(x = p.x, y = y), colour = "green")

fig2 <- ggplot(int.pred, aes(x = x, y = fit)) +
  theme_bw() +
  ggtitle("Prediction interval") +
  geom_point(data = xy, aes(x = x, y = y)) +
  geom_smooth(data = int.pred, aes(ymin = lwr, ymax = upr), stat = "identity")

grid.arrange(fig1, fig2, ncol = 2)

plot of chunk unnamed-chunk-3

Using this simple simulation example, examine how number of observations, error term and model specification influence fit and its error. How would you present such results?