2025-11-09

Linear Regression

  • Linear Regression is a mathematical model which can predict the relationship between two variables.

  • The independent variable, x, is called the “predictor” variable.

  • The dependent variable, y, is called the “response” variable.

  • Remember: correlation does not imply causality!

Summary of Data

In this interactive 3-D graph, we can see a general trend of increased circumference with increasing age. The data is grouped by individual tree. How can we better quantify this trend?

If we had no information about the x variable, the best prediction of a new tree circumference would simply be the average of previous measurements:

Here is a proposed linear model for the data from Tree #1. To optimize the fit, the “Sum of Squares” should be minimized. This is the sum of the squared vertical distances from a data point to the proposed “line of best fit”.

Equations

The general equation for linear regression can be expressed as:

\[ \displaystyle \widehat{y}_i=\beta_0+\beta_1x_i \] while the formula for the Sum of Squares is: \[ \displaystyle SSE=\sum_{i=1}^{n} (y_i-\widehat{y}_{i})^2 \]

where \(y_i\) is the observed value, \(\widehat{y}_i\) is the predicted value, and \((y_i-\widehat{y}_{i})\) is the difference between them;

\(\beta_0\) is the y-intercept of the linear model, and \(\beta_1\) is the slope.

Formula for Linear Regression

The coefficients of the linear regression can be found in the table below:

model=lm(circumference~age, data=Orange1)
print(model$coefficients)
## (Intercept)         age 
## 24.43784664  0.08147716

The formula for this line would be \(y = 24.44 + 0.08x\).

Residuals

The residuals for each data point (difference between observed data and predicted y-value) are shown below:

print(model$residuals)
##         1         2         3         4         5         6         7 
## -4.052152 -5.872793  8.461319  8.759084 -4.736232  5.775489 -8.334715

Squaring the residuals would give: \[ \begin{align} SSE&=\sum_{i=1}^{n} (y_i-\widehat{y}_{i})^2 \\ &= (-4.05)^2 + (-5.87)^2 + 8.46^2 + 8.75^2 + (-4.74)^2 + 5.78^2 + (-8.33)^2 \\ &= 324.4807218 \end{align} \]

Heteroscedasticity

Before appying a linear regression, the data should be checked for heteroscedasticity, or heterogeneity of the variance. A basic assumption of a linear regression is that the variance (or error) is independent of the x-value, and that it is normally distributed. Sometimes, however, larger x-values have larger variance, as seen below in data recording the stopping distance needed at different speeds of cars in the 1920’s:

Confidence Interval

The confidence interval around a linear model can also be calculated. The grey bar indicates the confidence interval of 95%, using the age and circumference of Orange Tree #1.