What is Linear Regression?

Think of linear regression as drawing the best possible straight line through a cloud of data points. We’re trying to understand how two things relate to each other:

  • Y (dependent variable): What we’re trying to predict or explain
  • X (independent variable): What we think affects Y

The whole point? Find that line that fits the data as well as possible.

The Math Behind It

Here’s what the model looks like mathematically:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Let me break this down:

  • \(Y_i\) is each actual data point we observe
  • \(\beta_0\) is where the line crosses the y-axis (the intercept)
  • \(\beta_1\) tells us how steep the line is (the slope)
  • \(X_i\) is our predictor value
  • \(\epsilon_i\) is the error term - basically the randomness we can’t explain (it follows a normal distribution with mean 0)

How Do We Find the Best Line?

We use something called “least squares” - which basically means we want to minimize how far off our predictions are. Here’s the formula:

\[SSE = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

And here’s how we calculate the slope and intercept:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

Let’s Look at Real Data

I’m using R’s built-in dataset about cars from the 1920s - it shows how fast they were going and how long it took them to stop:

# Load up the cars dataset
data(cars)
head(cars, 5)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

Running the Regression

Now let’s actually fit the model and see what we get:

# Run the regression - predicting distance from speed
model <- lm(dist ~ speed, data = cars)

# Show me the results
round(summary(model)$coefficients, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791     6.7584 -2.6011   0.0123
## speed         3.9324     0.4155  9.4640   0.0000

Visualizing in 3D

Here’s a cool 3D view of what’s happening. You can see the actual data points and how our regression line fits through them:

The Classic Scatter Plot

Here’s the traditional 2D view with our regression line cutting through the data:

Note: The regression line extends below zero at very low speeds, but this model is only valid within the observed speed range (4-25 mph).

Checking Our Assumptions

This residual plot helps us see if there are any patterns we missed. Ideally, these points should be randomly scattered around zero:

There’s a slight pattern here (the curve) suggesting the relationship might not be perfectly linear, but it’s close enough for our purposes.

What Does This All Mean?

# Pull out the coefficients
coef(model)
## (Intercept)       speed 
##  -17.579095    3.932409

So our equation is: \(\hat{dist} = -17.58 + 3.93 \times speed\)

What this tells us:

  • Every time speed goes up by 1 mph, the car needs about 3.93 more feet to stop
  • Our R² is 0.651, which means speed explains about 65% of why stopping distances vary
  • Both numbers are statistically significant (p-values way below 0.05), so this relationship isn’t just random chance

Wrapping Up

Main points to remember:

  • Linear regression helps us understand and predict relationships between variables
  • We find the best line by minimizing squared errors (least squares method)
  • Always check your residuals to make sure your model makes sense
  • In our example, there’s clearly a strong connection between how fast you’re going and how long it takes to stop

Where is this used? Pretty much everywhere - economics, biology, engineering, social sciences, you name it!