Simple Linear Regression

What is Linear Regression?

Think of linear regression as drawing the best possible straight line through a cloud of data points. We’re trying to understand how two things relate to each other:

Y (dependent variable): What we’re trying to predict or explain
X (independent variable): What we think affects Y

The whole point? Find that line that fits the data as well as possible.

The Math Behind It

Here’s what the model looks like mathematically:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Let me break this down:

\(Y_i\) is each actual data point we observe
\(\beta_0\) is where the line crosses the y-axis (the intercept)
\(\beta_1\) tells us how steep the line is (the slope)
\(X_i\) is our predictor value
\(\epsilon_i\) is the error term - basically the randomness we can’t explain (it follows a normal distribution with mean 0)

How Do We Find the Best Line?

We use something called “least squares” - which basically means we want to minimize how far off our predictions are. Here’s the formula:

\[SSE = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

And here’s how we calculate the slope and intercept:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

Let’s Look at Real Data

I’m using R’s built-in dataset about cars from the 1920s - it shows how fast they were going and how long it took them to stop:

# Load up the cars dataset
data(cars)
head(cars, 5)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

Running the Regression

Now let’s actually fit the model and see what we get:

# Run the regression - predicting distance from speed
model <- lm(dist ~ speed, data = cars)

# Show me the results
round(summary(model)$coefficients, 4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791     6.7584 -2.6011   0.0123
## speed         3.9324     0.4155  9.4640   0.0000

Visualizing in 3D

Here’s a cool 3D view of what’s happening. You can see the actual data points and how our regression line fits through them:

The Classic Scatter Plot

Here’s the traditional 2D view with our regression line cutting through the data:

Note: The regression line extends below zero at very low speeds, but this model is only valid within the observed speed range (4-25 mph).

Checking Our Assumptions

This residual plot helps us see if there are any patterns we missed. Ideally, these points should be randomly scattered around zero:

There’s a slight pattern here (the curve) suggesting the relationship might not be perfectly linear, but it’s close enough for our purposes.

What Does This All Mean?

# Pull out the coefficients
coef(model)

## (Intercept)       speed 
##  -17.579095    3.932409

So our equation is: \(\hat{dist} = -17.58 + 3.93 \times speed\)

What this tells us:

Every time speed goes up by 1 mph, the car needs about 3.93 more feet to stop
Our R² is 0.651, which means speed explains about 65% of why stopping distances vary
Both numbers are statistically significant (p-values way below 0.05), so this relationship isn’t just random chance

Wrapping Up

Main points to remember:

Linear regression helps us understand and predict relationships between variables
We find the best line by minimizing squared errors (least squares method)
Always check your residuals to make sure your model makes sense
In our example, there’s clearly a strong connection between how fast you’re going and how long it takes to stop

Where is this used? Pretty much everywhere - economics, biology, engineering, social sciences, you name it!