Simple Linear Regression

11/15/2025

What is Simple Linear Regression?

Simple Linear Regression is a statistical method that models the relationship between:

One independent variable (X) - the predictor
One dependent variable (Y) - the response

Goal: Find the best-fitting straight line through the data points

Applications:

Predicting sales based on advertising spend
Estimating house prices based on square footage
Forecasting temperature based on altitude

The Linear Model

The mathematical equation for simple linear regression is:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where:

\(Y\) = dependent variable (what we’re predicting)
\(X\) = independent variable (predictor)
\(\beta_0\) = y-intercept (value of Y when X = 0)
\(\beta_1\) = slope (change in Y for one unit change in X)
\(\epsilon\) = error term (random variation)

Estimating Parameters

We estimate \(\beta_0\) and \(\beta_1\) using the Least Squares Method, which minimizes the sum of squared residuals:

\[\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The formulas for the estimates are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

Example Dataset: Car Speed and Stopping Distance

Let’s analyze the relationship between car speed and stopping distance using R’s built-in cars dataset.

##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17

Question: Can we predict stopping distance based on speed?

Scatter Plot with Regression Line

Observation: There’s a positive linear relationship between speed and stopping distance.

R Code: Fitting the Model

# Fit the linear regression model
model <- lm(dist ~ speed, data = cars)

# Display the model summary
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Model Results

Our fitted regression equation is:

\[\hat{Y} = -17.58 + 3.93 \times X\]

Interpretation:

Intercept (\(\beta_0 = -17.58\)): Expected stopping distance when speed is 0 mph
Slope (\(\beta_1 = 3.93\)): For every 1 mph increase in speed, stopping distance increases by about 3.93 feet
R² = 0.651: About 65.1% of variation in stopping distance is explained by speed

Residual Analysis

Interactive 3D Visualization

Making Predictions

Using our model, we can predict stopping distances for new speeds:

# Predict stopping distance for speeds 15, 20, 25 mph
new_speeds <- data.frame(speed = c(15, 20, 25))
predictions <- predict(model, new_speeds, interval = "prediction")
cbind(new_speeds, predictions)

##   speed      fit      lwr       upr
## 1    15 41.40704 10.17482  72.63925
## 2    20 61.06908 29.60309  92.53507
## 3    25 80.73112 48.48730 112.97495

Example: A car traveling at 20 mph is predicted to have a stopping distance of about 61 feet.

Key Assumptions of Linear Regression

For valid inference, we need:

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed

Check these assumptions using:

Residual plots
Q-Q plots
Statistical tests (Shapiro-Wilk, Breusch-Pagan)

Summary

Simple Linear Regression is a powerful tool for:

Understanding relationships between variables
Making predictions
Quantifying the strength of associations

Remember:

Correlation ≠ Causation
Always check model assumptions
Be cautious when extrapolating beyond data range

Next Steps: Multiple regression, polynomial regression, logistic regression