Simple Linear Regression

What is Simple Linear Regression?

Simple linear regression is a tool used show relation between two variables. Simple Linear Regression models the relationship between:

A response variable \(Y\) (what we want to predict)
A predictor variable \(X\) (what we use to predict)

The goal is to find the best-fitting straight line through the data. Linear regression is a supervised algorithm meaning it needs labeled training data to learn and it learns from error. —

The Model

The simple linear regression model is:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where:

\(\beta_0\) = intercept (value of \(Y\) when \(X = 0\))
\(\beta_1\) = slope (change in \(Y\) for a one-unit increase in \(X\))
\(\epsilon \sim N(0, \sigma^2)\) = random error term

Finding the Best fit

In simple linear regression we have to find a straight line that fits the data best to do that We estimate \(\beta_0\) and \(\beta_1\) using Ordinary Least Squares (OLS), which minimizes the sum of squared residuals:

\[\min \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \min \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

The formula to find intercept and slope are:

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

Example: Car Weight vs MPG

We’ll use the built-in mtcars dataset to predict MPG from weight.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "blue", size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Car Weight vs Fuel Efficiency",
       x = "Weight (1000 lbs)", y = "Miles Per Gallon") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Fitting the Model in R — Code Slide

# Fit the model
model <- lm(mpg ~ wt, data = mtcars)

# View results
summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Residuals Plot

Residuals are the difference between the actual data point and the predicted data points.Residuals should be randomly scattered — no pattern means a good fit.Checking the residuals are important to make sure the model is accurate.

3D Plot: Weight, Horsepower & MPG

library(plotly)

plot_ly(mtcars, x = ~wt, y = ~hp, z = ~mpg,
        type = "scatter3d", mode = "markers",
        marker = list(color = ~mpg, colorscale = "Viridis", size = 5)) %>%
  layout(title = "Weight, Horsepower & MPG",
         scene = list(xaxis = list(title = "Weight"),
                      yaxis = list(title = "Horsepower"),
                      zaxis = list(title = "MPG")))

Model Interpretation

From the output of summary(model):

Intercept (\(\hat{\beta}_0\)) ≈ 37.3 — predicted MPG when weight = 0
Slope (\(\hat{\beta}_1\)) ≈ −5.34 — each 1,000 lb increase reduces MPG by ~5.3
R² ≈ 0.75 — weight explains 75% of the variation in MPG
p-value < 0.001 — the relationship is statistically significant

Conclusion

Simple linear regression finds the best line through data
OLS minimizes the total squared error
The slope tells us direction and magnitude of the relationship
R² tells us how well the model fits
Always check residuals to validate assumptions