Simple Linear Regression

What is Simple Linear Regression?

Simple linear regression is one of the most fundamental tools in statistics. The basic idea is pretty straightforward — you have two variables, and you want to know if one can help predict the other.

More specifically, we are trying to model a linear relationship between:

a response variable \(Y\) (what we are trying to predict)
a predictor variable \(X\) (what we use to make that prediction)

We will be using the mtcars dataset throughout this presentation. The question we are asking is: does a car’s horsepower tell us anything about its fuel efficiency?

The Model

The simple linear regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

\(\beta_0\) is the intercept — the expected value of \(Y\) when \(X = 0\)
\(\beta_1\) is the slope — how much \(Y\) changes for each one-unit increase in \(X\)
\(\varepsilon_i\) is the error term, assumed to follow \(\mathcal{N}(0, \sigma^2)\)

Once we estimate the coefficients, the fitted values are:

\[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\]

The difference between \(Y_i\) and \(\hat{Y}_i\) is the residual for observation \(i\).

Estimating the Coefficients

The coefficients are estimated using Ordinary Least Squares (OLS), which finds the line that minimizes the total squared error:

\[\text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

Taking derivatives and setting them to zero gives the closed-form solutions:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

These are guaranteed to give the best linear unbiased estimates (under the Gauss-Markov assumptions).

The Dataset

mtcars dataset — 32 cars from the 1974 Motor Trend magazine
Variable	Description	Min	Mean	Max
mpg	Miles per gallon	10.400	20.09	33.900
hp	Gross horsepower	52.000	146.69	335.000
wt	Weight (1000 lbs)	1.513	3.22	5.424

The main relationship we are modeling is mpg ~ hp. Intuitively, we would expect more powerful cars to use more fuel, so the slope should come out negative.

Scatter Plot

The downward trend is pretty clear. The shaded band is the 95% confidence interval for the regression line.

Residuals vs. Fitted

No obvious pattern here, which is what we want — it suggests the linear model is a reasonable fit.

The Code

data(mtcars)

fit <- lm(mpg ~ hp, data = mtcars)

summary(fit)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Reading the Output

Based on the model summary:

\[\hat{\text{mpg}} = 30.099 - 0.068 \times \text{hp}\]

Every additional unit of horsepower is associated with about a 0.068 drop in mpg
The intercept of 30.1 is the predicted mpg for a car with 0 horsepower — not physically meaningful on its own, but necessary for the line
\(R^2 \approx 0.60\) means horsepower alone explains about 60% of the variability in fuel efficiency
The p-value on hp is essentially 0, so the relationship is statistically significant

Not a perfect model, but for a single predictor it explains a solid chunk of the variation.

3D View: Adding Weight as a Second Predictor

Takeaways

Simple linear regression is a solid starting point for understanding relationships in data. In this case:

Horsepower is a strong, statistically significant predictor of fuel efficiency
The relationship is negative — higher horsepower means lower mpg, which makes intuitive sense
Adding weight as a second predictor (shown in the 3D plot) pushes \(R^2\) up to about 0.83, which is a meaningful improvement

One thing to keep in mind is that correlation does not imply causation. The model tells us there is a strong linear association, but not necessarily that high horsepower causes lower mpg — both could be driven by other characteristics of the car.