Introduction to Simple Linear Regression

Simple linear regression is a statistical model used to show the relationship between two variables.

  • Independent variable (X)
  • Dependent variable (Y)

We will find the best fitting straight line through the data points to try and predict Y based on X.

With this regression line we can try to predict a car’s fuel efficiency based on its weight.

Data Set Choice

We will perform simple linear regression on the built in R data set “mtcars” to show the relationship between:

  • (Y): Miles per Gallon (mpg)
  • (X): Weight in Thousands of Pounds (wt)

Regression Model

The simple linear regression formula is:

\[ Y = \beta_0 + \beta_1 X + \varepsilon \] Variable Definitions for Our Data:

\(Y\): mpg
\(X\): weight
\(\beta_0\): intercept
\(\beta_1\): slope (MPG/WEIGHT)
\(\varepsilon\): random error

Estimating Regression Line

We will estimate the parameters using Ordinary Least Squares, which minimizes:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

The estimated coefficients are:

## (Intercept)          wt 
##   37.285126   -5.344472

Summary of Data

Number of Cars

## [1] 32

MPG Average

## [1] 20.09062

MPG Standard Deviation

## [1] 6.026948

Scatter Plot with Regression Line (ggplot)

## `geom_smooth()` using formula = 'y ~ x'

Residual Plot (ggplot)

Interactive 3D View (Plotly)

R Code Example

Code used to create the regression model and residual plot:

# Fit linear regression model
fit = lm(mpg ~ wt, data = df)

# Find predicted values and residuals
df$mpghat = predict(fit, newdata = df)
df$residuals = df$mpg - df$mpghat

# Create residual plot
ggplot(df, aes(x = mpghat, y = residuals)) + geom_point(size = 2) + 
  geom_hline(yintercept = 0) + 
  labs(title = "Residuals vs fitted values")

Model Summary

## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Conclusion

Through our analysis we can confirm that:

  • Heavier cars consume more fuel
  • There is a negative linear relationship between weight and fuel efficiency

This shows that simple linear regression is a very good tool for showcasing the relationship between variables, and testing the significance of that relationship.