Simple Linear Regression is a method to model the relationship between two continuous variables.
We predict a dependent variable \(y\) based on an independent variable \(x\).
The simple linear regression model is given by:
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
where: - \(\beta_0\) =
intercept
- \(\beta_1\) = slope
- \(\epsilon_i\) = random error
The least squares estimates minimize the sum of squared errors:
\[ S(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]
The solution gives:
\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]
We’ll use the built-in mtcars dataset to predict
mpg (miles per gallon) from wt (weight of the
car).
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Linear Regression: MPG vs Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
The residual plot shows the difference between the observed value and the predicted value.
mtcars$residuals <- resid(model)
ggplot(mtcars, aes(x = wt, y = residuals)) +
geom_point(color = "darkgreen") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Weight",
x = "Weight (1000 lbs)",
y = "Residuals")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(
data = mtcars,
x = ~wt, y = ~mpg, z = ~hp,
type = "scatter3d",
mode = "markers",
marker = list(size = 5, color = ~hp, colorscale = "Viridis")
) %>%
layout(title = "3D Plot: MPG vs Weight vs Horsepower")