2025-03-26

Introduction

Simple Linear Regression is a basic yet powerful statistical technique used to model the relationship between a single explanatory variable and a dependent variable.

Regression Model Equation

The simple linear regression model is represented by the equation:

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

Where:
- \(y\): Response variable
- \(x\): Predictor variable
- \(\beta_0\): Intercept
- \(\beta_1\): Slope
- \(\varepsilon\): Random error

Assumptions of Linear Regression

  1. Linearity
  2. Independence of errors
  3. Homoscedasticity (equal variance)
  4. Normality of residuals

Example Dataset: mtcars

We will explore the relationship between car weight (wt) and miles per gallon (mpg) using the built-in mtcars dataset in R.

ggplot: Scatterplot with Regression Line

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "darkblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Miles per Gallon vs Weight of Car",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon")

ggplot: Residual Plot

model <- lm(mpg ~ wt, data = mtcars)
res <- residuals(model)
ggplot(data.frame(wt = mtcars$wt, res = res), aes(x = wt, y = res)) +
  geom_point(color = "purple") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Residuals vs Weight",
       x = "Weight (1000 lbs)",
       y = "Residuals")

plotly: Interactive Scatterplot

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data = mtcars, x = ~wt, y = ~mpg,
        type = 'scatter', mode = 'markers',
        marker = list(color = 'green', size = 10)) %>%
  layout(title = "Interactive MPG vs Weight",
         xaxis = list(title = "Weight"),
         yaxis = list(title = "MPG"))

Model Summary in R

summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpreting the Output

\[ \text{mpg} = 37.2851 - 5.3445 \cdot \text{wt} \]

  • The slope is negative, indicating that as weight increases, mpg decreases.
  • \(R^2 = 0.7528\): About 75% of the variation in mpg is explained by weight.
  • The p-value for wt is very small (\(< 0.001\)), so it’s a statistically significant predictor.

Conclusion

  • Simple linear regression is effective for modeling relationships between two continuous variables.
  • Visualization using ggplot2 and plotly helps us understand patterns and validate model assumptions.
  • This is widely used in fields such as engineering, economics, and data science.

References

  • R Documentation
  • “An Introduction to Statistical Learning” by James, Witten, Hastie, Tibshirani
  • mtcars dataset