2025-03-17

What is Simple Linear Regression?

  • A statistical method to understand the relationship between two variables. basically you can predict a variable based on another variable.
  • For example linear regression can help us predict a planet’s temperature based on its distance from sun
  • Variables:
    • Dependent Variable: What we’re trying to predict (planet’s temperature)
    • Independent Variable: What we use for prediction (distance from the sun)

Applications and Formula

  • Real-world applications:
    • Economics: Predicting sales based on advertising spend
    • Health: Relationship between exercise and heart rate
    • Finance: Forecasting stock prices based on market indices
    • Education: Relationship between study time and test scores
  • Mathematical Formula: \(Y = \beta_0 + \beta_1X + \epsilon\)
    • \(Y\) = Dependent variable (e.g., planet temperature)
    • \(X\) = Independent variable (e.g., distance from sun)
    • \(\beta_0\) = Y-intercept (temperature when distance = 0)
    • \(\beta_1\) = Slope (how much temperature changes with distance)
    • \(\epsilon\) = Error term (natural variation not explained by the model)

Using Motor Trend Cars Dataset

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Summary of the Dataset

##       mpg              wt              hp             qsec      
##  Min.   :10.40   Min.   :1.513   Min.   : 52.0   Min.   :14.50  
##  1st Qu.:15.43   1st Qu.:2.581   1st Qu.: 96.5   1st Qu.:16.89  
##  Median :19.20   Median :3.325   Median :123.0   Median :17.71  
##  Mean   :20.09   Mean   :3.217   Mean   :146.7   Mean   :17.85  
##  3rd Qu.:22.80   3rd Qu.:3.610   3rd Qu.:180.0   3rd Qu.:18.90  
##  Max.   :33.90   Max.   :5.424   Max.   :335.0   Max.   :22.90
  • Dataset: Motor Trend Car Road Tests (mtcars)
  • Variables:
    • mpg: Miles per gallon (fuel efficiency)
    • wt: Weight (1000 lbs)
    • hp: Horsepower
    • qsec: Quarter mile time (speed performance)

The Concept Visualized

## `geom_smooth()` using formula = 'y ~ x'

  • Observation: Heavier cars tend to have lower fuel efficiency (negative relationship).

The Mathematical Model

The linear equation for our car example:

\[MPG = \beta_0 + \beta_1 \times Weight + \epsilon\]

Based on our data: \[MPG = 37.29 -5.34 \times Weight\]

  • Interpretation:
    • For each additional 1000 lbs of weight, a car’s fuel efficiency decreases by approximately 5.34 MPG
    • The estimated fuel efficiency for a “weightless” car would be 37.29 MPG (theoretical)

Ordinary Least Squares Method

We find the “best” line by minimizing the sum of squared residuals:

\[\min \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2\]

  • Residuals: Vertical distances from actual MPG to predicted MPG
  • Goal: Minimize these differences to find the best-fitting line

Implementing Linear Regression

car_model <- lm(mpg ~ wt, data = mtcars)
summary(car_model)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## wt          -5.344472   0.559101 -9.559044 1.293959e-10
cat("R-squared:", summary(car_model)$r.squared, "\n")
## R-squared: 0.7528328
  • The coefficients show the estimated intercept and slope
  • The p-values indicate whether the relationship is statistically significant
  • R-squared shows how much of the fuel efficiency variation is explained by weight

Efficiency vs. Performance

## `geom_smooth()` using formula = 'y ~ x'

  • Observation: Fuel-efficient cars (green) tend to have lower power-to-weight ratios

3D Visualization with plotly

  • Observation: The interactive 3D plot reveals the complex relationship between these three variables

Limitations and Considerations

  • Linearity: Assumes a straight-line relationship between variables
  • Independence: Assumes each observation is independent
  • Homoscedasticity: Equal variance across all predicted values
  • Normality: Residuals should be normally distributed

Conclusion

  • Simple linear regression provides a foundation for understanding relationships between variables
  • For our cars example:
    • Weight is a significant predictor of fuel efficiency
    • Each 1000 lbs of weight reduces efficiency by approximately 5.34 MPG
    • The model explains 75.3% of the variation in fuel efficiency

Key takeaways:

  • Linear regression is a powerful tool for prediction and analysis
  • The approach can be extended to multiple variables and more complex relationships but always consider the underlying assumptions and limitations of your model