2025-06-01

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model the relationship between:

  • One explanatory variable (independent variable, X)
  • One response variable (dependent variable, Y)

The relationship is modeled using a straight line:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where: - \(\beta_0\) is the y-intercept - \(\beta_1\) is the slope - \(\epsilon\) is the error term

Mathematical Foundation

The least squares method finds the best-fitting line by minimizing the sum of squared residuals:

\[\text{Minimize: } \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The estimated coefficients are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

Example Dataset: Car Weight vs Fuel Efficiency

Let’s examine the relationship between car weight and fuel efficiency using the mtcars dataset.

Weight (1000 lbs) Miles per Gallon Cylinders Horsepower
Mazda RX4 2.620 21.0 6 110
Mazda RX4 Wag 2.875 21.0 6 110
Datsun 710 2.320 22.8 4 93
Hornet 4 Drive 3.215 21.4 6 110
Hornet Sportabout 3.440 18.7 8 175
Valiant 3.460 18.1 6 105

Research Question: How does car weight affect fuel efficiency?

Exploratory Data Analysis

Fitting the Linear Model

# Fit simple linear regression model
model <- lm(mpg ~ wt, data = mtcars)

# Display model summary
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Model Visualization

Interactive 3D Visualization

Model Diagnostics

# Create residual plots for model validation
par(mfrow = c(2, 2))
plot(model)

Key Results and Interpretation

Model Equation: \(\hat{MPG} = 37.29 - 5.34 \times Weight\)

Key Findings: - Slope: For every 1000 lb increase in weight, fuel efficiency decreases by 5.34 MPG - R-squared: 75.3% of variation in MPG is explained by weight - P-value: < 0.001 (highly significant relationship)

Practical Implications: - Lighter cars are more fuel-efficient - Weight is a strong predictor of fuel economy - Model explains most of the variation in MPG

Model Assumptions and LimitationsThe linear regression model assumes the following conditions.

  1. Linearity: X and Y variables display a linear relationship.
  2. Independence: Each observation stands on its own.
  3. Homoscedasticity: The residuals show consistent variance levels.
  4. Normality: The residuals in the data follow a normal distribution.

The model exhibits the following restrictions:

The model identifies relationships between variables yet fails to show causation between them. The model does not effectively detect relationships that are not linear in nature. The model becomes highly sensitive to the presence of outliers. The model cannot extend beyond the observed data range.