Introduction to Simple Linear Regression

Simple Linear Regression is a statistical method for modeling the relationship between:

  • One independent variable (predictor, X)
  • One dependent variable (response, Y)

The goal is to find the best-fitting straight line through the data points.

Applications:

  • Predicting sales based on advertising spend
  • Estimating house prices from square footage
  • Analyzing temperature trends over time

The Mathematical Model

The simple linear regression model is expressed as:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Where:

  • \(Y_i\) = observed value of the dependent variable
  • \(X_i\) = value of the independent variable
  • \(\beta_0\) = y-intercept (constant term)
  • \(\beta_1\) = slope (regression coefficient)
  • \(\epsilon_i\) = random error term, where \(\epsilon_i \sim N(0, \sigma^2)\)

Estimating the Parameters

The Least Squares Method minimizes the sum of squared residuals:

\[SSE = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The estimates are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Example Dataset: Car Speed vs Stopping Distance

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
## 7    10   18
## 8    10   26

This dataset contains:

  • speed: Speed of cars (mph)
  • dist: Stopping distance (ft)

Question: Can we predict stopping distance based on speed?

Visualization: Scatter Plot with Regression Line

Fitting the Model in R

# Fit the linear regression model
model <- lm(dist ~ speed, data = cars)

# Display the summary
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Model Results and Interpretation

The fitted regression equation is:

\[\hat{Y} = -17.58 + 3.93 \times X\]

Interpretation:

  • Intercept (\(\hat{\beta}_0\) = -17.58): Expected stopping distance when speed is 0 mph
  • Slope (\(\hat{\beta}_1\) = 3.93): For each 1 mph increase in speed, stopping distance increases by 3.93 feet
  • R² = 0.651: About 65.1% of variation in stopping distance is explained by speed

Residual Analysis

Good model: residuals randomly scattered around zero with constant variance.

Interactive 3D Visualization

Prediction Example

Using our model, we can make predictions:

# Predict stopping distance for a car going 22 mph
new_speed <- data.frame(speed = 22)
pred_result <- predict(model, new_speed, 
                       interval = "prediction", 
                       level = 0.95)
pred_result
##       fit      lwr      upr
## 1 68.9339 37.22044 100.6474

For a car traveling at 22 mph, the predicted stopping distance is 68.93 feet with a 95% prediction interval of [37.22, 100.65] feet.

Key Assumptions of Linear Regression

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent of each other
  3. Homoscedasticity: Constant variance of errors (\(\sigma^2\) is constant)
  4. Normality: Errors are normally distributed: \(\epsilon_i \sim N(0, \sigma^2)\)

Why these matter:

  • Violations can lead to biased estimates and incorrect inferences
  • Always check assumptions using diagnostic plots

Conclusion

Simple Linear Regression is a powerful tool for:

  • Understanding relationships between variables
  • Making predictions
  • Quantifying the strength of associations