Introduction

Simple linear regression models the relationship between two variables by fitting a linear equation. Useful in many disciplines, from predicting petal length from sepal length, to stock returns from interest rates, etc.

The Linear Model

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

  • \(Y\): response variable
  • \(X\): explanatory variable
  • \(\beta_0\): intercept
  • \(\beta_1\): slope
  • \(\varepsilon\): error term

Least Squares Estimation

Find \(\beta_0\) and \(\beta_1\) to minimize:

\[ \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

This gives us the line of best fit.

Example: iris Dataset

data(iris)
model <- lm(Petal.Length ~ Sepal.Length, data = iris)
summary(model)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

We model petal length as a function of sepal length in iris flowers.

ggplot: Scatter Plot with Regression Line

ggplot: Residual Plot

The curved pattern in the residuals suggests that the relationship between the predictors and response may not be perfectly linear, which can limit the accuracy of a linear regression model.

plotly: 3D Visualization

Model Interpretation

  • Slope \(\beta_1\): indicates how much petal length changes per unit change in sepal length
  • Intercept \(\beta_0\): expected petal length when sepal length is zero (not always meaningful in context)
  • \(R^2\): proportion of variance in petal length explained by sepal length
  • p-value: significance of sepal length as a predictor

Conclusion

Simple linear regression is a useful tool for modeling and interpreting relationships between variables. While it can capture general trends, diagnostic plots like residuals vs. fitted values help reveal when the relationship may not be truly linear, as seen in the iris dataset example.