Introduction to Simple Linear Regression

Simple Linear Regression is a statistical method used to model the relationship between two variables:

  • Response variable (Y): The variable we want to predict
  • Predictor variable (X): The variable we use to make predictions

In simple linear regression, we try to predict Y based on X using a straight line.

Applications: predicting crop yield based on rainfall, predicting website traffic based on ad spending, estimating salary based on years of experience, and many more.

Mathematical Model

The simple linear regression model is expressed as:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

where:

  • \(Y_i\) = observed response value for observation \(i\)
  • \(X_i\) = predictor value for observation \(i\)
  • \(\beta_0\) = y-intercept (population parameter)
  • \(\beta_1\) = slope (population parameter)
  • \(\epsilon_i\) = random error term

Fitted line is represented as: \[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\]

Least Squares Estimation

We estimate parameters by minimizing the Sum of Squared Errors (SSE):

\[SSE = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The least squares estimators are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

where \(\bar{X}\) and \(\bar{Y}\) are the sample means.

Iris Flower Measurements Dataset

The iris dataset contains measurements from 150 iris flowers across 3 species collected by a botanist. We’ll use linear regression to analyze how petal width relates to petal length using this classic dataset.

Statistic Value
Mean Petal Width (cm) 1.199
Mean Petal Length (cm) 3.758
Standard Deviation Petal Width 0.762
Standard Deviation Petal Length 1.765
Correlation 0.963

Scatter Plot of Linear Regression

Interactive 3D Visualization

Fitting the Model

#loading the library
library(ggplot2)

#using built-in iris dataset
study_data = iris[, c("Petal.Width", "Petal.Length")]
colnames(study_data) = c("Width", "Length")

#fitting linear regression model
model = lm(Length ~ Width, data = study_data)

#making predictions from fitted model
new_data = data.frame(Width = c(0.5, 1.0, 2.0))
predictions = predict(model, newdata = new_data)

Model Results and Interpretation

Estimate Std. Error t-value p-value
(Intercept) 1.0836 0.0730 14.8500 0
Width 2.2299 0.0514 43.3872 0

Model: \(\hat{Length} = 1.08 + 2.23 \times Width\)

  • Intercept (\(\hat{\beta}_0\)) is the expected petal length when width is 0 cm
  • Slope (\(\hat{\beta}_1\)) is basically each 1 cm increase in petal width increases length by ~2.23 cm
  • R² = 0.9271, meaning the model explains 92.71% of variance

Residual Analysis

Conclusion

  • Simple Linear Regression is a powerful tool for understanding relationships between variables.
  • Our model explains over 92% of the variance in petal length
  • Simple linear regression is the foundation for more complex regression techniques