What’s Linear Regression?!

Simple Linear Regression models the linear relationship between:

  • A predictor variable \(X\) (independent)
  • A response variable \(Y\) (dependent)

It is one of the most foundational tools in statistics and machine learning

Sort of like fitting the best fitting line through a scatter plot of data points

Applications in Computer Science:

  • Predicting execution time from input size
  • Estimating memory usage from data volume
  • Modeling latency vs. server load

The Model Equation

The simple linear regression model is defined as:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Where:

  • \(Y\) = response variable (what we predict)
  • \(X\) = predictor variable (what we observe)
  • \(\beta_0\) = intercept (value of \(Y\) when \(X = 0\))
  • \(\beta_1\) = slope (change in \(Y\) per unit increase in \(X\))
  • \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\) = random error term

Estimating the Coefficients

The coefficients \(\beta_0\) and \(\beta_1\) are estimated using Ordinary Least Squares (OLS), minimizing the sum of squared residuals:

\[\min_{\beta_0, \beta_1} \sum_{i=1}^{n} \left( y_i - \beta_0 - \beta_1 x_i \right)^2\]

The closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Our Dataset: Algorithm Runtime vs. Input Size

We simulate data representing the runtime (ms) of a sorting algorithm across varying input sizes

Input Size (n) Runtime (ms)
4583 73.18860
4692 76.81870
1502 31.61083
4169 58.46359
3245 53.51464
2644 26.72393

A classic CS problem: how does runtime scale with input?

Scatter Plot with Regression Line

Residual Diagnostics

Residuals = actual \(y_i\) minus predicted \(\hat{y}_i\). A good model has residuals randomly scattered around zero

3D Visualization: Residual Surface

The R Code Behind the Model

# Fit the linear model
model <- lm(runtime_ms ~ input_size, data = df)

# View coefficients and model summary
summary(model)

# Plot with ggplot2
ggplot(df, aes(x = input_size, y = runtime_ms)) +
  geom_point(color = "#3366CC", alpha = 0.7) +
  geom_smooth(method = "lm", color = "#8C1D40", se = TRUE) +
  labs(title = "Runtime vs Input Size",
       x = "Input Size (n)", y = "Runtime (ms)") +
  theme_minimal()

Model Summary & Interpretation

The fitted model is:

\[\hat{Y} = -0.5043 + 0.0155 \cdot X\]