2026-04-06

Introduction

Simple Linear Regression is one of the most widely used statistical tools.

  • It models the linear relationship between a predictor variable \(X\) and a response variable \(Y\)
  • Used across many fields: biology, economics, engineering, gaming analytics, and more
  • In this presentation, we use a fun example: World of Warcraft Classic player data
  • Leveling is not uniform: levels 1–20 are fast, 20–44 are moderate, and 44–60 are a slow grind

Can we predict a player’s character level from the number of hours they’ve played?

The Data

We simulate data for 100 WoW Classic players, with a realistic leveling curve — fast early, slow late:

set.seed(42)
n <- 100
hours <- runif(n, 1, 200)

# WoW Classic leveling curve: fast 1-20, medium 20-44, slow 44-60
hours_to_level <- function(h) {
  ifelse(h <= 20,  h * 1.0,
  ifelse(h <= 100, 20 + (h - 20) * 0.3,
                   44 + (h - 100) * 0.1))
}

level <- hours_to_level(hours) + rnorm(n, mean = 0, sd = 2)
level <- pmin(pmax(round(level), 1), 60)
wow   <- data.frame(hours = hours, level = level, log_hours = log(hours))
head(wow, 5)
##       hours level log_hours
## 1 183.04640    53  5.209740
## 2 187.47801    51  5.233662
## 3  57.94177    35  4.059438
## 4 166.25908    52  5.113547
## 5 128.70736    47  4.857541

Why Not a Straight Line?

A raw scatter plot shows the relationship curves — a straight line would be a poor fit:

The Log-Transformed Model

Taking \(\log(\text{Hours})\) straightens the curve. The model is still simple linear regression — linear in the parameters:

\[ Y_i = \beta_0 + \beta_1 \log(X_i) + \varepsilon_i, \quad \varepsilon_i \overset{iid}{\sim} N(0,\sigma^2) \]

Symbol Meaning
\(Y_i\) Character level of player \(i\)
\(\log(X_i)\) Log of hours played by player \(i\)
\(\beta_0\) Intercept
\(\beta_1\) Change in level per unit increase in \(\log(\text{Hours})\)
\(\varepsilon_i\) Random error term

OLS Estimation

The OLS estimators minimize the Residual Sum of Squares:

\[ \text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - b_0 - b_1 \log(X_i))^2 \]

The closed-form solutions are:

\[ b_1 = \frac{\sum_{i=1}^n (\log X_i - \overline{\log X})(Y_i - \bar{Y})}{\sum_{i=1}^n (\log X_i - \overline{\log X})^2}, \qquad b_0 = \bar{Y} - b_1\,\overline{\log X} \]

The fitted value for player \(i\) is: \[\hat{Y}_i = b_0 + b_1 \log(X_i)\]

Fitting the Log-Transformed Regression

R Code for the Model

ggplot(wow, aes(x = hours, y = level)) +
  geom_point(color = "#8C1D40", alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "#FFC627",
              linetype = "dashed", linewidth = 1) +
  labs(title = "Raw Hours vs. Level",
       x = "Hours Played", y = "Character Level") +
  theme_minimal(base_size = 13)

Model Diagnostics: Residual Plots

Residuals are approximately centered at zero with no strong curve pattern, though some spread increases at higher fitted values:

Plotly: Hours Played vs Character Level

Hover over any point to see a player’s hours played and character level:

Inference on the Slope

Under the model assumptions, the t-statistic for testing \(H_0: \beta_1 = 0\) is:

\[ t = \frac{b_1}{SE(b_1)} \sim t_{n-2} \quad \text{under } H_0 \]

A 95% confidence interval for \(\beta_1\) is:

\[ b_1 \pm t_{n-2,\,0.025} \cdot SE(b_1) \]

confint(model, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) -17.39846 -11.06260
## log_hours    11.63095  13.05452

A small p-value for the slope gives strong evidence that \(\log(\text{hours})\) played is a statistically significant predictor of character level.

Summary

  • WoW Classic leveling follows a diminishing-returns curve — a log transformation linearizes it
  • The log-transformed model is still simple linear regression, as it is linear in the parameters
  • The fitted model:

\[ \widehat{\text{Level}} = -14.23 + 12.34 \times \log(\text{Hours}) \]

  • Residual plots confirmed the log model is a much better fit than raw hours
  • The slope is highly statistically significant (\(p < 0.001\))