2025-10-16

What you’ll learn

  • How waiting time until eruption predicts eruption duration
  • The OLS model, key diagnostics, and interpretation
  • A quick look at model uncertainty and significance

Data at a glance

Eruption Duration (min) Waiting Time (min)
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55

Model and results

Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874 0.160 -11.702 0
waiting 0.076 0.002 34.089 0

Math: OLS estimator

Let \(x_i\) be waiting time and \(y_i\) be eruption duration.

\[ \hat{\beta}_{1} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})} {\sum_{i=1}^{n}(x_i-\bar{x})^2}, \quad \hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1}\bar{x} \]

Math: Inference for slope

We test the null hypothesis against the alternative:

\[ H_{0} : \beta_{1} = 0 \quad \text{vs} \quad H_{a} : \beta_{1} \neq 0 \]

The test statistic is:

\[ t = \frac{\hat{\beta}_{1}}{\operatorname{SE}(\hat{\beta}_{1})} \sim t_{n-2} \]

And the confidence interval for the slope is:

\[ \hat{\beta}_{1} \; \pm \; t_{0.975,\,n-2} \cdot \operatorname{SE}(\hat{\beta}_{1}) \]

ggplot #1: Duration vs Waiting (with fit)

ggplot #2: Residuals vs Fitted

plotly: Interactive Duration vs Waiting

Code to reproduce Plot #1

p_scatter <- ggplot(df, aes(waiting, eruptions)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Old Faithful: Eruption Duration vs Waiting Time",
       x = "Waiting Time (minutes)",
       y = "Eruption Duration (minutes)")

Takeaways

  • Longer waiting times are associated with longer eruptions
  • The slope is significantly different from 0 (see t-test)
  • Checking residuals shows whether the linear model is appropriate