Why F1 is a statistics playground 🏎️

In Formula 1, tiny improvements matter:

  • Pit strategy depends on tire degradation
  • Engineers trade off speed vs consistency
  • Teams model uncertainty to make better decisions

Today: we build a simple, reproducible analysis using regression + visualization.

The question (and the data we’ll use)

We want to understand:

  • How does average speed affect lap time?
  • How does tire wear affect lap time?

We’ll create a small F1-style dataset with: - Speed (km/h) - Tire wear (%) - Lap time (seconds)

Note: This dataset is simulated to be fully reproducible and easy to present.

Build a reproducible F1-style dataset

Quick sanity check:

n = 160 laps speed ≈ 220 km/h wear ∈ [0, 100]%

The model (LaTeX)

We use multiple linear regression:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon \]

Where: - \(Y\): lap time (seconds) - \(X_1\): average speed (km/h) - \(X_2\): tire wear (%)

Interpretation: \(\beta_1\) and \(\beta_2\) quantify tradeoffs in performance.

Assumptions (LaTeX)

To trust inference (CIs / p-values), common assumptions are:

\[ E(\varepsilon)=0,\qquad Var(\varepsilon)=\sigma^2 \]

Often also:

\[ \varepsilon \sim \mathcal{N}(0,\sigma^2) \]

Practical check: residual plots should look like “random noise”.

Required: a slide showing R code

model <- lm(lap_time ~ speed + tire_wear, data = f1)
summary(model)
## 
## Call:
## lm(formula = lap_time ~ speed + tire_wear, data = f1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7228 -1.2246 -0.0721  1.2609  4.4005 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 126.420910   3.350140   37.74   <2e-16 ***
## speed        -0.184921   0.015145  -12.21   <2e-16 ***
## tire_wear     0.054358   0.004705   11.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.714 on 157 degrees of freedom
## Multiple R-squared:  0.6569, Adjusted R-squared:  0.6525 
## F-statistic: 150.3 on 2 and 157 DF,  p-value: < 2.2e-16

On a real team, you’d add more predictors (track temp, fuel load, traffic, etc.).

ggplot #1 — Lap time vs speed

ggplot #2 — Lap time vs tire wear

Diagnostics — residuals vs fitted

Plotly (3D) — interactive performance view

What this means (in plain English)

  • If the speed coefficient is negative:
    • going faster tends to reduce lap time
  • If the wear coefficient is positive:
    • more wear tends to increase lap time

This matches race intuition: push early, then manage tires to avoid late-stint drop-off.

Takeaways

  • Regression turns racing intuition into numbers
  • ggplot makes relationships & diagnostics easy to communicate
  • plotly adds an interactive “wow factor” for presentations

Next steps (real world): add track temperature, fuel load, stint length, and driver effects.