Sigrid Keydana, Trivadis
2017/22/11
Trivadis
My background
My passion
Where to find me
It's 2000, just before the Olympics. This is the data we have:
Linear regression says 42.33.
Whatever we say, it's pretty likely to be wrong…
Let's better not commit to a point estimate…
Prediction intervals to the rescue!
Let's take the example of linear regression.
(fit <- lm(seconds ~ year, male400_1996))
Call:
lm(formula = seconds ~ year, data = male400_1996)
Coefficients:
(Intercept) year
207.46609 -0.08257
Here's the point prediction:
# this would yield the same result
# fit$coefficients[1] + fit$coefficients[2] * 2000
fit %>% predict(newdata = data.frame(year = c(2000)))
1
42.33243
Confidence intervals:
fit %>% predict(newdata = data.frame(year = c(2000)), interval = "confidence")
fit lwr upr
1 42.33243 41.2646 43.40026
Prediction intervals:
fit %>% predict(newdata = data.frame(year = c(2000)), interval = "prediction")
fit lwr upr
1 42.33243 39.48191 45.18295
Quite a difference! So which one do we take?
\( \hat \beta_0 = \bar y - \hat \beta_1 \bar x \) \( \hat \beta_1 = cor(y, x) \frac{sd(y)}{sd(x)} \)
\( y_i = \beta_0 + \beta_1 x_i + \epsilon_i \)
\( \sigma_{\hat \beta_0} = \hat \sigma^2 \left(\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i - \bar x)^2 }\right) \) \( \sigma_{\hat \beta_1} = \hat \sigma^2 / \sum_{i=1}^n (x_i - \bar x)^2 \) with \( \hat \sigma^2 =\frac{1}{n-2}\sum_{i=1}^n e_i^2 \)
Let's do this manually for the parameters. A 95% confidence interval for the intercept would look like this:
intercept_est <- summary(fit)$coefficients[1,1]
intercept_se <- summary(fit)$coefficients[1,2]
(conf_interval <- intercept_est + c(-1, 1) * qt(.975, df = fit$df) * intercept_se)
[1] 174.3053 240.6269
Same procedure for the slope:
slope_est <- summary(fit)$coefficients[2,1]
slope_se <- summary(fit)$coefficients[2,2]
(conf_interval <- slope_est + c(-1, 1) * qt(.975, df = fit$df) * slope_se)
[1] -0.09960579 -0.06552787
“with 95% confidence, we estimate that having 4 years pass results in a decrease in the men's 400m Olympic winning times of 0.07 to 0.1 seconds”
… which reflects our uncertainty about the slope, not the points on the line.
We need the standard error for a point \( x_0 \) on the regression line: \( \hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} \)
This reflects the amount of uncertainty due to our estimates being based on sample variation. Uncertainty is smallest near the mean of the predictor (\( \bar x \)).
For sure the predictor \( x \) (the year we're in) cannot be held 100% responsible for the outcome \( y \) (the 400m winning time).
The standard error for an actual prediction is: \( \hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} \)
What if we were using, say, ARIMA for forecasting that time series?
What if we used some custom method that does not come complete with standard errors / confidence intervals and all?
arima_fit <- auto.arima(male400_1996$seconds)
arima_fit$coef
ma1 drift
-0.7998509 -0.3807447
We get standard errors for the parameter estimates:
sqrt(diag(vcov(arima_fit)))
ma1 ma2 ma3
0.2524413 0.3333542 0.3169874
… which means we can get prediction intervals here, too:
We'll probably need to pull ourselves up by our own bootstraps…
Wouldn't it be nice if there was a unified, intuitive approach?
Well … there is.
In Bayesian statistics:
\[ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} \]
R packages for Bayesian modeling (using the Stan backend for MCMC sampling):
What do we know?
require(rethinking)
model <- map2stan(
alist(
seconds ~ dnorm(mu, sigma),
mu <- a + b*year,
a ~ dnorm(46, 30),
b ~ dnorm(0,10),
sigma ~ dcauchy(0, 10)
),
data = male400_1996,
iter = 6000,
chains = 4,
verbose = FALSE
)
precis(model)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
a 156.75 21.70 123.19 190.36 1695 1
b -0.06 0.01 -0.07 -0.04 1693 1
sigma 1.62 0.35 1.06 2.07 1076 1
With all the sampling that's been going on, now we don't just have 1 line, but many!
We extract parameter values from the samples and construct one line per sample:
We can immediately see the uncertainty in our model!
In the Bayesian framework, the equivalent of a confidence interval is a credible interval. Equivalent? Well… except
Credible intervals are easily constructed using existing samples from the parameters' posterior.
Of course, here again we're not really interested in predicting averages, but individual examples.
We want prediction intervals:
Prediction intervals are computed from the complete posterior predictive distribution.
… didn't you say we'd look at complete distributions for our predictions?
Yes. That's the posterior predictive distribution:
\[ p(\tilde{y}|X, y, \tilde{x}, \theta) = \int p(\tilde{y}|\tilde{x},\theta) \, p(\theta|X,y) \operatorname{d}\!\theta \]
The posterior predictive is a weighted average of predictions, computed over all possible parameter values.
We actually have a distribution of predictions for every point in time.
… what if I'm using deep learning?
“… We'll see that what we did above, averaging forward passes through the network, is equivalent to Monte Carlo integration over a Gaussian process posterior approximation.”
Yarin Gal, What my deep model doesn't know
… how wrong we'd have been with our point prediction…
In reality it's 2017 so we're so much wiser now…
So what was the men's 400m Olympic winning time in 2000?
Thank you!!