We are interested in a future observation \(Y_{h(new)}\) with \(X=X_h\). This data is separate from the sample data that we may have used previously to fit a linear regression model. Previously, all that we have been able to do is try to draw a regression line, and possibly test how well it fits using confidence intervals. Now we are interested in seeing where a new observation which is separate from our sample data would lie (i.e., \(X_h\) is the new datum which came from outside the sample data, and we are trying to estimate where it would lie, \(Y_{h(new)}\)).
The new point on the line would follow:
\(\hat{Y}_{h(new)} = \hat{\beta}_0 + \hat{\beta}_1 X_h\), where we are doing point prediction based on \(\hat{Y}=\hat{\beta}_0+\hat{\beta}_1X\) (there is no \(\varepsilon\) term, because \(\hat{Y}\) refers to a point on the line which can’t have a residual).
The prediction interval for the new point is: \(P(Y_{h(new)} \in [lb, ub])=0.95\) (where \(\alpha=0.05\))
The ‘lb’ and ‘ub’ represent lower bound and upper bound of a confidence interval which is random (based on sample data). It is important to note that we are implicitly assuming that this new future observation follows the model which we previously described, that is: \(Y_{h(new)}=\beta_0+\beta_1X_h+\varepsilon_h\)
Let us now do a confidence interval for the new observation’s point on the fitted line. However, I will first give a new equation so that the confidence interval makes more sense:
\(Var(\hat{Y}_i)=\sigma^2(\frac{1}{n}+\frac{(X_i-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2})\)
The confidence interval is therefore:
\(1-\alpha\) C.I. for the mean response at \(X_h\): \((\hat{\beta}_0+\hat{\beta}_1X_h) \pm t_{n-2,(1-\frac{\alpha}{2})} \hat{\sigma}\sqrt{\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2}}\)
(here \(\hat{Y}_h = \hat{\beta}_0+\hat{\beta}_1X_h\))
Given the following:
\(Y_{h(new)}=\beta_0+\beta_1X_h+\varepsilon_h\) and \(\hat{Y}_h=\hat{\beta}_0+\hat{\beta}_1X_h\)
Then we can state that: \(E(Y_{h(new)} - \hat{Y}_h)=0\)
Previously it was mentioned that the new new observation would follow the model, and therefore \(E(\varepsilon_h)=0\).
A more difficult new equation is:
\(Var(Y_{h(new)}-\hat{Y}_h) = \cdots = \sigma^2(1+\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2})\) The steps for covariance have been skipped.
The first confidence interval gave a C.I. for \(\hat{Y}_h\), while this will give a C.I. for \(Y_{h(new)}\).
\(Y_{h(new)}-\hat{Y}_h \sim N(0, \sigma^2(1+\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2}))\)
Now if you recall the test statistic will follow a t-distribution:
\(\frac{Y_{h(new)}-\hat{Y}_h}{\sigma\sqrt{1+\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2}}} \sim t_{n-2}\)
Skipping some steps…\(\Rightarrow P(\hat{Y}_h-t_{n-2,(1-\frac{\alpha}{2})} \hat{\sigma}\sqrt{\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2}} \leq Y_{h(new)} \leq \hat{Y}_h+t_{n-2,(1-\frac{\alpha}{2})} \hat{\sigma}\sqrt{\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum_{i=1}^n(X_i-\bar{X})^2}})=1-\alpha\)
When dealing with additional point estimations, \(X_{h1}, X_{h2},\cdots,X_{hg}\) then we will need to do new simultaneous confidence intervals for \(g\) levels of \(X\). We can also do simultaneous prediction intervals (P.I.) for each level of \(g\).
There are two approaches, Bonferroni and Working-Hotelling. However, I will not go into detail as to how to solve these two methods.