In the previous, we discussed creating a confidence interval using the \(CLT\)
All the intervals we discussed took the form of estimates, plus or minus a quantile from the Standard Normal distribution times the Standard Error of the estimator.
In this lecture we’re going to discuss some methods for small samples. Notably we’re going to talk about Student or Gosset’s \(t\) distribution and T confidence intervals.
These intervals are going to be of the form
\[Est \pm TQ \times SE_{Est}\]
So the only change is, we’ve changed the \(z\) quantile to a \(t\) quantile. The t distribution has heavier tails than the normal distribution, so these intervals are going to be a little bit wider. These are some of the handiest intervals in all of statistics, and if you ever want to rule between when to use a \(t\) interval or a \(z\) interval for the cases where both are available, simply use the \(t\) interval. Because as you collect more data, the \(t\) interval will just become more and more like the \(z\) interval anyway.
We’re just going to cover the single and two group version of the \(t\) interval.
Invented by William Gosset (under the pseudonym of “Student”) in 1908. He worked for the Guinnes brewery, and in fact, they didn’t want him to publish under his real name.
The \(t\) distribution has thicker tails than the normal. Is indexed by a degree of freedom; gets more like a standard normal as \(df\) gets larger.
The reason for the \(t\) distribution is as follows. It assumes that the underlying data are \(iid\) Gaussian with the result that
\[\frac{\bar{X}-\mu}{S/\sqrt{n}}\]
It is not Gaussian distributed, but it follows Gosset’s \(t\) distribution with \(n-1\) degrees of freedom. If we replace \(S\) by \(\sigma\), it would be exactly a standard normal. However, when we replace \(\sigma\) by \(S\), it no longer has a distribution as that of a standard normal. Instead it has a \(t\) distribution. As \(n\) increases this distinction is irrelevant. However, for small sample sizes, the difference can be quite large. And so, if you use standard normal for small sample sizes you can get, for example, confidence intervals that are too narrow.
Interval is \(\bar{X} \pm t{n-1}S/\sqrt{n}\) where \(t_{n-1}\) is the relevant quantile
The \(t\) technically assumes that the data are \(iid\) normal, though it is robust to this assumption.
It works well whenever the distribution of the data is roughly symmetric and mound/hill-shaped
When you have paired observations, for example when you measure something once and then the same obser, it’s the same unit a few days later or for a second measurement, you can use the \(t\) interval to analyze this kind of data taking differences or differences on the \(log\) scale.
For large degrees of freedom, \(t\) quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the \(CLT\) yielded. Because of this, instead of picking between interval and normal interval, I always say, just use the \(t\) interval.
For skewed distributions, the spirit of the \(t\) interval assumptions are violated
Also, for skewed distributions, it doesn’t make a lot of sense to center the interval at the mean
In this case, consider using the \(log\) scale or using a different summary like the median. Nonetheless, it’s useless to use \(t\) intervals for skewed distributions, because it, in a lot of ways, it doesn’t make sense to center intervals for skewed distributions at the mean.
For highly discrete data, like binary or Poisson data, other intervals are available, and it’s probably preferable to use them to the \(t\) interval.