Expected values characterize a distribution. The mean, characterizes the center of a density or mass function. The variance, characterizes how spread out a density is. Skewness considers how much a density is pulled toward high or low values.
The expected value or (population) mean of a random variable is the center of its distribution. For a discrete random variable \(X\) with PMF \(p(x)\), the expected value of a random variable is \[E[X] = \sum xp(x),\] where the sum is taken over the possible values of \(x\). Specifically, \(E[X]\) represents the center of mass of a collection of locations and weights.
It is important to contrast the population mean (the estimated) with the sample mean (the estimator). The sample mean estimates the population mean. Since the population mean is the center of mass of the population distribution, the sample mean is the center of mass of the data. The equation for the sample mean is identical with\(p(x_i) = 1/n\). Thus, \[\bar{x} = \dfrac{1}{n}\sum_{i=1}^n x_i.\]
For a continuous random variable, \(X\), with density, \(f\), the expected value is again exactly the center of mass of the density.
It is important to note that the sample mean is unbiased. That is, the population mean of its distribution is the mean that it is trying to estimate. The more data that goes into the sample mean, the more concentrated its density / mass function is around the population mean.
Recall that the mean of distribution was a measure of its center. The variance, on the other hand, is a measure of spread.
The mean of a sample is always centered at the same spot as the original distribution, but are less spread out. Thus, it is less likely for sample means to be far away from the population mean than it is for individual observations.
If \(X\) is a random variable with mean \(\mu\), the variance of \(X\) is defined as \(Var(X)\) and is calculated as \[Var(x) = E[(X − \mu)^2] = E[X^2] − E[X]^2.\]
The variance is the expected (squared) distance from the mean. Densities with a higher variance are more spread out than densities with a lower variance. The square root of the variance is called the standard deviation. The main benefit of working with standard deviations is that they have the same units as the data, whereas the variance has the units squared.
As an example, we find the mean and variance that result of a toss of a die? First it is not difficult to calculate \(E[X] = 3.5\). Next, \[E[X^2] = 1^2(1/6) + 2^2(1/6) + 3^2(1/6) + 4^2(1/6) + 5^2(1/6) + 6^2(1/6) = 15.17.\] Thus the variance is: \(Var(X) = E[X^2] − E[X]^2 = 15.17 - 3.5^2 \approx 2.92\).
The sample variance is the estimator of the population variance. Recall that the population variance is the expected squared deviation around the population mean. The sample variance is (roughly) the average squared deviation of observations around the sample mean. It is given by \[S^2 = \dfrac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}.\] The sample standard deviation is the positive square root of the sample variance.
At last, we finally get to a perhaps very surprising (and useful) fact: how to estimate the variability of the mean of a sample, when we only get to observe one realization. Recall that the average of a random sample from a population is itself a random variable having a distribution, which in simulation settings we can explore by repeated sampling averages. We know that this distribution is centered around the population mean, \(E[\bar{X}] = \mu\). We also know the variance of the distribution of means of random samples.
The variance of the sample mean is: \(Var(\bar{X}) = \sigma^2/n\) where \(\sigma^2\) is the variance of the population being sampled from.
This is very useful, since using this means we don’t have repeat sample means to get its variance directly using the data. We already know a good estimate of \(\sigma^2\) via the sample variance. So, we can get a good estimate of the variability of the mean, even though we only get to observe a single mean.
The square root of the variance of the mean is the standard deviation of the mean. We call the standard deviation of a statistic its standard error.
Asymptotics is a term used by mathematicians to denote the long term behaviour of something. In this case, long term refers to an increase in the sample size. Asymptotics are incredibly useful for simple statistical inference and approximations. The ideas of asymptotics form the basis for frequency interpretation of probabilities by considering the long run proportion of times an event occurs.
The Law of Large Numbers says that an observed sample average from a large sample will tend toward the true population average as the number of observations increase. Essentially, if you collect an infinite amount of data, you estimate the population mean perfectly. Suppose \(\bar{X}_n\) is the average of the result of \(n\) coin flips (i.e. the sample proportion of heads). The Law of Large Numbers states that as we flip a coin over and over, it eventually converges to the true probability of a head.
The Central Limit Theorem (CLT) is arguably the most fundamental theorem in statistics. The CLT states that the distribution of averages of variables is normal as the sample size increases. Thus, for samples, we have a good sense of distribution of the average even though (a) we only observed one average and (b) we don’t know the population distribution. So, we can consider our sample \(\bar{X}_n\) as approximately normally distributed with mean \(\mu\) and variance \(\sigma^2/n\) (or standard deviation \(\sigma/\sqrt{n}\).
Confidence intervals are a nice way of quantifying uncertainty in our estimates. The fact that the interval has a width characterizes that there is randomness that prevents us from getting a perfect estimate.
According to the CLT, the sample mean, \(\bar{X}\) is approximately normal with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\). Only about 2.5% of the normal distribution is outside of 2 standard deviations of the mean in either direction. Thus, the probability that \(\bar{X}\) is larger than \(\mu + 2 \sigma/\sqrt{n}\) or smaller than \(\mu - 2 \sigma/\sqrt{n}\) is about 5%. So, we call \(\mu \pm 2 \sigma/\sqrt{n}\) the 95% confidence interval for \(\mu\) since if we repeatedly took random samples, 95% of the samples would be in that interval. Note, the 97.5th quantile is 1.96 (so I rounded to 2 above). If instead of a 95% interval, you wanted a 90% interval, replace the 2 with the 95th percentile, which is 1.645.
In the event that each observations take either 0 or 1 with common success probability \(p\) then \(\sigma^2 = p(1 - p)\). The interval then takes the form: \(\hat{p} + 2\sqrt{\dfrac{p(1-p)}{n}}.\) Note again, the 2 can be replaced depending on the percentile required.
As an aside, \(p(1-p)\) is maximized at \(p = 1/2\). So, an estimate of the 95% confidence interval is \(\hat{p} + \dfrac{1}{\sqrt{n}}\).
We have seen confidence intervals of the form \[Est \pm Z \times SE_{Est}.\] For small samples, we use the \(t\)-distribution and \(t\) confidence intervals. These intervals are of the form: \[Est \pm t \times SE_{Est}.\] Thus, the only change is that we replaced the Z quantile now with a \(t\) quantile. If you are unsure if you should use a normal interval or a \(t\) interval, use the t interval.
The t distribution was invented by William Gosset (under the pseudonym “Student”) in 1908. Fisher provided further mathematical details about the distribution later. This distribution has thicker tails than the normal. It’s indexed by a degrees of freedom and it gets more like a standard normal as the degrees of freedom get larger. It assumes that the underlying data are Gaussian with the result that \(\dfrac{\bar{x} - \mu}{S/\sqrt{n}}\) follows the t distribution with \(n − 1\) degrees of freedom. Note that if we replaced \(S\) by \(\sigma\), the statistic would be normal. Thus, the \(t\)-distribution accounts for this additional bit of uncertainty. The confidence interval is \(\bar{X} \pm t_{n-1} S/\sqrt{n}\), where \(t_{n-1}\) is the relevant quantile from the \(t\)-distribution.
Hypothesis testing is concerned with making decisions using data. To make decisions using data, we need to characterize the kinds of conclusions we can make. Classical hypothesis testing is concerned with deciding between two decisions. A null hypothesis is representative of the status quo and always includes an equality. This hypothesis is labeled \(H_0\). This is what we assume by default. The alternative hypothesis, \(H_a\) is what we require evidence to conclude.
The alternative hypotheses are typically of the form of the true mean being <, > or \(\not =\) to the hypothesized mean. The null typically sharply specifies the mean. There are four possible outcomes of our statistical decision process:
| Truth | Decide | Result |
|---|---|---|
| \(H_0\) | \(H_0\) | Correctly accept null |
| \(H_0\) | \(H_a\) | Type I error |
| \(H_a\) | \(H_0\) | Correctly reject null |
| \(H_a\) | \(H_a\) | Type II error |
In a hypothesis test, we attempt to force the probability of a type I error to be small.
Consider a court of law and a criminal case. The null hypothesis is that the defendant is innocent. The rules requires a standard on the available evidence to reject the null hypothesis (and the jury to convict). The standard is to convict if the defendant appears guilty “Beyond reasonable doubt”. In statistics, we can be mathematically specific about our standard of evidence.
Note the consequences of setting a standard. If we set a low standard, such as convicting only if there circumstantial or greater evidence, then we would increase the percentage of innocent people convicted (type I errors). However, we would also increase the percentage of guilty people convicted (correctly rejecting the null). If we set a high standard, such as convicting if the jury has “No doubts whatsoever”, then we increase the the percentage of innocent people let free (correctly accepting the null) while we would also increase the percentage of guilty people let free (type II errors).
When we conduct a hypothesis test, we need to get a sense of the probability that our observed parameter is different than the true value of that parameter.
A reasonable strategy would reject the null hypothesis if the sample value, was significantly different than the hypothesized value. We let \(\alpha\) be the probability of a type I error. Often, we set \(\alpha = 0.05\). So, \(\alpha\) is the probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct.
If we assume a one-tailed test (for example, testing if our sample value is significantly larger than the assumed value), we want to see if the \(z\)-score for our observation is large enough. The \(z\)-score is \[z = \dfrac{\bar{x} - \mu}{s/\sqrt{n}}.\] This tests how many standard deviations our observation is from the mean. If this number is larger than 1.645, we can surmise that we are 95% confident that the true mean is different than what we assumed. Thus, we can reject \(H_0\).
To reiterate, the \(z\)-score is how many standard errors the sample parameter is above the hypothesized value. Setting the rule “We reject if our Z-score is larger than 1.645” controls the Type I error rate at 5%. A general rule here is to reject \(H_0\) the calculated \(z\)-score is larger than the \(z\)-score that gives the appropriate type I error rate, \(\alpha\).
Because the Type I error rate was controlled to be small, if we reject we know that one of the following occurred:
Note that if we are using a two-tailed test, we will need to adjust our \(z\)-score to account for half of the uncertainty on either side of the distribution. Also note that if the sample size is small, the \(t\)-distribution might be better than the normal distribution.
\(P\)-values are the most common measure of statistical significance. The basic idea of a \(P\)-value is to assume that the null hypothesis is true and calculate how unusual it would be to see data (in the form of a test statistic) as extreme as was seen in favor of the alternative hypothesis.
Formally, a \(P\)-value is the probability of observing a test statistic as or more extreme in favor of the alternative than was actually obtained, where the probability is calculated assuming that the null hypothesis is true.
In order to do this, we require a few steps.
The way to interpret \(P\)-values is as follows. If the P-value is small, then either \(H_0\) is true and we have observed a rare event or \(H_0\) is false (or possibly the null model is incorrect).
As an example, suppose that you get a \(t\) statistic of 2.5 for 15 degrees of freedom testing \(H_0: \mu = \mu_0\) versus \(H_a: \mu > \mu_0\). What’s the probability of getting a \(t\) statistic as large as 2.5?
pt(2.5, 15, lower.tail = FALSE)
## [1] 0.0122529
Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under \(H_0\) is 0.0123. So, (assuming our model is correct) either we observed data that was pretty unlikely under the null, or the null hypothesis if false.
If we selected \(\alpha = 0.05\) for a hypothesis test, we would reject \(H_0\) if our observed \(z\)-score was smaller than 0.05. The smallest value for \(\alpha\) that you still reject the null hypothesis is called the attained significance level. This is mathematically equivalent, but philosophically a little different from, the \(P\)-value.
Whereas the \(P\)-value is interpreted in the terms of how probabilistically extreme our test statistic is under the null, the attained significance level merely conveys what the smallest level of \(\alpha\) that one could reject at. This equivalence makes \(P\)-values very convenient as the reader of the results can perform the test at whatever \(\alpha\) they desire.
Caffo, Brian. Statistical Inference for Data Science. Leanpub, 2014. Available here.
Camm, Jeffrey D. Business Analytics. Third edition. Boston, MA, USA: Cengage, 2019.
Rohatgi, V. K. Statistical Inference. Dover ed. Mineola, N.Y: Dover Publications, 2003.