Statistics concerns itself mainly with conclusions and predictions resulting from chance outcomes that occur in carefully planned experiments or investigations.
In the finite case, these chance outcomes constitute a subset, or sample, of measurements or observations from a larger set of values called the population. In the continuous case they are usually identically distributed random variables, whose distribution we refer to as the population distribution, or the infinite population sampled.
Not all samples lend themselves to valid generalizations about the populations from which they came. Most methods of inference are based on the assumption that we are dealing with random samples.
If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed random variables, we say they constitute a random sample from the infinite population given by their common distribution.
If \(f(x_1, x_2, \ldots, x_n)\) is the value of the joint distribution, we can write:
\[f(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} f(x_i)\]
where \(f(x_i)\) is the value of the population distribution at \(x_i\).
Statistical inferences are usually based on statistics — random variables that are functions of \(X_1, X_2, \ldots, X_n\).
If \(X_1, X_2, \ldots, X_n\) constitute a random sample, then:
\[\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i \quad \text{(sample mean)}\]
\[S^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n - 1} \quad \text{(sample variance)}\]
For observed sample data, we calculate \(\bar{x}\) and \(s^2\); these are values of the corresponding random variables \(\bar{X}\) and \(S^2\).
Since statistics are random variables, their values vary from sample to sample. Their distributions are called sampling distributions.
If \(X_1, X_2, \ldots, X_n\) constitute a random sample from an infinite population with mean \(\mu\) and variance \(\sigma^2\), then:
\[E(\bar{X}) = \mu \qquad \text{and} \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}\]
Proof:
\[E(\bar{X}) = E\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n}\sum_{i=1}^n E(X_i) = \frac{1}{n} \cdot n\mu = \mu\]
\[\text{Var}(\bar{X}) = \text{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]
The standard error of the mean is \(\sigma_{\bar{X}} = \dfrac{\sigma}{\sqrt{n}}\). As \(n\) increases, \(\sigma_{\bar{X}}\) decreases — larger samples yield \(\bar{X}\) values closer to \(\mu\).
For any positive constant \(c\), the probability that \(\bar{X}\) falls between \(\mu - c\) and \(\mu + c\) is at least:
\[1 - \frac{\sigma^2}{nc^2}\]
As \(n \to \infty\), this probability approaches 1.
Proof:
From Chebyshev’s theorem, for any random variable with mean \(\mu\) and standard deviation \(\sigma\), and any \(k > 0\):
\[P(|X - \mu| < k\sigma) \geq 1 - \frac{1}{k^2}\]
Applying this to \(\bar{X}\) (which has standard deviation \(\sigma/\sqrt{n}\)), set \(k\sigma_{\bar{X}} = c\), so \(k = \dfrac{c\sqrt{n}}{\sigma}\):
\[P(|\bar{X} - \mu| < c) \geq 1 - \frac{\sigma^2}{nc^2}\]
This result is known as the Law of Large Numbers. \(\blacksquare\)
If \(X_1, X_2, \ldots, X_n\) constitute a random sample from an infinite population with mean \(\mu\), variance \(\sigma^2\), and moment-generating function \(M_X(t)\), then the limiting distribution of:
\[Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\]
as \(n \to \infty\) is the standard normal distribution.
Practical rule: The CLT approximation is used when \(n \geq 30\), regardless of the shape of the population.
Important note: The CLT does not say the distribution of \(\bar{X}\) becomes normal (since \(\text{Var}(\bar{X}) \to 0\)). It justifies approximating \(\bar{X}\) with a normal having mean \(\mu\) and variance \(\sigma^2/n\) when \(n\) is large.
A vending machine dispenses drinks with mean \(\mu = 200\,\text{ml}\) and standard deviation \(\sigma = 15\,\text{ml}\). Find \(P(\bar{X} \geq 204)\) for \(n = 36\).
Solution:
\[\sigma_{\bar{X}} = \frac{15}{\sqrt{36}} = 2.5\]
\[P(\bar{X} \geq 204) = P\!\left(Z \geq \frac{204 - 200}{2.5}\right) = P(Z \geq 1.6)\]
\[= 1 - P(Z \leq 1.6) = 1 - 0.9452 = \boxed{0.0548}\]
A random sample of size \(n = 72\) is taken from:
\[f(x) = \frac{1}{16}x\,e^{-x/4}, \quad x > 0\]
Use the CLT to find \(P(\bar{X} > 9)\).
Solution:
Identifying this as a Gamma distribution with \(\alpha = 2\), \(\beta = 4\):
\[E(X) = \alpha\beta = 8, \qquad \text{Var}(X) = \alpha\beta^2 = 32\]
\[P(\bar{X} > 9) = P\!\left(Z > \frac{9 - 8}{\sqrt{32/72}}\right) = P(Z > 1.5) = 1 - 0.9332 = \boxed{0.0668}\]
If \(\bar{X}\) is the mean of a random sample of size \(n\) from a normal population with mean \(\mu\) and variance \(\sigma^2\), then the exact sampling distribution is:
\[\bar{X} \sim N\!\left(\mu,\; \frac{\sigma^2}{n}\right)\]
(This holds for any \(n\), without the CLT approximation.)
A random sample of size \(n = 100\) is taken from a population with \(\mu = 75\) and \(\sigma^2 = 256\).
A random sample of size \(n = 81\) is taken from a population with \(\mu = 128\) and \(\sigma = 6.3\). Find \(P(\bar{X} \notin (126.6,\; 129.4))\) using:
A random sample of size 64 from a normal population with \(\mu = 51.4\), \(\sigma = 6.8\). Find:
\(n = 225\) from an exponential population with \(\theta = 4\). Find \(P(\bar{X} > 4.5)\) using the CLT.
\(n = 200\) from a uniform population with \(\alpha = 24\), \(\beta = 48\). Find \(P(\bar{X} < 35)\). [0.0207]
\(n = 100\) from a normal population with \(\sigma = 25\). Find \(P(|\bar{X} - \mu| \geq 3)\). [0.2302]
Let \(\bar{X}\) be the mean of a random sample of size 100 from a distribution with \(\sigma^2 = 50\). Find approximately \(P(49 < \bar{X} < 51)\).
Let \(f(x) = \frac{1}{x^2}\) for \(x \geq 1\). For \(n = 72\), find approximately the probability that more than 50 items are less than 3. [0.267]
\(\bar{X}\) is the mean of a random sample of size 128 from a Gamma\((\alpha=2, \beta=4)\) distribution. Find approximately \(P(7 < \bar{X} < 9)\). [0.954]
Find the approximate probability that the mean of a sample of size 15 from \(f(x) = 3x^2\), \(0 < x < 1\), lies between \(\frac{3}{5}\) and \(\frac{4}{5}\). [0.840]
A random variable \(X\) has the Chi-square distribution with \(r\) degrees of freedom if its pdf is:
\[f(x) = \frac{1}{2^{r/2}\,\Gamma(r/2)}\,x^{r/2 - 1}\,e^{-x/2}, \quad x > 0\]
We write \(X \sim \chi^2_r\) (or \(X\) is \(\chi^2_r\)).
\[M_X(t) = (1 - 2t)^{-r/2}, \quad t < \frac{1}{2}\]
Proof sketch: Substituting \(y = x(1-2t)/2\) into the integral and using the Gamma function identity \(\int_0^\infty y^{r/2-1}e^{-y}\,dy = \Gamma(r/2)\) gives the result above.
Differentiating \(M_X(t) = (1-2t)^{-r/2}\):
\[E(X) = r \qquad \text{and} \qquad \text{Var}(X) = 2r\]
If \(Z \sim N(0,1)\), then \(Z^2 \sim \chi^2_1\).
Proof: The MGF of \(Z^2\) is derived as \((1-2t)^{-1/2}\), which is the MGF of \(\chi^2_1\). \(\blacksquare\)
If \(X_1, X_2, \ldots, X_n\) are independent \(N(0,1)\) random variables, then:
\[Y = \sum_{i=1}^{n} X_i^2 \sim \chi^2_n\]
Proof: By Theorem 5 each \(X_i^2 \sim \chi^2_1\), and since they are independent:
\[M_Y(t) = \prod_{i=1}^n (1-2t)^{-1/2} = (1-2t)^{-n/2}\]
which is the MGF of \(\chi^2_n\). \(\blacksquare\)
If \(X_1, \ldots, X_n\) are independent with \(X_i \sim \chi^2_{r_i}\), then:
\[Y = \sum_{i=1}^n X_i \sim \chi^2_{r_1 + r_2 + \cdots + r_n}\]
If \(X_1\) and \(X_2\) are independent, \(X_1 \sim \chi^2_{r_1}\), and \(X_1 + X_2 \sim \chi^2_{r_1+r}\), then \(X_2 \sim \chi^2_r\).
If \(\bar{X}\) and \(S^2\) are the mean and variance of a random sample of size \(n\) from \(N(\mu, \sigma^2)\), then:
(a) \(\bar{X}\) and \(S^2\) are independent.
(b) \(\dfrac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)
Proof of (b): Using the identity:
\[\sum_{i=1}^n \frac{(X_i - \mu)^2}{\sigma^2} = \frac{(n-1)S^2}{\sigma^2} + \left(\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\right)^2\]
The left side is \(\chi^2_n\) (by Theorem 6). The second term on the right is \(\chi^2_1\) (by Theorems 4 and 5). By Theorem 8, \(\dfrac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\). \(\blacksquare\)
A manufacturing process is “in control” if \(\sigma \leq 0.60\) thousandths of an inch. For \(n = 20\), the process is declared “out of control” if \(\dfrac{(n-1)S^2}{\sigma^2} \geq \chi^2_{0.01,\,19} = 36.191\).
With \(S = 0.84\), \(\sigma = 0.60\):
\[\frac{(n-1)S^2}{\sigma^2} = \frac{19 \times (0.84)^2}{(0.60)^2} = 37.24 > 36.191\]
Conclusion: Reject \(H_0\); the process is out of control.
Although \(Z = \dfrac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\) is elegant, in practice \(\sigma\) is usually unknown and must be replaced by the sample standard deviation \(S\).
If \(Y \sim \chi^2_r\) and \(Z \sim N(0,1)\) are independent, then:
\[T = \frac{Z}{\sqrt{Y/r}}\]
has the t-distribution with \(r\) degrees of freedom, with pdf:
\[f(t) = \frac{\Gamma\!\left(\frac{r+1}{2}\right)}{\sqrt{r\pi}\;\Gamma\!\left(\frac{r}{2}\right)} \left(1 + \frac{t^2}{r}\right)^{-(r+1)/2}, \quad -\infty < t < \infty\]
Originally introduced by W.S. Gosset under the pen-name “Student” (his employer, a brewery, did not permit employee publications). Hence also known as Student’s t-distribution.
If \(\bar{X}\) and \(S^2\) are the mean and variance of a random sample of size \(n\) from \(N(\mu, \sigma^2)\), then:
\[T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}\]
Proof: By Theorem 9, set \(Z = \dfrac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim N(0,1)\) and \(Y = \dfrac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\), which are independent. Substituting into Theorem 10:
\[T = \frac{Z}{\sqrt{Y/(n-1)}} = \frac{(\bar{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{(n-1)S^2/[\sigma^2(n-1)]}} = \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1} \quad \blacksquare\]
In 16 one-hour test runs, an engine averaged \(\bar{x} = 16.4\) gallons with \(s = 2.1\) gallons. Test the claim that \(\mu = 12.0\) gallons per hour.
Solution:
\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{16.4 - 12.0}{2.1/\sqrt{16}} = \frac{4.4}{0.525} = 8.38\]
From t-tables: \(t_{0.005,\,15} = 2.947\).
Since \(8.38 > 2.947\), we reject the claim. The true average consumption exceeds 12.0 gallons per hour.
The F-distribution is named after Sir Ronald A. Fisher, one of the most prominent statisticians of the 20th century. It was originally studied as the sampling distribution of the ratio of two independent chi-square random variables, each divided by its degrees of freedom.
If \(U \sim \chi^2_{n_1}\) and \(V \sim \chi^2_{n_2}\) are independent, then:
\[F = \frac{U/n_1}{V/n_2}\]
has an F-distribution with \(n_1\) and \(n_2\) degrees of freedom, with pdf:
\[g(f) = \frac{\Gamma\!\left(\frac{n_1+n_2}{2}\right)}{\Gamma\!\left(\frac{n_1}{2}\right)\Gamma\!\left(\frac{n_2}{2}\right)} \left(\frac{n_1}{n_2}\right)^{n_1/2} \frac{f^{n_1/2 - 1}}{\left(1 + \dfrac{n_1}{n_2}f\right)^{(n_1+n_2)/2}}, \quad f > 0\]
Proof: Apply the change-of-variable \(f = \dfrac{un_2}{vn_1}\) to the joint density of \(U\) and \(V\), then integrate out \(v\) using the substitution \(w = v\!\left(1 + \dfrac{n_1}{n_2}f\right)/2\). \(\blacksquare\)
Let \(X \sim F(m, n)\). Then:
\[E(X) = \frac{n}{n-2}, \quad n > 2\]
\[\text{Var}(X) = \frac{2n^2(m+n-2)}{m(n-2)^2(n-4)}, \quad n > 4\]
Proof sketch:
Write \(X = \dfrac{U/m}{V/n}\) where \(U \sim \chi^2_m\) and \(V \sim \chi^2_n\) are independent.
\[E(X) = \frac{n}{m} E(U) \cdot E\!\left(\frac{1}{V}\right) = \frac{n}{m} \cdot m \cdot \frac{1}{n-2} = \frac{n}{n-2}\]
since \(E(1/V) = 1/(n-2)\) for \(V \sim \chi^2_n\) (obtained by substitution in the integral). Similarly, \(E(1/V^2) = 1/[(n-2)(n-4)]\) for \(n > 4\), which yields the variance formula. \(\blacksquare\)
The F-distribution arises naturally when comparing variances \(\sigma_1^2\) and \(\sigma_2^2\) of two normal populations.
If \(S_1^2\) and \(S_2^2\) are the variances of independent random samples of sizes \(n_1\) and \(n_2\) from \(N(\mu_1, \sigma_1^2)\) and \(N(\mu_2, \sigma_2^2)\) respectively, then:
\[F = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F(n_1-1,\; n_2-1)\]
Proof: By Theorem 9, \(\dfrac{(n_1-1)S_1^2}{\sigma_1^2} \sim \chi^2_{n_1-1}\) and \(\dfrac{(n_2-1)S_2^2}{\sigma_2^2} \sim \chi^2_{n_2-1}\), independently. Substituting into Theorem 12 gives the result. \(\blacksquare\)
The F-table gives critical values \(f_\alpha(n_1, n_2)\) such that \(P(F > f_\alpha) = \alpha\) for specified degrees of freedom.
Use Theorem 9 to show that for random samples of size \(n\) from \(N(\mu, \sigma^2)\), the sampling distribution of \(S^2\) has mean \(\sigma^2\) and variance \(\dfrac{2\sigma^4}{n-1}\).
Show that if \(X_1, X_2, \ldots, X_n\) are independent \(\chi^2_1\) and \(Y_n = X_1 + X_2 + \cdots + X_n\), then the limiting distribution of: \[Z_n = \frac{Y_n - n}{\sqrt{2n}}\] is \(N(0,1)\) as \(n \to \infty\).
Using Exercise 2, show that if \(X \sim \chi^2_n\) with large \(n\), then: \[\frac{X - n}{\sqrt{2n}} \approx N(0,1)\]
Use Exercise 3 to find the approximate probability that a \(\chi^2_{50}\) random variable exceeds 68.0.
Show that for \(n > 2\), the variance of the \(t\)-distribution with \(n\) degrees of freedom is \(\dfrac{n}{n-2}\). (Hint: substitute \(t = \sqrt{\dfrac{n}{1-u^2}} \cdot \text{sign}(t)\))
Verify that if \(T \sim t_n\), then \(T^2 \sim F(1, n)\).
Verify that if \(X \sim F(n_1, n_2)\) and \(n_2 \to \infty\), the distribution of \(Y = n_1 X\) approaches \(\chi^2_{n_1}\).
If \(X \sim F(n_1, n_2)\), show that \(Y = \dfrac{1}{X} \sim F(n_2, n_1)\).
Verify that if \(Y\) has a Beta distribution with \(\alpha = \dfrac{n_1}{2}\) and \(\beta = \dfrac{n_2}{2}\), then: \[X = \frac{n_2 Y}{n_1(1-Y)} \sim F(n_1, n_2)\]
End of STA227 Weeks 7–9 Notes