Statistics is a discipline that seeks to describe characteristics of populations that vary from one member of a population to the next.
Statistics is a discipline that seeks to describe characteristics of populations that vary from one member of a population to the next.
\[\mu \approx \bar{x}\]
\[\sigma^2 \approx s^2\]
\[\sigma \approx s\]
Accuracy: bias
Random samples minimize bias.
To describe a variable, we must describe the distribution of variates.
Statistics of location tell us about a specific place in a distribution.
Statistics of dispersion tell us about the spread of values in a distribution.
We can use what we learn about distributions to calculate probabilities.
Statisticians define a random trial as a process with 2 or more possible outcomes, whose occurrence cannot be predicted.
The sample space is a list of all possible outcomes.
An event is one possible subset of the sample space.
Probability is the proportion of times an event is likely to occur in a series of random trials.
Mutually exclusive events cannot occur simultaneously.
…is used to model probabilities for "normally" distributed variables.
\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]
Women's heights are approximately normally distributed with \(\bar{x} \approx 64, s \approx 3\), therefore \(~f(x,\mu=64,\sigma=3)\) models this distribution.
\[f(x) = \frac{1}{\sqrt{2 \pi 3^2}}e^{\frac{-(x-64)^2}{2*3^2}}\] We could also say this distribution is approximately \(N(64, 3)\).
We expect 68% of the women to fall within one standard deviation from the mean.
\[\int_{\mu-\sigma}^{\mu+\sigma} \! f(x) \, \mathrm{d}x = 0.68\]
We expect 95% of the women to fall within two standard deviations from the mean.
\[\int_{\mu-2\sigma}^{\mu+2\sigma} \! f(x) \, \mathrm{d}x = 0.95\]
We expect 99.7% of the women to fall within three standard deviations from the mean. \[\int_{\mu-3\sigma}^{\mu+3\sigma} \! f(x) \, \mathrm{d}x = 0.997\]
How likely is it that the next woman to walk into our room is shorter than 71 inches?
\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]
How likely is it that the next woman to walk into our room is taller than 71 inches?
\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]
We can ask a computer to calculate that integral for us:
pnorm(71, 64, 3)
## [1] 0.9901847
100% of women have height, i.e. the integral of the entire sample space is 1.
\[\int_{\mu-\infty}^{\mu+\infty} \! f(x) \, \mathrm{d}x = 1\]
How likely is it that the next woman to walk into our room is taller than 71 inches?
\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]
How likely is it that the next woman to walk into our room is taller than 71 inches?
\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]
Because the integral of the entire sample space is 1, we can get this probability by subtracting from 1:
1-pnorm(71, 64, 3)
## [1] 0.009815329
Estimates are always uncertain, but the larger your sample, the greater your certainty.
Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.
Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.
The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)
Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.
The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)
Larger \(n\) yield smaller \(s_{\bar{x}}\).
If you pull 20 samples and calculate the interval
\[\bar{x}-2 \sigma_{\bar{x}} ~ \mathrm{to} ~ \bar{x}+2 \sigma_{\bar{x}}\]
…for each, 19 of those intervals will contain the true population mean…usually.
This is called the 95% Confidence Interval, because we are 95% confident that the true mean falls within the interval.
What about categorical data?
In summary…