Statistics is a discipline that seeks to describe characteristics of populations that vary from one member of a population to the next.

Some definitions

  • A population is the collection of things about which we want to learn.
  • A sample is a representative subset of the population.
  • The sampling unit is the individual being measured.

Some definitions

  • Characteristics that vary from person to person, place to place, time to time, etc. are called variables.
  • A variate is a single measurement of one of these characteristics.
  • A distribution is the collection of a variable's variates.

Some definitions

  • A parameter is a statistic that describes a characteristic of a population.
    • Usually unknowable
  • An estimate is a statistic, derived from a sample, that estimates a parameter.

\[\mu \approx \bar{x}\]

\[\sigma^2 \approx s^2\]

\[\sigma \approx s\]

What makes a sample representative?

  • Precision: error
  • Accuracy: bias

What makes a sample representative?

  • Precision: error
  • Accuracy: bias

  • Large samples minimize error.
  • Random samples minimize bias.

Data can be

  • Categorical: Discrete, unordered categories
    • e.g. sex
  • Ordinal: Discrete, ordered categories with arbitrary intervals
    • e.g. pain
  • Interval: Numeric with meaningful intervals, but arbitrary zero
    • e.g. temperature
  • Ratio: Numeric with meaningful intervals, and meaningful zero
    • e.g. height



To describe a variable, we must describe the distribution of variates.

Describing distributions visually

  • Plotting all the variates, \(x_i\), on a line doesn't help us much.

Describing distributions visually

Describing distributions visually

  • Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.

Describing distributions visually

Describing distributions visually

  • Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.
  • Histograms show the number of variates within each bar's range.

Describing distributions visually

Describing distributions visually

  • Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.
  • Histograms show the number of variates within each bar's range.
  • Box & whisker plots show the median, quartiles, 1.5 \(\times\) interquartile range, and outliers.

Describing distributions visually

Describing distributions quantitatively

Statistics of location tell us about a specific place in a distribution.

Statistics of dispersion tell us about the spread of values in a distribution.

Statistics of location

  • Median
    • The middle value when variates are arranged in order
    • Robust towards outliers
  • Mode
  • Mean

Statistics of location

  • Median
  • Mode
    • The most common value (or range of values) of a distribution
    • Good for spotting multi-modal distributions
  • Mean

Statistics of location

  • Median
  • Mode
  • Mean
    • Pool the resource, then distribute it equally \[\mu \approx \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
    • The balance point of a distribution
    • AKA "average"
    • Sensitive to outliers

Statistics of dispersion

  • Range
    • The difference between the maximum and minimum variates
  • Variance
  • Standard Deviation
  • Quantiles
  • Skewness
  • Kurtosis

Statistics of dispersion

  • Range
  • Variance
    • The average squared deviation from the mean \[\sigma^2 \approx s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]
  • Standard Deviation
  • Quantiles
  • Skewness
  • Kurtosis

Statistics of dispersion

  • Range
  • Variance
  • Standard Deviation
    • Variance, but transformed back into the measured units \[~\sigma \approx s = \sqrt{s^2}\]
  • Quantiles
  • Skewness
  • Kurtosis

Statistics of dispersion

  • Range
  • Variance
  • Standard Deviation
  • Quantiles
    • Values below which a certain proportion of variates fall
    • AKA Percentiles
    • Quartiles = 25%, 50%, 75%, and 100% percentiles
  • Skewness
  • Kurtosis

Statistics of dispersion

  • Range
  • Variance
  • Standard Deviation
  • Quantiles
  • Skewness
    • A measure of how much the tails are pulled one way or another
  • Kurtosis

Statistics of dispersion

  • Range
  • Variance
  • Standard Deviation
  • Quantiles
  • Skewness
  • Kurtosis
    • A measure of how flat or peeked a distribution is



We can use what we learn about distributions to calculate probabilities.

Probability

Statisticians define a random trial as a process with 2 or more possible outcomes, whose occurrence cannot be predicted.

The sample space is a list of all possible outcomes.

An event is one possible subset of the sample space.

Probability is the proportion of times an event is likely to occur in a series of random trials.

Exclusivity

Mutually exclusive events cannot occur simultaneously.

  • Die rolls on the same die are mutually exclusive.
  • Hair and eye color are not mutually exclusive.

The Normal Distribution

…is used to model probabilities for "normally" distributed variables.

\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]

The Normal Distribution

Women's heights are approximately normally distributed with \(\bar{x} \approx 64, s \approx 3\), therefore \(~f(x,\mu=64,\sigma=3)\) models this distribution.

\[f(x) = \frac{1}{\sqrt{2 \pi 3^2}}e^{\frac{-(x-64)^2}{2*3^2}}\] We could also say this distribution is approximately \(N(64, 3)\).

The Normal Distribution

We expect 68% of the women to fall within one standard deviation from the mean.

\[\int_{\mu-\sigma}^{\mu+\sigma} \! f(x) \, \mathrm{d}x = 0.68\]

The Normal Distribution

We expect 95% of the women to fall within two standard deviations from the mean.

\[\int_{\mu-2\sigma}^{\mu+2\sigma} \! f(x) \, \mathrm{d}x = 0.95\]

The Normal Distribution

We expect 99.7% of the women to fall within three standard deviations from the mean. \[\int_{\mu-3\sigma}^{\mu+3\sigma} \! f(x) \, \mathrm{d}x = 0.997\]

The Normal Distribution

How likely is it that the next woman to walk into our room is shorter than 71 inches?

\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]

We can ask a computer to calculate that integral for us:

pnorm(71, 64, 3)
## [1] 0.9901847

The Normal Distribution

100% of women have height, i.e. the integral of the entire sample space is 1.

\[\int_{\mu-\infty}^{\mu+\infty} \! f(x) \, \mathrm{d}x = 1\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]

Because the integral of the entire sample space is 1, we can get this probability by subtracting from 1:

1-pnorm(71, 64, 3)
## [1] 0.009815329



Estimates are always uncertain, but the larger your sample, the greater your certainty.

Central Limit Theorem

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

Estimating uncertainty

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)

Estimating uncertainty

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)

Larger \(n\) yield smaller \(s_{\bar{x}}\).

95% Confidence Intervals

If you pull 20 samples and calculate the interval

\[\bar{x}-2 \sigma_{\bar{x}} ~ \mathrm{to} ~ \bar{x}+2 \sigma_{\bar{x}}\]

…for each, 19 of those intervals will contain the true population mean…usually.

This is called the 95% Confidence Interval, because we are 95% confident that the true mean falls within the interval.



What about categorical data?

Categorical data visually

  • Bar Graph
    • Always relative to zero!
    • Color can be used to emphasize ordered groups

Categorical data visually

Categorical data visually

  • Bar Graph
  • Pie Chart
    • Terrible!
    • Don't ever use one!

Categorical data visually

Categorical data quantitatively

  • Frequency: \(~f_j\)
    • The number of individuals, \(~f\), in group \(~j\)

Categorical data quantitatively

  • Frequency: \(~f_j\)
  • Proportion: \(~p = \frac{f_j}{n}\)
    • The frequency of individuals in a group relative to the number of individuals in all groups, \(~n\)
    • Percent: \(~p \times 100\%\)

Categorical data quantitatively

  • Frequency: \(~f_j\)
  • Proportion: \(~p\)
  • Odds: \(~\frac{p}{1-p} = \frac{f_i}{n-f_i}\)
    • The probability of success relative to the probability of failure



In summary…

Summary

  • Samples are used to estimate population parameters.
  • To describe variable characteristics, we describe the distribution of values that variable exhibits.
  • We can do this visually and quantitatively.
  • Statistics of location (mean, median, mode) tell us the "center" of a continuous distribution.
  • Statistics of dispersion (variance, standard deviation, etc.) describe the spread of values around the center.

Summary

  • Standard Errors and Confidence Intervals show the uncertaintity of our estimates.
  • Probability is the proportion of times an event is likely to occur in a series of random trials.
  • Probabilities can be calculated from probability density functions if we know certain characteristics about the distribution (mean, and standard deviation).
  • Categorical distributions are described using frequencies, proportions, and odds.