Fundamentals of Data Analysis

Statistics is a discipline that seeks to describe characteristics of populations that vary from one member of a population to the next.

Some definitions

A population is the collection of things about which we want to learn.
A sample is a representative subset of the population.
The sampling unit is the individual being measured.

Some definitions

Characteristics that vary from person to person, place to place, time to time, etc. are called variables.
A variate is a single measurement of one of these characteristics.
A distribution is the collection of a variable's variates.

Some definitions

A parameter is a statistic that describes a characteristic of a population.
- Usually unknowable
An estimate is a statistic, derived from a sample, that estimates a parameter.

\[\mu \approx \bar{x}\]

\[\sigma^2 \approx s^2\]

\[\sigma \approx s\]

What makes a sample representative?

Precision: error
Accuracy: bias

What makes a sample representative?

Precision: error
Accuracy: bias
Large samples minimize error.
Random samples minimize bias.

Data can be

Categorical: Discrete, unordered categories
- e.g. sex
Ordinal: Discrete, ordered categories with arbitrary intervals
- e.g. pain
Interval: Numeric with meaningful intervals, but arbitrary zero
- e.g. temperature
Ratio: Numeric with meaningful intervals, and meaningful zero
- e.g. height

To describe a variable, we must describe the distribution of variates.

Describing distributions visually

Plotting all the variates, \(x_i\), on a line doesn't help us much.

Describing distributions visually

Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.

Describing distributions visually

Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.
Histograms show the number of variates within each bar's range.

Describing distributions visually

Plotting all the variates, \(x_i\), on a line doesn't help us much, so let's spread them out a little.
Histograms show the number of variates within each bar's range.
Box & whisker plots show the median, quartiles, 1.5 \(\times\) interquartile range, and outliers.

Describing distributions visually

Describing distributions quantitatively

Statistics of location tell us about a specific place in a distribution.

Statistics of dispersion tell us about the spread of values in a distribution.

Statistics of location

Median
- The middle value when variates are arranged in order
- Robust towards outliers
Mode
Mean

Statistics of location

Median
Mode
- The most common value (or range of values) of a distribution
- Good for spotting multi-modal distributions
Mean

Statistics of location

Median
Mode
Mean
- Pool the resource, then distribute it equally \[\mu \approx \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
- The balance point of a distribution
- AKA "average"
- Sensitive to outliers

Statistics of dispersion

Range
- The difference between the maximum and minimum variates
Variance
Standard Deviation
Quantiles
Skewness
Kurtosis

Statistics of dispersion

Range
Variance
- The average squared deviation from the mean \[\sigma^2 \approx s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]
Standard Deviation
Quantiles
Skewness
Kurtosis

Statistics of dispersion

Range
Variance
Standard Deviation
- Variance, but transformed back into the measured units \[~\sigma \approx s = \sqrt{s^2}\]
Quantiles
Skewness
Kurtosis

Statistics of dispersion

Range
Variance
Standard Deviation
Quantiles
- Values below which a certain proportion of variates fall
- AKA Percentiles
- Quartiles = 25%, 50%, 75%, and 100% percentiles
Skewness
Kurtosis

Statistics of dispersion

Range
Variance
Standard Deviation
Quantiles
Skewness
- A measure of how much the tails are pulled one way or another
Kurtosis

Statistics of dispersion

Range
Variance
Standard Deviation
Quantiles
Skewness
Kurtosis
- A measure of how flat or peeked a distribution is

We can use what we learn about distributions to calculate probabilities.

Probability

Statisticians define a random trial as a process with 2 or more possible outcomes, whose occurrence cannot be predicted.

The sample space is a list of all possible outcomes.

An event is one possible subset of the sample space.

Probability is the proportion of times an event is likely to occur in a series of random trials.

Exclusivity

Mutually exclusive events cannot occur simultaneously.

Die rolls on the same die are mutually exclusive.
Hair and eye color are not mutually exclusive.

The Normal Distribution

…is used to model probabilities for "normally" distributed variables.

\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]

The Normal Distribution

Women's heights are approximately normally distributed with \(\bar{x} \approx 64, s \approx 3\), therefore \(~f(x,\mu=64,\sigma=3)\) models this distribution.

\[f(x) = \frac{1}{\sqrt{2 \pi 3^2}}e^{\frac{-(x-64)^2}{2*3^2}}\] We could also say this distribution is approximately \(N(64, 3)\).

The Normal Distribution

We expect 68% of the women to fall within one standard deviation from the mean.

\[\int_{\mu-\sigma}^{\mu+\sigma} \! f(x) \, \mathrm{d}x = 0.68\]

The Normal Distribution

We expect 95% of the women to fall within two standard deviations from the mean.

\[\int_{\mu-2\sigma}^{\mu+2\sigma} \! f(x) \, \mathrm{d}x = 0.95\]

The Normal Distribution

We expect 99.7% of the women to fall within three standard deviations from the mean. \[\int_{\mu-3\sigma}^{\mu+3\sigma} \! f(x) \, \mathrm{d}x = 0.997\]

The Normal Distribution

How likely is it that the next woman to walk into our room is shorter than 71 inches?

\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{\infty}^{71} \! f(x) \, \mathrm{d}x = ?\]

We can ask a computer to calculate that integral for us:

pnorm(71, 64, 3)

## [1] 0.9901847

The Normal Distribution

100% of women have height, i.e. the integral of the entire sample space is 1.

\[\int_{\mu-\infty}^{\mu+\infty} \! f(x) \, \mathrm{d}x = 1\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]

The Normal Distribution

How likely is it that the next woman to walk into our room is taller than 71 inches?

\[\int_{71}^{\infty} \! f(x) \, \mathrm{d}x = ?\]

Because the integral of the entire sample space is 1, we can get this probability by subtracting from 1:

1-pnorm(71, 64, 3)

## [1] 0.009815329

Estimates are always uncertain, but the larger your sample, the greater your certainty.

Central Limit Theorem

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

Estimating uncertainty

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)

Estimating uncertainty

Repeated sampling of a population, normal or not, results in a sampling distribution of means that is normal.

The \(\sigma\) of this distribution is called the standard error of the mean, \[\sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\] and measures the precision of \(\bar{x}\)

Larger \(n\) yield smaller \(s_{\bar{x}}\).

95% Confidence Intervals

If you pull 20 samples and calculate the interval

\[\bar{x}-2 \sigma_{\bar{x}} ~ \mathrm{to} ~ \bar{x}+2 \sigma_{\bar{x}}\]

…for each, 19 of those intervals will contain the true population mean…usually.

This is called the 95% Confidence Interval, because we are 95% confident that the true mean falls within the interval.

What about categorical data?

Categorical data visually

Bar Graph
- Always relative to zero!
- Color can be used to emphasize ordered groups

Categorical data visually

Bar Graph
Pie Chart
- Terrible!
- Don't ever use one!

Categorical data visually

Categorical data quantitatively

Frequency: \(~f_j\)
- The number of individuals, \(~f\), in group \(~j\)

Categorical data quantitatively

Frequency: \(~f_j\)
Proportion: \(~p = \frac{f_j}{n}\)
- The frequency of individuals in a group relative to the number of individuals in all groups, \(~n\)
- Percent: \(~p \times 100\%\)

Categorical data quantitatively

Frequency: \(~f_j\)
Proportion: \(~p\)
Odds: \(~\frac{p}{1-p} = \frac{f_i}{n-f_i}\)
- The probability of success relative to the probability of failure

In summary…

Summary

Samples are used to estimate population parameters.
To describe variable characteristics, we describe the distribution of values that variable exhibits.
We can do this visually and quantitatively.
Statistics of location (mean, median, mode) tell us the "center" of a continuous distribution.
Statistics of dispersion (variance, standard deviation, etc.) describe the spread of values around the center.

Summary

Standard Errors and Confidence Intervals show the uncertaintity of our estimates.
Probability is the proportion of times an event is likely to occur in a series of random trials.
Probabilities can be calculated from probability density functions if we know certain characteristics about the distribution (mean, and standard deviation).
Categorical distributions are described using frequencies, proportions, and odds.