Applied Statistics for High-throughput Biology: Session 2

February 18, 2016

Review: session 1 learning objectives

Perform basic data manipulation/exploration in R and dplyr
- load data from a csv file
- generate random numbers using sample()
- understand use of set.seed()
- generate histograms
Clone and contribute to the class Github repo
Complete assignments using R Markdown
Define random variables and distinguish them from non-random ones
Recognize some important random distributions from their probability density plots:
- Normal, Poisson, Negative Binomial, Binomial

The Normal distribution

https://i.ytimg.com/vi/3yQF7np9Eiw/maxresdefault.jpg

Mean: \(\mu = \frac{1}{N}\sum_{i=1}^{N}x_i^2\)
Variance: \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2\)
N = # in population

Session 2 learning objectives: foundations of hypothesis testing

Identify the difference between populations and samples
- give sampling strategies
Identify properties of a Normal distribution
Define the Central Limit Theorem and give examples of its application
Book sections:
- Chapter 1 - Inference, up to and including "Central Limit Theorem and t-distribution"

Population vs sample

The numbered population is the sampling frame
Sampling probability of every individual in the population must be non-zero
We must know what the sampling probability is for every individual in the population

Population vs sample (cont'd)

The population distribution is specified by parameters
Individuals in the population are sometimes referred to as a record, on which you make an observation
A sample is summarized by a statistic
The statistic is used to make inference about the population

http://mips.stanford.edu/courses/stats_data_analsys/lesson_1/pop.gif

Sampling strategies

Simple Random Sampling (SRS): each record sampled with equal probability

Stratified Random Sampling: records are sampled in planned numbers from pre-defined strata (usually better than Simple Random Sampling)
Cluster Random Sampling: clusters are randomly sampled from the population, then individuals randomly sampled from the clusters
Complex Sampling: a general term for designs with unequal sampling probability
Convenience Sampling: whatever records are easiest to observe

Population size doesn't matter!

A remarkable attribute of sampling:
- as long as the population is large and you can obtain a random sample, it doesn't matter how big the population is. Only how big the sample is.
- we normally treat the population as infinite

http://image.slidesharecdn.com/italy-powerpoint-1210006205187714-8/95/italy-powerpoint-6-728.jpg?cb=120998

Selection bias

Random Sample 2

Selection bias may be introduced by differences between:
- theoretical and empirical target populations
- assumed and actual sampling probabilities
- widespread and not clear how to account for

How inferential statistics works

http://www.biochemia-medica.com/system/files/Marusteri_M._Statistical_test_selection_when_comparing_groups_Fig._1.jpg

Sampling distributions and the Central Limit Theorem

What is a “sampling distribution?”

It is the distribution of a statistic for many samples taken from one population.

Take a sample from a population
Calculate the sample statistic (e.g. mean)
Repeat.

The values from (2) form a sampling distribution.
Question: how is this different from a population distribution?

Example: population and sampling distributions

We observe 100 counts from a Poisson distribution (\(\lambda = 2\)):

Question: is this a population or a sampling distribution?

Example: population and sampling distributions (cont'd)

We calculate the mean of those 100 counts, and do the same for 1,000 more samples of 100:

Question: is this a population or a sampling distribution?

Central Limit Theorem

The "CLT" relates the sampling distribution (of means) to the population distribution.

Mean of the population (\(\mu\)) and of the sampling distribution (\(\bar{X}\)) are identical
Standard deviation of the population (\(\sigma\)) is related to the standard deviation of the distribution of sample means (Standard Error or SE) by: \[ SE = \sigma / \sqrt{n} \]
For large n, the shape of the sampling distribution of means becomes normal

CLT 1: equal means

Recall Poisson distributed population and samples of n=30:

Distributions are different, but means are the same

CLT 2: Standard Error

Standard deviation of the sampling distribution is \(SE = \sigma / \sqrt{n}\):

CLT 3: large samples

The distribution of means of large samples is normal.
- for large enough n, the population distribution doesn't matter. How large?
- n < 30: population is normal or close to it
- n >= 30: skew and outliers are OK
- n > 500: even extreme population distributions

CLT 3: large samples (cont'd)

Example: an extremely skewed (log-normal) distribution:

t-distribution

What is the use of the t distribution

Recall from the CLT the sampling distribution is normal, with standard deviation \(SE = \sigma / \sqrt{n}\):
- For a normally distributed popluation, this holds for any sample size n
- For non-normally distributed populations, it holds for large n
But, this formula assumes we know \(\sigma\)
- if we instead estimate standard deviation from the sample (\(s\)), the sampling distribution is not normal
- it has wider tails than the normal distribution

What the t distribution looks like

Question: Why would such an apparently small difference from the normal distribution matter?

When to use the t distribution

Calculate the t-statistic for a sample: \(t = \frac{\bar{X} - \mu_0}{s}\)
if the population distribution is normal, or you have large sample size,
and you estimate standard deviation from a sample,
THEN: the t-statistic is distributed as \(t_{df=n-1}\)

this leads to the one-sample t-test
note the difference in the means of two samples also turns out to be t distributed when \(s\) is estimated

Why Hypothesis Testing?

Hypothesis testing is not the only framework for inferential statistics, e.g.:
- confidence intervals
- posterior probabilities (Bayesian statistics)
- read p-values are just the tip of the iceberg

Review: session 1 learning objectives

The Normal distribution

Session 2 learning objectives: foundations of hypothesis testing

Population vs sample

Population vs sample

Population vs sample (cont'd)

Sampling strategies

Population size doesn't matter!

Selection bias

How inferential statistics works

Sampling distributions and the Central Limit Theorem

What is a “sampling distribution?”

Example: population and sampling distributions

Example: population and sampling distributions (cont'd)

Central Limit Theorem

CLT 1: equal means

CLT 2: Standard Error

CLT 3: large samples

CLT 3: large samples (cont'd)

t-distribution

What is the use of the t distribution

What the t distribution looks like

When to use the t distribution

Why Hypothesis Testing?

Lab exercises

Links