February 18, 2016

Review: session 1 learning objectives

  • Perform basic data manipulation/exploration in R and dplyr
    • load data from a csv file
    • generate random numbers using sample()
    • understand use of set.seed()
    • generate histograms
  • Clone and contribute to the class Github repo
  • Complete assignments using R Markdown
  • Define random variables and distinguish them from non-random ones
  • Recognize some important random distributions from their probability density plots:
    • Normal, Poisson, Negative Binomial, Binomial

The Normal distribution

https://i.ytimg.com/vi/3yQF7np9Eiw/maxresdefault.jpg

Mean: \(\mu = \frac{1}{N}\sum_{i=1}^{N}x_i^2\)
Variance: \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2\)
N = # in population

Session 2 learning objectives: foundations of hypothesis testing

  • Identify the difference between populations and samples
    • give sampling strategies
  • Identify properties of a Normal distribution
  • Define the Central Limit Theorem and give examples of its application

  • Book sections:
    • Chapter 1 - Inference, up to and including "Central Limit Theorem and t-distribution"

Population vs sample

Population vs sample

Random Sample
  • The numbered population is the sampling frame
  • Sampling probability of every individual in the population must be non-zero
  • We must know what the sampling probability is for every individual in the population

Population vs sample (cont'd)

  • The population distribution is specified by parameters
  • Individuals in the population are sometimes referred to as a record, on which you make an observation
  • A sample is summarized by a statistic
  • The statistic is used to make inference about the population
http://mips.stanford.edu/courses/stats_data_analsys/lesson_1/pop.gif

Sampling strategies

  • Simple Random Sampling (SRS): each record sampled with equal probability
  • Stratified Random Sampling: records are sampled in planned numbers from pre-defined strata (usually better than Simple Random Sampling)
  • Cluster Random Sampling: clusters are randomly sampled from the population, then individuals randomly sampled from the clusters
  • Complex Sampling: a general term for designs with unequal sampling probability
  • Convenience Sampling: whatever records are easiest to observe

Population size doesn't matter!

  • A remarkable attribute of sampling:
    • as long as the population is large and you can obtain a random sample, it doesn't matter how big the population is. Only how big the sample is.
    • we normally treat the population as infinite
http://image.slidesharecdn.com/italy-powerpoint-1210006205187714-8/95/italy-powerpoint-6-728.jpg?cb=120998

Selection bias

Random Sample 2

  • Selection bias may be introduced by differences between:
    • theoretical and empirical target populations
    • assumed and actual sampling probabilities
    • widespread and not clear how to account for

How inferential statistics works

http://www.biochemia-medica.com/system/files/Marusteri_M._Statistical_test_selection_when_comparing_groups_Fig._1.jpg

Sampling distributions and the Central Limit Theorem

What is a “sampling distribution?”

It is the distribution of a statistic for many samples taken from one population.

  1. Take a sample from a population
  2. Calculate the sample statistic (e.g. mean)
  3. Repeat.
  • The values from (2) form a sampling distribution.

  • Question: how is this different from a population distribution?

Example: population and sampling distributions

We observe 100 counts from a Poisson distribution (\(\lambda = 2\)):

Question: is this a population or a sampling distribution?

Example: population and sampling distributions (cont'd)

  • We calculate the mean of those 100 counts, and do the same for 1,000 more samples of 100:

Question: is this a population or a sampling distribution?

Central Limit Theorem

The "CLT" relates the sampling distribution (of means) to the population distribution.

  1. Mean of the population (\(\mu\)) and of the sampling distribution (\(\bar{X}\)) are identical
  2. Standard deviation of the population (\(\sigma\)) is related to the standard deviation of the distribution of sample means (Standard Error or SE) by: \[ SE = \sigma / \sqrt{n} \]
  3. For large n, the shape of the sampling distribution of means becomes normal

CLT 1: equal means

Recall Poisson distributed population and samples of n=30:

  • Distributions are different, but means are the same

CLT 2: Standard Error

Standard deviation of the sampling distribution is \(SE = \sigma / \sqrt{n}\):

CLT 3: large samples

  • The distribution of means of large samples is normal.
    • for large enough n, the population distribution doesn't matter. How large?
    • n < 30: population is normal or close to it
    • n >= 30: skew and outliers are OK
    • n > 500: even extreme population distributions

CLT 3: large samples (cont'd)

  • Example: an extremely skewed (log-normal) distribution:

t-distribution

What is the use of the t distribution

  • Recall from the CLT the sampling distribution is normal, with standard deviation \(SE = \sigma / \sqrt{n}\):
    • For a normally distributed popluation, this holds for any sample size n
    • For non-normally distributed populations, it holds for large n
  • But, this formula assumes we know \(\sigma\)
    • if we instead estimate standard deviation from the sample (\(s\)), the sampling distribution is not normal
    • it has wider tails than the normal distribution

What the t distribution looks like

Question: Why would such an apparently small difference from the normal distribution matter?

When to use the t distribution

  • Calculate the t-statistic for a sample: \(t = \frac{\bar{X} - \mu_0}{s}\)
  • if the population distribution is normal, or you have large sample size,
  • and you estimate standard deviation from a sample,
  • THEN: the t-statistic is distributed as \(t_{df=n-1}\)
  • this leads to the one-sample t-test
  • note the difference in the means of two samples also turns out to be t distributed when \(s\) is estimated

Why Hypothesis Testing?

Lab exercises

Links