Lecture 2: Basic Concepts: Inference

Joel Correa da Rosa
December 21st 2016

Inference

In the statistics “central dogma”, conclusions are made by studying samples that are suppossed to represent a universe.

Population vs. Sample

Population (Universe) : Collection of all entities of interest (not necessarily people).

Sample : a subset of the population.

To make inference, the sample ideally should represent the population.

Sampling is necessary because:

  • it reduce costs
  • there are feasibility issues
  • some experiments are destructive

Sampling Methods

  • Simple random sampling
  • Systematic sampling
  • Stratified sampling
  • Cluster sampling
  • Convenience sampling
  • Quota sampling
  • Snowball sampling

The benefits of the stratified random sampling are: a) the cost per observation in the survey may be reduced. b) estimates of the population parameters may be wanted for each subpopulation. c) Increased accuracy at given cost.

Random sample and randomization

The concepts of random sample and randomization are differents. Random sampling is related to external validity (generalizability) and randomization (e.g. random assignment to treatments) is related to design and internal validity.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3271469/

Parameter

A (numeric) characteristic of the population, fixed and usually unknown.

Examples :

  • \( \mu \) : population mean
  • \( \sigma^2 \) : population variance
  • \( p \) : population prevalence
  • \( Q(p) \): p-th quantile
  • \( \rho(X,Y) \): correlation between two features

Sample statistic

A numerical characteristic of the sample that is used to estimate the unknown parameters in the population

Examples :

  • \( \bar{X} = \frac{\sum_{i=1}^N X_i}{n} \)
  • \( S^2 = \frac{\sum_{i=1}^n (X_i-\bar{X})^2}{n-1} \)
  • \( \hat{p} = \frac{\sum_{i=1}^n X_i}{n} \), \( X_i \in \{0,1\} \)

The fundamentals of the inference lie on studying the chances of the sampling statistics.

Sampling Distribution

The sample statistics will change according to different samples. The frequency distribution of possible outcomes for the sample statistic is called sampling distribution. Let's see a toy example.

# population 
pop<-c(4,8,10,15,25,46)
# population mean
mean(pop)
[1] 18
# population variance
var(pop)*(5/6)
[1] 200.3333
# population standard deviation
sqrt(var(pop)*(5/6))
[1] 14.15392

Samples of n=3

# All possible samples of size n=3
combn(pop,3)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,]    4    4    4    4    4    4    4    4    4     4     8     8     8
[2,]    8    8    8    8   10   10   10   15   15    25    10    10    10
[3,]   10   15   25   46   15   25   46   25   46    46    15    25    46
     [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,]     8     8     8    10    10    10    15
[2,]    15    15    25    15    15    25    25
[3,]    25    46    46    25    46    46    46

Sampling distribution of the sample mean (when n=3)

xbars<-apply(combn(pop,3),2,mean)
xbars
 [1]  7.333333  9.000000 12.333333 19.333333  9.666667 13.000000 20.000000
 [8] 14.666667 21.666667 25.000000 11.000000 14.333333 21.333333 16.000000
[15] 23.000000 26.333333 16.666667 23.666667 27.000000 28.666667

Sampling distribution of the sample mean (when n=3)

hist(xbars)

plot of chunk unnamed-chunk-7

Mean of all possible sample means

What is the mean of \( \bar{X} \) when \( n=3 \)

mean(xbars)
[1] 18

What is the standard deviation of \( \bar{X} \) when \( n=3 \)

sd(xbars)
[1] 6.494262

Population is infinite or sampling with replacement

In random sampling

mean(\( \bar X \))=\( \mu \)

sd(\( \bar X \))=\( \frac{\sigma}{\sqrt{n}} \)

Population is finite and sampling without replacement

In random sampling

mean(\( \bar X \))=\( \mu \)

sd(\( \bar X \))=\( \sqrt{\frac{N-1}{N-n}}\frac{\sigma}{\sqrt{n}} \)

An inference exercise

The population mean is unknown but lies within the range \( (7,15) \). The population standard deviation is 2.5. The population is infinite and the measurements are normally distributed.

Let's draw a sample \( n=10 \) and discuss some inferences about the mean.

# true mean
set.seed(123)
true.mean <- runif(1,7,15)

# draw a sample n=10
my.sample<-rnorm(10,true.mean,2.5)
my.sample
 [1] 11.302006 12.276137  5.076731 12.399360  9.028205  9.007515  9.758327
 [8] 12.502007  4.982444 13.526081

An inference exercise

# summary of the sample
mean(my.sample)
[1] 9.985881
sd(my.sample)
[1] 3.031804

What is your inference ?

Is the true mean greater than 8 ? (hypothesis test)

Foundations of Probability Theory

Aristotle: “The probable is what usually happens”

Cicero: “Probability is the very guide of life”

Democritus: “Everything existing in the universe is the fruit of chance”

Randomness

The cause of uncertainty is linked to randomness.

When sampling from a population, randomness is present.

In our everyday lives, almost everything is a random experiment.

Random Experiment

A experiment whose outcome cannot be predicted with certainty before the experiment is run.

If the same results are obtained when an experiment is repeated under the same conditions, the experiment is deterministic.

Probability and Statistics are the branches of mathematics that have been developed to deal with random experiments.

How to handle a random experiment ?

  • Build the sample space.
  • Determine the events of interest.
  • Assign probabilities.

Sample Space

The set of all possible outcomes of the Random Experiment.

Example #1 : Consider the random experiment that will sample 5 HIV+ subjects and verify how many subjects have adverse events after vaccination.

The sample space is: \( \Omega = \{0,1,2,3,4,5\} \)

Example #2 : Consider the random experiment that will sample 10 psoriatic subjects and measure the average IL-17 gene expression in log2 scale in their skin tissue.

The sample space is: \( \Omega= (-\infty,+\infty) \)

Event

An event is a subset of the sample space.

Example #1. Let's consider the following event: “More than 3 subjects have adverse events after vaccination”. This is a subset of the sample space \( A = \{4,5 \} \).

Example #2 : Define the event: “The average log2 expression for IL-17 in 10 psoriatic patients is greater than 7.” \( A = (7,+\infty) \).

Set Operations

Set operations are important to define the events of interests and also to better understanding of the probability rules. Considering two events (subsets of the sample space) \( A \) and \( B \).

Set operation # 1: Union (\( A \cup B \))

Set operation # 2: Intersection (\( A \cap B \))

Set Operation # 3: Complement (\( A^c \))

Probability

Is a number assigned to each event that is intended to measure the chance of its ocurrence in a random experiment.

The probability distribution is a function that maps events to real numbers.

Probability Axioms

The theory of probability is founded on 3 axioms.

  1. \( 0 \leq P(A) \leq 1~~;A\subseteq \Omega \) (Probability is always positive)

  2. \( P(\Omega)=1 \) (The sum of probabilities is 1)

  3. \( P(A \cup B)=P(A)+P(B)~~;A \cap B=\emptyset \) (The probability of disjoint events union is the sum of their probabilities )

Addition Rule of Probabilities

\( P(A \cup B) = P(A) + P(B) - P(A\cap B) \)

As a consequence of the axioms, if \( A \cap B \) is an empty set, i.e. there is no intersection, \( P(A \cap B)=0 \).

How to assign probabilities to events?

Two schools of thoughts:

  • Frequentist (Classical)

  • Bayesian

Probability Distribution (Example #1)

Assuming

1) Occurrence of an adverse event in a subject does not depend on occurrence in other subjects. 2) Probability of an adverse event is 50%

In 5 HIV+ subjects, the number of subjects with adverse events will follow the binomial distribution ?

Probability Distribution (Example #1)

# Binomial Distribution
barplot(dbinom(0:5,5,0.5))

plot of chunk unnamed-chunk-12

Probability Distribution (Example 2)

Assuming:

1) log2 expressions of IL-17 are normally distributed;

2) Average log2 expression in the population is 7;

3) Standard deviation for the log2 expressions in the population is 2.5

Probability Distribution (Example 2)

# Normal Distribution
x<-seq(0,16,0.01)
curve(dnorm(x,7,2.5),0,16)

plot of chunk unnamed-chunk-13

Conditional Probability

If the ocurrence of event \( B \) modifies the sample space, the probability of ocurrence of event \( A \) is updated according to the following law :

\( P(A|B) = \frac{P(A \cap B)}{P(B)} \)

If \( A \) is independent of \( B \), \( P(A|B)=P(A) \)

Diagnostic Test and Conditional Probability

Assuming that a diagnostic test is a composition of two random experiments:

a) Observe the result of a diagnostic test (\( \Omega_1 =\{T+,T-\} \))

b) Observe the result of a gold standard test (\( \Omega_2 =\{D+,D-\} \))

the sample space for the random experiment that results from this composition has four elements :

\( \Omega = \Omega_1 \times \Omega_2 = \{(T+,D+),(T+,D-),(T-,D+),(T-,D-)\} \)

obs: \( (T+,D+) = T+ \cap D+ \) Each element in the sample space is an intersection of two events.

Diagnostic Test and Conditional Probability

Based on a frequentist approach, the probabilities for events in this random experiment can be calculated from a 2x2 table with the frequencies of ocurrences in each cell (e.g. \( a \) is the number of ocurrences of \( T+ \cap D+ \)).

D+ D-
T+ a b
T- c d

\( n = a+b+c+d \): number of experiments runs

Diagnostic Test and Conditional Probability

sensitivity : \( P(T+|D+) \)

specificity : \( P(T-|D-) \)

false positive rate : \( P(T+|D-) \)

false negative rate : \( P(T-|D+) \)

Bayes Theorem

The Bayes Theorem is a consequence of the conditional probability definition.

Consider two events \( A \) and \( B \)

\( P(B|A)=\frac{P(A|B)P(B)}{P(A)} \)