Joel Correa da Rosa
December 21st 2016
In the statistics “central dogma”, conclusions are made by studying samples that are suppossed to represent a universe.
Population (Universe) : Collection of all entities of interest (not necessarily people).
Sample : a subset of the population.
To make inference, the sample ideally should represent the population.
Sampling is necessary because:
The benefits of the stratified random sampling are: a) the cost per observation in the survey may be reduced. b) estimates of the population parameters may be wanted for each subpopulation. c) Increased accuracy at given cost.
The concepts of random sample and randomization are differents. Random sampling is related to external validity (generalizability) and randomization (e.g. random assignment to treatments) is related to design and internal validity.
A (numeric) characteristic of the population, fixed and usually unknown.
Examples :
A numerical characteristic of the sample that is used to estimate the unknown parameters in the population
Examples :
The fundamentals of the inference lie on studying the chances of the sampling statistics.
The sample statistics will change according to different samples. The frequency distribution of possible outcomes for the sample statistic is called sampling distribution. Let's see a toy example.
# population
pop<-c(4,8,10,15,25,46)
# population mean
mean(pop)
[1] 18
# population variance
var(pop)*(5/6)
[1] 200.3333
# population standard deviation
sqrt(var(pop)*(5/6))
[1] 14.15392
# All possible samples of size n=3
combn(pop,3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 4 4 4 4 4 4 4 4 4 4 8 8 8
[2,] 8 8 8 8 10 10 10 15 15 25 10 10 10
[3,] 10 15 25 46 15 25 46 25 46 46 15 25 46
[,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 8 8 8 10 10 10 15
[2,] 15 15 25 15 15 25 25
[3,] 25 46 46 25 46 46 46
xbars<-apply(combn(pop,3),2,mean)
xbars
[1] 7.333333 9.000000 12.333333 19.333333 9.666667 13.000000 20.000000
[8] 14.666667 21.666667 25.000000 11.000000 14.333333 21.333333 16.000000
[15] 23.000000 26.333333 16.666667 23.666667 27.000000 28.666667
hist(xbars)
What is the mean of \( \bar{X} \) when \( n=3 \)
mean(xbars)
[1] 18
What is the standard deviation of \( \bar{X} \) when \( n=3 \)
sd(xbars)
[1] 6.494262
In random sampling
mean(\( \bar X \))=\( \mu \)
sd(\( \bar X \))=\( \frac{\sigma}{\sqrt{n}} \)
In random sampling
mean(\( \bar X \))=\( \mu \)
sd(\( \bar X \))=\( \sqrt{\frac{N-1}{N-n}}\frac{\sigma}{\sqrt{n}} \)
The population mean is unknown but lies within the range \( (7,15) \). The population standard deviation is 2.5. The population is infinite and the measurements are normally distributed.
Let's draw a sample \( n=10 \) and discuss some inferences about the mean.
# true mean
set.seed(123)
true.mean <- runif(1,7,15)
# draw a sample n=10
my.sample<-rnorm(10,true.mean,2.5)
my.sample
[1] 11.302006 12.276137 5.076731 12.399360 9.028205 9.007515 9.758327
[8] 12.502007 4.982444 13.526081
# summary of the sample
mean(my.sample)
[1] 9.985881
sd(my.sample)
[1] 3.031804
Is the true mean greater than 8 ? (hypothesis test)
Aristotle: “The probable is what usually happens”
Cicero: “Probability is the very guide of life”
Democritus: “Everything existing in the universe is the fruit of chance”
The cause of uncertainty is linked to randomness.
When sampling from a population, randomness is present.
In our everyday lives, almost everything is a random experiment.
A experiment whose outcome cannot be predicted with certainty before the experiment is run.
If the same results are obtained when an experiment is repeated under the same conditions, the experiment is deterministic.
Probability and Statistics are the branches of mathematics that have been developed to deal with random experiments.
The set of all possible outcomes of the Random Experiment.
Example #1 : Consider the random experiment that will sample 5 HIV+ subjects and verify how many subjects have adverse events after vaccination.
The sample space is: \( \Omega = \{0,1,2,3,4,5\} \)
Example #2 : Consider the random experiment that will sample 10 psoriatic subjects and measure the average IL-17 gene expression in log2 scale in their skin tissue.
The sample space is: \( \Omega= (-\infty,+\infty) \)
An event is a subset of the sample space.
Example #1. Let's consider the following event: “More than 3 subjects have adverse events after vaccination”. This is a subset of the sample space \( A = \{4,5 \} \).
Example #2 : Define the event: “The average log2 expression for IL-17 in 10 psoriatic patients is greater than 7.” \( A = (7,+\infty) \).
Set operations are important to define the events of interests and also to better understanding of the probability rules. Considering two events (subsets of the sample space) \( A \) and \( B \).
Set operation # 1: Union (\( A \cup B \))
Set operation # 2: Intersection (\( A \cap B \))
Set Operation # 3: Complement (\( A^c \))
Is a number assigned to each event that is intended to measure the chance of its ocurrence in a random experiment.
The probability distribution is a function that maps events to real numbers.
The theory of probability is founded on 3 axioms.
\( 0 \leq P(A) \leq 1~~;A\subseteq \Omega \) (Probability is always positive)
\( P(\Omega)=1 \) (The sum of probabilities is 1)
\( P(A \cup B)=P(A)+P(B)~~;A \cap B=\emptyset \) (The probability of disjoint events union is the sum of their probabilities )
\( P(A \cup B) = P(A) + P(B) - P(A\cap B) \)
As a consequence of the axioms, if \( A \cap B \) is an empty set, i.e. there is no intersection, \( P(A \cap B)=0 \).
Two schools of thoughts:
Frequentist (Classical)
Bayesian
Assuming
1) Occurrence of an adverse event in a subject does not depend on occurrence in other subjects. 2) Probability of an adverse event is 50%
In 5 HIV+ subjects, the number of subjects with adverse events will follow the binomial distribution ?
# Binomial Distribution
barplot(dbinom(0:5,5,0.5))
Assuming:
1) log2 expressions of IL-17 are normally distributed;
2) Average log2 expression in the population is 7;
3) Standard deviation for the log2 expressions in the population is 2.5
# Normal Distribution
x<-seq(0,16,0.01)
curve(dnorm(x,7,2.5),0,16)
If the ocurrence of event \( B \) modifies the sample space, the probability of ocurrence of event \( A \) is updated according to the following law :
\( P(A|B) = \frac{P(A \cap B)}{P(B)} \)
If \( A \) is independent of \( B \), \( P(A|B)=P(A) \)
Assuming that a diagnostic test is a composition of two random experiments:
a) Observe the result of a diagnostic test (\( \Omega_1 =\{T+,T-\} \))
b) Observe the result of a gold standard test (\( \Omega_2 =\{D+,D-\} \))
the sample space for the random experiment that results from this composition has four elements :
\( \Omega = \Omega_1 \times \Omega_2 = \{(T+,D+),(T+,D-),(T-,D+),(T-,D-)\} \)
obs: \( (T+,D+) = T+ \cap D+ \) Each element in the sample space is an intersection of two events.
Based on a frequentist approach, the probabilities for events in this random experiment can be calculated from a 2x2 table with the frequencies of ocurrences in each cell (e.g. \( a \) is the number of ocurrences of \( T+ \cap D+ \)).
D+ | D- | |
---|---|---|
T+ | a | b |
T- | c | d |
\( n = a+b+c+d \): number of experiments runs
sensitivity : \( P(T+|D+) \)
specificity : \( P(T-|D-) \)
false positive rate : \( P(T+|D-) \)
false negative rate : \( P(T-|D+) \)
The Bayes Theorem is a consequence of the conditional probability definition.
Consider two events \( A \) and \( B \)
\( P(B|A)=\frac{P(A|B)P(B)}{P(A)} \)