Ended with an example of probability of non-mutually exclusive sets
Non-mutually exclusive sets
Mutually exclusive sets: No elements in common
Non-mutually exclusive sets: Sets with elements in common
Non-mutually exclusive sets
The probability for the union of mutually exclusive sets was the sum of their probabilities:
P(A \(\cup\) B) = P(A) + P(B)
Non-mutually exclusive sets
The probability for the union of mutually exclusive sets was the sum of their probabilities: P(A \(\cup\) B) = P(A) + P(B)
With non-mutually exclusive sets this results in double counting of the shared elements, so… what can happen?
Non-mutually exclusive sets
The probability for the union of mutually exclusive sets was the sum of their probabilities: P(A \(\cup\) B) = P(A) + P(B)
With non-mutually exclusive sets this results in double counting of the shared elements, so… what can happen?
S = {1,2,3,4,5,6,7,8,9,10}
A = {1,2,3,4,5,6}
B = {5,6,7,8,9,10}
Non-mutually exclusive sets
The probability for the union of mutually exclusive sets was the sum of their probabilities: P(A \(\cup\) B) = P(A) + P(B)
With non-mutually exclusive sets this results in double counting of the shared elements, so…
It could lead to P > 1
Another example
S = {C,D,E,F,G,H,W,Z} (8 possibilities)
A = {C,D,E,W,Z} (5 favorable)
B = {D,E,F,G,H} (5 favorable)
P(A) = 5/8 and P(B) = 5/8
P(A \(\cup\) B) = 10/8 violates the rule P = 0 to 1
{D,E} got counted twice
{D,E} is W \(\cap\) Z
Solution
So for non-exclusive pairs, our formula is:
P(A\(\cup\) B) = P(A) + P(B) - P(A \(\cap\) B)
Note that this actually works for mutually exclusive sets too because for mutually exclusive sets, the intersection is the empty set and has P = 0
Independence
For now, problems will assume independence unless explicitly specified *
Non-independent (conditional) events are covered by Bayes Rule
You don’t have to worry about Bayes Rule for now except to understand that it exists and applies to non-independent events aka conditional events
Independent events are unrelated
If learning the probability of one event (A) doesn’t affect the probability of the other event (B), they are independent
Example: If I draw a number from a hat, then flip a coin the outcome of the draw doesn’t affect the outcome of the coin flip.
P(A \(\cap\) B) = P(A)P(B) for independent events
The probability rules for non-independent events, called conditional probability, are different
Independence
For now, problems will assume independence unless explicitly specified
Independence
For now, problems will assume independence unless explicitly specified
Non-independent (conditional) events are covered by Bayes Rule
Independence
For now, problems will assume independence unless explicitly specified
Non-independent (conditional) events are covered by Bayes Rule
You don’t have to worry about Bayes Rule for now except to understand that it exists and applies to non-independent events aka conditional events
Independence
For now, problems will assume independence unless explicitly specified
Non-independent (conditional) events are covered by Bayes Rule
You don’t have to worry about Bayes Rule for now except to understand that it exists and applies to non-independent events aka conditional events
Independent events are unrelated
Independence
For now, problems will assume independence unless explicitly specified
Non-independent (conditional) events are covered by Bayes Rule
You don’t have to worry about Bayes Rule for now except to understand that it exists and applies to non-independent events aka conditional events
Independent events are unrelated
If learning the probability of one event (A) doesn’t affect the probability of the other event (B), they are independent
Independence (Continued)
Independent events are unrelated
If learning the probability of one event (A) doesn’t affect the probability of the other event (B), they are independent
Example: If I draw a number from a hat, then flip a coin the outcome of the draw doesn’t affect the outcome of the coin flip.
Independence
For independent events
P(A \(\cap\) B) = P(A)P(B) for independent events
Independence
P(A \(\cap\) B) = P(A)P(B) for independent events
Example: If I draw a number from a hat with the numbers 1 to 5, then flip a coin the outcome of the draw doesn’t affect the outcome of the coin flip.
H = {1,2,3,4,5}
C = {head,tail}
What is P(1 \(\cap\) tail)?
Possible Test Question
What if on a short answer test question, I ask: “Event A and Event B are not independent. How would you determine the conditional probability of event A given event B?” What would you answer?
Why does the hat draw-coin flip example work?
If we create a set of all possible O it looks like this:
O = {1+head,2+head,3+head,4+head,5+head,1+tail,2+tail,3+tail,4+tail,5+tail}
Why does the formula work?
If we create a set of all possible O it looks like this:
O = {1+head,2+head,3+head,4+head,5+head,1+tail,2+tail,3+tail,4+tail,5+tail}
(1 \(\cap\) tail) or {1+tail} is one event
Why does the formula work?
If we create a set of all possible O it looks like this:
O = {1+head,2+head,3+head,4+head,5+head,1+tail,2+tail,3+tail,4+tail,5+tail}
(1 \(\cap\) tail) or {1+tail} is one event
P(1 \(\cap\) tail) = 1 favorable / 10 possible
Back to the test question
Answers:
I would apply Bayes Rule.
OR
I would construct a sample space and determine the probabilities with set theory. (Note: This is what Bayes Rule actually does.)
Why do we use data?
Purpose: analyzing data for causal inference (to begin to make statements about cause and effect - inferring causes)
Complex and uncertain data requires that we make…
Assumptions about the data
Because the world is complex, to make sense of unknowns we make assumptions about data
The assumptions are useful approximations even when not preceisely true
We still need to check that the real data does not seriously violate the assumptions
Data Assumptions: Random, Independent, and Identically Distributed
Randomness and independence matter as assumptions about data
Specifically, these are assumptions about the Data Generating Process or DGP
The Data Generating Process: the way the world produces the data
The Data Generating Process
The source of the data matters - the DGP matters
Previously stated: Data comes from a random world
So the DGP is random
Independence and Distribution
Events in the data are independent and identically distributed - the IID assumption
Independence and Distribution
Events in the data are independent and identically distributed - the IID assumption
Independence is statistical independence - the outcome of one event does not affect our belief about the probability of another event
The hat draw does not affect the coin toss
X does not affect Y
If X does affect Y, we may begin to infer some direct or indirect causal relationship in some direction somewhere possibly through one or more additional variables, but not necessarily that X causes Y. This is commonly shortened to the not quite accurate summary “correlation does not imply causation.”
Independence and Distribution
Events in the data are independent and identically distributed - the IID assumption
Independence is statistical independence - the outcome of one event does not affect our belief about the probability of another event
Identically distributed: drawn from the same probability distribution
So…
Introduction to distributions
R has functions for at least 20 distributions
The most important is the normal distribution
This is because of the central limit theorem
We will look at these in the most detail: normal, binomial, uniform, poisson
Distribution examples
The following are histograms
They represent the frequency or simply the number count of observations for each value
For example, if the value 4 shows 500, it means there that 4 came up 500 times in the data
The graphs were produced by generating random numbers based on the particular distribution with an R function
Uniform distribution
All outcomes are equally likely
Uniform distribution
All outcomes are equally likely
Uniform distribution: with code
All outcomes are equally likely
rand.unif <-runif(100000, min =0, max =10)hist(rand.unif, breaks =20, freq =TRUE, main ="uniform distribution of 100,000 random draws", xlab ='x', col ="red")
Normal Distribution
symmetrical around its mean with most values near the central peak
width is a function of the standard deviation
Other names: Gaussian distribution, bell curve
Normal Distribution
Normal Distribution: with code
rand.norm<-rnorm(100000)hist(rand.norm, breaks =200, freq =TRUE, main ="normal distribution, sd = 1, 100,000 random draws", xlab ='x', col ="red")
Binomial Distribution
binary
success/failure
yes/no
distribution for a number of Bernoulli trials
Binomial example
n = 1 makes this a Bernoulli distribution
Binomial example: with code
n = 1 makes this a Bernoulli distribution
rand.binom<-rbinom(100000,1,.5)hist(rand.binom, breaks =200, freq =TRUE, main ="binomial distribution, p = .5, 1 trial, 100,000 draws", xlab ='x', col ="red")
Binomial example: with code
trials = 25
rand.binom2 <-rbinom(100000,25,.5)hist(rand.binom2, breaks =200, freq =TRUE, main ="binomial distribution, p = .5, 25 trials, 100,000 draws", xlab ='x', col ="red")
Preview of the Central Limit Theorem
What happens if we do the same thing above but do it 1,000 times and plot the counts?
Preview of the Central Limit Theorem
Preview of the Central Limit Theorem: code
rand.binom3<-rbinom(100000,1000,.5)hist(rand.binom3, breaks =200, freq =TRUE, main ="Histogram of binomial distribution, p = .5, 1000 trials, 100,000 draws", xlab ='x', col ="red")
Preview of the Central Limit Theorem
For sufficiently large sample sizes, the distribution of sample means approximates a normal distribution
This means with a large enough number of trials, we can apply the normal distribution to know things about measures of central tendency, measures of dispersion, and probabilities
Sample sizes above 30
This is just a preview
68-95-99.7 Rule
One of the rules for normal distributions is:
The 68-95-99.7 rule
68% of the data is within 1 standard deviation of the mean
95% of the data is within 2 standard deviations of the mean
99.7% of the data is within 3 standard deviations of the mean
Preview of the Law of Large Numbers
The law of large numbers tells us that if we repeat an experiment a large number of time, the average of the results will be close to the expected value
This allows us to apply the actual mean of the sample to the expected mean of the population
Poisson distribution
Count of number of events in a fixed time/space
Knownconstant mean rate (We know how often they occur on average)
Independent of time since last event
Poisson distribution
Poisson distribution: with code
rand.poiss<-rpois(100000,1)hist(rand.poiss, breaks =200, freq =TRUE, main ="poisson distribution, lambda = 1, 100,000 draws", xlab ='x', col ="red")
Why we can’t use standard OLS regression for other DGP
We base the likelihood of something being significant on the proximity to the mean
As things get further from the mean in a normal distribution, they become less likely
Why we can’t use standard OLS regression for other DGP
Why we can’t use standard OLS regression for other DGP