Probability is a a game of chance and its been around for centuries. Famous mathematicians like Cardano, Fermat, and Pascal spent an incredible amount of time trying to figure this out.
Probability theory is not only useful in casinos and bets, but it’s also indispensable in any particular situation that depends on data affected by chance in some way.
Knowledge of probability is essential to data science.
Probability can be as straight forward like rolling dice on 7. There’s only 1/6 chance of this happening. But what about elections? Election forecaster Nate Silver gave Obama a 94% chance of winning in 2008, then 90% in 2012. Obama won both, and he was right. However, for 2016, he gave 71% chance of winning for Hillary Clinton. She lost. There are essential questions that are tackled in this section, like how are these probabilities calculated? What is being used to drive these forecasts?
We’ll cover election forecasting in the next module. We’ll also cover statistical inference which builds upon probability theory.
In this module, we will analyze the circumstances surrounding the financial crisis of 2007 to 2008. Part of what happened what the underestimation of risk of securities that financial companies sold. Specifically, the risk of mortgage backed securities and Collateralized Debt Obligations (or CDOs) were grossly underestimated.
The risk was assumed to be low, meaning the financial companies believed the home owners will make their monthly payments.
Since many home owners defaulted between 2007-2008, it resulted in a price crash of these securities. The banks lost so much money, they needed government bailouts to avoid closing down completely.
To understand this very complicating situation, we’ll first learn the basics of probability covered by these topics:
* random variables
* independence
* Monte Carlo simulation
* expected values
* standard errors
* margin of errors
* central limit theorem
The probability of categorical data is called discrete probability.
We will discuss the mathematical definition of probability to get precise answers to specific questions.
A more tangible way to think about the probability of an event is as a proportion of times the event occurs when we repeat the experiment over and over independently and under the same conditions.
Computers provide a way to actually perform the simple random experiments. Before computers, we would have to have a settling like color beads in a vase and pick at random.
Random number generators permit us to mimic the process of picking at random.
- Example in R is the sample function: sample()
beads <- rep( c("red", "blue"), times = c(2,3))
beads
## [1] "red" "red" "blue" "blue" "blue"
If you type sample( beads, 1), you will get one random sample.
sample( beads, 1)
## [1] "red"
We want to repeat this over & over.
- Since we cannot do this forever, we’re going to repeat the experiment a large amount of time enough where the results are equivalent to doing it forever.
- This is the Monte Carlo Simulation
What is not covered here is the rigorous definition of practically equivalent. There will be a more practical approach to decide what is large enough (repetition).
The first example of Monte Carlo Simulation will use the replicate() function.
We’ll reenact the 2 red & 3 blue beads in the vase and see what probability we receive.
B <- 10000
events <- replicate(B, sample(beads), 1)
tab <- table(events)
tab
## events
## blue red
## 30000 20000
prop.table(tab)
## events
## blue red
## 0.6 0.4
If you take one bead out of the vase and do the experiment again, it’s without replacement. If you take bead out and put it back into the vase (keep the same # of count), it’s with replacement.
- We want to make sure to do it with replacement.
sample(beads, 5)
## [1] "blue" "blue" "red" "red" "blue"
events <- sample(beads, B, replace = TRUE)
prop.table(table(events))
## events
## blue red
## 0.6057 0.3943
Defining a distribution for categorical outcomes is pretty straight forward.
1) Assign a probability to each category.
- For the beads in the vase, the proportion of each bead color defines the distribution.
image:
In the next example, we’ll be using the four polling proportions.
Remember, categorical data makes it easy to define probability distributions.
TWo events are independent from each other if the outcome of one does not affect the other.
A classic example of the is coin tossing.
- Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed. Pr(heads) = 0.5
- In our beads & vase example, the event of choosing the beads is independent with replacement. The probability of picking a red bead is 40%.
Events that are not independent are one event that affects the other. Without replacement.
If you take a blue bead and you don’t put it back, the likely hood of choosing a blue bead again will change.
If we use the sample() function and generate the data by assigning x, we would see the beads chosen without them being placed back. Without even guessing, we know what bead is left in the vase.
x <- sample(beads, 5)
x[2:5]
## [1] "blue" "red" "blue" "red"
When events are not independent, conditional probabilities are useful and necessary to make correct calculations.
Example: Probability of choosing a King if one king was previously chosen without replacement.
Pr(Card 2 is a King | Card 1 is a King) = 3/51
The dash symbol | means “given that condition” or “conditional on.”
Example of 2 independent events:
Pr(A | B) = Pr(A)
The probability of A given B is equal to the probability of A. It doesn’t matter what B is, the probability of A is unchanged.
If we want to know the probability of A and B occurring, we use the multiplication rule.
Pr(A and B) = Pr(A) * PR(B|A)
- The probability of A and B is equal to the probability of A multiplied by the probability of B given that A already happened.
Example: In Blackjack, we need to get 2 cards that add up close to 21. These cards are given out, without replacement.
image:
- The probability of getting an Ace on the first round is 1/13.
- The probability of getting a face card after getting an ace is 16/52 (considering we took one card out).
- The probability of these two events happening is approximately 2%.
The multiplicative rule also applies to more than two events.
Pr(A and B and C) = Pr(A) Pr(B|A) PR(C|A and B)
So the probability of A and B and C is equal to the probability of A times the probability of B given that A happened times the probability of C that A and b happened.
If we have independent event, then it’s very simple to calculate. Just multiply the events. But if they were not independent, you would get very incorrect numbers.
One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.
What is the probability that the ball will be cyan?
cyan <- 3
magenta <- 5
yellow <- 7
p1 = cyan/(cyan + magenta + yellow)
p1
## [1] 0.2
One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.
What is the probability that the ball will not be cyan?
p2 = 1 - p1
p2
## [1] 0.8
Instead of taking just one draw, consider taking two draws. You take the second draw without returning the first draw to the box. We call this sampling without replacement.
What is the probability that the first draw is cyan and that the second draw is not cyan?
cyan <- 3
magenta <- 5
yellow <- 7
# The variable `p1` is the probability of choosing a cyan ball from the box on the first draw.
p1 <- cyan / (cyan + magenta + yellow)
# Assign a variable `p2` as the probability of not choosing a cyan ball on the second draw without replacement.
p2 <- 1 - (cyan - 1) / (cyan + magenta + yellow - 1)
# Calculate the probability that the first draw is cyan and the second draw is not cyan.
p1 * p2
## [1] 0.1714286
Now repeat the experiment, but this time, after taking the first draw and recording the color, return it back to the box and shake the box. We call this sampling with replacement.
What is the probability that the first draw is cyan and that the second draw is not cyan?
cyan <- 3
magenta <- 5
yellow <- 7
# The variable `p_1` is the probability of choosing a cyan ball from the box on the first draw.
p1 <- cyan / (cyan + magenta + yellow)
# Assign a variable `p_2` as the probability of not choosing a cyan ball on the second draw with replacement.
p2 <- 1 - (cyan) / (cyan + magenta + yellow - 1)
# Calculate the probability that the first draw is cyan and the second draw is not cyan.
p1 * p2
## [1] 0.1571429