Probability is a a game of chance and its been around for centuries. Famous mathematicians like Cardano, Fermat, and Pascal spent an incredible amount of time trying to figure this out.
Probability theory is not only useful in casinos and bets, but it’s also indispensable in any particular situation that depends on data affected by chance in some way.
Knowledge of probability is essential to data science.
Probability can be as straight forward like rolling dice on 7. There’s only 1/6 chance of this happening. But what about elections? Election forecaster Nate Silver gave Obama a 94% chance of winning in 2008, then 90% in 2012. Obama won both, and he was right. However, for 2016, he gave 71% chance of winning for Hillary Clinton. She lost. There are essential questions that are tackled in this section, like how are these probabilities calculated? What is being used to drive these forecasts?
We’ll cover election forecasting in the next module. We’ll also cover statistical inference which builds upon probability theory.
In this module, we will analyze the circumstances surrounding the financial crisis of 2007 to 2008. Part of what happened what the underestimation of risk of securities that financial companies sold. Specifically, the risk of mortgage backed securities and Collateralized Debt Obligations (or CDOs) were grossly underestimated.
The risk was assumed to be low, meaning the financial companies believed the home owners will make their monthly payments.
Since many home owners defaulted between 2007-2008, it resulted in a price crash of these securities. The banks lost so much money, they needed government bailouts to avoid closing down completely.
To understand this very complicating situation, we’ll first learn the basics of probability covered by these topics:
* random variables
* independence
* Monte Carlo simulation
* expected values
* standard errors
* margin of errors
* central limit theorem
The probability of categorical data is called discrete probability.
We will discuss the mathematical definition of probability to get precise answers to specific questions.
A more tangible way to think about the probability of an event is as a proportion of times the event occurs when we repeat the experiment over and over independently and under the same conditions.
Computers provide a way to actually perform the simple random experiments. Before computers, we would have to have a settling like color beads in a vase and pick at random.
Random number generators permit us to mimic the process of picking at random.
- Example in R is the sample function: sample()
beads <- rep( c("red", "blue"), times = c(2,3))
beads
## [1] "red" "red" "blue" "blue" "blue"
If you type sample( beads, 1), you will get one random sample.
sample( beads, 1)
## [1] "blue"
We want to repeat this over & over.
- Since we cannot do this forever, we’re going to repeat the experiment a large amount of time enough where the results are equivalent to doing it forever.
- This is the Monte Carlo Simulation
What is not covered here is the rigorous definition of practically equivalent. There will be a more practical approach to decide what is large enough (repetition).
The first example of Monte Carlo Simulation will use the replicate() function.
We’ll reenact the 2 red & 3 blue beads in the vase and see what probability we receive.
B <- 10000
events <- replicate(B, sample(beads), 1)
tab <- table(events)
tab
## events
## blue red
## 30000 20000
prop.table(tab)
## events
## blue red
## 0.6 0.4
If you take one bead out of the vase and do the experiment again, it’s without replacement. If you take bead out and put it back into the vase (keep the same # of count), it’s with replacement.
- We want to make sure to do it with replacement.
sample(beads, 5)
## [1] "red" "blue" "blue" "red" "blue"
events <- sample(beads, B, replace = TRUE)
prop.table(table(events))
## events
## blue red
## 0.5918 0.4082
Defining a distribution for categorical outcomes is pretty straight forward.
1) Assign a probability to each category.
- For the beads in the vase, the proportion of each bead color defines the distribution.
image:
In the next example, we’ll be using the four polling proportions.
Remember, categorical data makes it easy to define probability distributions.
TWo events are independent from each other if the outcome of one does not affect the other.
A classic example of the is coin tossing.
- Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed. Pr(heads) = 0.5
- In our beads & vase example, the event of choosing the beads is independent with replacement. The probability of picking a red bead is 40%.
Events that are not independent are one event that affects the other. Without replacement.
If you take a blue bead and you don’t put it back, the likely hood of choosing a blue bead again will change.
If we use the sample() function and generate the data by assigning x, we would see the beads chosen without them being placed back. Without even guessing, we know what bead is left in the vase.
x <- sample(beads, 5)
x[2:5]
## [1] "red" "red" "blue" "blue"
When events are not independent, conditional probabilities are useful and necessary to make correct calculations.
Example: Probability of choosing a King if one king was previously chosen without replacement.
Pr(Card 2 is a King | Card 1 is a King) = 3/51
The dash symbol | means “given that condition” or “conditional on.”
Example of 2 independent events:
Pr(A | B) = Pr(A)
The probability of A given B is equal to the probability of A. It doesn’t matter what B is, the probability of A is unchanged.
If we want to know the probability of A and B occurring, we use the multiplication rule.
Pr(A and B) = Pr(A) * PR(B|A)
- The probability of A and B is equal to the probability of A multiplied by the probability of B given that A already happened.
Example: In Blackjack, we need to get 2 cards that add up close to 21. These cards are given out, without replacement.
image:
- The probability of getting an Ace on the first round is 1/13.
- The probability of getting a face card after getting an ace is 16/52 (considering we took one card out).
- The probability of these two events happening is approximately 2%.
The multiplicative rule also applies to more than two events.
Pr(A and B and C) = Pr(A) Pr(B|A) PR(C|A and B)
So the probability of A and B and C is equal to the probability of A times the probability of B given that A happened times the probability of C that A and b happened.
If we have independent event, then it’s very simple to calculate. Just multiply the events. But if they were not independent, you would get very incorrect numbers.
One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.
What is the probability that the ball will be cyan?
cyan <- 3
magenta <- 5
yellow <- 7
p1 = cyan/(cyan + magenta + yellow)
p1
## [1] 0.2
One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.
What is the probability that the ball will not be cyan?
p2 = 1 - p1
p2
## [1] 0.8
Instead of taking just one draw, consider taking two draws. You take the second draw without returning the first draw to the box. We call this sampling without replacement.
What is the probability that the first draw is cyan and that the second draw is not cyan?
cyan <- 3
magenta <- 5
yellow <- 7
# The variable `p1` is the probability of choosing a cyan ball from the box on the first draw.
p1 <- cyan / (cyan + magenta + yellow)
# Assign a variable `p2` as the probability of not choosing a cyan ball on the second draw without replacement.
p2 <- 1 - (cyan - 1) / (cyan + magenta + yellow - 1)
# Calculate the probability that the first draw is cyan and the second draw is not cyan.
p1 * p2
## [1] 0.1714286
Now repeat the experiment, but this time, after taking the first draw and recording the color, return it back to the box and shake the box. We call this sampling with replacement.
What is the probability that the first draw is cyan and that the second draw is not cyan?
cyan <- 3
magenta <- 5
yellow <- 7
# The variable `p_1` is the probability of choosing a cyan ball from the box on the first draw.
p1 <- cyan / (cyan + magenta + yellow)
# Assign a variable `p_2` as the probability of not choosing a cyan ball on the second draw with replacement.
p2 <- 1 - (cyan) / (cyan + magenta + yellow - 1)
# Calculate the probability that the first draw is cyan and the second draw is not cyan.
p1 * p2
## [1] 0.1571429
Probability computations are not always straight forward.
For example, what is the probability of drawing 5 cards (without replacement) of the same suit? A Flush in Poker?
Discrete probability teaches us how to make these computations using mathematics.
We’re going to construct a deck of cards using R.
For this, we need the expand.grid() and paste() function.
number <- "Three"
suit <- "Hearts"
paste(number, suit)
## [1] "Three Hearts"
paste(letters[1:5], as.character(1:5))
## [1] "a 1" "b 2" "c 3" "d 4" "e 5"
expand.grid(pants = c("blue", "black"), shirt = c("white", "grey", "plaid"))
## pants shirt
## 1 blue white
## 2 black white
## 3 blue grey
## 4 black grey
## 5 blue plaid
## 6 black plaid
Generate a Deck of cards
suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid( number = numbers, suit = suits)
deck <- paste(deck$number, deck$suit)
deck
## [1] "Ace Diamonds" "Deuce Diamonds" "Three Diamonds" "Four Diamonds"
## [5] "Five Diamonds" "Six Diamonds" "Seven Diamonds" "Eight Diamonds"
## [9] "Nine Diamonds" "Ten Diamonds" "Jack Diamonds" "Queen Diamonds"
## [13] "King Diamonds" "Ace Clubs" "Deuce Clubs" "Three Clubs"
## [17] "Four Clubs" "Five Clubs" "Six Clubs" "Seven Clubs"
## [21] "Eight Clubs" "Nine Clubs" "Ten Clubs" "Jack Clubs"
## [25] "Queen Clubs" "King Clubs" "Ace Hearts" "Deuce Hearts"
## [29] "Three Hearts" "Four Hearts" "Five Hearts" "Six Hearts"
## [33] "Seven Hearts" "Eight Hearts" "Nine Hearts" "Ten Hearts"
## [37] "Jack Hearts" "Queen Hearts" "King Hearts" "Ace Spades"
## [41] "Deuce Spades" "Three Spades" "Four Spades" "Five Spades"
## [45] "Six Spades" "Seven Spades" "Eight Spades" "Nine Spades"
## [49] "Ten Spades" "Jack Spades" "Queen Spades" "King Spades"
Use the Deck of Cards we constructed for the next questions.
1) Check that the probability of a king in the first card is 1 in 13.
- Compute the proportion of prossible outcomes that satisfy our condition.
Instruction
Create a vector that contains the four possibilities of getting a King.
Then use the mean() function to check the proportion of the deck for one of the King cards.
kings <- paste("King", suits)
mean(deck %in% kings)
## [1] 0.07692308
library(gtools)
permutations(5, 2)
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 1 4
## [4,] 1 5
## [5,] 2 1
## [6,] 2 3
## [7,] 2 4
## [8,] 2 5
## [9,] 3 1
## [10,] 3 2
## [11,] 3 4
## [12,] 3 5
## [13,] 4 1
## [14,] 4 2
## [15,] 4 3
## [16,] 4 5
## [17,] 5 1
## [18,] 5 2
## [19,] 5 3
## [20,] 5 4
all_phone_numbers <- permutations(10, 7, v = 0:9)
n <- nrow(all_phone_numbers)
index <- sample(n, 5)
all_phone_numbers[index, ]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 2 9 5 6 0 8 3
## [2,] 9 1 7 5 6 3 8
## [3,] 4 6 1 8 7 9 5
## [4,] 7 9 4 8 1 5 6
## [5,] 0 9 1 6 2 3 8
hands <- permutations(52, 2, v = deck)
first_card <- hands[ ,1]
second_card <- hands[ ,2]
sum(first_card %in% kings)
## [1] 204
sum(first_card %in% kings & second_card %in% kings) /
sum(first_card %in% kings)
## [1] 0.05882353
Answer 0.058… = \(3/51\)
The code below will give the same answer as above. We’re computing the proportions instead of the totals.
mean(first_card %in% kings & second_card %in% kings) /
mean(first_card %in% kings)
## [1] 0.05882353
When order doesn’t matter, like in blackjack, it doesn’t matter if you get a ace first and a 10 second, it still equals 21.
For this, we need to use cominations() instead of permutations().
Look at the difference between the permutation() function and combination() function.
permutations(3,2)
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 2 1
## [4,] 2 3
## [5,] 3 1
## [6,] 3 2
combinations(3,2)
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 2 3
Since order dosen’t matter for combinations(), notice that “2,1” doesn’t appear because 1,2 already appeared. Similarly, 3,1 and 3,2 don’t appear as well.
#vector for all aces
aces <- paste("Ace", suits)
#vector for all face cards
facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid( number=facecard, suit=suits)
facecard <- paste( facecard$number, facecard$suit)
#combination of picking 2 cards out of 52
hands <- combinations(52, 2, v=deck)
#Count how many times we chose a ace & facecard.
mean(hands[,1] %in% aces & hands[,2] %in% facecard)
## [1] 0.04826546
Here is the code that considers BOTH possibilities (Ace first or Facecard first):
mean((hands[,1] %in% aces & hands[,2] %in% facecard) |
(hands[,2] %in% aces & hands[,1] %in% facecard))
## [1] 0.04826546
Instead of using combinations() to deduce the exact probability of a natural 21 in Blackjack, lets use Monte Carlo simulation to estimate this probability.
- In this case, we would draw two cards over and over, and keep track of how many 21’s we get.
- Use the function sample() to draw a card without replacement.
hand <- sample(deck, 2)
hand
## [1] "Eight Clubs" "King Diamonds"
B <- 10000
results <- replicate(B, {
hand <- sample(deck, 2)
(hand[1] %in% aces & hand[2] %in% facecard) |
(hand[2] %in% aces & hand[1] %in% facecard)
})
mean(results)
## [1] 0.0497