Section 1: Discrete Probability


1.1 Probability

Course Notes

Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists.

Probability is defined as the relative possibility that an event will occur, as expressed by the ratio of the number of actual occurrences to the total number of possible occurrences. Alternatively, *the proportion of times an event occurs when we repeat the experiment an infinite number of times, independently, under the same conditions.

We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem.

Section 1 introduces you to Discrete Probability and is divided into three parts:

  1. Introduction to Discrete Probability
  2. Combinations and Permutations
  3. Addition Rule and Monty Hall

Introduction to Discrete Probability

The probability of an event is the proportion of times the event occurs when we repeat the experiment independently under the same conditions.

Pr(A)= probability of event A

An event is defined as an outcome that can occur when when something happens by chance.

We can determine probabilities related to discrete variables (picking a red bead, choosing 48 Democrats and 52 Republicans from 100 likely voters) and continuous variables (height over 6 feet).

Monte Carlo Simulations

Monte Carlo simulations model the probability of different outcomes by repeating a random process a large enough number of times that the results are similar to what would be observed if the process were repeated forever.

You can simulate random selections in R using the sample() function. Leveraging the rep() function to create a virtual collection of ‘beads’ then:

# As an example, let's create a vector beads and call the rep(). rep() replicates the values in x by the factors in y:
beads <- rep(c("red", "blue"), times = c(2,3))   
beads
## [1] "red"  "red"  "blue" "blue" "blue"
sample(beads, 1)    # sample 1 bead at random
## [1] "red"

Now, we want to repeat this experiment over and over - repeating it a large enough number of times to make the results practically equivalent to doing it over and over forever. This is an example of a Monte Carlo simulation.

In our experiment, we are place two red beads and two blue beads into the jar. This would approximately result in a 40% chance of selecting a red bead at random and a 60% chance of selecting a blue bead at random. So what if we applied this to a Monte Carlo simulation to see how closely these probabilities would be in reality. We can simulate 10000 random bead picks like this:

B <- 10000    # number of times to draw 1 bead
events <- replicate(B, sample(beads, 1))    # draw 1 bead, B times
tab <- table(events)    # make a table of outcome counts
tab    # view count table
## events
## blue  red 
## 5915 4085
prop.table(tab)    # view table of outcome proportions
## events
##   blue    red 
## 0.5915 0.4085

The results above suggest our approximation is close. Let’s try again, but this time with 1 million draws:

B <- 1000000    # number of times to draw 1 bead
events <- replicate(B, sample(beads, 1))    # draw 1 bead, B times
tab <- table(events)    # make a table of outcome counts
tab    # view count table
## events
##   blue    red 
## 599680 400320
prop.table(tab)    # view table of outcome proportions
## events
##    blue     red 
## 0.59968 0.40032

We can see that our estimated probabilities match very well with actual probabilities when carried out many, many times. That is, there is a 40% probability of picking a red bead at random, and a 60% probability of picking a blue bead at random.

By default, the sample method does not replace the bead back into the urn prior to the next random selection. This can be modified by passing in an argument to the sample() function:

B <- 10000    # number of times to draw 1 bead
events <- sample(beads, B, replace = TRUE)    # draw 1 bead, B times
prop.table(tab)    # view table of outcome proportions
## events
##    blue     red 
## 0.59968 0.40032

Setting the Random Seed

Throughout this course, we use random number generators. This implies that many of the results presented can actually change by chance, which then suggests that a frozen version of the course may show a different result than what you obtain when you try to code as shown in the course. This is actually fine since the results are random and change from time to time.

However, if you want to to ensure that results are exactly the same every time you run them, you can set R’s random number generation seed to a specific number. Below we set it to ‘1986’. We want to avoid using the same seed every time. A popular way to pick the seed is the year MINUS month MINUS day. For example, we picked 1986 on December 20, 2018: 2018 − 12 − 20 = 1986.

set.seed(1986) 

Using the Mean Function for Probability

An important application of the mean() function

In R, applying the mean() function to a logical vector returns the proportion of elements that are TRUE. It is very common to use the mean() function in this way to calculate probabilities and we will do so throughout the course. To find the probability of drawing a blue bead at random, you can run:

mean(beads == "blue")
## [1] 0.6
mean(beads == "red")
## [1] 0.4

Probability Distributions

  • The probability distribution for a variable describes the probability of observing each possible outcome.
  • For discrete categorical variables, the probability distribution is defined by the proportions for each group.

Independence

We say that two events are independent if the outcome of one does not affect the other. Coin tosses are a classic example. Selecting a King from a deck of cards, however, has an impact on the probability of selecting a second king (decreases). When events are NOT independent, this gives rise to Conditional Probabilities. Conditional probabilities compute the probability that an event occurs given information about dependent events.

In our card example- the probability of drawing a King from a deck of cards is 1 in 13. This is because there are only 13 types of cards in the deck, Ace through Kings. However, if we draw a King on the first draw, what is then the probability of drawing a King on the second draw? I has now reduced to 3 out of 51 - since there are only 3 Kings remaining in the deck and we have already removed one card from the 52 card complete deck. In probability, the following notation is used:

Note that, when two events, say A and B, are independent, we have the following equation. The probability of A given B is equal to the probability of A. It doesn’t matter what B is. The probability A is unchanged:

If we want to know the probability of two events, say A and B, occurring, we can use the multiplication rule. So the probability of A and B is equal to the probability of A multiplied by the probability of B given that A already happened:

1.2 Combinations and Permutations

Probabilities can get increasing complex however. For example, what is the probability that if I draw 5 cards without replacement, I get all cards of the same suit, what is called a flush in poker?

To use R to simulate, we first need to create a deck of cards. We will leverage expand.grid() and paste().

Create two vectors- one which contains all our suits, another that contains all possible card possibilities:

suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
suits  
## [1] "Diamonds" "Clubs"    "Hearts"   "Spades"
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")
numbers  
##  [1] "Ace"   "Deuce" "Three" "Four"  "Five"  "Six"   "Seven" "Eight" "Nine" 
## [10] "Ten"   "Jack"  "Queen" "King"

Now we can leverage the expand.grid(), which renders a combinations of 2 vectors or lists with named references:

deck <- expand.grid(number = numbers, suit = suits)
deck  
##    number     suit
## 1     Ace Diamonds
## 2   Deuce Diamonds
## 3   Three Diamonds
## 4    Four Diamonds
## 5    Five Diamonds
## 6     Six Diamonds
## 7   Seven Diamonds
## 8   Eight Diamonds
## 9    Nine Diamonds
## 10    Ten Diamonds
## 11   Jack Diamonds
## 12  Queen Diamonds
## 13   King Diamonds
## 14    Ace    Clubs
## 15  Deuce    Clubs
## 16  Three    Clubs
## 17   Four    Clubs
## 18   Five    Clubs
## 19    Six    Clubs
## 20  Seven    Clubs
## 21  Eight    Clubs
## 22   Nine    Clubs
## 23    Ten    Clubs
## 24   Jack    Clubs
## 25  Queen    Clubs
## 26   King    Clubs
## 27    Ace   Hearts
## 28  Deuce   Hearts
## 29  Three   Hearts
## 30   Four   Hearts
## 31   Five   Hearts
## 32    Six   Hearts
## 33  Seven   Hearts
## 34  Eight   Hearts
## 35   Nine   Hearts
## 36    Ten   Hearts
## 37   Jack   Hearts
## 38  Queen   Hearts
## 39   King   Hearts
## 40    Ace   Spades
## 41  Deuce   Spades
## 42  Three   Spades
## 43   Four   Spades
## 44   Five   Spades
## 45    Six   Spades
## 46  Seven   Spades
## 47  Eight   Spades
## 48   Nine   Spades
## 49    Ten   Spades
## 50   Jack   Spades
## 51  Queen   Spades
## 52   King   Spades

Now we apply paste(), which joins two strings and inserts a space in between and we finally have a complete deck.

deck <- paste(deck$number, deck$suit)
deck  
##  [1] "Ace Diamonds"   "Deuce Diamonds" "Three Diamonds" "Four Diamonds" 
##  [5] "Five Diamonds"  "Six Diamonds"   "Seven Diamonds" "Eight Diamonds"
##  [9] "Nine Diamonds"  "Ten Diamonds"   "Jack Diamonds"  "Queen Diamonds"
## [13] "King Diamonds"  "Ace Clubs"      "Deuce Clubs"    "Three Clubs"   
## [17] "Four Clubs"     "Five Clubs"     "Six Clubs"      "Seven Clubs"   
## [21] "Eight Clubs"    "Nine Clubs"     "Ten Clubs"      "Jack Clubs"    
## [25] "Queen Clubs"    "King Clubs"     "Ace Hearts"     "Deuce Hearts"  
## [29] "Three Hearts"   "Four Hearts"    "Five Hearts"    "Six Hearts"    
## [33] "Seven Hearts"   "Eight Hearts"   "Nine Hearts"    "Ten Hearts"    
## [37] "Jack Hearts"    "Queen Hearts"   "King Hearts"    "Ace Spades"    
## [41] "Deuce Spades"   "Three Spades"   "Four Spades"    "Five Spades"   
## [45] "Six Spades"     "Seven Spades"   "Eight Spades"   "Nine Spades"   
## [49] "Ten Spades"     "Jack Spades"    "Queen Spades"   "King Spades"

Now to calculate the probability of drawing a King from the deck, recall the important application of the mean() function. In R, applying the mean() function to a logical vector returns the proportion of elements that are TRUE. It is very common to use the mean() function in this way to calculate probabilities and we will do so throughout the course.

So create a utility vector kings, then determine the mean of the %in% of kings in the deck:

# probability of drawing a king
kings <- paste("King", suits)
mean(deck %in% kings)
## [1] 0.07692308

The result is 0.77, which is equivalent to 1/13 as we initially supposed.

Now, how about the conditional probability of the second card being a king given that the first was a king?

Earlier we deduced that if 1 king is already out, then there’s 51 left, So the probability is 3 in 51. But let’s verify that with R. This introduces combinations() and permutations() tools available in the gtools package.

permutations(n,r) from the gtools package lists the different ways that r items can be selected from a set of n options when order matters.

library(gtools)
permutations(5,2)    # ways to choose 2 numbers in order from 1:5
##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    1
##  [6,]    2    3
##  [7,]    2    4
##  [8,]    2    5
##  [9,]    3    1
## [10,]    3    2
## [11,]    3    4
## [12,]    3    5
## [13,]    4    1
## [14,]    4    2
## [15,]    4    3
## [16,]    4    5
## [17,]    5    1
## [18,]    5    2
## [19,]    5    3
## [20,]    5    4

combinations(n,r) from the gtools package lists the different ways that r items can be selected from a set of n options when order does not matter.

combinations(3,2)    # order does not matter
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    3
# combinations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)
# permutations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)
# 
# n - Size of the source vector
# r - Size of the target vectors
# v - Source vector. Defaults to 1:n but you can pass in another vector
# repeats.allowed   - Logical flag indicating whether the constructed vectors may include duplicated values. Defaults to FALSE.

testv <- c("a","b","c","d","e","f","g","h","i","j")
testp <- permutations(4,2,testv) #will take 4 from source vector and create every possible permutation of 2 items, returns a matrix 
testc <- combinations(4,2,testv) #will take 4 from source vector and create any possible combination of 2 items regardless of order, returns a matrix 
testp
##       [,1] [,2]
##  [1,] "a"  "b" 
##  [2,] "a"  "c" 
##  [3,] "a"  "d" 
##  [4,] "b"  "a" 
##  [5,] "b"  "c" 
##  [6,] "b"  "d" 
##  [7,] "c"  "a" 
##  [8,] "c"  "b" 
##  [9,] "c"  "d" 
## [10,] "d"  "a" 
## [11,] "d"  "b" 
## [12,] "d"  "c"
testc
##      [,1] [,2]
## [1,] "a"  "b" 
## [2,] "a"  "c" 
## [3,] "a"  "d" 
## [4,] "b"  "c" 
## [5,] "b"  "d" 
## [6,] "c"  "d"

With these new tools then, how can we determine the probability of drawing a second King?

We’ll take our deck that we created above and use it to determine every possible 2-card combination from a 52 card deck and determine what percentage of those combinations is two kings:

hands <- permutations(52,2, v = deck) # will take 52 from source vector and create every possible permutation of 2 cards, 2652 combinations

first_card <- hands[,1] # we create a utility vector equal to the first column of the hands vector and call it first_card
second_card <- hands[,2] # we create a utility vector equal to the second column of the hands vector and call it second_card
sum(first_card %in% kings) # now we will determine how many of the 2652 combinations have a king as the FIRST CARD, in this case 204
## [1] 204

The Multiplication Rule yields the probability of two events happening, given that “|” one has already occurred.

  • When events are INDEPENDENT: P (A and B and N) = P(A) * P(B) * P(N)
  • When events are DEPENDENT: P (A and B and N) = P(A) * P(B|A) * P(N|B)

Because these cards are dependent, the probability of ‘drawing a king’ and ‘drawing a second king’ is equal to the ‘probability of drawing a king’ TIMES the ‘probability of drawing a second king’ DIVIDED by the ‘probability of drawing a king’:

sum(first_card %in% kings & second_card %in% kings) / sum(first_card %in% kings)
## [1] 0.05882353