Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists.
Probability is defined as the relative possibility that an event will occur, as expressed by the ratio of the number of actual occurrences to the total number of possible occurrences. Alternatively, *the proportion of times an event occurs when we repeat the experiment an infinite number of times, independently, under the same conditions.
We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem.
Section 1 introduces you to Discrete Probability and is divided into three parts:
The probability of an event is the proportion of times the event occurs when we repeat the experiment independently under the same conditions.
Pr(A)= probability of event A
An event is defined as an outcome that can occur when when something happens by chance.
We can determine probabilities related to discrete variables (picking a red bead, choosing 48 Democrats and 52 Republicans from 100 likely voters) and continuous variables (height over 6 feet).
Monte Carlo simulations model the probability of different outcomes by repeating a random process a large enough number of times that the results are similar to what would be observed if the process were repeated forever.
You can simulate random selections in R using the sample() function. Leveraging the rep() function to create a virtual collection of ‘beads’ then:
# As an example, let's create a vector beads and call the rep(). rep() replicates the values in x by the factors in y:
beads <- rep(c("red", "blue"), times = c(2,3))
beads
## [1] "red" "red" "blue" "blue" "blue"
sample(beads, 1) # sample 1 bead at random
## [1] "red"
Now, we want to repeat this experiment over and over - repeating it a large enough number of times to make the results practically equivalent to doing it over and over forever. This is an example of a Monte Carlo simulation.
In our experiment, we are place two red beads and two blue beads into the jar. This would approximately result in a 40% chance of selecting a red bead at random and a 60% chance of selecting a blue bead at random. So what if we applied this to a Monte Carlo simulation to see how closely these probabilities would be in reality. We can simulate 10000 random bead picks like this:
B <- 10000 # number of times to draw 1 bead
events <- replicate(B, sample(beads, 1)) # draw 1 bead, B times
tab <- table(events) # make a table of outcome counts
tab # view count table
## events
## blue red
## 5915 4085
prop.table(tab) # view table of outcome proportions
## events
## blue red
## 0.5915 0.4085
The results above suggest our approximation is close. Let’s try again, but this time with 1 million draws:
B <- 1000000 # number of times to draw 1 bead
events <- replicate(B, sample(beads, 1)) # draw 1 bead, B times
tab <- table(events) # make a table of outcome counts
tab # view count table
## events
## blue red
## 599680 400320
prop.table(tab) # view table of outcome proportions
## events
## blue red
## 0.59968 0.40032
We can see that our estimated probabilities match very well with actual probabilities when carried out many, many times. That is, there is a 40% probability of picking a red bead at random, and a 60% probability of picking a blue bead at random.
By default, the sample method does not replace the bead back into the urn prior to the next random selection. This can be modified by passing in an argument to the sample() function:
B <- 10000 # number of times to draw 1 bead
events <- sample(beads, B, replace = TRUE) # draw 1 bead, B times
prop.table(tab) # view table of outcome proportions
## events
## blue red
## 0.59968 0.40032
Throughout this course, we use random number generators. This implies that many of the results presented can actually change by chance, which then suggests that a frozen version of the course may show a different result than what you obtain when you try to code as shown in the course. This is actually fine since the results are random and change from time to time.
However, if you want to to ensure that results are exactly the same every time you run them, you can set R’s random number generation seed to a specific number. Below we set it to ‘1986’. We want to avoid using the same seed every time. A popular way to pick the seed is the year MINUS month MINUS day. For example, we picked 1986 on December 20, 2018: 2018 − 12 − 20 = 1986.
set.seed(1986)
An important application of the mean() function
In R, applying the mean() function to a logical vector returns the proportion of elements that are TRUE. It is very common to use the mean() function in this way to calculate probabilities and we will do so throughout the course. To find the probability of drawing a blue bead at random, you can run:
mean(beads == "blue")
## [1] 0.6
mean(beads == "red")
## [1] 0.4
We say that two events are independent if the outcome of one does not affect the other. Coin tosses are a classic example. Selecting a King from a deck of cards, however, has an impact on the probability of selecting a second king (decreases). When events are NOT independent, this gives rise to Conditional Probabilities. Conditional probabilities compute the probability that an event occurs given information about dependent events.
In our card example- the probability of drawing a King from a deck of cards is 1 in 13. This is because there are only 13 types of cards in the deck, Ace through Kings. However, if we draw a King on the first draw, what is then the probability of drawing a King on the second draw? I has now reduced to 3 out of 51 - since there are only 3 Kings remaining in the deck and we have already removed one card from the 52 card complete deck. In probability, the following notation is used:
Note that, when two events, say A and B, are independent, we have the following equation. The probability of A given B is equal to the probability of A. It doesn’t matter what B is. The probability A is unchanged:
If we want to know the probability of two events, say A and B, occurring, we can use the multiplication rule. So the probability of A and B is equal to the probability of A multiplied by the probability of B given that A already happened:
Probabilities can get increasing complex however. For example, what is the probability that if I draw 5 cards without replacement, I get all cards of the same suit, what is called a flush in poker?
To use R to simulate, we first need to create a deck of cards. We will leverage expand.grid() and paste().
Create two vectors- one which contains all our suits, another that contains all possible card possibilities:
suits <- c("Diamonds", "Clubs", "Hearts", "Spades")
suits
## [1] "Diamonds" "Clubs" "Hearts" "Spades"
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")
numbers
## [1] "Ace" "Deuce" "Three" "Four" "Five" "Six" "Seven" "Eight" "Nine"
## [10] "Ten" "Jack" "Queen" "King"
Now we can leverage the expand.grid(), which renders a combinations of 2 vectors or lists with named references:
deck <- expand.grid(number = numbers, suit = suits)
deck
## number suit
## 1 Ace Diamonds
## 2 Deuce Diamonds
## 3 Three Diamonds
## 4 Four Diamonds
## 5 Five Diamonds
## 6 Six Diamonds
## 7 Seven Diamonds
## 8 Eight Diamonds
## 9 Nine Diamonds
## 10 Ten Diamonds
## 11 Jack Diamonds
## 12 Queen Diamonds
## 13 King Diamonds
## 14 Ace Clubs
## 15 Deuce Clubs
## 16 Three Clubs
## 17 Four Clubs
## 18 Five Clubs
## 19 Six Clubs
## 20 Seven Clubs
## 21 Eight Clubs
## 22 Nine Clubs
## 23 Ten Clubs
## 24 Jack Clubs
## 25 Queen Clubs
## 26 King Clubs
## 27 Ace Hearts
## 28 Deuce Hearts
## 29 Three Hearts
## 30 Four Hearts
## 31 Five Hearts
## 32 Six Hearts
## 33 Seven Hearts
## 34 Eight Hearts
## 35 Nine Hearts
## 36 Ten Hearts
## 37 Jack Hearts
## 38 Queen Hearts
## 39 King Hearts
## 40 Ace Spades
## 41 Deuce Spades
## 42 Three Spades
## 43 Four Spades
## 44 Five Spades
## 45 Six Spades
## 46 Seven Spades
## 47 Eight Spades
## 48 Nine Spades
## 49 Ten Spades
## 50 Jack Spades
## 51 Queen Spades
## 52 King Spades
Now we apply paste(), which joins two strings and inserts a space in between and we finally have a complete deck.
deck <- paste(deck$number, deck$suit)
deck
## [1] "Ace Diamonds" "Deuce Diamonds" "Three Diamonds" "Four Diamonds"
## [5] "Five Diamonds" "Six Diamonds" "Seven Diamonds" "Eight Diamonds"
## [9] "Nine Diamonds" "Ten Diamonds" "Jack Diamonds" "Queen Diamonds"
## [13] "King Diamonds" "Ace Clubs" "Deuce Clubs" "Three Clubs"
## [17] "Four Clubs" "Five Clubs" "Six Clubs" "Seven Clubs"
## [21] "Eight Clubs" "Nine Clubs" "Ten Clubs" "Jack Clubs"
## [25] "Queen Clubs" "King Clubs" "Ace Hearts" "Deuce Hearts"
## [29] "Three Hearts" "Four Hearts" "Five Hearts" "Six Hearts"
## [33] "Seven Hearts" "Eight Hearts" "Nine Hearts" "Ten Hearts"
## [37] "Jack Hearts" "Queen Hearts" "King Hearts" "Ace Spades"
## [41] "Deuce Spades" "Three Spades" "Four Spades" "Five Spades"
## [45] "Six Spades" "Seven Spades" "Eight Spades" "Nine Spades"
## [49] "Ten Spades" "Jack Spades" "Queen Spades" "King Spades"
Now to calculate the probability of drawing a King from the deck, recall the important application of the mean() function. In R, applying the mean() function to a logical vector returns the proportion of elements that are TRUE. It is very common to use the mean() function in this way to calculate probabilities and we will do so throughout the course.
So create a utility vector kings, then determine the mean of the %in% of kings in the deck:
# probability of drawing a king
kings <- paste("King", suits)
mean(deck %in% kings)
## [1] 0.07692308
The result is 0.77, which is equivalent to 1/13 as we initially supposed.
Now, how about the conditional probability of the second card being a king given that the first was a king?
Earlier we deduced that if 1 king is already out, then there’s 51 left, So the probability is 3 in 51. But let’s verify that with R. This introduces combinations() and permutations() tools available in the gtools package.
permutations(n,r) from the gtools package lists the different ways that r items can be selected from a set of n options when order matters.
library(gtools)
permutations(5,2) # ways to choose 2 numbers in order from 1:5
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 1 4
## [4,] 1 5
## [5,] 2 1
## [6,] 2 3
## [7,] 2 4
## [8,] 2 5
## [9,] 3 1
## [10,] 3 2
## [11,] 3 4
## [12,] 3 5
## [13,] 4 1
## [14,] 4 2
## [15,] 4 3
## [16,] 4 5
## [17,] 5 1
## [18,] 5 2
## [19,] 5 3
## [20,] 5 4
combinations(n,r) from the gtools package lists the different ways that r items can be selected from a set of n options when order does not matter.
combinations(3,2) # order does not matter
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 2 3
# combinations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)
# permutations(n, r, v=1:n, set=TRUE, repeats.allowed=FALSE)
#
# n - Size of the source vector
# r - Size of the target vectors
# v - Source vector. Defaults to 1:n but you can pass in another vector
# repeats.allowed - Logical flag indicating whether the constructed vectors may include duplicated values. Defaults to FALSE.
testv <- c("a","b","c","d","e","f","g","h","i","j")
testp <- permutations(4,2,testv) #will take 4 from source vector and create every possible permutation of 2 items, returns a matrix
testc <- combinations(4,2,testv) #will take 4 from source vector and create any possible combination of 2 items regardless of order, returns a matrix
testp
## [,1] [,2]
## [1,] "a" "b"
## [2,] "a" "c"
## [3,] "a" "d"
## [4,] "b" "a"
## [5,] "b" "c"
## [6,] "b" "d"
## [7,] "c" "a"
## [8,] "c" "b"
## [9,] "c" "d"
## [10,] "d" "a"
## [11,] "d" "b"
## [12,] "d" "c"
testc
## [,1] [,2]
## [1,] "a" "b"
## [2,] "a" "c"
## [3,] "a" "d"
## [4,] "b" "c"
## [5,] "b" "d"
## [6,] "c" "d"
With these new tools then, how can we determine the probability of drawing a second King?
We’ll take our deck that we created above and use it to determine every possible 2-card combination from a 52 card deck and determine what percentage of those combinations is two kings:
hands <- permutations(52,2, v = deck) # will take 52 from source vector and create every possible permutation of 2 cards, 2652 combinations
first_card <- hands[,1] # we create a utility vector equal to the first column of the hands vector and call it first_card
second_card <- hands[,2] # we create a utility vector equal to the second column of the hands vector and call it second_card
sum(first_card %in% kings) # now we will determine how many of the 2652 combinations have a king as the FIRST CARD, in this case 204
## [1] 204
The Multiplication Rule yields the probability of two events happening, given that “|” one has already occurred.
Because these cards are dependent, the probability of ‘drawing a king’ and ‘drawing a second king’ is equal to the ‘probability of drawing a king’ TIMES the ‘probability of drawing a second king’ DIVIDED by the ‘probability of drawing a king’:
sum(first_card %in% kings & second_card %in% kings) / sum(first_card %in% kings)
## [1] 0.05882353