Data Science: Probability

  • Course Instructor: Rafael Irizarry

Abstract

This is the third in a series of courses in a Professional Certificate in Data Science program. The courses in the Professional Certificate program are designed to prepare you to do data analysis in R, from simple computations to machine learning. The courses are designed to be taken in order. A prerequisite for this course is courses 1 and 2 of the series or equivalent knowledge of basic R coding and data visualization. If you need to learn or refresh some basic R, check out Data Science: R Basics, the first course in this series.

The textbook for the Data Science course series is freely available online. This course corresponds to the Probability section of textbook, starting here.

This course assumes you are comfortable with basic math, algebra, and logical operations. HarvardX has partnered with DataCamp for all assignments in R that allow you to program directly in a browser-based interface. You will not need to download any special software.

Using a combination of a guided introduction through short video lectures and more independent in-depth exploration, you will get to practice your new R skills on real-life applications.

Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn important concepts in probability theory. The motivation for this course is the circumstances surrounding the financial crisis of 2007-2008. Part of what caused this financial crisis was that the risk of certain securities sold by financial institutions was underestimated. To begin to understand this very complicated event, we need to understand the basics of probability. We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem. These statistical concepts are fundamental to conducting statistical tests on data and understanding whether the data you are analyzing are likely occurring due to an experimental method or to chance. Statistical inference, covered in the next course in this series, builds upon probability theory.

Learning Objectives

  • Important concepts in probability theory including random variables and independence
  • How to perform a Monte Carlo simulation
  • The meaning of expected values and standard errors and how to compute them in R
  • The importance of the Central Limit Theorem

Course Overview

Section 1: Discrete Probability
You will learn about basic principles of probability related to categorical data using card games as examples.

Section 2: Continuous Probability
You will learn about basic principles of probability related to numeric and continuous data.

Section 3: Random Variables, Sampling Models, and The Central Limit Theorem
You will learn about random variables (numeric outcomes resulting from random processes), how to model data generation procedures as draws from an urn, and the Central Limit Theorem, which applies to large sample sizes.

Section 4: The Big Short
You will learn how interest rates are determined and how some bad assumptions led to the financial crisis of 2007-2008.

Section 1 Overview

Section 1 introduces you to Discrete Probability. Section 1 is divided into three parts:

  1. Introduction to Discrete Probability
  2. Combinations and Permutations
  3. Addition Rule and Monty Hall

After completing Section 1, you will be able to:

  • Apply basic probability theory to categorical data.
  • Perform a Monte Carlo simulation to approximate the results of repeating an experiment over and over, including simulating the outcomes in the Monty Hall problem.
  • Distinguish between: sampling with and without replacement, events that are and are not independent, and combinations and permutations.
  • Apply the multiplication and addition rules, as appropriate, to calculate the probably of multiple events occurring.
  • Use sapply instead of a for loop to perform element-wise operations on a function.

The textbook for this section is available here

1.1 Introduction to Discrete Probability

Discrete Probability

Example 1: If I have two red beads and three blue beads inside an urn and I pick one at random, what is the probability of picking a red one?

## [1] 0.4

Monte Carlo Simulation

Random number generators permit us to mimic the process of picking at random. An example is the sample() function in R.

## [1] "red"  "red"  "blue" "blue" "blue"
## [1] "blue"

An example of Monte Carlo Simulation is that repeat the experiment a large enough number of times to make the results practically equivalent to doing it over and over forever. To perform our first Monte Carlo simulation, we use the replicate() function, which permits us to repeat the same task any number of times. Here, we repeat the random event B = 10,000 times

## events
## blue  red 
## 5902 4098
## events
##   blue    red 
## 0.5902 0.4098

The function sample has an argument that permits us to pick more than one element from the urn. However, by default, this selection occurs without replacement.

## [1] "blue" "red"  "red"  "blue" "blue"
## [1] "red"  "blue" "red"  "blue" "blue"
## [1] "blue" "blue" "red"  "blue" "red"
## events
##   blue    red 
## 0.5982 0.4018

Probability Distributions

If you’re are randomly calling likely voters from a population that has 44% Democrat, 44% Republican, 10% undecided, and 2% green, these proportions define the probability for each group. For this example, the probability distribution is simply these four proportions.

Independence

Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed.

Non-Independent Events

The first outcome affects the second. If we deal a King for the first card, and don’t replace it into the deck, the probabilities of a second card being a King is less because there are only three Kings left: the probability is 3 out of 51. more…

Assessment 1: Introduction to Discrete Probability

  1. Probability of cyan
    One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls. What is the probability that the ball will be cyan?
## [1] 0.2
  1. Probability of not cyan
    One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls. What is the probability that the ball will not be cyan?
## [1] 0.8
## [1] 0.8
  1. Sampling without replacement
    Instead of taking just one draw, consider taking two draws. You take the second draw without returning the first draw to the box. We call this sampling without replacement. What is the probability that the first draw is cyan and that the second draw is not cyan?
## [1] 0.1714286
  1. Sampling with replacement
    Now repeat the experiment, but this time, after taking the first draw and recording the color, return it back to the box and shake the box. We call this sampling with replacement. What is the probability that the first draw is cyan and that the second draw is not cyan?
## [1] 0.16

1.2 Combinations and Permutations

let’s start by constructing a deck of cards using R. For this, we will use the function expand.grid() and the function paste()

Here is how we generate a deck of cards:

With the deck constructed, we can now double check that the probability of a King in the first card is 1/13. We simply compute the proportion of possible outcomes that satisfy our condition:

## [1] 0.07692308

which is 1/13.

Now, how about the conditional probability of the second card being a King given that the first was a King? Earlier, we deduced that if one King is already out of the deck and there are 51 left, then this probability is 3/51. Let’s confirm by listing out all possible outcomes.

To do this, we can use the permutations() function from the gtools package. For any list of size n, this function computes all the different combinations we can get when we select r items. Here are all the ways we can choose two numbers from a list consisting of 1,2,3:

##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    1
##  [6,]    2    3
##  [7,]    2    4
##  [8,]    2    5
##  [9,]    3    1
## [10,]    3    2
## [11,]    3    4
## [12,]    3    5
## [13,]    4    1
## [14,]    4    2
## [15,]    4    3
## [16,]    4    5
## [17,]    5    1
## [18,]    5    2
## [19,]    5    3
## [20,]    5    4

Optionally for this function permutations, we can add a vector. So for example, if you want to see 5 random 7-digit phone numbers out of all possible phone numbers, you could type code like this.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    5    3    1    2    8    6    9
## [2,]    7    3    5    8    1    9    6
## [3,]    3    4    0    5    7    8    9
## [4,]    0    9    8    4    5    7    1
## [5,]    2    6    1    5    3    9    8

To compute all possible ways that we can choose 2 cards when the order matters, we simply type the following piece of code.

To compute all possible ways that we can choose 2 cards when the order matters, we simply type the following piece of code. Here we use permutations.

There’s 52 cards, we’re going to choose 2, and we’re going to select them out of the vector that includes our card names, which we called deck earlier. This is going to be a matrix with 2 dimensions, 2 columns, and in this case, it’s going to have 2,652 rows. Those are all the permutations.

Define the first card and the second card by grabbing the first and second columns using this simple piece of code.

## [1] 204
## [1] 0.05882353
## [1] 0.05882353

the difference between the permutations functions, which lists all permutations, and the combination function, where order does not matter.

##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    1
## [4,]    2    3
## [5,]    3    1
## [6,]    3    2
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    3

So to compute the probability of a Natural 21 in Blackjack, we can do this:

## [1] 0.04826546

In the last line, we assume the Ace comes first. This is only because we know the way combination enumerates possibilities and it will list this case first. But to be safe, we could have written this and produced the same answer:

## [1] 0.04826546

Monte Carlo example

Instead of using combinations to deduce the exact probability of a Natural 21, we can use a Monte Carlo to estimate this probability.

## [1] 0.04826546

In this case, we draw two cards over and over and keep track of how many 21s we get. We can use the function sample to draw two cards without replacements:

## [1] "Seven Diamonds" "Deuce Diamonds"

And then check if one card is an Ace and the other a face card or a 10. Going forward, we include 10 when we say face card. Now we need to check both possibilities:

## [1] FALSE

If we repeat this 10,000 times, we get a very good approximation of the probability of a Natural 21.

Let’s start by writing a function that draws a hand and returns TRUE if we get a 21. The function does not need any arguments because it uses objects defined in the global environment.

Here we do have to check both possibilities: Ace first or Ace second because we are not using the combinations function. The function returns TRUE if we get a 21 and FALSE otherwise:

## [1] FALSE

Now we can play this game, say, 10,000 times:

## [1] 0.0468

The Birthday Problem

If we assume this is a randomly selected group of 50 people, what is the chance that at least two people have the same birthday?
ere we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29. This actually doesn’t change the answer much.

To check if in this particular set of 50 people we have at least two with the same birthday, we can use the function duplicated, which returns TRUE whenever an element of a vector is a duplicate. Here is an example:

## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [1] TRUE

In this case, we see that it did happen. At least two people had the same birthday.

To estimate the probability of a shared birthday in the group, we repeat this experiment by sampling sets of 50 birthdays over and over:

## [1] 0.9722

Were you expecting the probability to be this high?

People tend to underestimate these probabilities. To get an intuition as to why it is so high, think about what happens when the group size is close to 365. At this stage, we run out of days and the probability is one.

Say we want to use this knowledge to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%?

Let’s create a look-up table. We can quickly create a function to compute this for any group size:

We can now make a plot of the estimated probabilities of two people having the same birthday in a group of size n :

Now let’s compute the exact probabilities rather than use Monte Carlo approximations. To make the math simpler, instead of computing the probability of it happening, we will compute the probability of it not happening. For this, we use the multiplication rule.

Let’s start with the first person. The probability that person 1 has a unique birthday is 1. The probability that person 2 has a unique birthday, given that person 1 already took one, is 364/365. Then, given that the first two people have unique birthdays, person 3 is left with 363 days to choose from. We continue this way and find the chances of all 50 people having a unique birthday is:

1 ? 364/365 ? 363/365 . 365???n+1/365

We can write a function that does this for any number:

How many Monte Carlo experiments are enough?

We know that the larger B, the better the approximation. But how big do we need it to be? This is actually a challenging question and answering it often requires advanced theoretical statistics training.

One practical approach we will describe here is to check for the stability of the estimate. The following is an example with the birthday problem for a group of 22 people.

In this plot, we can see that the values start to stabilize (that is, they vary less than .01) around 1000. Note that the exact probability, which we know in this case, is 0.569.

Assessment 2: Combinations and Permutations

  1. Independence
    Imagine you draw two balls from a box containing colored balls. You either replace the first ball before you draw the second or you leave the first ball out of the box when you draw the second ball.

Under which situation are the two draws independent of one another?

Remember that two events A and B are independent if Pr(A and B)=Pr(A)P(B).

A. You don’t replace the first ball before drawing the next.

B. You do replace the first ball before drawing the next.

C. Neither situation describes independent events.

D. Both situations describe independent events.

  1. Sampling with replacement
    Say you’ve drawn 5 balls from the a box that has 3 cyan balls, 5 magenta balls, and 7 yellow balls, with replacement, and all have been yellow.

What is the probability that the next one is yellow?

## [1] 0.4666667
  1. Rolling a die
    If you roll a 6-sided die once, what is the probability of not seeing a 6? If you roll a 6-sided die six times, what is the probability of not seeing a 6 on any roll?
## [1] 0.334898
  1. Probability the Celtics win a game
    Two teams, say the Celtics and the Cavs, are playing a seven game series. The Cavs are a better team and have a 60% chance of winning each game.

What is the probability that the Celtics win at least one game? Remember that the Celtics must win one of the first four games, or the series will be over!

  • Calculate the probability that the Cavs will win the first four games of the series.

  • Calculate the probability that the Celtics win at least one game in the first four games of the series.

## [1] 0.8704
  1. Monte Carlo simulation for Celtics winning a game
    Create a Monte Carlo simulation to confirm your answer to the previous problem by estimating how frequently the Celtics win at least 1 of 4 games. Use B <- 10000 simulations.

The provided sample code simulates a single series of four random games, simulated_games

  • Use the replicate function for B <- 10000 simulations of a four game series.
  • Within each simulation, replicate the sample code to simulate a four-game series named simulated_games.
  • Then, use the any function to indicate whether the four-game series contains at least one win for the Celtics.
  • Use the mean function to find the proportion of simulations that contain at least one win for the Celtics out of four games.
## [1] 0.8757

##1.3 Addition rule
The addition rule tells us that:

Pr(A or B) = Pr(A) + Pr(B) ??? Pr(A and B)

This rule is intuitive: think of a Venn diagram. If we simply add the probabilities, we count the intersection twice so we need to substract one instance.

The Monty Hall Problem

In the 1970s, there was a game show called “Let’s Make a Deal” and Monty Hall was the host. At some point in the game, contestants were asked to pick one of three doors. Behind one door there was a prize. The other doors had a goat behind them to show the contestant they had lost. After the contestant picked a door, before revealing whether the chosen door contained a prize, Monty Hall would open one of the two remaining doors and show the contestant there was no prize behind that door. Then he would ask “Do you want to switch doors?” What would you do?

We can use probability to show that if you stick with the original door choice, your chances of winning a prize remain 1 in 3. However, if you switch to the other door, your chances of winning double to 2 in 3! This seems counterintuitive. Many people incorrectly think both chances are 1 in 2 since you are choosing between 2 options. You can watch a detailed mathematical explanation of why this is here or read one here. Below we use a Monte Carlo simulation to see which strategy is better. Note that this code is written longer than it should be for pedagogical purposes.

Let’s start with the stick strategy:

## [1] 0.3388

As we write the code, we note that the lines starting with my_pick and show have no influence on the last logical operation since we stick to our original choice anyway. From this we should realize that the chance is 1 in 3, what we began with.

Now let’s repeat the exercise, but consider the switch strategy:

## [1] 0.6708

The Monte Carlo estimate confirms the 2/3 calculation. This helps us gain some insight by showing that we are removing a door, show, that is definitely not a winner from our choices. We also see that unless we get it right when we first pick, you win: 1 - 1/3 = 2/3.

Assessment 3: The Addition Rule and Monty Hall

  1. The Cavs and the Warriors
    Two teams, say the Cavs and the Warriors, are playing a seven game championship series. The first to win four games wins the series. The teams are equally good, so they each have a 50-50 chance of winning each game.

If the Cavs lose the first game, what is the probability that they win the series?

## [1] 0.34375
  1. The Cavs and the Warriors - Monte Carlo
    Confirm the results of the previous question with a Monte Carlo simulation to estimate the probability of the Cavs winning the series after losing the first game.
## [1] 0.3371
  1. A and B play a series - part 1
    Two teams, A and B, are playing a seven series game series. Team A is better than team B and has a p>0.5 chance of winning each game.

  1. A and B play a series - part 2
    Repeat the previous exercise, but now keep the probability that team A wins fixed at p <- 0.75 and compute the probability for different series lengths. For example, wins in best of 1 game, 3 games, 5 games, and so on through a series that lasts 25 games.

Section 2 Overview

Section 2 introduces you to Continuous Probability.

After completing Section 2, you will:

  • understand the differences between calculating probabilities for discrete and continuous data.
  • be able to use cumulative distribution functions to assign probabilities to intervals when dealing with continuous data.
  • be able to use R to generate normally distributed outcomes for use in Monte Carlo simulations.
  • know some of the useful theoretical continuous distributions in addition to the normal distribution, such as the student-t, chi-squared, exponential, gamma, beta, and beta-binomial distributions.

The textbook for this section is available here

Continuous Probability

Keep in mind that we have not yet introduced probability in the context of CDFs. Let’s do this by asking the following: if I pick one of the male students at random, what is the chance that he is taller than 70.5 inches? Because every student has the same chance of being picked, the answer to this is equivalent to the proportion of students that are taller than 70.5 inches. Using the CDF we obtain an answer by typing:

## [1] 0.3768473

Once a CDF is defined, we can use this to compute the probability of any subset. For instance, the probability of a student being between height a and height b is:

F(b)-F(a)

Theoretical Distribution

obtained with the function pnorm. Using the normal distribution:

The cumulative distribution for the normal distribution is defined by a mathematical formula, which in R can be obtained with the function pnorm. Using the normal distribution:

## [1] 0.371369
## [1] 0.114532
## [1] 0.1194581
## [1] 0.1219212

Note how close we get with the normal approximation:

## [1] 0.1031077
## [1] 0.1097121
## [1] 0.1081743

However, the approximation is not as useful for other intervals. For instance, notice how the approximation breaks down when we try to estimate:

## [1] 0.02216749
## [1] 0.08359562

In general, we call this situation discretization. Although the true height distribution is continuous, the reported heights tend to be more common at discrete values, in this case, due to rounding. As long as we are aware of how to deal with this reality, the normal approximation can still be a very useful tool.

Probability Density

The probability density at x is defined as the function, we’re going to call it little f of x, such that the probability distribution big F of a, which is the probability of x being less than or equal to a, is the integral of all values up to a of little f of x dx.

dnorm() is the probability density function for the normal distribution:

## [1] 0.03206008
## [1] 0.01990735

Monte Carlo simulations for continuous variables

R provides functions to generate normally distributed outcomes. Specifically, the rnorm function takes three arguments: size, average (defaults to 0), and standard deviation (defaults to 1) and produces random numbers. Here is an example of how we could generate data that looks like our reported heights:

This is one of the most useful functions in R as it will permit us to generate data that mimics natural events and answers questions related to what could happen by chance by running Monte Carlo simulations.

If, for example, we pick 800 males at random, what is the distribution of the tallest person? How rare is a seven footer in a group of 800 males? The following Monte Carlo simulation helps us answer that question:

## [1] 0.0214

Other Continuous Distributions

Other continuous distributions that we may encounter are the student-t, chi-squared, exponential, gamma, beta, and beta-binomial. R provides functions to compute the density, the quantiles, the cumulative distribution functions and to generate Monte Carlo simulations. R uses a convention that lets us remember the names, namely using the letters d, q, p and r in front of a shorthand for the distribution. We have already seen the functions dnorm, pnorm and rnorm for the normal distribution. The functions qnorm gives us the quantiles. We can therefore draw a distribution like this:

For example, for the student-t, described later in Section 17.10, the shorthand t is used so the functions are dt for the density, qt for the quantiles, pt for the cumulative distribution function, and rt for Monte Carlo simulation.

Assessment 4: Continuous Probability

  1. Distribution of female heights - 1
    Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 5 feet or shorter?
  • Use pnorm to define the probability that a height will take a value less than 5 feet given the stated distribution.
## [1] 0.09121122
  1. Distribution of female heights - 2
    Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 6 feet or taller?
  • Use pnorm to define the probability that a height will take a value of 6 feet or taller.
## [1] 0.003830381
  1. Distribution of female heights - 3
    Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is between 61 and 67 inches?
  • Use pnorm to define the probability that a randomly chosen woman will be shorter than 67 inches.
  • Subtract the probability that a randomly chosen will be shorter than 61 inches.
## [1] 0.6826895
  1. Distribution of female heights - 4
    Repeat the previous exercise, but convert everything to centimeters. That is, multiply every height, including the standard deviation, by 2.54. What is the answer now?
  • Convert the average height and standard deviation to centimeters by multiplying each value by 2.54.
  • Repeat the previous calculation using pnorm to define the probability that a randomly chosen woman will have a height between 61 and 67 inches, converted to centimeters by multiplying each value by 2.54.
## [1] 0.6826895
  1. Probability of 1 SD from average
    Compute the probability that the height of a randomly chosen female is within 1 SD from the average height.
  • Calculate the values for heights one standard deviation taller and shorter than the average.
  • Calculate the probability that a randomly chosen woman will be within 1 SD from the average height.
## [1] 0.6826895
  1. Distribution of male heights
    Imagine the distribution of male adults is approximately normal with an expected value of 69 inches and a standard deviation of 3 inches. How tall is a male in the 99th percentile?
  • Determine the height of a man in the 99th percentile, given an average height of 69 inches and a standard deviation of 3 inches.
## [1] 75.97904
  1. Distribution of IQ scores
    The distribution of IQ scores is approximately normally distributed. The expected value is 100 and the standard deviation is 15. Suppose you want to know the distribution of the person with the highest IQ in your school district, where 10,000 people are born each year.

Generate 10,000 IQ scores 1,000 times using a Monte Carlo simulation. Make a histogram of the highest IQ scores.

  • Use the function rnorm to generate a random distribution of 10,000 values with a given average and standard deviation.
  • Use the function max to return the largest value from a supplied vector.
  • Repeat the previous steps a total of 1,000 times. Store the vector of the top 1,000 IQ scores as highestIQ.
  • Plot the histogram of values using the function hist.

Section 3 Overview

Section 3 introduces you to Random Variables, Sampling Models, and the Central Limit Theorem.

Section 3 is divided into two parts:

  • Random Variables and Sampling Models
  • The Central Limit Theorem.

After completing Section 3, you will:

  • understand what random variables are, how to generate them, and the correct mathematical notation to use with them.
  • be able to use sampling models to estimate characteristics of a larger population.
  • be able to explain the difference between a distribution and a probability distribution.
  • understand the Central Limit Theorem and the law of large numbers.

The textbook for this section is available here

Random variables

Random variables are numeric outcomes resulting from random processes. We can easily generate random variables using some of the simple examples we have shown. For example, define X to be 1 if a bead is blue and red otherwise:

## [1] 1
## [1] 1
## [1] 1

Sampling Models

For example, we can model the process of polling likely voters as drawing 0’s- Republicans- and 1’s- Democrats- from an urn containing the 0 and 1 code for all likely voters.

In epidemiological studies, we often assume that the subjects in our study are a random sample from the population of interest. The data related to a specific outcome can be modeled as a random sample

Similarly, in experimental research, we often assume that the individual organisms we are studying- for example, worms, flies, or mice- are a random sample from a larger population.

Casino games offer a plethora of examples of real-world situations in which sampling models are used to answer specific questions.

Roulette wheel example

## [1] 38

Running a Monte Carlo Simulation with the above example:

## [1] 0.0492
## [1] 0.0438

In the histogram above, we see that the distribution appears to be approximately normal. A qq-plot will confirm that the normal approximation is close to perfect. If, in fact, the distribution is normal, then all we need to define the distribution is the average and the standard deviation. Because we have the original values from which the distribution is created, we can easily compute these:

## [1] 52.842
## [1] 31.50489

Distributions versus Probability Distributions

Previously we described how any list of numbers, let’s call it x1 through xn,has a distribution. The definition is quite straightforward. We define capital F of a as a function that answers the question, what proportion of the list is less than or equal to a. Because they are useful summaries, when the distribution is approximately normal, we define the average and the standard deviation. These are defined with a straightforward operation of the list. In r we simply compute the average and standard deviation this way,

A random variable x has a distribution function. To define this, we do not need a list of numbers. It’s a theoretical concept. In this case, to define the distribution, we define capital F of a as a function that answers the question, what is the probability that x is less than or equal to a. There is no list of numbers. However, if x is defined by drawing from an urn with numbers in it, then there is a list, the list of numbers inside the urn. In this case, the distribution of that list is the probability distribution of x and the average and standard deviation of that list are the expected value and standard errors of the random variable. Another way to think about it that does not involve an urn is to run a Monte Carlo simulation and generate a very large list of outcomes of x. These outcomes are a list of numbers. The distribution of this list will be a very good approximation of the probability distribution of x. The longer the list, the better the approximation. The average and standard deviation of this list will approximate the expected value and standard error of the random variable.

Central Limit Theorem (CLT)

The Central Limit Theorem-or the CLT for short tells us that when the number of independent draws-also called sample size-is large, the probability distribution of the sum of these draws is approximately normal.

Assessment 5: Random Variables and Sampling Models

  1. American Roulette probabilities
    An American roulette wheel has 18 red, 18 black, and 2 green pockets. Each red and black pocket is associated with a number from 1 to 36. The two remaining green slots feature “0” and “00”. Players place bets on which pocket they think a ball will land in after the wheel is spun. Players can bet on a specific number (0, 00, 1-36) or color (red, black, or green).

What are the chances that the ball lands in a green pocket?

## [1] 0.05263158
  1. American Roulette payout
    In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.

Create a model to predict your winnings from betting on green one time.

  • Use the sample function return a random value from a specified range of values.
  • Use the prob = argument in the sample function to specify a vector of probabilities for returning each of the values contained in the vector of values being sampled.
  • Take a single sample (n = 1).
## [1] -1
  1. American Roulette expected value
    In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.In the previous exercise, you created a model to predict your winnings from betting on green.

Now, compute the expected value of X, the random variable you generated previously.

  • Using the chances of winning $17 (p_green) and the chances of losing $1 (p_not_green), calculate the expected outcome of a bet that the ball will land in a green pocket.
## [1] -0.05263158
  1. American Roulette standard error
    The standard error of a random variable X tells us the difference between a random variable and its expected value. You calculated a random variable X in exercise 2 and the expected value of that random variable in exercise 3.

Now, compute the standard error of that random variable, which represents a single outcome after one spin of the roulette wheel.

  • Compute the standard error of the random variable you generated in exercise 2, or the outcome of any one spin of the roulette wheel.
  • Recall that the payout for winning on green is $17 for a $1 bet.
## [1] 4.019344
  1. American Roulette sum of winnings
    You modeled the outcome of a single spin of the roulette wheel, X, in exercise 2.

Now create a random variable S that sums your winnings after betting on green 1,000 times.

  • Use set.seed to make sure the result of your random operation matches the expected answer for this problem.
  • Specify the number of times you want to sample from the possible outcomes.
  • Use the sample function to return a random value from a vector of possible values.
  • Be sure to assign a probability to each outcome and to indicate that you are sampling with replacement.
## [1] -10
  1. American Roulette winnings expected value
    In the previous exercise, you generated a vector of random outcomes, S, after betting on green 1,000 times.

What is the expected value of S?

  • Using the chances of winning $17 (p_green) and the chances of losing $1 (p_not_green), calculate the expected outcome of a bet that the ball will land in a green pocket over 1,000 bets.
## [1] -52.63158
  1. American Roulette winnings expected value
    You generated the expected value of S, the outcomes of 1,000 bets that the ball lands in the green pocket, in the previous exercise.

What is the standard error of S?

  • Compute the standard error of the random variable you generated in exercise 5, or the outcomes of 1,000 spins of the roulette wheel.
## [1] 127.1028

The Central Limit Theorem Continued

Property 1: The first, is that the expected value of a sum of random variables is the sum of the expected values of the individual random variables.

Prperty 2: is that the expected value of random variables times a non-random constant is the expected value times that non-random constant.

Property 3: is that the square of the standard error of the sum of independent random variables is the sum of the square of the standard error of each random variable.

Property 4: is that the standard error of random variables times a non-random constant is the standard error times a non-random constant.

Law of Large Numbers

An important implication of the final result is that the standard error of the average becomes smaller and smaller as n grows larger. When n is very large, then the standard error is practically 0 and the average of the draws converges to the average of the urn. This is known in statistical textbooks as the law of large numbers or the law of averages.

How Large is Large in CLT?

In the lottery, the chance of winning are less than 1 in a million. Thousands of people play, so the number of draws is very large. So the central limit should work. Yet, the number of winners, the sum of the draws, range between 0, and in very extreme cases, four. This sum is certainly not well approximated by the normal distribution. So the central limit theorem doesn’t apply, even with a very large sample size. This is generally true when the probability of success is very low. In these cases, the Poisson distribution is more appropriate.

Assessment 6: The Central Limit Theorem

  1. American Roulette probability of winning money
    The exercises in the previous chapter explored winnings in American roulette. In this chapter of exercises, we will continue with the roulette example and add in the Central Limit Theorem.

In the previous chapter of exercises, you created a random variable S that is the sum of your winnings after betting on green a number of times in American Roulette.

What is the probability that you end up winning money if you bet on green 100 times?

  • Execute the sample code to determine the expected value avg and standard error se as you have done in previous exercises.
  • Use the pnorm function to determine the probability of winning money.
## [1] 0.4479091
  1. American Roulette Monte Carlo simulation
    Create a Monte Carlo simulation that generates 10,000 outcomes of S, the sum of 100 bets.

Compute the average and standard deviation of the resulting list and compare them to the expected value (-5.263158) and standard error (40.19344) for S that you calculated previously.

  • Use the replicate function to replicate the sample code for B <- 10000 simulations.
  • Within replicate, use the sample function to simulate n <- 100 outcomes of either a win (17)oraloss(-1) for the bet. Use the order c(17, -1) and corresponding probabilities. Then, use the sum function to add up the winnings over all iterations of the model. Make sure to include sum or DataCamp may crash with a “Session Expired” error.
  • Use the mean function to compute the average winnings.
  • Use the sd function to compute the standard deviation of the winnings.
## [1] -5.9086
## [1] 40.30608
  1. American Roulette Monte Carlo vs CLT
    In this chapter, you calculated the probability of winning money in American roulette using the CLT.

Now, calculate the probability of winning money from the Monte Carlo simulation. The Monte Carlo simulation from the previous exercise has already been pre-run for you, resulting in the variable S that contains a list of 10,000 simulated outcomes.

  • Use the mean function to calculate the probability of winning money from the Monte Carlo simulation, S.
## [1] 0.4232
  1. American Roulette Monte Carlo vs CLT comparison
    The Monte Carlo result and the CLT approximation for the probability of losing money after 100 bets are close, but not that close. What could account for this?

A. 10,000 simulations is not enough. If we do more, the estimates will match.

B. The CLT does not work as well when the probability of success is small.

C. The difference is within rounding error.

D. The CLT only works for the averages.

  1. American Roulette average winnings per bet
    Now create a random variable Y that contains your average winnings per bet after betting on green 10,000 times.
  • Run a single Monte Carlo simulation of 10,000 bets using the following steps. (You do not need to replicate the sample code.)
  • Specify n as the number of times you want to sample from the possible outcomes.
  • Use the sample function to return n values from a vector of possible values: winning $17 or losing $1. Be sure to assign a probability to each outcome and indicate that you are sampling with replacement.
  • Calculate the average result per bet placed using the mean function.
## [1] 0.008
  1. American Roulette per bet expected value
    What is the expected value of Y, the average outcome per bet after betting on green 10,000 times?
  • Using the chances of winning $17 (p_green) and the chances of losing $1 (p_not_green), calculate the expected outcome of a bet that the ball will land in a green pocket.
  • Print this value to the console (do not assign it to a variable).
## [1] -0.05263158
  1. American Roulette per bet standard error
    What is the standard error of Y, the average result of 10,000 spins?
  • Compute the standard error of Y, the average result of 10,000 independent spins.
## [1] 0.04019344
  1. American Roulette winnings per game are positive
    What is the probability that your winnings are positive after betting on green 10,000 times?
  • Execute the code that we wrote in previous exercises to determine the average and standard error.
  • Use the pnorm function to determine the probability of winning more than $0.
## [1] 0.0951898
  1. American Roulette Monte Carlo again
    Create a Monte Carlo simulation that generates 10,000 outcomes of S, the average outcome from 10,000 bets on green.

Compute the average and standard deviation of the resulting list to confirm the results from previous exercises using the Central Limit Theorem.

  • Use the replicate function to model 10,000 iterations of a series of 10,000 bets.
  • Each iteration inside replicate should simulate 10,000 bets and determine the average outcome of those 10,000 bets. If you forget to take the mean, DataCamp will crash with a “Session Expired” error.
  • Find the average of the 10,000 average outcomes.
  • Compute the standard deviation of the 10,000 simulations.
## [1] -0.05223142
## [1] 0.03996168
  1. American Roulette comparison
    In a previous exercise, you found the probability of winning more than $0 after betting on green 10,000 times using the Central Limit Theorem. Then, you used a Monte Carlo simulation to model the average result of betting on green 10,000 times over 10,000 simulated series of bets.

What is the probability of winning more than $0 as estimated by your Monte Carlo simulation? The code to generate the vector S that contains the the average outcomes of 10,000 bets modeled 10,000 times has already been run for you.

  • Calculate the probability of winning more than $0 in the Monte Carlo simulation from the previous exercise.
  • You do not need to run another simulation: the results of the simulation are in your workspace as the vector S.
## [1] 0.0977
  1. American Roulette comparison analysis
    The Monte Carlo result and the CLT approximation are now much closer than when we calculated the probability of winning for 100 bets on green. What could account for this difference?

A. We are now computing averages instead of sums.

B. 10,000 Monte Carlo simulations was not enough to provide a good estimate.

C. The CLT works better when the sample size is larger.

D. It is not closer. The difference is within rounding error.

Section 4 Overview

Section 4 introduces you to the Big Short.

After completing Section 4, you will:

  • understand the relationship between sampling models and interest rates as determined by banks.
  • understand how interest rates can be set to minimize the chances of the bank losing money.
  • understand how inappropriate assumptions of independence contributed to the financial meltdown of 2007.

The textbook for this section is available here

Interest Rates Explained

Suppose your bank will give out 1,000 loans for 180,000 this year. Also suppose that your bank loses, after adding up all the costs, $200,000 per foreclosure.For simplicity, we assume that that includes all operational costs. A sampling model for this scenario is coded like this.

## [1] -5e+06

We either default and lose money, or not default and not lose money. If we run the simulation we see that we lose $2.8 millions. Note that the total loss defined by the final sum is a random variable. Every time you run the code you get a different answer. This is because it’s a probability of defaulting. It’s not going to happen for sure. We can easily construct a Monte Carlo simulation to get an idea of the distribution of this random variable. Here’s the distribution.

We don’t really need a Monte Carlo simulation though. Using what we’ve learned, the CLT tells us that because our losses are a sum of independent draws, its distribution is approximately normal with expected value and standard deviation given by the following formula.

## [1] -4e+06
## [1] 885437.7
## [1] 6249.181
## [1] 0.03471767
## [1] 2124.198
## [1] 2124198

we get that the x has to be about 6,249, which is an interest rate of about 3%, which is still pretty good. Note also that by choosing this interest rate, we now have an expected profit per loan of about $2,124, which is a total expected profit of about $2 million. We can run a Monte Carlo simulation and check our theoretical approximation. We do that, and we indeed get that value again. And again, the probability of profit being less than zero according to the Monte Carlo simulation is about 1%.

## [1] 2122535
## [1] 0.01237

###The Big Short
One of our employees points out that since the bank is making about $2,000 per loan, that you should give out more loans. Why just n? You explain that finding those n clients was hard. You need a group that is predictable, and that keeps the chances of defaults low. He then points out that even if the probability of default is higher, as long as your expected value is positive, you can minimize your chances of losing money by increasing n, the number of loans, and relying on the law of large numbers. He claims that even if the default rate is twice as high, say 4%, if we set the rate just a bit higher so that this happens, you will get a positive expected value. So if we set the interest rate at 5%, we are guaranteed a positive expected value of $640 per loan. And we can minimize our chances of losing money by simply increasing the number of loans, since the probability of S being less than 0 is equal to the probability of Z being less than negative expected value of S divided by standard error S, with Z a standard normal random variable, as we saw earlier. And if we do the math, we will see that we can pick an n so that this probability is 1 in 100.

## [1] 640
## [1] 22163
## [1] 14184320
## [1] 14090646
## [1] 14109853
## [1] 0.3479
## [1] 0.2422

Assessment7: The Big Short

  1. Bank earnings
    Say you manage a bank that gives out 10,000 loans. The default rate is 0.03 and you lose $200,000 in each foreclosure.

Create a random variable S that contains the earnings of your bank. Calculate the total amount of money lost in this scenario.

  • Using the sample function, generate a vector called defaults that contains n samples from a vector of c(0,1), where 0 indicates a payment and 1 indicates a default
  • Multiply the total number of defaults by the loss per foreclosure.
## [1] -6.3e+07
  1. Bank earnings Monte Carlo
    Run a Monte Carlo simulation with 10,000 outcomes for S, the sum of losses over 10,000 loans. Make a histogram of the results.
  • Within a replicate loop with 10,000 iterations, use sample to generate a list of 10,000 loan outcomes: payment (0) or default (1). Use the outcome order c(0,1) and probability of default p_default.
  • Still within the loop, use the function sum to count the number of foreclosures multiplied by loss_per_foreclosure to return the sum of all losses across the 10,000 loans. If you do not take the sum inside the replicate loop, DataCamp may crash with a “Session Expired” error.
  • Plot the histogram of values using the function hist.

  1. Bank earnings expected value
    What is the expected value of S, the sum of losses over 10,000 loans? For now, assume a bank makes no money if the loan is paid.
  • Using the chances of default (p_default), calculate the expected losses over 10,000 loans.
## [1] -6e+07
  1. Bank earnings standard error
    What is the standard error of S?
  • Compute the standard error of the random variable S you generated in the previous exercise, the summed outcomes of 10,000 loans.
## [1] 3411744
  1. Bank earnings interest rate - 1
    So far, we’ve been assuming that we make no money when people pay their loans and we lose a lot of money when people default on their loans. Assume we give out loans for $180,000. How much money do we need to make when people pay their loans so that our net loss is $0?

In other words, what interest rate do we need to charge in order to not lose money?

  • If the amount of money lost or gained equals 0, the probability of default times the total loss per default equals the amount earned per probability of the loan being paid.
  • Divide the total amount needed per loan by the loan amount to determine the interest rate.
## [1] 0.03436426
  1. Bank earnings interest rate - 2
    With the interest rate calculated in the last example, we still lose money 50% of the time. What should the interest rate be so that the chance of losing money is 1 in 20?

In math notation, what should the interest rate be so that Pr(S<0)=0.05?

Remember that we can add a constant to both sides of the equation to get:
Bank_earnings_interest_rate

  • Use the qnorm function to compute a continuous variable at given quantile of the distribution to solve for z.
  • In this equation, l, p, and n are known values. Once you’ve solved for z, solve for x.
  • Divide x by the loan amount to calculate the rate.
## [1] 0.03768738
  1. Bank earnings - minimize money loss
    The bank wants to minimize the probability of losing money. Which of the following achieves their goal without making interest rates go up?

A. A smaller pool of loans

B. A larger probability of default

C. A reduced default rate

D. A larger cost per loan default