PH125.3x: Data Science: Probability

School: EDX, HarvardX
Course Instructor: Rafael Irizarry

Abstract

This is the third in a series of courses in a Professional Certificate in Data Science program. The courses in the Professional Certificate program are designed to prepare you to do data analysis in R, from simple computations to machine learning. The courses are designed to be taken in order. A prerequisite for this course is courses 1 and 2 of the series or equivalent knowledge of basic R coding and data visualization. If you need to learn or refresh some basic R, check out Data Science: R Basics, the first course in this series.

The textbook for the Data Science course series is freely available online. This course corresponds to the Probability section of textbook, starting here.

This course assumes you are comfortable with basic math, algebra, and logical operations. HarvardX has partnered with DataCamp for all assignments in R that allow you to program directly in a browser-based interface. You will not need to download any special software.

Using a combination of a guided introduction through short video lectures and more independent in-depth exploration, you will get to practice your new R skills on real-life applications.

Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn important concepts in probability theory. The motivation for this course is the circumstances surrounding the financial crisis of 2007-2008. Part of what caused this financial crisis was that the risk of certain securities sold by financial institutions was underestimated. To begin to understand this very complicated event, we need to understand the basics of probability. We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem. These statistical concepts are fundamental to conducting statistical tests on data and understanding whether the data you are analyzing are likely occurring due to an experimental method or to chance. Statistical inference, covered in the next course in this series, builds upon probability theory.

Learning Objective:

Important concepts in probability theory including random variables and independence
How to perform a Monte Carlo simulation
The meaning of expected values and standard errors and how to compute them in R
The importance of the Central Limit Theorem

Course Outline:

Section 1: Discrete Probability You will learn about basic principles of probability related to categorical data using card games as examples.

Section 2: Continuous Probability You will learn about basic principles of probability related to numeric and continuous data.

Section 3: Random Variables, Sampling Models, and the Central Limit Theorem You will learn about random variables (numeric outcomes resulting from random processes), how to model data generation procedures as draws from an urn, and the Central Limit Theorem, which applies to large sample sizes.

Section 4: The Big Short You will learn how interest rates are determined and how some bad assumptions led to the financial crisis of 2007-2008.

Section 1: Discrete Probability

Section 1 introduces you to Discrete Probability. Section 1 is divided into three parts:

Introduction to Discrete Probability
Combinations and Permutations
Addition Rule and Monty Hall

After completing Section 1, you will be able to:

Apply basic probability theory to categorical data.
Perform a Monte Carlo simulation to approximate the results of repeating an experiment over and over, including simulating the outcomes in the Monty Hall problem.
Distinguish between: sampling with and without replacement, events that are and are not independent, and combinations and permutations.
Apply the multiplication and addition rules, as appropriate, to calculate the probably of multiple events occurring.
Use sapply instead of a for loop to perform element-wise operations on a function.

1.1 Introduction to Discrete Probability

categorical data,this subset of probability is referred to as discrete probability. Card games is a very good illustrator of this kind of probability.

1.1.1 Discreate Probability

Ex 1: two blue beads and three read beads:

read <- 2
blue <- 3

1.1.2 Monte Carlo Simulation

Picking at randomn in R you can use the sample() function

beads <- rep(c("red", "blue"), times = c(2,3))
beads

## [1] "red"  "red"  "blue" "blue" "blue"

#using the sample function for picking one at random
sample(beads,1)

## [1] "blue"

A monte Carlo Simulation is the equivalent of doing the experiment enough times to become over and over again. To perform our first MC simulation we use the replicate() function.

B <- 10000

events <- replicate(B, sample(beads, 1))

#looking at the number of incidents of an event
tab <- table(events)
tab

## events
## blue  red 
## 5964 4036

# to look at the proportions:
prop.table(tab)

## events
##   blue    red 
## 0.5964 0.4036

Doing the same experiment with replacement

events <- sample(beads, B, replace = TRUE)
prop.table(table(events))

## events
##  blue   red 
## 0.604 0.396

1.1.3 Probability Distributions

Every time we toss a fair coin, the probability of seeing heads is 1/2 regardless of what previous tosses have revealed.

1.1.4 Independence

# how to get black jack:
# first we calculate the probability of getting an ace =
ace <- 4/52  # four chances out of 52
#then we calculate the chance of getting an 10 out of the remainding 51 cards
ten <- 4*4/(52-1)

# Then we can calculate the probability of black jack as :
blackjack <- ace * ten
blackjack

## [1] 0.02413273

One very wrong case of using the multiplication rule in a court case:

prob_mustace <- 0.2
prob_beard <- 0.1
prob_both <- prob_mustace * prob_beard
prob_both

## [1] 0.02

This is however very wrong as beard and mustace are not independent (the condition of mustace, given a beard is about 95%)

ex 1: Assessment: Introduction to Discrete Probability

Probability of cyan

One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.

What is the probability that the ball will be cyan?

cyan <- 3
magenta <- 5
yellow <- 7

balls <- cyan + magenta + yellow
p_cyan <-  cyan / balls
p_cyan

## [1] 0.2

ex 2: Probability of not cyan

1/1 point (graded) One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.

What is the probability that the ball will not be cyan?

p_not_cyan <- 1- p_cyan
p_not_cyan

## [1] 0.8

ex 3: Sampling without replacement

1/1 point (graded) Instead of taking just one draw, consider taking two draws. You take the second draw without returning the first draw to the box. We call this sampling without replacement.

What is the probability that the first draw is cyan and that the second draw is not cyan?

without <- 3/15 * (1-2/14) 
without

## [1] 0.1714286

EX 4: Sampling with replacement

Now repeat the experiment, but this time, after taking the first draw and recording the color, return it back to the box and shake the box. We call this sampling with replacement.

What is the probability that the first draw is cyan and that the second draw is not cyan?

with <- 3/15 * (1-3/15) 
with

## [1] 0.16

So for example, if you want to see 5 random 7-digit phone numbers out of all possible phone numbers, you could type code like this. Here we’re defining a vector of digits that goes from 0 to 9 rather than 1 through 10. So these four lines of code generate all phone numbers, picks 5 at random,

#install.packages(("permutations"))
library(permutations)

## 
## Attaching package: 'permutations'

## The following object is masked from 'package:stats':
## 
##     cycle

1.2 Combinations and Permutations

Lets start by creating a deck of card using R by using the two functions: * Expand.grid() * paste()

# how to generate a deck of cards
suits <- c("Diamonds", "Clubs", "Hearts","Spades")
numbers <- c("Ace", "Deuce",  "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten",
             "Jack", "Queen", "King" )
deck <- expand.grid(number=numbers, suit=suits)
deck <- paste(deck$number, deck$suit)

Did we construct the right deck: what is the probability of a king in the first deck (1 in 13)?

king <- paste("King", suits)
mean(deck %in% king)

## [1] 0.07692308

print("checking: ")

## [1] "checking: "

1/13

## [1] 0.07692308

We create a vector that contains the four ways we can get a king

king <- paste("King", suits)
mean(deck %in% king)

## [1] 0.07692308

What is the chance of drawing two kings in a row: (4/52) * (3/51), for this we can use the combination() and permutation() functions) Permutations: computes, for any list of size n, all the different ways we can select r items

# here all the ways we can choose 2 numbers from the list 1, 2, 3, 4, 5. Notice that the order matters.

#So 3, 1 is different than 1, 3, So it appears in our permutations.

library(gtools)
permutations(5,2)

##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    1
##  [6,]    2    3
##  [7,]    2    4
##  [8,]    2    5
##  [9,]    3    1
## [10,]    3    2
## [11,]    3    4
## [12,]    3    5
## [13,]    4    1
## [14,]    4    2
## [15,]    4    3
## [16,]    4    5
## [17,]    5    1
## [18,]    5    2
## [19,]    5    3
## [20,]    5    4

# the reason for 1,1 2,2 .. not in the list is once we picked a number we can't pick it again

all_phone_numbers <- permutations(10, 7, v=0:9)
n <- nrow(all_phone_numbers)
index <- sample(n, 5)
all_phone_numbers[index,]

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    4    9    7    5    0    3    8
## [2,]    3    7    9    5    8    4    2
## [3,]    3    8    7    2    9    0    6
## [4,]    2    4    3    6    5    7    1
## [5,]    8    0    7    3    1    6    9

To do this fopr card selection

hands <- permutations(52, 2, v = deck)

52 * 51

## [1] 2652

check how many have the first card as a king

firstCard <- hands[,1]
secondCard<-hands[,2]

sum(firstCard %in% king)

## [1] 204

How man cases have both a king in first card and in second card:

sum(firstCard %in% king & secondCard %in% king) / sum(firstCard %in% king)

## [1] 0.05882353

print("checking: ")

## [1] "checking: "

3/51

## [1] 0.05882353

The difference between pemutations (orders matter) and combinations where orders doesn’t matter

permutations(3,2)

##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    1
## [4,]    2    3
## [5,]    3    1
## [6,]    3    2

combinations(3,2)

##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    3

So to compute the probability of a natural 21 in blackjack, we can do this. We can define a vector that includes all the aces, a vector that includes all the face cards, then we generate all the combinations of picking 2 cards out of 52, and then we simply count. How often do we get aces and a face card?

aces <- paste("Ace", suits)
aces

## [1] "Ace Diamonds" "Ace Clubs"    "Ace Hearts"   "Ace Spades"

facecard <- c("King", "Queen", "Jack", "Ten")
facecard <- expand.grid(number=facecard, suit=suits)
#facecard
facecard <- paste(facecard$number, facecard$suit)
#facecard
#Generate all hands for a deck:
hands <- combinations(52, 2, v=deck)

mean(hands[,1] %in% aces & hands[,2] %in% facecard)

## [1] 0.04826546

The below code takes into account that ace and picture can be first or last

mean((hands[,1] %in% aces & hands[,2] %in% facecard) |
       (hands[,2] %in% aces & hands[,1] %in% facecard))

## [1] 0.04826546

use hand to draw samples of two cards

hand <- sample(deck, 2)
hand

## [1] "King Spades" "Nine Hearts"

Doing a monte Carlo Simulation to check:

B <- 10000
results <- replicate(B, {
  hand <- sample(deck,2)
 ((hands[,1] %in% aces & hands[,2] %in% facecard) |
       (hands[,2] %in% aces & hands[,1] %in% facecard))
})
mean(results)

## [1] 0.04826546

1.2.2 The Birthday problem

In a class with 50 students, what is the probability that two have birthsday the same day (29 feb is excluded).

n <- 50
bdays <- sample(1:365, n, replace=TRUE)

# to check we can use duplicated
duplicated(c(1,2,3,1,4,3,5))

## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

So to check for the birthsdays

any(duplicated(bdays))

## [1] TRUE

B <- 10000
results <- replicate(B,{
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
})
mean(results)

## [1] 0.9688

1.2.3 Sapply

For the birthsday problem we can try and find out when the group of people is big enough for the chance of birthsday the same day get bigger than 50%, 75% ….

compute_prob <- function(n, B=10000) {
  same_day <- replicate(B,{
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
})
mean(same_day)
}

n <- seq(1,60)
prob <- sapply(n, compute_prob)

plot(n,prob)

P(person 1 has a unique birthday) = 1 P(peson 2 has a unique birthday | person 1 has a unique birthday) = 364/365 P(peson 3 has a unique birthday | person 1 and 2 has a unique birthday) = 363/365

exact_prob <- function(n){
  prob_unique <- seq(365, 365-n+1) / 365
  1-prob_unique
}

eprob <- sapply(n, exact_prob)

eprob <- as.numeric(unlist(eprob))

class(eprob)

## [1] "numeric"

class(prob)

## [1] "numeric"

dim(eprob)

## NULL

dim(n)

## NULL

plot(eprob)

plot(prob)

#eprob
#prob
#lines(n, eprob, col="red")

1.2.4 How many Monte Carlo experiments are enough?

We yse the birthday experiment with different MC sizes

B <- 10^seq(1,5, len = 100)
compute_prob <- function(B, n=22){
  same_day <- replicate(B, {
    bdays <- sample(1:365, n, replace=TRUE)
    any(duplicated(bdays))
  })
  mean(same_day)
}

prob <- sapply(B, compute_prob)

plot(B, prob)

Exercise 1. Independence

Imagine you draw two balls from a box containing colored balls. You either replace the first ball before you draw the second or you leave the first ball out of the box when you draw the second ball.

Under which situation are the two draws independent of one another?

Remember that two events A and B are independent if Pr(A and B)=Pr(A)P(B).

Answer: You do replace the first ball before drawing the next.

Exercise 2. Sampling with replacement

Say you’ve drawn 5 balls from the a box that has 3 cyan balls, 5 magenta balls, and 7 yellow balls, with replacement, and all have been yellow.

What is the probability that the next one is yellow?

cyan <- 3
magenta <- 5
yellow <- 7

# Assign the variable 'p_yellow' as the probability that a yellow ball is drawn from the box.
p_yellow <- yellow/(cyan+magenta+yellow)

# Using the variable 'p_yellow', calculate the probability of drawing a yellow ball on the sixth draw. Print this value to the console.
p_yellow

## [1] 0.4666667

Exercise 3. Rolling a die

If you roll a 6-sided die once, what is the probability of not seeing a 6? If you roll a 6-sided die six times, what is the probability of not seeing a 6 on any roll?

# Assign the variable 'p_no6' as the probability of not seeing a 6 on a single roll.
p_no6 <- 5/6

# Calculate the probability of not seeing a 6 on six rolls.
p_no6^6

## [1] 0.334898

Exercise 4. Probability the Celtics win a game

Two teams, say the Celtics and the Cavs, are playing a seven game series. The Cavs are a better team and have a 60% chance of winning each game.

What is the probability that the Celtics win at least one game? Remember that the Celtics must win one of the first four games, or the series will be over!

# Assign the variable `p_cavs_win4` as the probability that the Cavs will win the first four games of the series.
p_cavs_win4 <- 0.6^4

# Using the variable `p_cavs_win4`, calculate the probability that the Celtics win at least one game in the first four games of the series.
(1-p_cavs_win4)

## [1] 0.8704

Exercise 5. Monte Carlo simulation for Celtics winning a game

Create a Monte Carlo simulation to confirm your answer to the previous problem by estimating how frequently the Celtics win at least 1 of 4 games. Use B <- 10000 simulations.

The provided sample code simulates a single series of four random games, simulated_games.

# This line of sample code simulates four random games where the Celtics either lose or win. Each game is independent of other games.
simulated_games <- sample(c("lose","win"), 4, replace = TRUE, prob = c(0.6, 0.4))

# The variable 'B' specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# Create an object called `celtic_wins` that first replicates the sample code generating the variable called `simulated_games` for `B` iterations and then tallies the number of simulated series that contain at least one win for the Celtics.
celtic_wins <- replicate(B, {
  simulated_games <- sample(c("lose","win"), 4, replace = TRUE, prob = c(0.6, 0.4))
  any(simulated_games=="win")
})

# Calculate the frequency out of B iterations that the Celtics won at least one game. Print your answer to the console. 
mean(celtic_wins)

## [1] 0.8757

1.3 Addition Rule and Monty Hall

1.3.1 The Addition Rule

The addition rule tells us the probability of A or B is the probability of A plus the probability a B minus the probability of A and B.

The probability of an ace followed by a face card we know: Ace, followed by facecard: 1/13 * 16/51 Facecard followed by ace: 16/52 * 4/51

p1 <- 1/13 * 16/51
p2 <- 16/52 * 4/51
p <- p1 + p2
p

## [1] 0.04826546

1.3.2 The Monty Hall problem

So let’s start by creating a simulation that imitates the strategy of sticking to the same door.

B <- 10000
stick <- replicate(B, {
  doors <- as.character(1:3)
  prize <- sample(c("car", "goat", "goat"))
  prize_door <- doors[prize == "car"]
  my_pick <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)
  stick <- my_pick
  stick == prize_door
})

mean(stick)

## [1] 0.3231

The same experiment but now we use the switch strategy (always switching)

B <- 10000
switch <- replicate(B, {
  doors <- as.character(1:3)
  prize <- sample(c("car", "goat", "goat"))
  prize_door <- doors[prize == "car"]
  my_pick <- sample(doors, 1)
  show <- sample(doors[!doors %in% c(my_pick, prize_door)],1)
  stick <- my_pick
  switch <- doors[!doors%in%c(my_pick, show)]
  switch == prize_door
})

mean(switch)

## [1] 0.6718

Exercise 1. The Cavs and the Warriors

Two teams, say the Cavs and the Warriors, are playing a seven game championship series. The first to win four games wins the series. The teams are equally good, so they each have a 50-50 chance of winning each game.

If the Cavs lose the first game, what is the probability that they win the series?

# Assign a variable 'n' as the number of remaining games.
n <- 6

# Assign a variable 'l' to a list of possible game outcomes, where 0 indicates a loss and 1 indicates a win for the Cavs. 
l <- list(0:1)

# Create a data frame named 'possibilities' that contains all possible outcomes for the remaining games.
possibilities <- expand.grid(rep(l, n))

# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
results <- rowSums(possibilities)>=4

# Calculate the proportion of 'results' in which the Cavs win the series. Print the outcome to the console.
mean(results)

## [1] 0.34375

Exercise 2. The Cavs and the Warriors - Monte Carlo

Confirm the results of the previous question with a Monte Carlo simulation to estimate the probability of the Cavs winning the series after losing the first game.

# The variable `B` specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# Create an object called `results` that replicates the sample code for `B` iterations and tallies the number of simulated series that contain at least four wins for the Cavs.
results <- replicate(B, {
   cavs_wins <- sample(c(0,1), 6, replace = TRUE)
   sum(cavs_wins)>=4
})


# Calculate the frequency out of `B` iterations that the Cavs won at least four games in the remainder of the series. Print your answer to the console.
mean(results)

## [1] 0.3453

Exercise 3. A and B play a series - part 1

Two teams, A and B, are playing a seven series game series. Team A is better than team B and has a p>0.5 chance of winning each game.

# Let's assign the variable 'p' as the vector of probabilities that team A will win.
p <- seq(0.5, 0.95, 0.025)

# Given a value 'p', the probability of winning the series for the underdog team B can be computed with the following function based on a Monte Carlo simulation:
prob_win <- function(p){
  B <- 10000
  result <- replicate(B, {
    b_win <- sample(c(1,0), 7, replace = TRUE, prob = c(1-p, p))
    sum(b_win)>=4
    })
  mean(result)
}

# Apply the 'prob_win' function across the vector of probabilities that team A will win to determine the probability that team B will win. Call this object 'Pr'.
Pr <- sapply(p, prob_win)

# Plot the probability 'p' on the x-axis and 'Pr' on the y-axis.
plot(p, Pr)

Exercise 4. A and B play a series - part 2

Repeat the previous exercise, but now keep the probability that team A wins fixed at p <- 0.75 and compute the probability for different series lengths. For example, wins in best of 1 game, 3 games, 5 games, and so on through a series that lasts 25 games.

# Given a value 'p', the probability of winning the series for the underdog team $B$ can be computed with the following function based on a Monte Carlo simulation:
prob_win <- function(N, p=0.75){
      B <- 10000
      result <- replicate(B, {
        b_win <- sample(c(1,0), N, replace = TRUE, prob = c(1-p, p))
        sum(b_win)>=(N+1)/2
        })
      mean(result)
    }

# Assign the variable 'N' as the vector of series lengths. Use only odd numbers ranging from 1 to 25 games.
N <- seq(1, 25, 2)

# Apply the 'prob_win' function across the vector of series lengths to determine the probability that team B will win. Call this object `Pr`.
Pr<- sapply(N, prob_win)

# Plot the number of games in the series 'N' on the x-axis and 'Pr' on the y-axis.
plot(N, Pr)

Section 2: Continuous Probability

Section 2 introduces you to Continuous Probability.

After completing Section 2, you will:

understand the differences between calculating probabilities for discrete and continuous data.
be able to use cumulative distribution functions to assign probabilities to intervals when dealing with continuous data.
be able to use R to generate normally distributed outcomes for use in Monte Carlo simulations.
know some of the useful theoretical continuous distributions in addition to the normal distribution, such as the student-t, chi-squared, exponential, gamma, beta, and beta-binomial distributions.

2.1 Continuous Probability

2.1.1 Continuous probability

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.1  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter()  masks stats::filter()
## x stringr::fixed() masks permutations::fixed()
## x dplyr::lag()     masks stats::lag()
## x stringr::word()  masks permutations::word()

library(dslabs)
data(heights)
x <- heights %>% filter(sex=="Male") %>% .$height


# now we can define the imperical distribution function
F <- function(a) mean(x<=a)

Keep in mind that we have not yet introduced probability. Let’s do this by asking the following: if I pick one of the male students at random, what is the chance that he is taller than 70.5 inches? Because every student has the same chance of being picked, the answer to this is equivalent to the proportion of students that are taller than 70.5 inches. Using the CDF we obtain an answer by typing:

1 - F(70.5)

## [1] 0.3633005

2.1.2 Theoretical Distribution

The cumulative distribution for the normal distribution is defined by a mathematical formula, which in R can be obtained with the function pnorm. Using the normal distribution:

1 - pnorm(70.5, mean(x), sd(x))

## [1] 0.371369

# This is the actual data:
mean(x <= 68.5) - mean(x <= 67.5)

## [1] 0.114532

mean(x <= 69.5) - mean(x <= 68.5)

## [1] 0.1194581

mean(x <= 70.5) - mean(x <= 69.5)

## [1] 0.1219212

# and this is the approximation:
pnorm(68.5, mean(x), sd(x)) - pnorm(67.5, mean(x), sd(x))

## [1] 0.1031077

pnorm(69.5, mean(x), sd(x)) - pnorm(68.5, mean(x), sd(x))

## [1] 0.1097121

pnorm(70.5, mean(x), sd(x)) - pnorm(69.5, mean(x), sd(x))

## [1] 0.1081743

2.1.3 Probability Density

The probability density at x is defined as the function, we’re going to call it little f of x, such that the probability distribution big F of a, which is the probability of x being less than or equal to a, is the integral of all values up to a of little f of x dx.

dnorm() is the probability density function for the normal distribution:

# the probability that a person is higher than 76 inch
avg <- mean(x)
s <- sd(x)
1 - pnorm(76, avg, s)

## [1] 0.03206008

dnorm(76, mean(x), sd(x))

## [1] 0.01990735

2.1.4 Monte Carlo Simulations

x <- heights %>% filter(sex=="Male") %>% .$height
n <- length(x)
avg <- mean(x)
s <- sd(x)
simulated_heights <- rnorm(n, avg, s)

ds_theme_set()
data.frame(simulated_heights=simulated_heights) %>% ggplot(aes(simulated_heights)) +
  geom_histogram(cloor="black", binwidth = 2)

## Warning: Ignoring unknown parameters: cloor

Specifically, we could ask, how rare is that the tallest person is a seven footer? We can use the following Monte Carlo simulation to answer this question.

B <- 10000
tallest <- replicate(B, {
  simulated_data <- rnorm(800, avg, s)
  max(simulated_data)
})
mean(tallest)

## [1] 80.80211

mean(tallest >= 7*12)

## [1] 0.0214

2.1.5 Other Continuous Distributions

Other continuous distributions: student-t chi-sqyuared gamma beta

R uses a convention that lets us remember the names of these functions. Namely, using the letters: * d for density, * qfor quantile, * p for probability density function, * and r for random. By putting these letters in front of a shorthand for the distribution,

# Example: we can use the dnorm function to generate this plot. This is the density function for the normal distribution.
x <- seq(-4, 4, length.out = 100)
data.frame(x, f = dnorm(x)) %>% ggplot(aes(x,f)) +
  geom_line()

Exercise 1. Distribution of female heights - 1

Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 5 feet or shorter?

# Assign a variable 'female_avg' as the average female height.
female_avg <- 64

# Assign a variable 'female_sd' as the standard deviation for female heights.
female_sd <- 3

# Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is shorter than 5 feet. Print this value to the console.
pnorm(5*12, female_avg,female_sd)

## [1] 0.09121122

Exercise 2. Distribution of female heights - 2

# Assign a variable 'female_avg' as the average female height.
female_avg <- 64

# Assign a variable 'female_sd' as the standard deviation for female heights.
female_sd <- 3

# Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is 6 feet or taller. Print this value to the console.
1-pnorm(6*12, female_avg,female_sd)

## [1] 0.003830381

Exercise 3. Distribution of female heights - 3

# Assign a variable 'female_avg' as the average female height.
female_avg <- 64

# Assign a variable 'female_sd' as the standard deviation for female heights.
female_sd <- 3

# Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.
pnorm(67, female_avg,female_sd) - pnorm(61, female_avg,female_sd)

## [1] 0.6826895

Exercise 4. Distribution of female heights - 4

Repeat the previous exercise, but convert everything to centimeters. That is, multiply every height, including the standard deviation, by 2.54. What is the answer now?

# Assign a variable 'female_avg' as the average female height. Convert this value to centimeters.
female_avg <- 64*2.54

# Assign a variable 'female_sd' as the standard deviation for female heights. Convert this value to centimeters.
female_sd <- 3*2.54

# Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.
pnorm(67*2.54, female_avg,female_sd) - pnorm(61*2.54, female_avg,female_sd)

## [1] 0.6826895

Exercise 5. Probability of 1 SD from average

Compute the probability that the height of a randomly chosen female is within 1 SD from the average height.

# Assign a variable 'female_avg' as the average female height.
female_avg <- 64

# Assign a variable 'female_sd' as the standard deviation for female heights.
female_sd <- 3

# To a variable named 'taller', assign the value of a height that is one SD taller than average.
taller <- female_avg + female_sd

# To a variable named 'shorter', assign the value of a height that is one SD shorter than average.
shorter <- female_avg - female_sd

# Calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.

pnorm(taller, female_avg,female_sd) - pnorm(shorter, female_avg,female_sd)

## [1] 0.6826895

Exercise 6. Distribution of male heights

Imagine the distribution of male adults is approximately normal with an expected value of 69 inches and a standard deviation of 3 inches. How tall is a male in the 99th percentile?

# Assign a variable 'female_avg' as the average female height.
male_avg <- 69

# Assign a variable 'female_sd' as the standard deviation for female heights.
male_sd <- 3

# Determine the height of a man in the 99th percentile of the distribution.
qnorm(0.99, male_avg, male_sd)

## [1] 75.97904

Exercise 7. Distribution of IQ scores

The distribution of IQ scores is approximately normally distributed. The expected value is 100 and the standard deviation is 15. Suppose you want to know the distribution of the person with the highest IQ in your school district, where 10,000 people are born each year.

Generate 10,000 IQ scores 1,000 times using a Monte Carlo simulation. Make a histogram of the highest IQ scores.

# The variable `B` specifies the number of times we want the simulation to run.
B <- 1000

# Use the `set.seed` function to make sure your answer matches the expected result after random number generation.
set.seed(1)

# Create an object called `highestIQ` that contains the highest IQ score from each random distribution of 10,000 people.
highestIQ <- replicate(B, {
    sim <- rnorm(10000,100,15)
    max(sim)
})

# Make a histogram of the highest IQ scores.
hist(highestIQ)

Section 3: Random Variables, Sampling Models, and the Central Limit Theorem

Section 3 introduces you to Random Variables, Sampling Models, and the Central Limit Theorem.

Section 3 is divided into two parts:

Random Variables and Sampling Models
The Central Limit Theorem.

After completing Section 3, you will:

understand what random variables are, how to generate them, and the correct mathematical notation to use with them.
be able to use sampling models to estimate characteristics of a larger population.
be able to explain the difference between a distribution and a probability distribution.
understand the Central Limit Theorem and the law of large numbers.

3.1 Random Variables and Sampling Models

3.1.1 Random Variables

Random variables are numeric outcomes resulting from a random process.

Below is the beads example where we use a random generator:

beads <- rep(c("red", "blue"), times = c(2,3))
X <- ifelse (sample(beads, 1) == "blue", 1, 0)
X

## [1] 1

3.1.2 Sampling Models

For example, we can model the process of polling likely voters as drawing 0’s– Republicans– and 1’s– Democrats– from an urn containing the 0 and 1 code for all likely voters.

In epidemiological studies, we often assume that the subjects in our study are a random sample from the population of interest. The data related to a specific outcome can be modeled as a random sample

Similarly, in experimental research, we often assume that the individual organisms we are studying– for example, worms, flies, or mice– are a random sample from a larger population.

Casino games offer a plethora of examples of real-world situations in which sampling models are used to answer specific questions.

Roulette wheel example

color <- rep(c("Black", "Red", "Green"), c(18,18,2))

n <- 1000
X <- sample(ifelse(color =="Red", -1, 1), n, replace = TRUE)
X[1:10]

##  [1] -1  1 -1 -1  1 -1  1 -1  1 -1

X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
S <- sum(X)
S

## [1] 56

Running a Monte Carlo Simulation with the above example:

a <- 0
n <- 1000
B <- 10000
roulette_winnings <- function(n){
  X <- sample(c(-1,1), n, replace = TRUE, prob=c(9/19, 10/19))
  sum(X)
}
S <- replicate(B, roulette_winnings(n))

mean(S <= a)

## [1] 0.0462

mean(S)

## [1] 52.8384

sd(S)

## [1] 31.65269

s <- seq(min(S), max(S), length = 100)
normal_density <- data.frame(S = s, f=dnorm(s, mean(S), sd(S)))
data.frame(S=S) %>% ggplot(aes(S, ..density..)) +
  geom_histogram(color = "black", binwidth = 10) +
  ylab("Probability") +
  geom_line(data = normal_density, mapping=aes(s,f), color="blue")

3.1.3 Distributions versus Probability Distributions

Previously we described how any list of numbers, let’s call it x1 through xn, has a distribution. The definition is quite straightforward. We define capital F of a as a function that answers the question, what proportion of the list is less than or equal to a. Because they are useful summaries, when the distribution is approximately normal, we define the average and the standard deviation.

A random variable x has a distribution function.To define this, we do not need a list of numbers. It’s a theoretical concept. In this case, to define the distribution, we define capital F of a as a function that answers the question, what is the probability that x is less than or equal to a. There is no list of numbers. However, if x is defined by drawing from an urn with numbers in it,

3.1.4 Notation for Random Variables

3.1.5 Central Limit Theorem

The Central Limit Theorem–or the CLT for short tells us that when the number of independent draws–also called sample size–is large, the probability distribution of the sum of these draws is approximately normal.

Exercise 1. American Roulette probabilities

An American roulette wheel has 18 red, 18 black, and 2 green pockets. Each red and black pocket is associated with a number from 1 to 36. The two remaining green slots feature “0” and “00”. Players place bets on which pocket they think a ball will land in after the wheel is spun. Players can bet on a specific number (0, 00, 1-36) or color (red, black, or green).

What are the chances that the ball lands in a green pocket?

# The variables `green`, `black`, and `red` contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green = green/(green+black+red)

# Print the variable `p_green` to the console
p_green

## [1] 0.05263158

Exercise 2. American Roulette payout

In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.

Create a model to predict your winnings from betting on green.

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

Exercise 3. American Roulette expected value

In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.In the previous exercise, you created a model to predict your winnings from betting on green.

Now, compute the expected value of X, the random variable you generated previously.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Calculate the expected outcome if you win $17 if the ball lands on green and you lose $1 if the ball doesn't land on green
p_green * 17 + p_not_green * -1

## [1] -0.05263158

Exercise 4. American Roulette standard error

The standard error of a random variable X tells us the difference between a random variable and its expected value. You calculated a random variable X in exercise 2 and the expected value of that random variable in exercise 3.

Now, compute the standard error of that random variable, which represents a single outcome after one spin of the roulette wheel.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Compute the standard error of the random variable
abs((17 - -1))*sqrt(p_green*p_not_green)

## [1] 4.019344

Exercise 5. American Roulette sum of winnings

You modeled the outcome of a single spin of the roulette wheel, X, in exercise 2.

Now create a random variable S that sums your winnings after betting on green 1,000 times.

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Define the number of bets using the variable 'n'
n <- 1000

# Create a vector called 'X' that contains the outcomes of 1000 samples
X <- sample(c(17,-1), size = n, replace=TRUE, prob=c(p_green, p_not_green))

# Assign the sum of all 1000 outcomes to the variable 'S'
S <- sum(X)

# Print the value of 'S' to the console
S

## [1] -10

Exercise 6. American Roulette winnings expected value

In the previous exercise, you generated a vector of random outcomes, S, after betting on green 1,000 times.

What is the expected value of S?

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 1000

# Calculate the expected outcome of 1,000 spins if you win $17 when the ball lands on green and you lose $1 when the ball doesn't land on green
n * (p_green * 17 + p_not_green * -1)

## [1] -52.63158

Exercise 7. American Roulette winnings expected value

You generated the expected value of S, the outcomes of 1,000 bets that the ball lands in the green pocket, in the previous exercise.

What is the standard error of S?

# The variables 'green', 'black', and 'red' contain the number of pockets for each color
green <- 2
black <- 18
red <- 18

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- green / (green+black+red)

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 1000

# Compute the standard error of the sum of 1,000 outcomes
sqrt(n) * abs((17 + 1))*sqrt(p_green*p_not_green)

## [1] 127.1028

3.2 The Central Limit Theorem Continued

3.2.1 Averages and Proportions

Property 1: The first, is that the expected value of a sum of random variables is the sum of the expected values of the individual random variables.

Prperty 2: is that the expected value of random variables times a non-random constant is the expected value times that non-random constant.

Property 3: is that the square of the standard error of the sum of independent random variables is the sum of the square of the standard error of each random variable.

Property 4: is that the standard error of random variables times a non-random constant is the standard error times a non-random constant.

3.2.2 Law of Large Numbers

A similar argument would be to say that red is due on roulette after seeing black come up five times in a row.These events are independent. So the chance of a coin landing heads is 50%, regardless of the previous five.

Similarly for the roulette outcome. The law of averages applies only when the number of draws is very, very large, not in small samples. After a million tosses, you will definitely see about 50% heads, regardless of what the first five were.

3.2.3 How Large is Large in CLT?

In the Lottery: Yet, the number of winners, the sum of the draws, range between 0, and in very extreme cases, four. This sum is certainly not well approximated by the normal distribution. So the central limit theorem doesn’t apply, even with a very large sample size.

This is generally true when the probability of success is very low. In these cases, the Poisson distribution is more appropriate. We do not cover the theory here, but you can learn about the Poisson distribution in any probability textbook and even Wikipedia.

Exercise 1. American Roulette probability of winning money

The exercises in the previous chapter explored winnings in American roulette. In this chapter of exercises, we will continue with the roulette example and add in the Central Limit Theorem.

In the previous chapter of exercises, you created a random variable S that is the sum of your winnings after betting on green a number of times in American Roulette.

What is the probability that you end up winning money if you bet on green 100 times?

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- 2 / 38

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 100

# Calculate 'avg', the expected outcome of 100 spins if you win $17 when the ball lands on green and you lose $1 when the ball doesn't land on green
avg <- n * (17*p_green + -1*p_not_green)

# Compute 'se', the standard error of the sum of 100 outcomes
se <- sqrt(n) * (17 - -1)*sqrt(p_green*p_not_green)

# Using the expected value 'avg' and standard error 'se', compute the probability that you win money betting on green 100 times.
1-pnorm(0,avg,se)

## [1] 0.4479091

Exercise 2. American Roulette Monte Carlo simulation

Create a Monte Carlo simulation that generates 10,000 outcomes of S, the sum of 100 bets.

Compute the average and standard deviation of the resulting list and compare them to the expected value (-5.263158) and standard error (40.19344) for S that you calculated previously.

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- 2 / 38

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1-p_green

# Define the number of bets using the variable 'n'
n <- 100

# The variable `B` specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# Create an object called `S` that replicates the sample code for `B` iterations and sums the outcomes.
S <- replicate(B,{
   X <- sample(c(17,-1), size = n, replace = TRUE, prob = c(p_green, p_not_green))
   sum(X)
   })



# Compute the average value for 'S'
mean(S)

## [1] -5.9086

# Calculate the standard deviation of 'S'
sd(S)

## [1] 40.30608

Exercise 3. American Roulette Monte Carlo vs CLT

In this chapter, you calculated the probability of winning money in American roulette using the CLT.

Now, calculate the probability of winning money from the Monte Carlo simulation. The Monte Carlo simulation from the previous exercise has already been pre-run for you, resulting in the variable S that contains a list of 10,000 simulated outcomes.

# Calculate the proportion of outcomes in the vector `S` that exceed $0
mean(S>0)

## [1] 0.4232

Exercise 5. American Roulette average winnings per bet

Now create a random variable Y that contains your average winnings per bet after betting on green 10,000 times.

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)

# Define the number of bets using the variable 'n'
n <- 10000

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- 2 / 38

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1 - p_green

# Create a vector called `X` that contains the outcomes of `n` bets
X <- sample(c(17,-1), size = n, replace = TRUE, prob = c(p_green, p_not_green))

# Define a variable `Y` that contains the mean outcome per bet. Print this mean to the console.
Y <- mean(X);Y

## [1] 0.008

Exercise 6. American Roulette per bet expected value

What is the expected value of Y, the average outcome per bet after betting on green 10,000 times?

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- 2 / 38

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1 - p_green

# Calculate the expected outcome of `Y`, the mean outcome per bet in 10,000 bets
Y <- p_green * 17 + p_not_green * -1;Y

## [1] -0.05263158

Exercise 7. American Roulette per bet standard error

What is the standard error of Y, the average result of 10,000 spins?

# Define the number of bets using the variable 'n'
n <- 10000

# Assign a variable `p_green` as the probability of the ball landing in a green pocket
p_green <- 2 / 38

# Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
p_not_green <- 1 - p_green

# Compute the standard error of 'Y', the mean outcome per bet from 10,000 bets.
abs((17 - -1))*sqrt(p_green*p_not_green) / sqrt(n)

## [1] 0.04019344

Exercise 8. American Roulette winnings per game are positive

What is the probability that your winnings are positive after betting on green 10,000 times?

# We defined the average using the following code
avg <- 17*p_green + -1*p_not_green

# We defined standard error using this equation
se <- 1/sqrt(n) * (17 - -1)*sqrt(p_green*p_not_green)

# Given this average and standard error, determine the probability of winning more than $0. Print the result to the console.
1 - pnorm(0, avg, se)

## [1] 0.0951898

Exercise 9. American Roulette Monte Carlo again

Create a Monte Carlo simulation that generates 10,000 outcomes of S, the average outcome from 10,000 bets on green.

Compute the average and standard deviation of the resulting list to confirm the results from previous exercises using the Central Limit Theorem.

# The variable `n` specifies the number of independent bets on green
n <- 10000

# The variable `B` specifies the number of times we want the simulation to run
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random number generation
set.seed(1)

# Generate a vector `S` that contains the the average outcomes of 10,000 bets modeled 10,000 times
S <- replicate(B,{  
  X <- sample(c(17,-1), size = n, replace = TRUE, prob = c(p_green, p_not_green))
  mean(X)
})

# Compute the average of `S`
mean(S)

## [1] -0.05223142

# Compute the standard deviation of `S`
sd(S)

## [1] 0.03996168

Exercise 10. American Roulette comparison

In a previous exercise, you found the probability of winning more than $0 after betting on green 10,000 times using the Central Limit Theorem. Then, you used a Monte Carlo simulation to model the average result of betting on green 10,000 times over 10,000 simulated series of bets.

What is the probability of winning more than $0 as estimated by your Monte Carlo simulation? The code to generate the vector S that contains the the average outcomes of 10,000 bets modeled 10,000 times has already been run for you.

# Compute the proportion of outcomes in the vector 'S' where you won more than $0
mean(S>0)

## [1] 0.0977

Section 4: The Big Short

Section 4 introduces you to the Big Short.

After completing Section 4, you will:

understand the relationship between sampling models and interest rates as determined by banks.
understand how interest rates can be set to minimize the chances of the bank losing money.
understand how inappropriate assumptions of independence contributed to the financial meltdown of 2007.

4.1 The Big Short

4.1.1 Interest rates Explained

Suppose your bank will give out 1,000 loans for 180,000 this year. Also suppose that your bank loses, after adding up all the costs, $200,000 per foreclosure. For simplicity, we assume that that includes all operational costs. A sampling model for this scenario is coded like this.

n <- 1000
loss_per_foreclosure <- -200000
p <- 0.02 
defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE)
sum(defaults * loss_per_foreclosure)

## [1] -5e+06

B <- 10000
losses <- replicate(B, {
    defaults <- sample( c(0,1), n, prob=c(1-p, p), replace = TRUE) 
  sum(defaults * loss_per_foreclosure)
})

data.frame(losses_in_millions = losses/10^6) %>% ggplot(aes(losses_in_millions)) +
  geom_histogram(binwidth = 0.6, col = 'green')

We don’t really need a Monte Carlo simulation though. Using what we have learned, the CLT tells us that because our losses are a sum of independent draws, its distribution is approximately normal with expected value and standard errors given by:

n*(p*loss_per_foreclosure + (1-p)*0)

## [1] -4e+06

#> [1] -4e+06
sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))

## [1] 885437.7

l <- loss_per_foreclosure
z <- qnorm(0.01)
x <- -l*( n*p - z*sqrt(n*p*(1-p)))/ ( n*(1-p) + z*sqrt(n*p*(1-p)))
x

## [1] 6249.181

Our interest rate now goes up to 0.035. This is still a very competitive interest rate. By choosing this interest rate, we now have an expected profit per loan of:

loss_per_foreclosure*p + x*(1-p)

## [1] 2124.198

4.1.2 The Big Short

One of your employees points out that since the bank is making 2124 dollars per loan, the bank should give out more loans! Why just $n$? You explain that finding those n clients was hard. You need a group that is predictable and that keeps the chances of defaults low. He then points out that even if the probability of default is higher, as long as our expected value is positive, you can minimize your chances of losses by increasing $n$ and relying on the law of large numbers.

He claims that even if the default rate is twice as high, say 4%, if we set the rate just a bit higher than:

p <- 0.04
r <- (- loss_per_foreclosure*p/(1-p)) / 180000
r

## [1] 0.0462963

Exercise 1. Bank earnings

Say you manage a bank that gives out 10,000 loans. The default rate is 0.03 and you lose $200,000 in each foreclosure.

Create a random variable S that contains the earnings of your bank. Calculate the total amount of money lost in this scenario.

# Assign the number of loans to the variable `n`
n <- 10000

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Generate a vector called `defaults` that contains the default outcomes of `n` loans
defaults <- sample( c(0,1), n, replace = TRUE, prob=c(1-p_default, p_default))

# Generate `S`, the total amount of money lost across all foreclosures. Print the value to the console.
S <- sum(defaults * loss_per_foreclosure)
S

## [1] -6.3e+07

Exercise 2. Bank earnings Monte Carlo

Run a Monte Carlo simulation with 10,000 outcomes for S, the sum of losses over 10,000 loans. Make a histogram of the results.

# Assign the number of loans to the variable `n`
n <- 10000

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# The variable `B` specifies the number of times we want the simulation to run
B <- 10000

# Generate a list of summed losses 'S'. Replicate the code from the previous exercise over 'B' iterations to generate a list of summed losses for 'n' loans

S <- replicate(B, {
    defaults <- sample( c(0,1), n, prob=c(1-p_default, p_default), replace = TRUE) 
  sum(defaults * loss_per_foreclosure)
})

# Plot a histogram of 'S'
hist(S)

### Exercise 3. Bank earnings expected value What is the expected value of S, the sum of losses over 10,000 loans? For now, assume a bank makes no money if the loan is paid.

# Assign the number of loans to the variable `n`
n <- 10000

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Calcualte the expected loss due to default out of 10,000 loans
n*(p_default*loss_per_foreclosure + (1-p_default)*0)

## [1] -6e+07

Exercise 4. Bank earnings standard error

What is the standard error of S?

# Assign the number of loans to the variable `n`
n <- 10000

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Compute the standard error of the sum of 10,000 loans
sqrt(n) * abs(loss_per_foreclosure) * sqrt(p_default*(1 - p_default))

## [1] 3411744

Exercise 5. Bank earnings interest rate - 1

So far, we’ve been assuming that we make no money when people pay their loans and we lose a lot of money when people default on their loans. Assume we give out loans for $180,000. How much money do we need to make when people pay their loans so that our net loss is $0?

In other words, what interest rate do we need to charge in order to not lose money?

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Assign a variable `x` as the total amount necessary to have an expected outcome of $0 
x <- -(loss_per_foreclosure*p_default) / (1 - p_default)

# Convert `x` to an interest rate, given that the loan amount is $180,000. Print this value to the console.
x / 180000

## [1] 0.03436426

Exercise 6. Bank earnings interest rate - 2

With the interest rate calculated in the last example, we still lose money 50% of the time. What should the interest rate be so that the chance of losing money is 1 in 20?

In math notation, what should the interest rate be so that Pr(S<0)=0.05?

Remember that we can add a constant to both sides of the equation to get: Pr(S???E[S]SE[S]<???E[S]SE[S]) which is

Pr(Z<???[lp+x(1???p)]n(x???l)np(1???p)???????????????????????????)=0.05 Let z = qnorm(0.05) give us the value of z for which: Pr(Z???z)=0.05

# Assign the number of loans to the variable `n`
n <- 10000

# Assign the loss per foreclosure to the variable `loss_per_foreclosure`
loss_per_foreclosure <- -200000

# Assign the probability of default to the variable `p_default`
p_default <- 0.03

# Generate a variable `z` using the `qnorm` function
z <- qnorm(0.05)

# Generate a variable `x` using `z`, `p_default`, `loss_per_foreclosure`, and `n`
x <- -loss_per_foreclosure*( n*p_default - z*sqrt(n*p_default*(1 - p_default)))/ ( n*(1 - p_default) + z*sqrt(n*p_default*(1 - p_default)))

# Convert `x` to an interest rate, given that the loan amount is $180,000. Print this value to the console.
x / 180000

## [1] 0.03768738

Data Science: Probability

Henrik Gjerning

2019-03-10

PH125.3x: Data Science: Probability

Abstract

Learning Objective:

Course Outline:

Section 1: Discrete Probability

1.1 Introduction to Discrete Probability

1.1.1 Discreate Probability

1.1.2 Monte Carlo Simulation

1.1.3 Probability Distributions

1.1.4 Independence

ex 1: Assessment: Introduction to Discrete Probability

ex 2: Probability of not cyan

ex 3: Sampling without replacement

EX 4: Sampling with replacement

1.2 Combinations and Permutations

1.2.2 The Birthday problem

1.2.4 How many Monte Carlo experiments are enough?

Exercise 1. Independence

Exercise 2. Sampling with replacement

Exercise 3. Rolling a die

Exercise 4. Probability the Celtics win a game

Exercise 5. Monte Carlo simulation for Celtics winning a game

1.3 Addition Rule and Monty Hall

1.3.1 The Addition Rule

1.3.2 The Monty Hall problem

Exercise 1. The Cavs and the Warriors

Exercise 2. The Cavs and the Warriors - Monte Carlo

Exercise 3. A and B play a series - part 1

Exercise 4. A and B play a series - part 2

Section 2: Continuous Probability

2.1 Continuous Probability

2.1.1 Continuous probability

2.1.2 Theoretical Distribution

2.1.3 Probability Density

2.1.4 Monte Carlo Simulations

2.1.5 Other Continuous Distributions

Exercise 1. Distribution of female heights - 1

Exercise 2. Distribution of female heights - 2

Exercise 3. Distribution of female heights - 3

Exercise 4. Distribution of female heights - 4

Exercise 5. Probability of 1 SD from average

Exercise 6. Distribution of male heights

Exercise 7. Distribution of IQ scores

Section 3: Random Variables, Sampling Models, and the Central Limit Theorem

3.1 Random Variables and Sampling Models

3.1.1 Random Variables

3.1.2 Sampling Models

3.1.3 Distributions versus Probability Distributions

3.1.4 Notation for Random Variables

3.1.5 Central Limit Theorem

Exercise 1. American Roulette probabilities

Exercise 2. American Roulette payout

Exercise 3. American Roulette expected value

Exercise 4. American Roulette standard error

Exercise 5. American Roulette sum of winnings

Exercise 6. American Roulette winnings expected value

Exercise 7. American Roulette winnings expected value

3.2 The Central Limit Theorem Continued

3.2.1 Averages and Proportions

3.2.2 Law of Large Numbers

3.2.3 How Large is Large in CLT?

Exercise 1. American Roulette probability of winning money

Exercise 2. American Roulette Monte Carlo simulation

Exercise 3. American Roulette Monte Carlo vs CLT

Exercise 5. American Roulette average winnings per bet

Exercise 6. American Roulette per bet expected value

Exercise 7. American Roulette per bet standard error

Exercise 8. American Roulette winnings per game are positive

Exercise 9. American Roulette Monte Carlo again

Exercise 10. American Roulette comparison

Section 4: The Big Short

4.1 The Big Short

4.1.1 Interest rates Explained

4.1.2 The Big Short

Exercise 1. Bank earnings

Exercise 2. Bank earnings Monte Carlo

Exercise 4. Bank earnings standard error