Discrete random variables

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Question 1

Consider throwing 2 six-sided dice. Using the expand_grid(), mutate(), group_by(), and summarise() functions, compute the probability distribution of the discrete random variable X=sum of the two faces.

experiment <- expand.grid( die1=1:6,
                           die2=1:6)
experiment <- experiment %>%
  mutate(prob = rep( 1/n(), n() )
    )
pd1 <- experiment %>%
    mutate(
        value = (die1 + die2)
    ) %>%
    group_by(value) %>%
    summarise(prob = sum(prob))
pd1

## # A tibble: 11 × 2
##    value   prob
##    <int>  <dbl>
##  1     2 0.0278
##  2     3 0.0556
##  3     4 0.0833
##  4     5 0.111 
##  5     6 0.139 
##  6     7 0.167 
##  7     8 0.139 
##  8     9 0.111 
##  9    10 0.0833
## 10    11 0.0556
## 11    12 0.0278

Check if the two requirements of probability functions are met:

Are all probabilities between 0 and 1?

all(pd1$prob >= 0)

## [1] TRUE

all(pd1$prob <= 1)

## [1] TRUE

Do the probabilities sum to 1?

sum(pd1$prob)

## [1] 1

Question 2

Plot the probability distribution of the random variable X = “sum of the faces of two dice” which you derived in the previous question.

ggplot(data = pd1) +
    geom_segment(aes(x = value, xend = value, y = 0, yend = prob)) +
    geom_point(aes(x = value, y = prob))

Comment on what the distribution tells us.

Answer: It tells us that the most likely outcome for the experiment is the value 7 and values become less likely outcomes the more we move towards each extreme.

Compute the centre and spread of the distribution and interpret them in context.

Centre: 7

Spread (standard deviation):

pd1 %>%
    summarise(
        mu = sum(value * prob),
        sigma = sqrt( sum((value - mu)^2 * prob) )
    )

## # A tibble: 1 × 2
##      mu sigma
##   <dbl> <dbl>
## 1     7  2.42

Question 3

Create a graph showing the distribution of a Binomial random variable X when n=10 and p=0.5.

n1 <- 10
p1 <- 0.5
tibble(
    x = 0:n1,
    prob = dbinom(x = x, size = n1, prob = p1)
) %>%
    ggplot() +
    geom_segment(aes(x = x, xend = x, y = 0, yend = prob)) +
    geom_point(aes(x = x, y = prob))

Question 4

Create a graph showing the distribution of a Geometric random variable X when p=0.5.

tibble(
    x = 0:10, 
    prob = dgeom(x = x, prob = p1)
) %>%
    ggplot() +
    geom_segment(aes(x = x, xend = x, y = 0, yend = prob)) +
    geom_point(aes(x = x, y = prob))

Question 5

Consider a random variable X having the following distribution. Compute the expected value and standard deviation of X.

Value - Probability

0 - 1−p

1 - p

Expected value: (0 x (1 - p)) + (1 x p)

Variance: (0 − p)^2 x (1 − p) + (1 − p)^2 x p = (1 − p)(p^2 + p(1 − p)) = (1 − p) x p

Standard deviation: sqrt((1 − p) x p)

Question 6

According to a briefing paper to the UK House of Commons, nearly 29% of the population, or 19 million, live below the poverty level.1 Suppose these figures hold true for the region in which you live. You plan to randomly sample 25 individuals from your region.

What is the probability that your sample will include at least two people with incomes below the poverty level?

Hint: P(X≥2)=1−P(X=0)−P(X=1)

p2 <- 0.29
n2 <- 25

1 - dbinom(0, n2, p2) - dbinom(1, n2, p2)

## [1] 0.9978565

What are the expected value and standard deviation of the number of people in your sample with incomes below the poverty level?

Expected value

n2 * p2

## [1] 7.25

Standard deviation

sqrt(n2 * p2 * (1 - p2))

## [1] 2.26881

Question 7

A recent survey has found that about 16% of residents have no home insurance. You are to randomly sample 20 residents for a survey. What is the probability that your sample will include at least three people who do not have home insurance?

p3 <- 0.16
n3 <- 20

1 - dbinom(0, n3, p3) - dbinom(1, n3, p3) - dbinom(2, n3, p3)

## [1] 0.6420017

What are the expected value and standard deviation of the number of people in your sample without home insurance?

Expected value

n3 * p3

## [1] 3.2

Standard deviation

sqrt(n3 * p3 * (1 - p3))

## [1] 1.639512

Question 8

Suppose you are rolling a pair of dice and waiting for a sum of 7 to occur. What is the probability that you get a sum of 7 for the first time on your first roll?

p3 <- 6/36
dgeom(0, p3)

## [1] 0.1666667

What is the probability that you get a sum of 7 for the first time on your second roll? 1 - 6/36 = 0.8333333

dgeom(1, p3)

## [1] 0.1388889

What is the probability that it takes more than 10 rolls to get a sum of 7?

Hint: Getting a success for the first time on the 3rd trial means having 2 failures first, so we would use dgeom(2, p).

1 - sum(dgeom(0:10, p3))

## [1] 0.134588

Question 9

A light bulb company advertises a fault rate of 1.1% Suppose you buy 20 light bulbs all of the same model by randomly picking one from 20 different stores in your area. You find that among the bulbs that you bought, the number of faulty ones is 3 out of 20. Visualise the probability distribution of the number of faulty light bulbs out of a sample of 20, if the probability of a faulty light bulb is truly 1.1%

p4 <- 1.1 / 100
n4 <- 20
tibble(
    value = 0:n4,
    prob = dbinom(value, n4, p4)
) %>%
    ggplot() +
    geom_segment(aes(x = value, xend = value, y = 0, yend = prob)) +
    geom_point(aes(x = value, y = prob))

What’s the chance of observing 3 faulty light bulbs out of 20, if the true fault rate is 1.1%?

dbinom(3, n4, p4)

## [1] 0.001257245

Based on your answer to part b, do you think there is enough evidence to doubt the fault rate advertised by the company? In other words, under the advertised fault rate of 1.1%, is this chance of getting 3 faulty bulbs out of 20 high enough to support the company’s claim?

Answer: Getting 3 faulty bulbs out of 20 would be extremely unlikely if the true fault rate was really 1.1%, therefore it might in fact be higher than the rate advertised by the company.