Last updated: 2023-03-25

1. Probability Distributions

Bernoulli trials

A Bernoulli trial is a random experiment with two possible outcomes: success and failure.

You will imitate the shots of basketball players with various skill levels. Let’s assume that the value 1 represents a goal and the value 0 a miss. The outcomes of the shots might be presented as a vector. For example, the vector c(1, 1, 0) means that the player has scored twice and then missed once.

The Bernoulli distribution is a particular case of the binomial distribution where a single trial is conducted (so size amounts to 1).

Recall that bar plots are great for visualizing discrete distributions.

Generate the random results of 10 shots and draw their frequency chart for the player with a goal probability of 0.5.

> set.seed(123)
> # Generate the outcomes of basketball shots
> shots <- rbinom(n = 10, size = 1, prob = 0.5)
> print(shots)
 [1] 0 1 0 1 1 0 1 1 1 0
> # Draw the frequency chart of the results
> barplot(table(shots))

Generate the random results of 10 shots and draw their frequency chart for the player with a goal probability of 0.3.

> # Generate the outcomes of basketball shots
> shots <- rbinom(n = 10, size = 1, prob = 0.3)
> print(shots)
 [1] 1 0 0 0 0 1 0 0 0 1
> # Draw the frequency chart of the results
> barplot(table(shots))

Generate the random results of 10 shots and draw their frequency chart for the player with a goal probability of 0.9.

> # Generate the outcomes of basketball shots
> shots <- rbinom(n = 10, size = 1, prob = 0.90)
> print(shots)
 [1] 1 1 1 0 1 1 1 1 1 1
> # Draw the frequency chart of the results
> barplot(table(shots))

Binomial distribution

In the previous exercise, you’ve modeled the Bernoulli trials. The binomial distribution is the sum of the number of successful outcomes in a set of Bernoulli trials.

The notation of the binomial distribution is , where is the number of experiments, and is the probability of a success.

For this exercise, consider 10 consecutive fair coin flips. You’ve bet for tails and consider this outcome of a coin flip as a success.

Recall that:

dbinom(x = k, size = n, prob = p) calculates \(P(X=k)\) for \(X \sim B(n, p)\),

pbinom(q = k, size = n, prob = p) calculates \(P(X\leq k)\) for \(X \sim B(n, p)\) .

Remember that for discrete distributions that take on whole numbers:\(P(X \geq k) = 1 - P(X \leq k -1)\) .

Assign the probability of getting exactly 6 tails to six_tails and print the result

> six_tails <- dbinom(x = 6, size = 10, prob = 0.5)
> print(six_tails)
[1] 0.2050781

Assign the probability of getting 7 or less tails to seven_or_less and print the result.

> seven_or_less <- pbinom(q = 7, size = 10, prob = 0.5)
> print(seven_or_less)
[1] 0.9453125

Assign the probability of getting 5 or more tails to five_or_more and print the result.

> five_or_more <- pbinom(q = 4, size = 10, prob = 0.5, lower.tail = FALSE)
> print(five_or_more)
[1] 0.6230469

Uniform distribution

Questions related to the continuous uniform distribution come up during interviews because the calculations associated with this distribution are relatively straightforward.

A random variable is usually denoted as \(X\) and a continuous uniform distribution on a range \([a, b]\) is denoted as \(U(a, b)\).

Recall that punif(q = k, min = a, max = b) calculates \(P(X \leq k)\) for \(X \sim U(a, b)\) .

You can derive the probability that a random variable falls into a range as a difference of the two cumulative probabilities:

\(P(j < X < k) = P(X \leq k) - P(X \leq j)\)

Question. What is the probability that a random variable from the continuous uniform distribution on the range \([1, 10]\) falls into the range \([4, 7]\)?

\[\frac{7-4}{10-1} = \frac{1}{3}\]

Calculate \(P(X \leq 7)\) if \(X \sim U(1, 10)\) and save it as seven_or_lower.

> seven_or_lower <- punif(q = 7, min = 1, max = 10)
> print(seven_or_lower)
[1] 0.6666667

Calculate \(P(X \leq 4)\) if \(X \sim U(1, 10)\) and save it as four_or_lower.

> four_or_lower <- punif(q = 4, min = 1, max = 10)
> print(four_or_lower)
[1] 0.3333333

Calculate \(P(4 < X < 7)\) if \(X \sim U(1, 10)\) and save it as between_four_and_seven.

> between_four_and_seven <- seven_or_lower - four_or_lower
> print(between_four_and_seven)
[1] 0.3333333
> # Alternatively:
> diff(punif(q = c(4, 7), min = 1, max = 10))
[1] 0.3333333

Shape of normal distribution

All normal distributions are symmetric and have a bell-shaped density curve with a single peak.

The normal distribution takes two parameters: the mean (\(\mu\)) and the variance (\(\sigma^2\)). The notation of the normal distribution is \(N(\mu, \sigma^2)\). The mean indicates where the peak of the density curve occurs, and the variance indicates the spread of the bell curve.

The standard deviation (\(\sigma\)) is the square root of variance (\(\sigma^2\)). The rnorm() function takes the standard deviation (sd) as an argument.

We will review descriptive statistics in the next chapter.

In this exercise, you will generate samples from three different normal distributions and visualize their distributions.

Question

How do the mean and the standard deviation impact the density curve of the normal distribution?

Answer

A higher mean shifts the graph to the right. A higher standard deviation flattens the curve.

Problem

  • Set the sample size, n, to \(50000\).

  • Generate samples of size `n from the following three distributions: \(N(0, 1)\), \(N(0, 3)\) , and \(N(2, 1)\).

  • Run the code and take a look at the generated plot. Which curve belongs to which distribution?

Solution

> set.seed(123)
> 
> # Set the sample size
> n = 50000
> 
> # Generate random samples from three distributions
> sample_N01 <- rnorm(n, mean = 0, sd = sqrt(1))
> sample_N03 <- rnorm(n, mean = 0, sd = sqrt(3))
> sample_N21 <- rnorm(n, mean = 2, sd = sqrt(1))
> 
> # Create a data frame
> data <- data.frame(sample_N01, sample_N03, sample_N21)
> head(data, 3)
  sample_N01 sample_N03 sample_N21
1 -0.5604756  0.4496226   2.264993
2 -0.2301775  1.5891752   3.830747
3  1.5587083 -1.2510921   1.940622
> # Gather the data
> data %>% 
+   gather(key = distribution, value = value) %>% 
+   head(3)
  distribution      value
1   sample_N01 -0.5604756
2   sample_N01 -0.2301775
3   sample_N01  1.5587083
> # Alternatively we can use melt function
> data %>% 
+   reshape2::melt(variable = "distribution", value.name = "value") %>% 
+   head(3)
No id variables; using all as measure variables
  distribution      value
1   sample_N01 -0.5604756
2   sample_N01 -0.2301775
3   sample_N01  1.5587083
> # Visualize the distributions
> data %>% gather(key = distribution, value = value) %>% 
+   ggplot(aes(x = value, fill = distribution)) +
+   geom_density(alpha = 0.3)

Well done! The plot supports your answer on how the mean and the standard deviation impact the density curve. You will tackle the questions on the shape of the normal distribution with ease in the interview. In the next exercise, you will work with true and sample probabilities of the normal distribution.

Sample from normal distribution

The normal distribution is a frequent topic during interviews due to the vast applications of this distribution.

A random sample is a set of observed items from the whole population. You can make inferences about the population based on a random sample taken from the population. For example, you can calculate the sample probability which is an estimate of the population’s true probability.

To compute the sample probability, calculate the proportion of the observations in a sample that meet the given criteria.

To compute the true probability, use probability functions.

Recall that:

  • the standard normal distribution has \(\mu=0\) and \(\sigma^2=1\) (referred to as \(N(0, 1)\)),

  • pnorm(q = k) returns \(P(X \leq k)\) for \(X \sim N(0, 1)\) .

Problem

  • Generate 1000 data points from the standard normal distribution.

  • Draw a histogram of the generated data points.

  • Compute the probability that an observation from the generated sample, data_points, is lower or equal to 2.

  • Compute the true probability that the random variable from \(N(0, 1)\) is lower or equal to 2.

Solution

> set.seed(123)
> 
> # Generate data points
> data_points <- rnorm(n = 1000)
> 
> # Inspect the distribution
> hist(data_points)

> # Compute the sample probability and print it
> sample_probability <- mean(data_points <= 2)
> print(sample_probability)
[1] 0.972
> # Compute the true probability and print it
> true_probability <- pnorm(q = 2)
> print(true_probability) 
[1] 0.9772499

Law of large numbers

In this exercise, you will simulate rolling a standard die numbered 1 through 6.

Simulation of die rolling can be performed using the sample() function.

If you take a sample whose size is bigger than the number of possible values, you need to set the replace parameter of the sample() function to TRUE.

The law of large numbers states that the average of the results obtained from trials will tend to become closer to the expected value as more trials are performed.

Question

What is the expected value for a die roll?

Answer

This is:

> sum(c(1:6) * rep(1/6, 6))
[1] 3.5

Problem

  • Generate a sample of 20 die rolls and assign the result to small_sample.

  • Calculate the mean of the small_sample.

  • Generate a sample of 1000 die rolls and assign the result to big_sample.

  • Compute the mean of the big_sample.

Answer

> set.seed(1)
> 
> # Create a sample of 20 die rolls
> small_sample <- sample(1:6, size = 20, replace = TRUE)
> 
> # Calculate the mean of the small sample
> mean(small_sample)
[1] 3.4
> # Create a sample of 1000 die rolls
> big_sample <- sample(1:6, size = 1000, replace = TRUE)
> 
> # Calculate the mean of the big sample
> mean(big_sample)
[1] 3.517

Compare the mean of the small sample with the mean of the big sample. Which of the means is closer to the expected value that you calculated in the first step? The law of large numbers states that with the increase of the sample’s size, the mean is closer to the true value. It’s important to distinguish between the law of large numbers and the central limit theorem in interviews, so let’s practice the CLT.

Simulating central limit theorem

The cental limit theorem (CLT) implies that we can apply statistical methods that work for normal distributions to problems involving other types of distributions. Interviewers are eager to check your understanding of CLT, especially if your future position involves A/B testing.

You will show the mechanics behind CLT on the example of die rolls.

In the last exercise, you generated 1000 die rolls by setting the size parameter: sample(1:6, size = 1000, replace = TRUE).

In step 1 of this exercise, you will generate 1 die roll output in a loop with 1000 iterations which is equivalent to the above.

To visualize:

discrete data - you can use barplot(table(x)),

continuous data - you can use hist(x).

The die_outputs andmean_die_outputs` vectors have already been initialized.

> die_outputs <- c()
> mean_die_outputs <- c()

Problem

  • In a loop with 1000 iterations, generate 1 random number from the range 1 to 6. Assign the results to the die_outputs vector.

  • Draw a bar chart to visualize the number of occurrences of each outcome.

Solution

> # Simulate 1000 die roll outputs
> for (i in 1:1000) {
+     die_outputs[i] <- sample(1:6, size = 1)
+ }
> 
> # Visualize a subset of the vector
> print(die_outputs[1:10])
 [1] 4 4 4 4 1 6 6 5 2 5
> # Visualize the number of occurrences of each result
> barplot(table(die_outputs))

  • Generate 30 die rolls 1000 times. Calculate means for each set of outputs and assign them to the mean_die_outputs vector.

  • Plot the histogram of means for each set of 30 die rolls.

Solution

> # Calculate 1000 means of 30 die roll outputs
> for (i in 1:1000) {
+     mean_die_outputs[i] <- mean(sample(1:6, size = 30, replace = TRUE))
+ }
> 
> # Visualize a subset of the vector
> print(mean_die_outputs[1:10])
 [1] 3.666667 3.200000 3.600000 3.233333 3.100000 3.533333 4.133333 3.500000
 [9] 3.200000 3.600000
> # Inspect the distribution of the results
> hist(mean_die_outputs)

Note that:

> sample(1:6, size = 30, replace = TRUE)
 [1] 6 4 4 5 4 2 3 2 4 3 3 1 1 6 3 2 6 3 4 6 5 6 2 5 4 4 5 5 4 5

Fantastic work! The distribution of the results of rolling the dice is uniform, but the distribution of samples’ means is bell-shaped! You can apply the probabilistic and statistical methods that work for normal distributions to the distribution of samples’ means!