Probability Distributions

assignment week 11

Logo


1 Introduction

Hey everyone! Welcome to Chapter 7, where we dive into Probability Distributions. Think of a distribution like a map that tells us every possible outcome of an experiment (like flipping coins, rolling dice, or surveying students) and how often each outcome is likely to happen.

In this chapter, we’re going to meet our new best friends: random variables. We’ll learn how to describe their behavior, visualize their shapes (Symmetric? Skewed?), and use them to make solid, evidence-based guesses about the real world.

2 Sampling Distribution

In this video, we learn the basic concepts of sampling distribution. However, before understanding sampling distribution, we need to know the difference between sample distribution and sampling distribution.

Imagine we have a population of 10,000 people, and the average height (mean) of the population is known to be 5’4”. A sample is a small portion of individuals from the population that we use to draw conclusions. The average height of one sample could be 5’3“, another sample could be 5’7”, or even 5’4”. This means that the sample mean is not always the same as the population mean. This happens because the sample size is small, so the data variation is greater and does not always represent the population perfectly.

2.1 Difference: Sample Distribution vs. Sampling Distribution

  1. Sample distribution is when we take one sample from a population and analyze the data.
  2. Sampling distribution is a statistical distribution formed from many small samples taken from the same population.

2.2 How to Form a Sampling Distribution

  1. Determine the population to be studied. For example, height.
  2. Take a simple random sample of size n. Example: n = 5.
  3. Measure the height of each individual in the sample.
  4. Calculate the sample mean (X̄).
  5. Record the X̄ values in a histogram/frequency table.
  6. Repeat the process thousands of times: take a new sample → calculate X̄ → add to the histogram.

If done enough times, this collection of X̄ values will form a sampling distribution, and the distribution will typically be close to normal (according to the Central Limit Theorem).

2.3 Population Distribution vs. Sampling Distribution

  • The population distribution has a mean = μ and a standard deviation = σ.
  • If X is normally distributed with mean μ and standard deviation σ, this is denoted as: X ~ N(μ, σ)

2.4 Standardization (Population)

Z = (X − μ) / σ

In sampling distribution:

  • Mean of all X̄ (sample mean) = μ
  • Standard deviation of sampling distribution = σ / √n This is also called standard error.

2.5 Standardization in Sampling Distribution

Z = (X̄ − μ) / (σ / √n)

2.6 Important Summary

  • Population distribution: tests all individuals in the population.
  • Sample distribution: tests all individuals in a single sample.
  • Sampling distribution: takes samples multiple times and combines the statistics (e.g., X̄).

2.7 Why is Sampling Distribution Important?

Because it is impossible to measure everyone on earth (±8 billion people), we use samples. From the sampling distribution, we can estimate the population mean without having to measure everyone.

Sampling distribution also allows us to calculate the probability of a certain value occurring based on the sample mean.


2.8 Example Question 1

Given that the height of Canadians follows a normal distribution with:

  • μ = 160 cm
  • σ = 7 cm

Question: What is the probability that the average height of 10 people (n = 10) is less than 157 cm?

Steps:

  1. This is a sampling distribution problem because it uses the sample mean.
  2. Mean of the sampling distribution = 160.
  3. Standard error = σ / √n = 7 / √10 = 2.21.
  4. Calculate Z: Z = (157 − 160) / 2.21 = −1.36
  5. Find P(Z < −1.36) = 0.0869.

So the probability that the average height of 10 Canadians is < 157 cm is 0.0869 or 8.69%.


2.9 Example 2

What is the proportion of all people (not the sample) who are taller than 170 cm?

Since the question asks about all people, the population distribution is used, not the sampling distribution.

  1. Calculate Z: Z = (170 − 160) / 7 = 1.43
  2. From the Z table, the area to the left of Z = 0.9236.
  3. The question asks for the right area → 1 − 0.9236 = 0.0764.

So the proportion of people who are taller than 170 cm is 0.0764 or 7.64%.


3 Central Limit Theorem

Before discussing the Central Limit Theorem, we must first understand what a sampling distribution is.

A sampling distribution is a distribution formed from the results of repeated sampling from a population, then calculating certain statistics (e.g., mean, X̄) from each sample. The process is as follows:

  1. Take one sample from the population.
  2. Calculate the value of X̄.
  3. Repeat this process many times.
  4. Collect all X̄ values and plot them on a graph.

This forms the sampling distribution of the sample mean.


3.1 Central Limit Theorem (CLT)

The Central Limit Theorem states that:

If the sample size (n) is large enough, then the sampling distribution of the sample mean (X̄) will be normal, even if the original population distribution is not normal.

This means: the population may be skewed, random, or asymmetrical, but with a sufficiently large sample, the distribution of X̄ will still be close to normal.


3.2 Example Explanation

When we take a random sample from a population:

  • Most samples come from parts of the population that have a high probability.
  • Less data comes from rare parts.
  • Some samples may contain data from uncommon parts—this is normal.

If we calculate X̄ from each sample and combine them, most X̄ will be close to the true population mean (µ). Some X̄ may be further away, but this is normal probabilistically.

When all X̄ are combined, the distribution becomes normal. This is visual evidence of the CLT.


3.3 Adequate Sample Size

General rule:

The Central Limit Theorem is safe to use when the sample size n ≥ 30.

  • If n < 30 → the sampling distribution is not necessarily normal.
  • If n ≥ 30 → the sampling distribution is usually close to normal.

However, if the initial population is already normal, then even a small sample size will still produce a normal sampling distribution.


3.4 The Importance of CLT

If we know that the sampling distribution is normal, then we can:

  • use the normal distribution formula,
  • perform statistical inference,
  • make more accurate parameter estimates.

3.5 Example Question from the Video

The answers that result in a sampling distribution that is almost normal are:

C, D, E, and F

Explanation:

  • A & B are incorrect → sample size < 30.
  • C, D, F are correct → sample size ≥ 30 → CLT applies.
  • E is correct → the initial population is already normal → the sampling distribution is automatically normal.

3.6 Conclusion

  • Sampling distributions are formed from repeated sampling.
  • The CLT guarantees that sampling distributions approach normal if the sample size is large enough.
  • n ≥ 30 is a safe size.
  • If the initial population is already normal, even small samples are fine.
  • The CLT is important for statistical analysis and estimation.

4 Sample Propotion

In this video, we learn about sampling distribution of sample proportions. Before we get into that, we need to understand two basic concepts:

4.1 What is Sampling Distribution?

Sampling distribution is the process of:

  1. Taking samples repeatedly from a population.
  2. Calculating statistics from each sample, such as the mean (X̄) or proportion (P-hat).
  3. Combining all of these statistical values into a graph. This graph is called a sampling distribution.

4.1.1 What is Proportion?

In statistics, proportion describes a part of a whole that has certain characteristics.

Examples of variables for which the proportion can be calculated:

  • height
  • weight
  • eye color
  • test scores

For example, suppose we want to know the proportion of people with green eyes. There are two ways:

  • Take a sample and then count how many people have green eyes.
  • Measure the entire population.

Proportion formula: proportion = number of successes / total number

Example:

  • Sample of 10 people, 2 have green eyes → P-hat = 2/10 = 0.2
  • Population of 5000 people, 900 have green eyes → P = 900/5000 = 0.18

Important notes:

  • Population proportion → denoted by P
  • Sample proportion → denoted by P-hat

4.2 Variation in P-hat Values

If we take repeated samples from the population:

  • Some samples produce P-hat = 0.21
  • Some produce P-hat = 0.19
  • Some produce P-hat = 0.17
  • And so on

These differences are normal because each sample contains different data. If all these P-hat values are combined in a graph, we obtain the sampling distribution of the proportion (P-hat).


4.3 Statistics in the P-hat Distribution

Like distributions in general, the P-hat distribution also has:

  • Mean (μₚ̂)
  • Standard deviation (σₚ̂)

If the sampling distribution of proportions follows the Central Limit Theorem, then:

a. Mean of the P-hat distribution:

μₚ̂ = P This means that the average of all P-hats will approach the population proportion.

b. Standard deviation of the P-hat distribution:

σₚ̂ = sqrt( P × (1 – P) / N )

Explanation:

  • N = sample size
  • P = population proportion
  • Q = 1 – P

c. The P-hat distribution is normal if the CLT conditions are met, namely:

  • N × P ≥ 10
  • N × (1 – P) ≥ 10

4.4 Standardization (Z-score) for Proportions

To calculate a specific probability of P-hat, we can use the formula:

Z = (P-hat – P) / sqrt( P × (1 – P) / N )

This formula is used to find the probability based on the Z-score table.


4.5 Differences between CLT for Mean and Proportion

Distribution Type CLT Requirements
Sampling distribution mean (X̄) N ≥ 30
Sampling distribution of proportion (P-hat) N×P ≥ 10 and N×(1–P) ≥ 10

This is important so that the P-hat distribution approximates a normal distribution.


4.6 Relationship with Binomial Distribution

The sampling distribution of proportions is closely related to:

  • Binomial distribution
  • Probability formula

This is because proportions essentially reflect “how many successes out of the total number of trials.”


4.7 Conclusion

  • P-hat is the sample proportion.

  • The sampling distribution of P-hat is formed when we take many samples from the population.

  • Mean of the P-hat distribution = P

  • Standard deviation = sqrt( P × (1 – P) / N )

  • To use the CLT for proportions, the following conditions must be met:

  • N × P ≥ 10

  • N × (1 – P) ≥ 10

  • If the conditions are met, we can use the Z-score to calculate the probability.


5 Review Sampling Distribution

Binomial Distribution, Probability, and Sampling Distribution of Proportion
This material summarizes the discussion on probability, binomial distribution, and sampling distribution of proportion (P-hat). The following explanation is arranged sequentially for easy understanding.

In the original video, the discussion focuses on how probability, binomial distribution, and sampling distribution are connected through example problems. This material is very helpful for understanding statistical concepts that are often confusing when studied separately.

Learning objectives: - Understand basic probability
- Calculate probability with binomial distributions - Determine probability using sampling distribution of proportions with the Central Limit Theorem (CLT)


5.1 Example 1: Probability of Writing at Least Two Green Reviews

Suppose there are: - 200 green reviews - 300 blue reviews
Total = 500 reviews.

We take 3 reviews with replacement and want to find the probability of at least two green reviews.

5.1.1 Determining basic probability

  • Probability of getting a green review
    \[ P(G) = \frac{200}{500} = 0.4 \]

  • Probability of getting a blue review
    \[ P(B) = \frac{300}{500} = 0.6 \]

5.1.2 Sample space

Each sequence of reviews can be written as a combination of G and B.
Example: - GGB
- BGB
- BBB
- and so on.

5.1.3 Calculating the probability for each sequence

Example: - Probability of GBB
\[ 0.4 \times 0.6 \times 0.6 = 0.096 \]

  • Probability of BBB
    \[ 0.6 \times 0.6 \times 0.6 = 0.216 \]

5.1.4 Probability of at least two green reviews

It is necessary to add the probabilities: - Exactly 2 green
- Exactly 3 green

For example, from the calculation: - Probability of exactly 2 green = 0.288
- Probability of exactly 3 green = 0.064

Therefore: \[ P(\text{≥2 green}) = 0.288 + 0.064 = 0.352 \]


5.2 Example 2: Using the Binomial Distribution

Now, we take 5 reviews.

Finding the probability of at least 2 green reviews:

\[ n = 5,\; p = 0.4 \]

We need to add the probabilities:
- 2 green
- 3 green
- 4 green
- 5 green

Each probability is calculated using the binomial formula:

\[ P(X=k) = {n \choose k} p^{k} (1-p)^{n-k} \]

After summing everything: \[ P(\text{≥2 green}) = 0.66304 \]


5.3 Example 3: Sampling Distribution of Proportion

Suppose reviews (marbles) are drawn 100 times, with a probability of green: \[ P = 0.4 \]

We want to find the probability of at least 35 green reviews, that is:

\[ \hat{P} = \frac{35}{100} = 0.35 \]

5.3.1 Checking the Central Limit Theorem (CLT) conditions

The CLT for proportions applies if: - \(nP \ge 10\)
- \(n(1-P) \ge 10\)

Check: - \(nP = 100 \times 0.4 = 40 \ge 10\)
- \(n(1-P) = 100 \times 0.6 = 60 \ge 10\)

The conditions are met, meaning that the sampling distribution of the proportion can be considered normal.


5.4 Z-score for Proportion

Using the formula:

\[ Z = \frac{\hat{P} - P}{\sqrt{\frac{P(1-P)}{n}}} \]

Substitute into the formula: - \(\hat{P} = 0.35\) - \(P = 0.4\) - \(n = 100\)

Result: \[ Z = -1.02 \]

The Z-value = -1.02 has an area to the left of 0.1539.

Since we want to find the probability of at least 35 greens (right area):

\[ 1 - 0.1539 = 0.8461 \]

This means that, \[ P(\text{≥35 greens}) = 0.8461 = 84.61\% \]


5.5 Important Notes

  • Using the sampling distribution of proportions gives approximate probabilities, not exact values.
  • For exact probabilities, use:
    1. Sample space (if small)
    2. Binomial distribution (more common)

5.6 Conclusion

  • The base probability is calculated by comparing the number of successful outcomes to the total number of outcomes.
  • The binomial distribution is used when the number of trials (n) is not too large.
  • The sampling distribution of proportions is used if n is large and the CLT conditions are met.
  • With normalization (Z-score), we can easily estimate probabilities.