Essential of Probability

assignment week 11

Logo

1 Introduce

Probability distributions describe how the values of a random variable are spread out and how likely each outcome is. They help us understand uncertainty, whether the data is discrete (countable events) or continuous (measurements). By identifying the correct distribution, we can analyze data more accurately, make better predictions, and support decision-making in various statistical applications.

2 Sampling Distribution

library(knitr)
include_url("https://www.youtube.com/embed/7S7j75d3GM4")

A sampling distribution is the distribution of a statistic (e.g., the mean) obtained by taking many simple random samples from a population. A sample distribution is derived from only one sample, while a sampling distribution is derived from many samples.

In the sampling distribution for the mean, the average value of all the \({\bar{X}}\)

always equal to the population mean 𝜇 However, the standard deviation is smaller because it consists of the average, not individual data.

The standard deviation of the sampling distribution is called the standard error and is calculated as: \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\)

Standardization formula for sampling distribution: \(z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\)

Sampling distribution is used because it is more efficient and allows us to estimate population parameters without having to measure the entire population, as well as calculate probabilities based on sample size.

Example 1:

Height distribution of Canadians: 𝜇 = 160 , 𝜎 = 7 , 𝑛 = 10 .

Standard error:

\(\frac{7}{\sqrt{10}}\) = 2.21

Probability \({\bar{X}}\) < 157

z = \(\frac{157 - 160}{2.21}\) = -1.36

From the z table, the probability = 0.0869 (8.69%).

Example question 2:

Proportion of people taller than 170 cm:

Standardization: z = \(\frac{170 - 160}{7}\) = 1.43

Right area 1 - 0.9236 = 0.0764

So the proportion is 7.64% .

3 Central Limit Theorem

library(knitr)
include_url("https://www.youtube.com/embed/ivd8wEHnMCg")

A sampling distribution is created by repeatedly taking simple random samples from a population, calculating a statistic (such as the sample mean \({\bar{X}}\)) for each sample, and plotting these values to form a distribution.

The Central Limit Theorem (CLT) states that if the sample size 𝑛is large enough, the sampling distribution of the sample mean will be approximately normal — regardless of the shape of the population distribution. This means that even a skewed population can produce a normally distributed sampling distribution if 𝑛 ≥ 30 n≥30.

When taking repeated samples, most sample means \({\bar{X}}\) will fall close to the true population mean μ. Some will be farther away, but collectively, these \({\bar{X}}\) values form a sampling distribution that becomes normal when the sample size is sufficiently large.

A general rule of thumb:

The CLT can be safely applied when 𝑛 ≥ 30 .
If the population itself is already normally distributed, then the sampling distribution will be normal even for small samples.

Small sample sizes produce more variability, less precision, and a higher chance of obtaining unusual samples, so they do not guarantee a normal sampling distribution unless the population is normal.

The CLT is useful because a normal sampling distribution allows us to apply formulas and tools related to the normal distribution to analyze data.

When evaluating situations to determine whether the sampling distribution will be approximately normal:

Cases where 𝑛 < 30 and the population is not normal → not normal.
Cases where 𝑛 ≥ 30 , or the population is normal → approximately normal.

4 Sample Proportion

library(knitr)
include_url("https://www.youtube.com/embed/q2e4mK0FTbw")

A sampling distribution is created by repeatedly taking samples from a population, calculating a statistic for each sample (such as \({\bar{X}}\) or \(\hat{p}\)), and then combining the results into a distribution.

For proportions, the statistic of interest is the proportion of successes, which represents the fraction of outcomes that meet a specific criterion. Examples of measurable variables include height, weight, eye color, or exam scores.

The sample proportion and population proportion are calculated using the same idea:

Sample Proportion \(\hat{p} = \frac{\text{number of successes}}{\text{sample size}}\)

Population Proportion \(p = \frac{\text{number of successes in population}}{\text{population size}}\)

Since each sample contains different individuals, repeated samples can produce different values of \(\hat{p}\).

Plotting all these \(\hat{p}\) values forms the sampling distribution of the sample proportion.

This distribution has a mean and a standard deviation:

Mean of the sampling distribution

\(\mu_{\hat{p}} = p\)

Standard deviation (Standard Error) \(\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\)

If the sampling distribution is approximately normal, we can use Z-scores and standardization formulas to compute probabilities.

Z-score formula for proportions \(Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}\)

To apply the Central Limit Theorem (CLT) for sample proportions, two conditions must be satisfied:

Conditions for CLT to apply \(np \ge 10\)

\(n(1 - p) \ge 10\)

When both conditions are met, the sampling distribution of \(\hat{p}\) is approximately normal, allowing us to use Z-tables and normal distribution methods. Sampling distributions for proportions are also related to the binomial distribution, since proportions are essentially standardized binomial outcomes.

5 Review Sampling Distribution

library(knitr)
include_url("https://www.youtube.com/embed/c0mFEL_SWzE")

1. Probability of Green & Blue

There are 200 green and 300 blue marbles → Probability of green:

\(p = \frac{200}{500} = 0.4\)

Probability of blue: \(q = 1 - p = 0.6\)
When drawing three times with replacement, the probability of a specific sequence (e.g., GBB) is: \(0.4 \times 0.6 \times 0.6\)

2. Probability of “At Least Two Green” (Manual Sample Space Method)

Enumerate all sequences with ≥2 greens and add their probabilities.

Example: Probability(G G B), Probability(G B G), Probability(B G G), Probability(G G G)

Final combined probability from the example: 0.352

3. Using Binomial Formula for Larger n

Instead of listing sample space, use the binomial formula: \(P(X = k) = \binom{n}{k} p^{k} (1-p)^{\,n-k}\)

To find at least 2 successes: \(P(X \ge 2) = P(X=2) + P(X=3) + P(X=4) + P(X=5)\)

Example calculation for \(n = 5\), \(p = 0.4\) gives total probability:

\[ P(X \ge 2) = 0.66304 \]

4. Important Note

Using CLT gives an approximation, not the exact binomial probability.

Exact results require the full binomial sum or sample space enumeration, which becomes impractical for large 𝑛