Probability Distribution

Assignment ~ Week 11

1 About Probability

Probability serves as the theoretical foundation for statistics, providing the framework for evaluating and managing uncertainty in data-driven decision-making processes.

In the context of experiments or processes that display varying outcomes, the concept of a random variable is utilized to model those outputs. Subsequently, the probability distribution determines how the likelihood (probability) is divided or allocated among all possible values of that random variable.

Mastering the characteristics and visualization of a distribution is crucial, as this understanding is the primary determinant in:

This chapter will delve into several core topics:

Continuous Random Variables: Focuses on variables that can take any value within a continuous interval.
Sampling Distributions: Distributions that show the spread of values of statistics calculated from a sample.
Central Limit Theorem (CLT): The fundamental principle that explains the tendency of sample means to follow a normal distribution pattern, even if the initial population data distribution is non-normal.
Sample Proportion Distributions: Specific distribution patterns that are highly relevant and frequently used in the analysis of survey data.

2 Continuous Random

This video serves as an essential introduction to the world of Continuous Random Variables by contrasting them with Discrete Random Variables. It establishes that understanding probability distributions is the core foundation of inferential statistics used for decision-making. The main difference lies in how data is obtained (counting vs measuring) and consequently, how probabilities are calculated (point probability vs area under a curve).

2.1 Discrete Variable

A Discrete Variable is defined as a variable that can take on a countable or finite number of values. Its probability is described by a Probability Mass Function.

formulas (Probability Mass Function - PMF): The probability of the variable \(X\) taking on a specific value \(x\) is: \[P(X = x)\] The sum of all possible probabilities must equal one: \[\sum_{\text{all } x} P(X = x) = 1\]

2.2 Continuous Variable

A Continuous Variable is defined as a variable that can take on any numerical value within a given range, resulting in an infinite and uncountable set of possibilities. Its probability is described by a Probability Density Function (PDF).

Formulas (Probability Density Function - PDF): The probability of \(X\) falling within an interval between \(a\) and \(b\) is calculated by finding the area under the PDF curve \(f(x)\): \[P(a \le X \le b) = \int_a^b f(x) dx\] The probability of the variable \(X\) taking on any single exact value is zero: \[P(X = x) = 0\]

2.3 Comparison: Probability Distribution Representation

The primary difference in distribution representation is visual and conceptual:

Discrete RVs are displayed using Bar Charts with gaps, emphasizing the countable nature.
Continuous RVs are displayed using a Histogram or a smooth Density Curve. The lack of gaps reflects the continuity. For a PDF, the total area under the entire curve must equal one.

Formula (Normalization Condition for PDF): The total probability over the entire range of \(X\) must be unity: \[\int_{-\infty}^{\infty} f(x) dx = 1\]

2.4 Comparison: Probability Formula Calculation

This comparison highlights the core mathematical distinction:

Discrete RVs allow for the calculation of probability at a specific point (\(P(X=x)\)).
Continuous RVs cannot calculate probability at a specific point (\(P(X=x) = 0\)), requiring integration over an interval instead.

Formula (Core Calculation Distinction): For Continuous Variables, probability is always an area: \[P(a \le X \le b) = \int_a^b f(x) dx\] For Discrete Variables, probability is a summation of points: \[P(a \le X \le b) = \sum_{x=a}^b P(X=x)\]

3 Sampling Distribution

The core of this video is to differentiate between a data distribution derived from a single sample and a distribution derived from the statistic (mean) of many samples.

3.1 Population Distribution

This is the distribution of the individual values for every single member of the entire group we are studying (the target of the research). It describes the underlying reality we are trying to understand.

The population distribution describes the full set of values for every individual in the population.
Its parameters:

Population Mean
\[ \mu = \frac{\sum X}{N} \]
Population Standard Deviation
\[ \sigma = \sqrt{\frac{\sum (X-\mu)^2}{N}} \]

3.2 Sample Distribution (Single Sample)

This is the distribution of the individual values within one single, specific set of data that was randomly collected from the population. This distribution reflects the immediate data collected but has high variability.

A sample distribution is created from one sample taken from the population.
Its key statistics:

Sample Mean
\[ \bar{x} = \frac{\sum x_i}{n} \]
Sample Standard Deviation
\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \]

3.3 Sampling Distribution (of the Sample Mean)

This is the distribution formed by taking a statistic calculated from many repeated samplesdrawn from the same population. It acts as the crucial bridge for making inferences, allowing us to estimate the population mean.

A sampling distribution is formed by taking many samples and computing the sample mean for each.

Key Theoretical Results:

Mean of the Sampling Distribution \[ \mu_{\bar{x}} = \mu \]
Standard Error of the Mean \[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]

This distribution becomes more normal as sample size increases (Central Limit Theorem).

4 Central Limit Theorem (CLT)

The video explains the Central Limit Theorem (CLT) as one of the most important principles in inferential statistics. The CLT states that the Sampling Distribution of the Sample Mean (\(\bar{x}\)) will approach a Normal Distribution as the sample size becomes sufficiently large. This result is fundamental because it allows researchers to make probability-based conclusions about a population, even when the original population distribution is not normal.

The theorem ensures that sample means behave predictably, enabling the use of Z-scores, confidence intervals, and hypothesis testing.

4.1 Definition and Parameters of the CLT

The central rule of CLT states that when the sample size is sufficiently large (\(n\)), the Sampling Distribution of the Sample Mean becomes Normal, regardless of the shape of the original population distribution. This universality makes the CLT extremely powerful for practical statistical work.

Mean of the Sampling Distribution \[\mu_{\bar{x}} = \mu \quad \text{(Same as the Population Mean)}\]
Standard Error of the Mean \[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \quad \text{(Standard Deviation of the Sampling Distribution)}\]
Variance of the Sampling Distribution \[\text{Var}(\bar{X}) = \frac{\sigma^2}{n}\]
Distributional Convergence \[\bar{X} \xrightarrow[]{d} N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty\]

4.2 Probability Mechanism Behind the CLT

The CLT works because repeated sampling causes extreme values in the data to cancel each other out. Large observations are offset by small ones, which forces the sample means to cluster around the true population mean (\(\mu\)). As the number of observations increases, this balancing effect intensifies, producing the classic bell-shaped distribution.

There is no single formula behind this conceptual mechanism, but the reasoning comes from aggregation:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i \]

As \(n \to \infty\): - Extreme values lose influence
- Variance shrinks at the rate of \(1/n\)
- The distribution stabilizes around the true mean

This idea aligns with the Law of Large Numbers (LLN):

\[ \bar{X} \xrightarrow[]{p} \mu \]

4.3 Applications, Rules of Thumb, and Limitations

The CLT is applied in constructing confidence intervals, performing hypothesis tests, and standardizing statistics using Z-scores. A common guideline is the Rule of 30, which states that the sampling distribution becomes approximately normal when:

\[n \ge 30\]

If the population distribution is already Normal, then the sampling distribution is Normal for any sample size. Nevertheless, large samples are always recommended because they reduce the Standard Error and improve precision.

Z-score for Sample Mean \[Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\]
Z-score for Sample Proportion \[Z = \frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}}\]
Standard Error (Precision Measure) \[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]

5 Sample Proportion

This video introduces the concept of Sample Proportion and how the proportion from a sample relates to the true Population Proportion. It also explains how repeated sampling leads to the formation of a Sampling Distribution of the Sample Proportion, which is crucial for inferential statistics and for applying the Central Limit Theorem for Proportions.

5.1 Proportion

Proportion in statistics is defined as the fraction of favorable outcomes relative to the total number of observations.

A proportion always takes a value between 0 and 1, representing the relative frequency of a characteristic.

5.2 Sample Proportion (\(\hat{p}\)) vs. Population Proportion (\(p\))

Population Proportion (\(p\))
This is the parameter that describes the proportion in the entire population.

\[ p = \frac{\text{Number of Favorable Outcomes}}{N} \]

5.3 Sample Proportion (\(\hat{p}\))

This is a statistic based on one random sample.

\[ \hat{p} = \frac{\text{Number of Favorable Outcomes}}{n} \]

The value of \(\hat{p}\) varies from sample to sample, but it serves as an estimator of the true population proportion \(p\).

5.4 Sampling Distribution of the Sample Proportion

If we take repeated random samples of the same size and compute \(\hat{p}\) for each sample, the collection of all those \(\hat{p}\) values forms the Sampling Distribution of \(\hat{p}\).

Its parameters are:

Mean of the Sampling Distribution \[ \mu_{\hat{p}} = p \]

5.5 Standard Error of the Sampling Distribution

\[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

As the sample size increases, the spread (standard error) becomes smaller, meaning \(\hat{p}\) becomes more accurate.

5.6 Central Limit Theorem (CLT) for Proportions

The CLT states that the Sampling Distribution of \(\hat{p}\) becomes approximately Normal if the following conditions are satisfied:

\[ n \cdot p \ge 10 \] \[ n \cdot (1-p) \ge 10 \]

When these conditions are met, we may use the Z-distribution for inference.

Z-Score for Sample Proportion

\[ Z = \frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}} \] where \(q = 1 - p\).

6 Review of Sampling Distribution

This video presents three different methods for solving the same probability problem, arranged from the simplest to the most advanced, depending on the sample size (n). Each method becomes more efficient as n increases, addressing the impracticality of earlier methods for larger sample sizes.

6.1 Basic Probability and the Sample Space Method

This method is used when the number of trials (n) is extremely small.

Scenario 1: Drawing 3 Marbles

Problem: Find the probability of drawing at least two green marbles out of 3 trials (n = 3).

Steps:

Determine the probabilities of success (P) and failure (Q).
Example: \(P = 0.4\), \(Q = 0.6\)
List all possible favorable outcomes (Sample Space).
Examples: GGB, GBG, BGG, GGG
Compute the probability of each sequence by multiplying the individual probabilities.

Example:
\(P(\text{GGB}) = 0.4 \times 0.4 \times 0.6 = 0.096\)
Add all probabilities that meet the requirement:
\(P(\text{at least 2 Green}) = P(k=2) + P(k=3)\)

Conclusion Scenario 1:
This method becomes time-consuming and impractical as n increases.

6.2 Binomial Distribution (The Binomial Formula Method)

This method is used when the number of trials (n) is moderately small, making the Sample Space approach too complex.

Scenario 2: Drawing 5 Marbles

Problem: Find the probability of drawing at least two green marbles out of 5 trials (n = 5).

Function:
The Binomial Formula calculates the probability of obtaining an exact number of successes (k) out of n trials.

Application:
Since the problem asks for at least two greens:

\[ P(\text{at least 2}) = P(k=2) + P(k=3) + P(k=4) + P(k=5) \]

Formula Used:

\[ P(k) = \binom{n}{k} p^k q^{n-k} \]

Conclusion Scenario 2:
The binomial method works well for n = 5, but becomes too lengthy and unrealistic as n increases further.

6.3 Sampling Distribution of the Sample Proportion (Using CLT)

This method is used when the number of trials (n) is very large
(e.g., \(n = 100\)), making both previous methods infeasible.
It provides an approximate probability using the Central Limit Theorem (CLT).

Scenario 3: Drawing 100 Marbles

Problem: Find the probability of drawing at least 35 green marbles out of 100 trials (n = 100).

Concept Link:
The video connects this problem to the Sampling Distribution of the Sample Proportion, discussed in previous lessons.

Steps

1. Check the CLT Conditions

The sampling distribution of \(\hat{p}\) is approximately Normal if:

\[ n \cdot p \ge 10 \] \[ n \cdot (1 - p) \ge 10 \]

Example in the video:

\(100 \times 0.4 = 40\)
\(100 \times 0.6 = 60\)

Both ≥ 10 → CLT applies.

2. Compute the Standard Error

\[ \sigma_{\hat{p}} = \sqrt{\frac{p \cdot q}{n}} \]

3. Standardize the Sample Proportion (Z-score)

Sample proportion:

\[ \hat{p} = \frac{35}{100} = 0.35 \]

Z-score:

\[ Z = \frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}} \]

4. Find the Area

Use the Z-value to find the corresponding area under the Normal Curve
→ this area represents the approximate probability.

Conclusion Scenario 3

This method provides approximate, but highly accurate probabilities.
It is the only practical method for large sample sizes (large) in introductory statistics.

7 Reference

[1] Gravetter, F. dan Wallnau, L., Statistics for the Behavioral Sciences (11th ed.), Cengage, 2021.