Introduction

Probability distribution is a mathematical function that describes how the values of a random variable are spread or distributed. It provides a complete picture of all possible outcomes of an experiment and assigns a probability to each outcome (for discrete variables) or to intervals of outcomes (for continuous variables). In essence, a probability distribution tells us how likely each value or range of values is. For discrete random variables, it is typically represented using a probability mass function (PMF), while continuous random variables use a probability density function (PDF). Regardless of the type, the total probability across all outcomes must equal 1. Probability distributions are essential in statistics because they allow us to model uncertainty, make predictions, and perform inference about real-world phenomena based on data.

Imagine you are making soup. The recipe calls for salt to taste. While the exact amount is uncertain, you have an estimate based on experience—typically between half and one teaspoon. Probability distributions work in a similar way, but instead of relying on intuition, we use mathematics to describe the range of possible values and how likely each of those values is to occur.

In this material, we will explore key concepts ranging from continuous random variables, sampling distributions, the central limit theorem, to sample proportions. Each concept will be explained using a step-by-step approach: first with a formal definition, followed by a real-life analogy, complete with visualizations and computational examples.

Continuous Random

Continuous random variables differ fundamentally from discrete ones. While discrete variables count occurrences (how many?), continuous variables measure quantities (how much?). This section explores the unique characteristics of continuous random variables and how we work with them.

1.1 What Makes Variables Continuous?

Continuous random variables can take any value within an interval. Think of measuring time, weight, or temperature—these can be infinitely precise. For example, when timing a race, you could record 10.5 seconds, 10.52 seconds, or 10.523 seconds.

Key Difference:

Discrete: “How many students passed?” (0, 1, 2, 3, …) — Countable
Continuous: “What percentage scored above 80%?” (0.751, 0.752, 0.753, …) — Measurable

1.2 Probability Density Functions (PDF)

Since continuous variables have infinite possible values, we use Probability Density Functions (PDF) instead of probability mass functions. The PDF shows relative likelihood, not direct probability.

Three Essential PDF Properties:

\(f(x) \geq 0\) for all x (never negative)

Total area under the curve = 1: \(\int_{-\infty}^{\infty} f(x)dx = 1\)

Probability = area: \(P(a \leq X \leq b) = \int_{a}^{b} f(x)dx\)

Important Insight: The PDF value at a point isn’t a probability—it’s a density. Only areas under the curve represent probabilities.

# Visualizing PDF as Area Under Curve
x <- seq(-3, 3, length.out = 1000)
y <- dnorm(x)

plot(x, y, type = "l", lwd = 2, col = "blue",
     main = "Probability = Area Under PDF Curve",
     xlab = "X", ylab = "Density f(x)")

# Shade area for probability between -1 and 1
polygon(c(-1, x[x >= -1 & x <= 1], 1), 
        c(0, y[x >= -1 & x <= 1], 0), 
        col = rgb(0, 0, 1, 0.3))
text(0, 0.1, "Area = Probability", col = "darkblue")
text(0, 0.05, "P(-1 ≤ X ≤ 1) ≈ 0.6827", col = "red")

1.3 The Zero Probability Paradox

One of the most surprising aspects of continuous variables: \(P(X = a) = 0\) for any specific value \(a\). This doesn’t mean the value is impossible—it means the probability of exactly that value is infinitesimally small.

Practical Approach: We always work with intervals, not exact values. Instead of asking “What’s the probability of exactly 170 cm?” we ask “What’s the probability between 169.5 cm and 170.5 cm?”

1.4 Common Continuous Distributions

Normal Distribution (Bell Curve)

The most famous continuous distribution, appearing naturally in many phenomena:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

Parameters:

μ (mu) = mean (center)
σ (sigma) = standard deviation (spread)

Uniform Distribution

Equal probability density across an interval:

\[ f(x) = \frac{1}{b-a} \quad \text{for } a \leq x \leq b \]

Exponential Distribution

Models time between events:

\[ f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0 \]

# Comparing Three Continuous Distributions
par(mfrow = c(1, 3))

# Normal Distribution
curve(dnorm(x), -4, 4, main = "Normal Distribution",
      xlab = "X", ylab = "Density", col = "blue", lwd = 2)

# Uniform Distribution
curve(dunif(x, 0, 1), -0.5, 1.5, main = "Uniform Distribution",
      xlab = "X", ylab = "Density", col = "red", lwd = 2)

# Exponential Distribution
curve(dexp(x, rate = 1), 0, 5, main = "Exponential Distribution",
      xlab = "X", ylab = "Density", col = "green", lwd = 2)

1.5 Practical Example: Exam Scores

Suppose final exam scores follow a normal distribution with mean 75 and standard deviation 10.

Questions:

What percentage of students scored below 60?
What’s the probability of scoring between 70 and 80?
What score is needed to be in the top 10%?

Calculations:

\(P(X < 60) = P(Z < \frac{60-75}{10}) = P(Z < -1.5) ≈ 0.0668\) (6.68%)

\(P(70 < X < 80) = P(\frac{70-75}{10} < Z < \frac{80-75}{10}) = P(-0.5 < Z < 0.5) ≈ 0.3829\) (38.29%)

Top 10% cutoff: \(75 + 1.28 \times 10 ≈ 87.8\)

1.6 Key Takeaways

Continuous variables measure quantities with infinite precision
PDFs describe probability density, not direct probability
Probability is calculated as area under the PDF curve
Normal distribution is fundamental in statistics
Always work with intervals, not exact values

Sampling Distribution

2.1 The Foundation

A sampling distribution shows what happens when we take many random samples from the same population and calculate the same statistic (like the mean) from each sample.

Core Concept: It’s the distribution of a statistic across all possible samples of the same size.

Video Analogy: If you repeatedly take samples of 50 people and calculate each sample’s average height, the collection of those averages forms the sampling distribution.

2.2 The Sampling Distribution of the Sample Mean

For a population with mean μ and standard deviation σ, when we take samples of size n:

Center: The mean of all sample means equals the population mean:

\[ \mu_{\bar{x}} = \mu \]

Spread: The standard deviation of sample means (called Standard Error) is:

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]

Key Insight: Sample means vary less than individual observations. Larger samples give more consistent results.

2.3 Standard Error

Standard Error (SE) quantifies how much sample statistics vary from sample to sample:

\[ SE = \frac{\sigma}{\sqrt{n}} \]

What it tells us:

SE decreases as sample size increases
Cutting SE in half requires quadrupling sample size
Smaller SE means more precise estimates

# Demonstrating Standard Error vs Sample Size
set.seed(123)
population_sd <- 15
sample_sizes <- c(5, 20, 50, 100)

standard_errors <- population_sd / sqrt(sample_sizes)

results <- data.frame(
  Sample_Size = sample_sizes,
  Standard_Error = round(standard_errors, 2),
  Relative_Precision = round(1/standard_errors, 2)
)

knitr::kable(results, 
             caption = "How Sample Size Affects Standard Error",
             col.names = c("Sample Size (n)", "Standard Error", "Relative Precision"))

How Sample Size Affects Standard Error
Sample Size (n)	Standard Error	Relative Precision
5	6.71	0.15
20	3.35	0.30
50	2.12	0.47
100	1.50	0.67

2.4 Shape of the Sampling Distribution

Normal Populations: If the population is normal, the sampling distribution is normal for any sample size.

Non-Normal Populations: For large samples (typically n ≥ 30), the sampling distribution becomes approximately normal (Central Limit Theorem).

Visual Proof:

# Shape Changes with Sample Size
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# Skewed population
skewed_pop <- rexp(10000, rate = 0.5)

for(n in c(5, 15, 30, 50)) {
  sample_means <- replicate(1000, mean(sample(skewed_pop, n)))
  
  hist(sample_means, main = paste("n =", n),
       xlab = "", ylab = "",
       col = "lightblue", breaks = 30,
       probability = TRUE)
  
  # Add normal curve
  x_norm <- seq(min(sample_means), max(sample_means), length = 100)
  y_norm <- dnorm(x_norm, mean = mean(sample_means), sd = sd(sample_means))
  lines(x_norm, y_norm, col = "red", lwd = 2)
}

2.5 Practical Application: Factory Quality Control

Scenario: A factory produces components with length μ = 50 mm, σ = 2 mm. Quality control samples 36 components each hour.

Question: What’s the probability that a sample mean is less than 49.5 mm?

Step-by-Step Solution:

Calculate Standard Error: \[ SE = \frac{2}{\sqrt{36}} = \frac{2}{6} = 0.333 \]

Calculate Z-score: \[ Z = \frac{49.5 - 50}{0.333} = -1.5 \]

Find probability: \[ P(\bar{X} < 49.5) = P(Z < -1.5) = 0.0668 \]

Interpretation: There’s a 6.68% chance of observing such a low sample mean if the process is working correctly.

2.6 Why This Matters in Real Research?

Confidence Intervals:
Sample mean ± Margin of Error (where Margin of Error uses SE)
Hypothesis Testing:
Determines if sample results are unusual under the null hypothesis
Sample Size Planning:
Helps decide how many observations to collect

Key Formula for Confidence Interval:

\[ \bar{x} \pm z^* \times \frac{\sigma}{\sqrt{n}} \] Where \(z^*\) depends on confidence level (1.96 for 95% confidence).

2.7 Common Pitfalls to Avoid

Confusing SD with SE:
- SD: Variation in individual data points
- SE: Variation in sample statistics
Ignoring Sample Size Requirements:
- For means: n ≥ 30 for CLT to apply
- For proportions: np ≥ 10 and n(1-p) ≥ 10
Forgetting Random Sampling Assumption:
- Results only valid if samples are random

2.8 Essential Takeaways

Sampling distribution describes how sample statistics vary
Standard Error = σ/√n measures this variation
Distribution shape becomes normal for large samples
Applications include confidence intervals and hypothesis tests

Remember: The sampling distribution connects what we see in a sample to what exists in the population—it’s the foundation of statistical inference.

Central Limit Theorem

The Central Limit Theorem (CLT) states that regardless of the population’s distribution, the sampling distribution of the sample mean will approach a normal distribution as the sample size increases.

In Simple Terms: Take any population—skewed, uniform, exponential—take large enough samples, calculate their means, and those means will form a bell curve.

Formal Statement: For a population with mean μ and standard deviation σ, when we take random samples of size n (with n sufficiently large):

\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]

Why CLT is Revolutionary?

Before CLT: We assumed populations were normal to do statistical inference.

After CLT: We can work with any population shape if we have large enough samples.

The Magic Number: n ≥ 30 is often considered “large enough” for CLT to kick in, though very skewed distributions may need larger n.

3.1 Mathematical Foundation

The standardized sample mean follows a standard normal distribution for large n:

\[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1) \]

What Changes with Sample Size: - n = 1: Sampling distribution = Population distribution - n = 30: Approximately normal for most populations - n = 100: Very close to normal

3.2 Practical Example: Factory Production

Scenario: A factory produces screws with lengths following an exponential distribution (mean = 50mm, SD = 50mm). This distribution is highly skewed—most screws are short, but some are very long.

Problem: What’s the probability that a sample of 40 screws has average length > 60mm?

Without CLT: Complex calculation with exponential distribution.

With CLT: Simple normal approximation.

Solution:

\[ SE = \frac{50}{\sqrt{40}} = 7.906 \]

\[ Z = \frac{60 - 50}{7.906} = 1.265 \]

\[ P(\bar{X} > 60) = P(Z > 1.265) = 0.103 \quad (10.3\%) \]

3.3 Sample Size Guidelines

# Sample Size Requirements for Different Distributions
library(knitr)

guidelines <- data.frame(
  Distribution_Type = c("Moderately Skewed", "Highly Skewed", "Extremely Skewed", "Proportions (p ≈ 0.5)", "Proportions (p ≈ 0.1)"),
  Minimum_n = c("30", "50", "100+", "30", "100"),
  Reason = c("CLT works well", "More observations needed", "Very slow convergence", "np and n(1-p) both > 15", "Need np > 10")
)

kable(guidelines, 
      caption = "Sample Size Guidelines for CLT Approximation",
      col.names = c("Distribution Type", "Minimum Sample Size", "Reason"))

Sample Size Guidelines for CLT Approximation
Distribution Type	Minimum Sample Size	Reason
Moderately Skewed	30	CLT works well
Highly Skewed	50	More observations needed
Extremely Skewed	100+	Very slow convergence
Proportions (p ≈ 0.5)	30	np and n(1-p) both > 15
Proportions (p ≈ 0.1)	100	Need np > 10

3.4 The De Moivre-Laplace Theorem

For proportions, CLT takes a special form called the De Moivre-Laplace Theorem:

\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \quad \text{for large n} \] with conditions: np ≥ 10 and n(1-p) ≥ 10.

3.5 Key Takeaways

Universal Applicability: CLT works for any population distribution
Sample Means Become Normal: For large enough samples
Practical Threshold: n ≥ 30 often sufficient
Foundation for Inference: Enables t-tests, confidence intervals, regression
Not About Population: Population stays the same; sample means become normal

Final Insight: CLT is why the normal distribution is everywhere in statistics. It’s not that the world is normally distributed—it’s that averages of random variables tend to be normal, regardless of what we start with.

Sample Proportion

4.1 From Counts to Proportions

When we deal with categorical data (yes/no, success/failure), we use sample proportions instead of sample means. The sample proportion (p̂) measures the fraction of successes in a sample.

Formula:

\[ \hat{p} = \frac{X}{n} \]

where: - X = number of successes in the sample - n = sample size

Example Applications: - Political polls: proportion supporting a candidate - Quality control: proportion of defective items - Medicine: proportion of patients responding to treatment

4.2 Sampling Distribution of Sample Proportion

When we take many samples and calculate p̂ for each, these proportions form a sampling distribution with:

Center: The mean of all sample proportions equals the population proportion: \[ \mu_{\hat{p}} = p \]

Spread: The standard deviation (Standard Error) is: \[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

Important: This formula uses p (population proportion), which we often don’t know. In practice, we use p̂ as an estimate.

4.3 Conditions for Normality

For the sampling distribution to be approximately normal, we need:

Random Sampling: Samples must be random
Independence: n ≤ 10% of population if sampling without replacement
Success-Failure Condition:
- np ≥ 10
- n(1-p) ≥ 10

Rule of Thumb: At least 10 successes and 10 failures in the sample.

# Visualizing Conditions for Normality
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# Different scenarios
scenarios <- list(
  c(p = 0.5, n = 20, label = "Good: p=0.5, n=20"),
  c(p = 0.1, n = 30, label = "Poor: p=0.1, n=30"),
  c(p = 0.5, n = 50, label = "Good: p=0.5, n=50"),
  c(p = 0.1, n = 100, label = "Good: p=0.1, n=100")
)

for(scen in scenarios) {
  p <- as.numeric(scen["p"])
  n <- as.numeric(scen["n"])
  
  # Generate sampling distribution
  sample_props <- rbinom(10000, n, p) / n
  
  hist(sample_props, main = scen["label"],
       xlab = "", ylab = "",
       col = ifelse(p*n >= 10 & n*(1-p) >= 10, "lightgreen", "lightcoral"),
       breaks = 30, probability = TRUE)
  
  # Add normal curve if conditions met
  if(p*n >= 10 & n*(1-p) >= 10) {
    x_norm <- seq(min(sample_props), max(sample_props), length = 100)
    y_norm <- dnorm(x_norm, mean = p, sd = sqrt(p*(1-p)/n))
    lines(x_norm, y_norm, col = "blue", lwd = 2)
  }
  
  # Add success-failure counts
  text(mean(sample_props), max(hist(sample_props, plot=FALSE)$density)*0.8,
       paste("Successes:", round(p*n,1)), cex = 0.7)
  text(mean(sample_props), max(hist(sample_props, plot=FALSE)$density)*0.6,
       paste("Failures:", round(n*(1-p),1)), cex = 0.7)
}

4.4 The Normal Approximation

When conditions are met: \[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \]

Standardized Form:

\[ Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} \sim N(0,1) \]

Important: This approximation works well when p is not too close to 0 or 1, and sample size is large enough.

Practical Example: Political Poll

Scenario: A candidate claims 60% support (p = 0.60). A poll of 500 voters shows 280 support (p̂ = 0.56).

Question: What’s the probability of getting 56% or less support if the true support is 60%?

Step-by-Step Solution:

Check Conditions:
- Random sample: assumed
- Independence: 500 < 10% of voters
- Success-Failure:
  - np = 500 × 0.60 = 300 ≥ 10
  - n(1-p) = 500 × 0.40 = 200 ≥ 10 ✓ Conditions met
Calculate Standard Error:

\[ SE = \sqrt{\frac{0.60 \times 0.40}{500}} = \sqrt{0.00048} = 0.0219 \]
Calculate Z-score:

\[ Z = \frac{0.56 - 0.60}{0.0219} = -1.826 \]
Find Probability:

\[ P(\hat{p} \leq 0.56) = P(Z \leq -1.826) = 0.034 \]

Interpretation: There’s only a 3.4% chance of getting 56% or less support if true support is 60%. This suggests the candidate’s claim might be too high.

Margin of Error and Confidence Intervals

For a 95% confidence interval: \[ \hat{p} \pm 1.96 \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Example: If p̂ = 0.56 from n = 500: \[ SE = \sqrt{\frac{0.56 \times 0.44}{500}} = 0.0222 \]
\[ 95\%\ CI = 0.56 \pm 1.96 \times 0.0222 = (0.516, 0.604) \]

We’re 95% confident the true proportion is between 51.6% and 60.4%.

Common Mistakes

Using Wrong Standard Error Formula:
- Wrong: \(SE = \sqrt{\frac{p̂(1-p̂)}{n}}\) when p is known
- Right: Use p when known, p̂ when estimating
Ignoring Conditions:
- Applying normal approximation when np < 10 or n(1-p) < 10
Confusing Population and Sample:
- p = population proportion (usually unknown)
- p̂ = sample proportion (calculated from data)

4.5 Sample Size Determination

To achieve a desired margin of error (ME): \[ n = \left(\frac{z^*}{ME}\right)^2 p(1-p) \]

Conservative Approach: Use p = 0.5 (maximizes required sample size): \[ n = \left(\frac{1.96}{ME}\right)^2 \times 0.25 \]

# Sample Size Calculator for Proportions
calculate_sample_size <- function(ME, p = 0.5, confidence = 0.95) {
  z <- qnorm(1 - (1-confidence)/2)
  n <- (z^2 * p * (1-p)) / (ME^2)
  return(ceiling(n))
}

# Example calculations
ME_levels <- c(0.01, 0.03, 0.05, 0.10)
sample_sizes <- sapply(ME_levels, calculate_sample_size)

results <- data.frame(
  Margin_of_Error = paste0(ME_levels*100, "%"),
  Sample_Size_Needed = sample_sizes,
  Notes = c("Very precise", "Typical poll", "Moderate precision", "Rough estimate")
)

knitr::kable(results, 
             caption = "Sample Size Requirements for Different Margins of Error (95% Confidence)",
             col.names = c("Margin of Error", "Minimum Sample Size", "Typical Use"))

Sample Size Requirements for Different Margins of Error (95% Confidence)
Margin of Error	Minimum Sample Size	Typical Use
1%	9604	Very precise
3%	1068	Typical poll
5%	385	Moderate precision
10%	97	Rough estimate

4.6 Key Takeaways

Sample proportion p̂ = X/n estimates population proportion p
Sampling distribution is approximately normal when np ≥ 10 and n(1-p) ≥ 10
Standard Error = √[p(1-p)/n]
Confidence intervals help estimate population proportion
Sample size planning ensures desired precision

Remember: For proportions, the “success-failure condition” is crucial. Always check np ≥ 10 and n(1-p) ≥ 10 before using normal approximation methods.

Review Sampling Distribution

5.1 Connecting All the Pieces

Sampling distributions form the foundation of statistical inference—they connect what we observe in our sample to what exists in the population. This review integrates everything we’ve learned about:

Continuous Random Variables (how individual measurements behave)
Sampling Distribution of the Mean (how sample averages behave)
Central Limit Theorem (why sample averages become normal)
Sample Proportion (how percentages/success rates behave)

The Big Picture: Every time we collect data and calculate a statistic (mean, proportion), we’re seeing one possible value from a sampling distribution.

5.2 The Complete Framework

For Sample Means (Continuous Data):

\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{when n is large} \]

For Sample Proportions (Categorical Data):

\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \quad \text{when np ≥ 10 and n(1-p) ≥ 10} \]

Key Similarity: Both follow normal distributions for large enough samples, even if the original population doesn’t.

5.3 The Four-Step Process for Any Sampling Problem

Step 1: Identify the Parameter - Mean (μ) or proportion (p)? - Known or unknown population parameters?

Step 2: Check Conditions - Random sampling? - Independence? - Sample size requirements met?

Step 3: Calculate Standard Error - For means: \(SE = \frac{\sigma}{\sqrt{n}}\) - For proportions: \(SE = \sqrt{\frac{p(1-p)}{n}}\)

Step 4: Apply Normal Distribution - Use z-scores: \(Z = \frac{\text{statistic} - \text{parameter}}{SE}\) - Find probabilities or create intervals

# Complete Visualization of Sampling Distribution Framework
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# 1. Population Distribution
pop_data <- rexp(10000, rate = 0.5)
hist(pop_data, main = "1. Population Distribution",
     xlab = "", ylab = "", col = "lightblue", breaks = 30)

# 2. Single Sample
single_sample <- sample(pop_data, 30)
hist(single_sample, main = "2. Single Sample (n=30)",
     xlab = "", ylab = "", col = "lightgreen", breaks = 15)

# 3. Sampling Distribution
sample_means <- replicate(1000, mean(sample(pop_data, 30)))
hist(sample_means, main = "3. Sampling Distribution of Means",
     xlab = "", ylab = "", col = "lightcoral", breaks = 30, probability = TRUE)

# Add normal curve
x_norm <- seq(min(sample_means), max(sample_means), length = 100)
y_norm <- dnorm(x_norm, mean = mean(sample_means), sd = sd(sample_means))
lines(x_norm, y_norm, col = "blue", lwd = 2)

# 4. Standardized Distribution
z_scores <- (sample_means - mean(pop_data)) / (sd(pop_data)/sqrt(30))
hist(z_scores, main = "4. Standardized (Z) Distribution",
     xlab = "", ylab = "", col = "lightyellow", breaks = 30, probability = TRUE)

# Add standard normal curve
curve(dnorm(x), add = TRUE, col = "red", lwd = 2)

Decision Tree: Which Formula to Use?

# Decision Tree Table
library(kableExtra)

decision_tree <- data.frame(
  Question = c(
    "What type of data do you have?",
    "Is the population standard deviation known?",
    "What sample size?",
    "For proportions: Are conditions met?"
  ),
  Answer_1 = c(
    "Continuous → Use mean formulas",
    "Yes → Use z-distribution",
    "n ≥ 30 → CLT applies",
    "np ≥ 10 and n(1-p) ≥ 10 → Normal approximation OK"
  ),
  Answer_2 = c(
    "Categorical → Use proportion formulas",
    "No → Use t-distribution (future topic)",
    "n < 30 → Check population normality",
    "Conditions not met → Use exact methods"
  )
)

kable(decision_tree, caption = "Decision Tree for Sampling Distribution Problems") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(1:4, extra_css = "border-bottom: 2px solid #ddd;")

Decision Tree for Sampling Distribution Problems
Question	Answer_1	Answer_2
What type of data do you have?	Continuous → Use mean formulas	Categorical → Use proportion formulas
Is the population standard deviation known?	Yes → Use z-distribution	No → Use t-distribution (future topic)
What sample size?	n ≥ 30 → CLT applies	n < 30 → Check population normality
For proportions: Are conditions met?	np ≥ 10 and n(1-p) ≥ 10 → Normal approximation OK	Conditions not met → Use exact methods

Common Problem Types and Solutions

Type 1: Probability Calculations

Question: “What’s the probability that our sample mean is less than X?”

Solution: 1. Calculate SE 2. Calculate Z-score: \(Z = \frac{\bar{x} - \mu}{SE}\) 3. Use normal table or software to find probability

Type 2: Confidence Intervals

Question: “What range likely contains the population parameter?”

Solution: \[ \text{Statistic} \pm (z^* \times SE) \] where z* = 1.96 for 95% confidence

Type 3: Sample Size Planning

Question: “How large a sample do I need?”

Solution: - For means: \(n = \left(\frac{z^* \sigma}{ME}\right)^2\) - For proportions: \(n = \left(\frac{z^*}{ME}\right)^2 p(1-p)\)

5.4 Real-World Integration Example

Scenario: A pharmaceutical company tests a new drug. They want to know: 1. Average reduction in blood pressure (continuous) 2. Proportion of patients with side effects (categorical)

Data Collected:

100 patients sampled
Mean blood pressure reduction: 8.5 mmHg (σ = 4 mmHg from previous studies)
15 patients reported side effects

Analysis Part 1 (Mean): \[ SE_{\text{mean}} = \frac{4}{\sqrt{100}} = 0.4 \]
\[ 95\%\ CI\ \text{for mean}: 8.5 \pm 1.96 \times 0.4 = (7.72, 9.28)\ \text{mmHg} \]

Analysis Part 2 (Proportion): \[ \hat{p} = 15/100 = 0.15 \]
\[ SE_{\text{prop}} = \sqrt{\frac{0.15 \times 0.85}{100}} = 0.0357 \]
\[ 95\%\ CI\ \text{for proportion}: 0.15 \pm 1.96 \times 0.0357 = (0.080, 0.220) \]

# Visualizing Both Confidence Intervals
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Mean CI
mean_ci <- c(7.72, 9.28)
plot(1, 8.5, xlim = c(0.5, 1.5), ylim = c(7, 10),
     main = "95% CI for Mean Reduction",
     xlab = "", ylab = "Blood Pressure Reduction (mmHg)",
     pch = 16, col = "blue", xaxt = "n")
segments(1, mean_ci[1], 1, mean_ci[2], col = "blue", lwd = 2)
segments(0.9, mean_ci[1], 1.1, mean_ci[1], col = "blue", lwd = 2)
segments(0.9, mean_ci[2], 1.1, mean_ci[2], col = "blue", lwd = 2)
text(1, 7.5, paste("CI: (", mean_ci[1], ",", mean_ci[2], ")"), cex = 0.8)

# Proportion CI
prop_ci <- c(0.080, 0.220)
plot(1, 0.15, xlim = c(0.5, 1.5), ylim = c(0, 0.25),
     main = "95% CI for Side Effect Proportion",
     xlab = "", ylab = "Proportion",
     pch = 16, col = "red", xaxt = "n")
segments(1, prop_ci[1], 1, prop_ci[2], col = "red", lwd = 2)
segments(0.9, prop_ci[1], 1.1, prop_ci[1], col = "red", lwd = 2)
segments(0.9, prop_ci[2], 1.1, prop_ci[2], col = "red", lwd = 2)
text(1, 0.05, paste("CI: (", prop_ci[1], ",", prop_ci[2], ")"), cex = 0.8)

The Most Important Takeaways

1. Sampling Variability is Natural

Every sample is different. The sampling distribution shows us how much variation to expect.

2. Standard Error Measures Precision

\[ SE = \frac{\text{Variability}}{\sqrt{\text{Sample Size}}} \] Larger samples give more precise estimates.

3. Normality Emerges from Aggregation

Even if individual measurements aren’t normal, sample statistics tend to be normal for large samples (CLT).

4. Formulas Depend on Data Type

Continuous data → Mean formulas
Categorical data → Proportion formulas

5. Conditions Matter

Random sampling
Independence
Sample size requirements

Common Exam Questions (and How to Answer)

Q1: “Why can we use normal distribution for sample means if the population isn’t normal?”

A: Central Limit Theorem—sample means become normal for large samples regardless of population shape.

Q2: “What’s the difference between standard deviation and standard error?”

A: SD measures variation in data; SE measures variation in sample statistics.

Q3: “When can’t we use the normal approximation for proportions?”

A: When np < 10 or n(1-p) < 10—not enough successes or failures.

Q4: “How does sample size affect the sampling distribution?”

A: Larger n → smaller SE → narrower distribution → more precise estimates.

5.5 Summary Table

# Comprehensive Summary Table
summary_all <- data.frame(
  Concept = c("Sampling Distribution", "Standard Error", "Conditions for Normality", 
              "Confidence Interval", "Sample Size Formula"),
  For_Means = c("Distribution of sample means", "σ/√n", "n ≥ 30 (or normal population)", 
                "x̄ ± z*(σ/√n)", "n = (z*σ/ME)²"),
  For_Proportions = c("Distribution of sample proportions", "√[p(1-p)/n]", 
                      "np ≥ 10 and n(1-p) ≥ 10", "p̂ ± z*√[p̂(1-p̂)/n]", 
                      "n = (z*/ME)² × p(1-p)"),
  Key_Insight = c("Shows how statistics vary across samples", 
                  "Measures precision of estimate",
                  "Ensures normal approximation is valid",
                  "Range likely containing parameter",
                  "Ensures desired precision")
)

kable(summary_all, caption = "Complete Summary of Sampling Distribution Concepts") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:4, width = "25%")

Complete Summary of Sampling Distribution Concepts
Concept	For_Means	For_Proportions	Key_Insight
Sampling Distribution	Distribution of sample means	Distribution of sample proportions	Shows how statistics vary across samples
Standard Error	σ/√n	√[p(1-p)/n]	Measures precision of estimate
Conditions for Normality	n ≥ 30 (or normal population)	np ≥ 10 and n(1-p) ≥ 10	Ensures normal approximation is valid
Confidence Interval	x̄ ± z*(σ/√n)	\|p̂ ± z*√[p̂(1-p̂)/n]	\|Range likely containing parameter
Sample Size Formula	n = (z*σ/ME)²	n = (z*/ME)² × p(1-p)	Ensures desired precision

5.6 Looking Forward

Sampling distributions provide the foundation for:

Hypothesis Testing: Is our sample result statistically significant?
Regression Analysis: How reliable are our regression coefficients?
Experimental Design: How to plan studies for maximum information?
Bayesian Statistics: How to update beliefs with new data?

Final Thought: The beauty of sampling distributions is that they turn uncertainty into something measurable. Instead of saying “I don’t know,” we can say “I’m 95% confident the true value is in this range.” That’s the power of statistical inference.

Remember: Every statistical analysis you’ll ever do rests on understanding sampling distributions. Master this, and you’ve mastered the core of statistics.

References

Video Materials

Continuous Random Variables - https://youtu.be/ZyUzRVa6hCM
Sampling Distribution - https://youtu.be/7S7j75d3GM4
Central Limit Theorem - https://youtu.be/ivd8wEHnMCg
Sample Proportion - https://youtu.be/q2e4mK0FTbw
Review Sampling Distribution - https://youtu.be/c0mFEL_SWzE

Textbooks

Walpole, R. E., et al. (2016). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson.
Devore, J. L. (2015). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage Learning.
Montgomery, D. C., & Runger, G. C. (2018). Applied Statistics and Probability for Engineers (7th ed.). Wiley.

Probability Distributions

Assignment Week 11

Fityanandra Athar Adyaksa (52250059)

December 07, 2025

Fityanandra Athar Adyaksa (52250059) Data Science students at Enthusiastic about learning December 07, 2025

Enthusiastic about learning