Foto

Fityanandra Athar Adyaksa (52250059)


Data Science students at

Enthusiastic about learning

December 07, 2025




Introduction


Probability distribution is a mathematical function that describes how the values of a random variable are spread or distributed. It provides a complete picture of all possible outcomes of an experiment and assigns a probability to each outcome (for discrete variables) or to intervals of outcomes (for continuous variables). In essence, a probability distribution tells us how likely each value or range of values is. For discrete random variables, it is typically represented using a probability mass function (PMF), while continuous random variables use a probability density function (PDF). Regardless of the type, the total probability across all outcomes must equal 1. Probability distributions are essential in statistics because they allow us to model uncertainty, make predictions, and perform inference about real-world phenomena based on data.

Imagine you are making soup. The recipe calls for salt to taste. While the exact amount is uncertain, you have an estimate based on experience—typically between half and one teaspoon. Probability distributions work in a similar way, but instead of relying on intuition, we use mathematics to describe the range of possible values and how likely each of those values is to occur.

In this material, we will explore key concepts ranging from continuous random variables, sampling distributions, the central limit theorem, to sample proportions. Each concept will be explained using a step-by-step approach: first with a formal definition, followed by a real-life analogy, complete with visualizations and computational examples.



Continuous Random



Continuous random variables differ fundamentally from discrete ones. While discrete variables count occurrences (how many?), continuous variables measure quantities (how much?). This section explores the unique characteristics of continuous random variables and how we work with them.


1.1 What Makes Variables Continuous?

Continuous random variables can take any value within an interval. Think of measuring time, weight, or temperature—these can be infinitely precise. For example, when timing a race, you could record 10.5 seconds, 10.52 seconds, or 10.523 seconds.

Key Difference:

  • Discrete: “How many students passed?” (0, 1, 2, 3, …) — Countable
  • Continuous: “What percentage scored above 80%?” (0.751, 0.752, 0.753, …) — Measurable


1.2 Probability Density Functions (PDF)

Since continuous variables have infinite possible values, we use Probability Density Functions (PDF) instead of probability mass functions. The PDF shows relative likelihood, not direct probability.

Three Essential PDF Properties:

\(f(x) \geq 0\) for all x (never negative)

Total area under the curve = 1: \(\int_{-\infty}^{\infty} f(x)dx = 1\)

Probability = area: \(P(a \leq X \leq b) = \int_{a}^{b} f(x)dx\)

Important Insight: The PDF value at a point isn’t a probability—it’s a density. Only areas under the curve represent probabilities.
# Visualizing PDF as Area Under Curve
x <- seq(-3, 3, length.out = 1000)
y <- dnorm(x)

plot(x, y, type = "l", lwd = 2, col = "blue",
     main = "Probability = Area Under PDF Curve",
     xlab = "X", ylab = "Density f(x)")

# Shade area for probability between -1 and 1
polygon(c(-1, x[x >= -1 & x <= 1], 1), 
        c(0, y[x >= -1 & x <= 1], 0), 
        col = rgb(0, 0, 1, 0.3))
text(0, 0.1, "Area = Probability", col = "darkblue")
text(0, 0.05, "P(-1 ≤ X ≤ 1) ≈ 0.6827", col = "red")


1.3 The Zero Probability Paradox

One of the most surprising aspects of continuous variables: \(P(X = a) = 0\) for any specific value \(a\). This doesn’t mean the value is impossible—it means the probability of exactly that value is infinitesimally small.

Practical Approach: We always work with intervals, not exact values. Instead of asking “What’s the probability of exactly 170 cm?” we ask “What’s the probability between 169.5 cm and 170.5 cm?”


1.4 Common Continuous Distributions

Normal Distribution (Bell Curve)

The most famous continuous distribution, appearing naturally in many phenomena:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

Parameters:

  • μ (mu) = mean (center)
  • σ (sigma) = standard deviation (spread)

Uniform Distribution

Equal probability density across an interval:

\[ f(x) = \frac{1}{b-a} \quad \text{for } a \leq x \leq b \]


Exponential Distribution

Models time between events:

\[ f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0 \]


# Comparing Three Continuous Distributions
par(mfrow = c(1, 3))

# Normal Distribution
curve(dnorm(x), -4, 4, main = "Normal Distribution",
      xlab = "X", ylab = "Density", col = "blue", lwd = 2)

# Uniform Distribution
curve(dunif(x, 0, 1), -0.5, 1.5, main = "Uniform Distribution",
      xlab = "X", ylab = "Density", col = "red", lwd = 2)

# Exponential Distribution
curve(dexp(x, rate = 1), 0, 5, main = "Exponential Distribution",
      xlab = "X", ylab = "Density", col = "green", lwd = 2)


1.5 Practical Example: Exam Scores

Suppose final exam scores follow a normal distribution with mean 75 and standard deviation 10.

Questions:

  1. What percentage of students scored below 60?
  2. What’s the probability of scoring between 70 and 80?
  3. What score is needed to be in the top 10%?
Calculations:
  1. \(P(X < 60) = P(Z < \frac{60-75}{10}) = P(Z < -1.5) ≈ 0.0668\) (6.68%)
  1. \(P(70 < X < 80) = P(\frac{70-75}{10} < Z < \frac{80-75}{10}) = P(-0.5 < Z < 0.5) ≈ 0.3829\) (38.29%)
  1. Top 10% cutoff: \(75 + 1.28 \times 10 ≈ 87.8\)


1.6 Key Takeaways

  1. Continuous variables measure quantities with infinite precision
  2. PDFs describe probability density, not direct probability
  3. Probability is calculated as area under the PDF curve
  4. Normal distribution is fundamental in statistics
  5. Always work with intervals, not exact values




Sampling Distribution



2.1 The Foundation

A sampling distribution shows what happens when we take many random samples from the same population and calculate the same statistic (like the mean) from each sample.

Core Concept: It’s the distribution of a statistic across all possible samples of the same size.

Video Analogy: If you repeatedly take samples of 50 people and calculate each sample’s average height, the collection of those averages forms the sampling distribution.


2.2 The Sampling Distribution of the Sample Mean

For a population with mean μ and standard deviation σ, when we take samples of size n:

Center: The mean of all sample means equals the population mean:

\[ \mu_{\bar{x}} = \mu \]

Spread: The standard deviation of sample means (called Standard Error) is:

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]

Key Insight: Sample means vary less than individual observations. Larger samples give more consistent results.


2.3 Standard Error

Standard Error (SE) quantifies how much sample statistics vary from sample to sample:

\[ SE = \frac{\sigma}{\sqrt{n}} \]

What it tells us:

  • SE decreases as sample size increases
  • Cutting SE in half requires quadrupling sample size
  • Smaller SE means more precise estimates
# Demonstrating Standard Error vs Sample Size
set.seed(123)
population_sd <- 15
sample_sizes <- c(5, 20, 50, 100)

standard_errors <- population_sd / sqrt(sample_sizes)

results <- data.frame(
  Sample_Size = sample_sizes,
  Standard_Error = round(standard_errors, 2),
  Relative_Precision = round(1/standard_errors, 2)
)

knitr::kable(results, 
             caption = "How Sample Size Affects Standard Error",
             col.names = c("Sample Size (n)", "Standard Error", "Relative Precision"))
How Sample Size Affects Standard Error
Sample Size (n) Standard Error Relative Precision
5 6.71 0.15
20 3.35 0.30
50 2.12 0.47
100 1.50 0.67


2.4 Shape of the Sampling Distribution

Normal Populations: If the population is normal, the sampling distribution is normal for any sample size.

Non-Normal Populations: For large samples (typically n ≥ 30), the sampling distribution becomes approximately normal (Central Limit Theorem).

Visual Proof:
# Shape Changes with Sample Size
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# Skewed population
skewed_pop <- rexp(10000, rate = 0.5)

for(n in c(5, 15, 30, 50)) {
  sample_means <- replicate(1000, mean(sample(skewed_pop, n)))
  
  hist(sample_means, main = paste("n =", n),
       xlab = "", ylab = "",
       col = "lightblue", breaks = 30,
       probability = TRUE)
  
  # Add normal curve
  x_norm <- seq(min(sample_means), max(sample_means), length = 100)
  y_norm <- dnorm(x_norm, mean = mean(sample_means), sd = sd(sample_means))
  lines(x_norm, y_norm, col = "red", lwd = 2)
}


2.5 Practical Application: Factory Quality Control

Scenario: A factory produces components with length μ = 50 mm, σ = 2 mm. Quality control samples 36 components each hour.

Question: What’s the probability that a sample mean is less than 49.5 mm?

Step-by-Step Solution:
  1. Calculate Standard Error: \[ SE = \frac{2}{\sqrt{36}} = \frac{2}{6} = 0.333 \]
  1. Calculate Z-score: \[ Z = \frac{49.5 - 50}{0.333} = -1.5 \]
  1. Find probability: \[ P(\bar{X} < 49.5) = P(Z < -1.5) = 0.0668 \]

Interpretation: There’s a 6.68% chance of observing such a low sample mean if the process is working correctly.


2.6 Why This Matters in Real Research?

  1. Confidence Intervals:
    Sample mean ± Margin of Error (where Margin of Error uses SE)

  2. Hypothesis Testing:
    Determines if sample results are unusual under the null hypothesis

  3. Sample Size Planning:
    Helps decide how many observations to collect

Key Formula for Confidence Interval:

\[ \bar{x} \pm z^* \times \frac{\sigma}{\sqrt{n}} \] Where \(z^*\) depends on confidence level (1.96 for 95% confidence).


2.7 Common Pitfalls to Avoid

  1. Confusing SD with SE:
    • SD: Variation in individual data points
    • SE: Variation in sample statistics
  2. Ignoring Sample Size Requirements:
    • For means: n ≥ 30 for CLT to apply
    • For proportions: np ≥ 10 and n(1-p) ≥ 10
  3. Forgetting Random Sampling Assumption:
    • Results only valid if samples are random


2.8 Essential Takeaways

  1. Sampling distribution describes how sample statistics vary
  2. Standard Error = σ/√n measures this variation
  3. Distribution shape becomes normal for large samples
  4. Applications include confidence intervals and hypothesis tests

Remember: The sampling distribution connects what we see in a sample to what exists in the population—it’s the foundation of statistical inference.




Central Limit Theorem



The Central Limit Theorem (CLT) states that regardless of the population’s distribution, the sampling distribution of the sample mean will approach a normal distribution as the sample size increases.

In Simple Terms: Take any population—skewed, uniform, exponential—take large enough samples, calculate their means, and those means will form a bell curve.

Formal Statement: For a population with mean μ and standard deviation σ, when we take random samples of size n (with n sufficiently large):

\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]


Why CLT is Revolutionary?

Before CLT: We assumed populations were normal to do statistical inference.

After CLT: We can work with any population shape if we have large enough samples.

The Magic Number: n ≥ 30 is often considered “large enough” for CLT to kick in, though very skewed distributions may need larger n.


3.1 Mathematical Foundation

The standardized sample mean follows a standard normal distribution for large n:

\[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1) \]

What Changes with Sample Size: - n = 1: Sampling distribution = Population distribution - n = 30: Approximately normal for most populations - n = 100: Very close to normal


3.2 Practical Example: Factory Production

Scenario: A factory produces screws with lengths following an exponential distribution (mean = 50mm, SD = 50mm). This distribution is highly skewed—most screws are short, but some are very long.

Problem: What’s the probability that a sample of 40 screws has average length > 60mm?

Without CLT: Complex calculation with exponential distribution.

With CLT: Simple normal approximation.
Solution:

\[ SE = \frac{50}{\sqrt{40}} = 7.906 \]

\[ Z = \frac{60 - 50}{7.906} = 1.265 \]

\[ P(\bar{X} > 60) = P(Z > 1.265) = 0.103 \quad (10.3\%) \]


3.3 Sample Size Guidelines

# Sample Size Requirements for Different Distributions
library(knitr)

guidelines <- data.frame(
  Distribution_Type = c("Moderately Skewed", "Highly Skewed", "Extremely Skewed", "Proportions (p ≈ 0.5)", "Proportions (p ≈ 0.1)"),
  Minimum_n = c("30", "50", "100+", "30", "100"),
  Reason = c("CLT works well", "More observations needed", "Very slow convergence", "np and n(1-p) both > 15", "Need np > 10")
)

kable(guidelines, 
      caption = "Sample Size Guidelines for CLT Approximation",
      col.names = c("Distribution Type", "Minimum Sample Size", "Reason"))
Sample Size Guidelines for CLT Approximation
Distribution Type Minimum Sample Size Reason
Moderately Skewed 30 CLT works well
Highly Skewed 50 More observations needed
Extremely Skewed 100+ Very slow convergence
Proportions (p ≈ 0.5) 30 np and n(1-p) both > 15
Proportions (p ≈ 0.1) 100 Need np > 10


3.4 The De Moivre-Laplace Theorem

For proportions, CLT takes a special form called the De Moivre-Laplace Theorem:

\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \quad \text{for large n} \] with conditions: np ≥ 10 and n(1-p) ≥ 10.


3.5 Key Takeaways

  1. Universal Applicability: CLT works for any population distribution
  2. Sample Means Become Normal: For large enough samples
  3. Practical Threshold: n ≥ 30 often sufficient
  4. Foundation for Inference: Enables t-tests, confidence intervals, regression
  5. Not About Population: Population stays the same; sample means become normal

Final Insight: CLT is why the normal distribution is everywhere in statistics. It’s not that the world is normally distributed—it’s that averages of random variables tend to be normal, regardless of what we start with.




Sample Proportion



4.1 From Counts to Proportions

When we deal with categorical data (yes/no, success/failure), we use sample proportions instead of sample means. The sample proportion (p̂) measures the fraction of successes in a sample.

Formula:

\[ \hat{p} = \frac{X}{n} \]

where: - X = number of successes in the sample - n = sample size

Example Applications: - Political polls: proportion supporting a candidate - Quality control: proportion of defective items - Medicine: proportion of patients responding to treatment


4.2 Sampling Distribution of Sample Proportion

When we take many samples and calculate p̂ for each, these proportions form a sampling distribution with:

Center: The mean of all sample proportions equals the population proportion: \[ \mu_{\hat{p}} = p \]

Spread: The standard deviation (Standard Error) is: \[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

Important: This formula uses p (population proportion), which we often don’t know. In practice, we use p̂ as an estimate.


4.3 Conditions for Normality

For the sampling distribution to be approximately normal, we need:

  1. Random Sampling: Samples must be random
  2. Independence: n ≤ 10% of population if sampling without replacement
  3. Success-Failure Condition:
    • np ≥ 10
    • n(1-p) ≥ 10

Rule of Thumb: At least 10 successes and 10 failures in the sample.

# Visualizing Conditions for Normality
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# Different scenarios
scenarios <- list(
  c(p = 0.5, n = 20, label = "Good: p=0.5, n=20"),
  c(p = 0.1, n = 30, label = "Poor: p=0.1, n=30"),
  c(p = 0.5, n = 50, label = "Good: p=0.5, n=50"),
  c(p = 0.1, n = 100, label = "Good: p=0.1, n=100")
)

for(scen in scenarios) {
  p <- as.numeric(scen["p"])
  n <- as.numeric(scen["n"])
  
  # Generate sampling distribution
  sample_props <- rbinom(10000, n, p) / n
  
  hist(sample_props, main = scen["label"],
       xlab = "", ylab = "",
       col = ifelse(p*n >= 10 & n*(1-p) >= 10, "lightgreen", "lightcoral"),
       breaks = 30, probability = TRUE)
  
  # Add normal curve if conditions met
  if(p*n >= 10 & n*(1-p) >= 10) {
    x_norm <- seq(min(sample_props), max(sample_props), length = 100)
    y_norm <- dnorm(x_norm, mean = p, sd = sqrt(p*(1-p)/n))
    lines(x_norm, y_norm, col = "blue", lwd = 2)
  }
  
  # Add success-failure counts
  text(mean(sample_props), max(hist(sample_props, plot=FALSE)$density)*0.8,
       paste("Successes:", round(p*n,1)), cex = 0.7)
  text(mean(sample_props), max(hist(sample_props, plot=FALSE)$density)*0.6,
       paste("Failures:", round(n*(1-p),1)), cex = 0.7)
}


4.4 The Normal Approximation

When conditions are met: \[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \]

Standardized Form:

\[ Z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} \sim N(0,1) \]

Important: This approximation works well when p is not too close to 0 or 1, and sample size is large enough.


Practical Example: Political Poll

Scenario: A candidate claims 60% support (p = 0.60). A poll of 500 voters shows 280 support (p̂ = 0.56).

Question: What’s the probability of getting 56% or less support if the true support is 60%?

Step-by-Step Solution:

  1. Check Conditions:
    • Random sample: assumed
    • Independence: 500 < 10% of voters
    • Success-Failure:
      • np = 500 × 0.60 = 300 ≥ 10
      • n(1-p) = 500 × 0.40 = 200 ≥ 10 ✓ Conditions met
  2. Calculate Standard Error:

    \[ SE = \sqrt{\frac{0.60 \times 0.40}{500}} = \sqrt{0.00048} = 0.0219 \]

  3. Calculate Z-score:

    \[ Z = \frac{0.56 - 0.60}{0.0219} = -1.826 \]

  4. Find Probability:

    \[ P(\hat{p} \leq 0.56) = P(Z \leq -1.826) = 0.034 \]

Interpretation: There’s only a 3.4% chance of getting 56% or less support if true support is 60%. This suggests the candidate’s claim might be too high.


Margin of Error and Confidence Intervals

For a 95% confidence interval: \[ \hat{p} \pm 1.96 \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Example: If p̂ = 0.56 from n = 500: \[ SE = \sqrt{\frac{0.56 \times 0.44}{500}} = 0.0222 \]
\[ 95\%\ CI = 0.56 \pm 1.96 \times 0.0222 = (0.516, 0.604) \]

We’re 95% confident the true proportion is between 51.6% and 60.4%.


Common Mistakes

  1. Using Wrong Standard Error Formula:
    • Wrong: \(SE = \sqrt{\frac{p̂(1-p̂)}{n}}\) when p is known
    • Right: Use p when known, p̂ when estimating
  2. Ignoring Conditions:
    • Applying normal approximation when np < 10 or n(1-p) < 10
  3. Confusing Population and Sample:
    • p = population proportion (usually unknown)
    • p̂ = sample proportion (calculated from data)


4.5 Sample Size Determination

To achieve a desired margin of error (ME): \[ n = \left(\frac{z^*}{ME}\right)^2 p(1-p) \]

Conservative Approach: Use p = 0.5 (maximizes required sample size): \[ n = \left(\frac{1.96}{ME}\right)^2 \times 0.25 \]

# Sample Size Calculator for Proportions
calculate_sample_size <- function(ME, p = 0.5, confidence = 0.95) {
  z <- qnorm(1 - (1-confidence)/2)
  n <- (z^2 * p * (1-p)) / (ME^2)
  return(ceiling(n))
}

# Example calculations
ME_levels <- c(0.01, 0.03, 0.05, 0.10)
sample_sizes <- sapply(ME_levels, calculate_sample_size)

results <- data.frame(
  Margin_of_Error = paste0(ME_levels*100, "%"),
  Sample_Size_Needed = sample_sizes,
  Notes = c("Very precise", "Typical poll", "Moderate precision", "Rough estimate")
)

knitr::kable(results, 
             caption = "Sample Size Requirements for Different Margins of Error (95% Confidence)",
             col.names = c("Margin of Error", "Minimum Sample Size", "Typical Use"))
Sample Size Requirements for Different Margins of Error (95% Confidence)
Margin of Error Minimum Sample Size Typical Use
1% 9604 Very precise
3% 1068 Typical poll
5% 385 Moderate precision
10% 97 Rough estimate


4.6 Key Takeaways

  1. Sample proportion p̂ = X/n estimates population proportion p
  2. Sampling distribution is approximately normal when np ≥ 10 and n(1-p) ≥ 10
  3. Standard Error = √[p(1-p)/n]
  4. Confidence intervals help estimate population proportion
  5. Sample size planning ensures desired precision

Remember: For proportions, the “success-failure condition” is crucial. Always check np ≥ 10 and n(1-p) ≥ 10 before using normal approximation methods.




Review Sampling Distribution



5.1 Connecting All the Pieces

Sampling distributions form the foundation of statistical inference—they connect what we observe in our sample to what exists in the population. This review integrates everything we’ve learned about:

  1. Continuous Random Variables (how individual measurements behave)
  2. Sampling Distribution of the Mean (how sample averages behave)
  3. Central Limit Theorem (why sample averages become normal)
  4. Sample Proportion (how percentages/success rates behave)

The Big Picture: Every time we collect data and calculate a statistic (mean, proportion), we’re seeing one possible value from a sampling distribution.

5.2 The Complete Framework

For Sample Means (Continuous Data):

\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{when n is large} \]

For Sample Proportions (Categorical Data):

\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \quad \text{when np ≥ 10 and n(1-p) ≥ 10} \]

Key Similarity: Both follow normal distributions for large enough samples, even if the original population doesn’t.


5.3 The Four-Step Process for Any Sampling Problem

Step 1: Identify the Parameter - Mean (μ) or proportion (p)? - Known or unknown population parameters?

Step 2: Check Conditions - Random sampling? - Independence? - Sample size requirements met?

Step 3: Calculate Standard Error - For means: \(SE = \frac{\sigma}{\sqrt{n}}\) - For proportions: \(SE = \sqrt{\frac{p(1-p)}{n}}\)

Step 4: Apply Normal Distribution - Use z-scores: \(Z = \frac{\text{statistic} - \text{parameter}}{SE}\) - Find probabilities or create intervals

# Complete Visualization of Sampling Distribution Framework
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

# 1. Population Distribution
pop_data <- rexp(10000, rate = 0.5)
hist(pop_data, main = "1. Population Distribution",
     xlab = "", ylab = "", col = "lightblue", breaks = 30)

# 2. Single Sample
single_sample <- sample(pop_data, 30)
hist(single_sample, main = "2. Single Sample (n=30)",
     xlab = "", ylab = "", col = "lightgreen", breaks = 15)

# 3. Sampling Distribution
sample_means <- replicate(1000, mean(sample(pop_data, 30)))
hist(sample_means, main = "3. Sampling Distribution of Means",
     xlab = "", ylab = "", col = "lightcoral", breaks = 30, probability = TRUE)

# Add normal curve
x_norm <- seq(min(sample_means), max(sample_means), length = 100)
y_norm <- dnorm(x_norm, mean = mean(sample_means), sd = sd(sample_means))
lines(x_norm, y_norm, col = "blue", lwd = 2)

# 4. Standardized Distribution
z_scores <- (sample_means - mean(pop_data)) / (sd(pop_data)/sqrt(30))
hist(z_scores, main = "4. Standardized (Z) Distribution",
     xlab = "", ylab = "", col = "lightyellow", breaks = 30, probability = TRUE)

# Add standard normal curve
curve(dnorm(x), add = TRUE, col = "red", lwd = 2)


Decision Tree: Which Formula to Use?

# Decision Tree Table
library(kableExtra)

decision_tree <- data.frame(
  Question = c(
    "What type of data do you have?",
    "Is the population standard deviation known?",
    "What sample size?",
    "For proportions: Are conditions met?"
  ),
  Answer_1 = c(
    "Continuous → Use mean formulas",
    "Yes → Use z-distribution",
    "n ≥ 30 → CLT applies",
    "np ≥ 10 and n(1-p) ≥ 10 → Normal approximation OK"
  ),
  Answer_2 = c(
    "Categorical → Use proportion formulas",
    "No → Use t-distribution (future topic)",
    "n < 30 → Check population normality",
    "Conditions not met → Use exact methods"
  )
)

kable(decision_tree, caption = "Decision Tree for Sampling Distribution Problems") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(1:4, extra_css = "border-bottom: 2px solid #ddd;")
Decision Tree for Sampling Distribution Problems
Question Answer_1 Answer_2
What type of data do you have? Continuous → Use mean formulas Categorical → Use proportion formulas
Is the population standard deviation known? Yes → Use z-distribution No → Use t-distribution (future topic)
What sample size? n ≥ 30 → CLT applies n < 30 → Check population normality
For proportions: Are conditions met? np ≥ 10 and n(1-p) ≥ 10 → Normal approximation OK Conditions not met → Use exact methods

Common Problem Types and Solutions


Type 1: Probability Calculations

Question: “What’s the probability that our sample mean is less than X?”

Solution: 1. Calculate SE 2. Calculate Z-score: \(Z = \frac{\bar{x} - \mu}{SE}\) 3. Use normal table or software to find probability


Type 2: Confidence Intervals

Question: “What range likely contains the population parameter?”

Solution: \[ \text{Statistic} \pm (z^* \times SE) \] where z* = 1.96 for 95% confidence


Type 3: Sample Size Planning

Question: “How large a sample do I need?”

Solution: - For means: \(n = \left(\frac{z^* \sigma}{ME}\right)^2\) - For proportions: \(n = \left(\frac{z^*}{ME}\right)^2 p(1-p)\)


5.4 Real-World Integration Example

Scenario: A pharmaceutical company tests a new drug. They want to know: 1. Average reduction in blood pressure (continuous) 2. Proportion of patients with side effects (categorical)

Data Collected:

  • 100 patients sampled
  • Mean blood pressure reduction: 8.5 mmHg (σ = 4 mmHg from previous studies)
  • 15 patients reported side effects

Analysis Part 1 (Mean): \[ SE_{\text{mean}} = \frac{4}{\sqrt{100}} = 0.4 \]
\[ 95\%\ CI\ \text{for mean}: 8.5 \pm 1.96 \times 0.4 = (7.72, 9.28)\ \text{mmHg} \]

Analysis Part 2 (Proportion): \[ \hat{p} = 15/100 = 0.15 \]
\[ SE_{\text{prop}} = \sqrt{\frac{0.15 \times 0.85}{100}} = 0.0357 \]
\[ 95\%\ CI\ \text{for proportion}: 0.15 \pm 1.96 \times 0.0357 = (0.080, 0.220) \]

# Visualizing Both Confidence Intervals
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Mean CI
mean_ci <- c(7.72, 9.28)
plot(1, 8.5, xlim = c(0.5, 1.5), ylim = c(7, 10),
     main = "95% CI for Mean Reduction",
     xlab = "", ylab = "Blood Pressure Reduction (mmHg)",
     pch = 16, col = "blue", xaxt = "n")
segments(1, mean_ci[1], 1, mean_ci[2], col = "blue", lwd = 2)
segments(0.9, mean_ci[1], 1.1, mean_ci[1], col = "blue", lwd = 2)
segments(0.9, mean_ci[2], 1.1, mean_ci[2], col = "blue", lwd = 2)
text(1, 7.5, paste("CI: (", mean_ci[1], ",", mean_ci[2], ")"), cex = 0.8)

# Proportion CI
prop_ci <- c(0.080, 0.220)
plot(1, 0.15, xlim = c(0.5, 1.5), ylim = c(0, 0.25),
     main = "95% CI for Side Effect Proportion",
     xlab = "", ylab = "Proportion",
     pch = 16, col = "red", xaxt = "n")
segments(1, prop_ci[1], 1, prop_ci[2], col = "red", lwd = 2)
segments(0.9, prop_ci[1], 1.1, prop_ci[1], col = "red", lwd = 2)
segments(0.9, prop_ci[2], 1.1, prop_ci[2], col = "red", lwd = 2)
text(1, 0.05, paste("CI: (", prop_ci[1], ",", prop_ci[2], ")"), cex = 0.8)


The Most Important Takeaways

1. Sampling Variability is Natural

Every sample is different. The sampling distribution shows us how much variation to expect.

2. Standard Error Measures Precision

\[ SE = \frac{\text{Variability}}{\sqrt{\text{Sample Size}}} \] Larger samples give more precise estimates.

3. Normality Emerges from Aggregation

Even if individual measurements aren’t normal, sample statistics tend to be normal for large samples (CLT).

4. Formulas Depend on Data Type

  • Continuous data → Mean formulas
  • Categorical data → Proportion formulas

5. Conditions Matter

  • Random sampling
  • Independence
  • Sample size requirements


Common Exam Questions (and How to Answer)

Q1: “Why can we use normal distribution for sample means if the population isn’t normal?”

A: Central Limit Theorem—sample means become normal for large samples regardless of population shape.

Q2: “What’s the difference between standard deviation and standard error?”

A: SD measures variation in data; SE measures variation in sample statistics.

Q3: “When can’t we use the normal approximation for proportions?”

A: When np < 10 or n(1-p) < 10—not enough successes or failures.

Q4: “How does sample size affect the sampling distribution?”

A: Larger n → smaller SE → narrower distribution → more precise estimates.

5.5 Summary Table

# Comprehensive Summary Table
summary_all <- data.frame(
  Concept = c("Sampling Distribution", "Standard Error", "Conditions for Normality", 
              "Confidence Interval", "Sample Size Formula"),
  For_Means = c("Distribution of sample means", "σ/√n", "n ≥ 30 (or normal population)", 
                "x̄ ± z*(σ/√n)", "n = (z*σ/ME)²"),
  For_Proportions = c("Distribution of sample proportions", "√[p(1-p)/n]", 
                      "np ≥ 10 and n(1-p) ≥ 10", "p̂ ± z*√[p̂(1-p̂)/n]", 
                      "n = (z*/ME)² × p(1-p)"),
  Key_Insight = c("Shows how statistics vary across samples", 
                  "Measures precision of estimate",
                  "Ensures normal approximation is valid",
                  "Range likely containing parameter",
                  "Ensures desired precision")
)

kable(summary_all, caption = "Complete Summary of Sampling Distribution Concepts") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:4, width = "25%")
Complete Summary of Sampling Distribution Concepts
Concept For_Means For_Proportions Key_Insight
Sampling Distribution Distribution of sample means Distribution of sample proportions Shows how statistics vary across samples
Standard Error σ/√n √[p(1-p)/n] Measures precision of estimate
Conditions for Normality n ≥ 30 (or normal population) np ≥ 10 and n(1-p) ≥ 10 Ensures normal approximation is valid
Confidence Interval x̄ ± z*(σ/√n) |p̂ ± z*√[p̂(1-p̂)/n] |Range likely containing parameter
Sample Size Formula n = (z*σ/ME)² n = (z*/ME)² × p(1-p) Ensures desired precision

5.6 Looking Forward

Sampling distributions provide the foundation for:

  1. Hypothesis Testing: Is our sample result statistically significant?
  2. Regression Analysis: How reliable are our regression coefficients?
  3. Experimental Design: How to plan studies for maximum information?
  4. Bayesian Statistics: How to update beliefs with new data?

Final Thought: The beauty of sampling distributions is that they turn uncertainty into something measurable. Instead of saying “I don’t know,” we can say “I’m 95% confident the true value is in this range.” That’s the power of statistical inference.

Remember: Every statistical analysis you’ll ever do rests on understanding sampling distributions. Master this, and you’ve mastered the core of statistics.




References

Video Materials

  1. Continuous Random Variables - https://youtu.be/ZyUzRVa6hCM
  2. Sampling Distribution - https://youtu.be/7S7j75d3GM4
  3. Central Limit Theorem - https://youtu.be/ivd8wEHnMCg
  4. Sample Proportion - https://youtu.be/q2e4mK0FTbw
  5. Review Sampling Distribution - https://youtu.be/c0mFEL_SWzE

Textbooks

  1. Walpole, R. E., et al. (2016). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson.
  2. Devore, J. L. (2015). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage Learning.
  3. Montgomery, D. C., & Runger, G. C. (2018). Applied Statistics and Probability for Engineers (7th ed.). Wiley.