1 Why the Normal Distribution?

Walk into any large lecture hall and measure every student’s height. Plot it. You will see a bell — most students cluster around an average, with fewer and fewer as you move toward extremes. The same shape emerges when you measure exam scores, blood pressure readings, manufacturing tolerances, or daily stock returns.

This bell shape is the Normal Distribution, and it is arguably the most important probability distribution in all of statistics. Understanding it — and knowing how to use it to make confident claims from data — is a foundational skill for any data analyst.

By the end of this tutorial you will be able to:


2 The Normal Distribution

2.1 Shape and Parameters

The normal distribution is a continuous probability distribution defined by two parameters:

  • Mean (μ) — controls the location (where the centre of the bell sits).
  • Standard deviation (σ) — controls the spread (how wide or narrow the bell is).

Its probability density function is:

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

x <- seq(-8, 12, length.out = 500)

params <- data.frame(
  mu    = c(0,  0,  0,  4),
  sigma = c(1,  2,  0.5, 1.5),
  label = c("μ=0, σ=1", "μ=0, σ=2", "μ=0, σ=0.5", "μ=4, σ=1.5")
)

curves <- lapply(seq_len(nrow(params)), function(i) {
  data.frame(
    x     = x,
    y     = dnorm(x, params$mu[i], params$sigma[i]),
    label = params$label[i]
  )
}) |> do.call(what = rbind)

ggplot(curves, aes(x = x, y = y, colour = label)) +
  geom_line(linewidth = 1) +
  labs(x = "x", y = "Density", colour = NULL,
       title = "Normal Distributions — Effect of μ and σ") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")
Normal distributions with different means and standard deviations. The shape is always symmetric and bell-shaped; μ shifts the centre, σ controls the width.

Normal distributions with different means and standard deviations. The shape is always symmetric and bell-shaped; μ shifts the centre, σ controls the width.

2.2 The Standard Normal Distribution

The standard normal has μ = 0 and σ = 1. Any normal variable X ~ N(μ, σ²) can be converted to a standard normal Z using:

\[Z = \frac{X - \mu}{\sigma}\]

This standardisation allows us to use a single table (or function) to answer probability questions for any normal distribution.

x_std <- seq(-4, 4, length.out = 500)
df_std <- data.frame(x = x_std, y = dnorm(x_std))

sigmas <- c(-3, -2, -1, 1, 2, 3)

ggplot(df_std, aes(x, y)) +
  geom_line(linewidth = 1.2, colour = "steelblue") +
  geom_vline(xintercept = sigmas, linetype = "dashed", colour = "gray50", linewidth = 0.5) +
  annotate("text", x = sigmas, y = -0.012,
           label = paste0(sigmas, "σ"), size = 3.5, colour = "gray40") +
  annotate("text", x = 0, y = -0.012, label = "μ", size = 3.5, colour = "gray40") +
  geom_vline(xintercept = 0, colour = "gray40", linewidth = 0.5) +
  scale_y_continuous(limits = c(-0.02, 0.42)) +
  labs(x = "Standard Deviations from Mean (Z)", y = "Density",
       title = "Standard Normal Distribution  N(0, 1)") +
  theme_minimal(base_size = 13)
The standard normal distribution N(0,1). Vertical dashed lines mark ±1σ, ±2σ, and ±3σ from the mean.

The standard normal distribution N(0,1). Vertical dashed lines mark ±1σ, ±2σ, and ±3σ from the mean.


3 The 68–95–99.7 Rule

One of the most useful properties of the normal distribution is that fixed proportions of data always fall within fixed distances of the mean, regardless of the specific μ and σ.

x_r <- seq(-4, 4, length.out = 500)
df_r <- data.frame(x = x_r, y = dnorm(x_r))

shade <- function(lo, hi, fill, alpha = 0.4) {
  xs <- seq(lo, hi, length.out = 200)
  data.frame(x = xs, y = dnorm(xs), fill = fill, alpha = alpha)
}

regions <- bind_rows(
  shade(-3, 3, "99.7%", 0.25),
  shade(-2, 2, "95%",   0.35),
  shade(-1, 1, "68%",   0.55)
)

ggplot(df_r, aes(x, y)) +
  geom_ribbon(data = regions |> filter(fill == "99.7%"),
              aes(ymin = 0, ymax = y), fill = "#aec6cf", alpha = 0.4) +
  geom_ribbon(data = regions |> filter(fill == "95%"),
              aes(ymin = 0, ymax = y), fill = "#6baed6", alpha = 0.4) +
  geom_ribbon(data = regions |> filter(fill == "68%"),
              aes(ymin = 0, ymax = y), fill = "#2171b5", alpha = 0.4) +
  geom_line(linewidth = 1.2, colour = "steelblue") +
  annotate("text", x = 0,    y = 0.20, label = "68%",   size = 5, fontface = "bold", colour = "white") +
  annotate("text", x = 1.55, y = 0.07, label = "95%",   size = 4, colour = "#2171b5") +
  annotate("text", x = 2.6,  y = 0.02, label = "99.7%", size = 3.5, colour = "#2171b5") +
  annotate("segment", x = -1, xend = 1, y = 0.24, yend = 0.24,
           arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
  annotate("text", x = 0, y = 0.255, label = "±1σ", size = 3.5) +
  annotate("segment", x = -2, xend = 2, y = 0.31, yend = 0.31,
           arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
  annotate("text", x = 0, y = 0.325, label = "±2σ", size = 3.5) +
  annotate("segment", x = -3, xend = 3, y = 0.38, yend = 0.38,
           arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
  annotate("text", x = 0, y = 0.395, label = "±3σ", size = 3.5) +
  labs(x = "Standard Deviations from Mean", y = "Density",
       title = "The 68–95–99.7 Empirical Rule") +
  theme_minimal(base_size = 13)
The empirical rule: shaded regions show the proportion of values within 1, 2, and 3 standard deviations of the mean.

The empirical rule: shaded regions show the proportion of values within 1, 2, and 3 standard deviations of the mean.

The Empirical Rule — Summary

Range Proportion of data
Within ±1σ of the mean 68.3%
Within ±2σ of the mean 95.4%
Within ±3σ of the mean 99.7%

This rule applies to any normally distributed variable.

3.1 Worked Example — Exam Scores

MBA students’ exam scores are approximately normally distributed with a mean of 65 and a standard deviation of 10.

mu_exam <- 65
sd_exam <- 10
x_exam  <- seq(25, 105, length.out = 500)
df_exam <- data.frame(x = x_exam, y = dnorm(x_exam, mu_exam, sd_exam))

ggplot(df_exam, aes(x, y)) +
  geom_ribbon(data = df_exam |> filter(x >= mu_exam - 3*sd_exam, x <= mu_exam + 3*sd_exam),
              aes(ymin = 0, ymax = y), fill = "#aec6cf", alpha = 0.4) +
  geom_ribbon(data = df_exam |> filter(x >= mu_exam - 2*sd_exam, x <= mu_exam + 2*sd_exam),
              aes(ymin = 0, ymax = y), fill = "#6baed6", alpha = 0.4) +
  geom_ribbon(data = df_exam |> filter(x >= mu_exam - sd_exam,   x <= mu_exam + sd_exam),
              aes(ymin = 0, ymax = y), fill = "#2171b5", alpha = 0.4) +
  geom_line(linewidth = 1.2, colour = "steelblue") +
  geom_vline(xintercept = mu_exam, colour = "firebrick", linewidth = 1) +
  scale_x_continuous(breaks = c(35, 45, 55, 65, 75, 85, 95)) +
  annotate("text", x = 65, y = 0.043, label = "μ = 65", colour = "firebrick", size = 4) +
  labs(x = "Exam Score", y = "Density",
       title = "MBA Exam Score Distribution  N(65, 10²)") +
  theme_minimal(base_size = 13)
MBA exam score distribution. The shaded regions show the 68%, 95%, and 99.7% intervals.

MBA exam score distribution. The shaded regions show the 68%, 95%, and 99.7% intervals.

Key questions you can now answer:

  • What proportion of students score between 55 and 75? → Within ±1σ → 68%
  • What score separates the top 2.5% of students? → 65 + 2×10 = 85
  • A student scored 80. How unusual is that?
z <- (80 - 65) / 10
prob_above <- 1 - pnorm(z)

cat("Z-score for a score of 80:", round(z, 2), "\n")
## Z-score for a score of 80: 1.5
cat("Proportion of students scoring above 80:", scales::percent(prob_above, 0.1), "\n")
## Proportion of students scoring above 80: 6.7%

4 The Sampling Distribution of the Mean

4.1 From Individual Scores to Sample Means

When we collect data, we usually work with a sample rather than the full population. We compute a sample mean \(\bar{X}\) and use it to estimate the true population mean μ.

A crucial question: How much does \(\bar{X}\) vary across different samples?

The answer is the sampling distribution of the mean — the distribution of \(\bar{X}\) values you would get if you took many independent samples of size n from the population.

The Central Limit Theorem (CLT)

For large enough samples, the sampling distribution of the mean is approximately normal, regardless of the shape of the original population, with:

\[\bar{X} \sim N\!\left(\mu,\ \frac{\sigma^2}{n}\right)\]

The standard deviation of this distribution, \(\sigma/\sqrt{n}\), is called the Standard Error (SE).

4.2 Simulation: Watching the CLT in Action

Let’s draw repeated samples from a right-skewed population (a mixture distribution — nothing like a bell curve) and watch the distribution of sample means become normal.

set.seed(8831)
pop_size   <- 100000
population <- c(rexp(pop_size * 0.6, rate = 0.5),
                rnorm(pop_size * 0.4, mean = 8, sd = 1))
pop_mean   <- mean(population)

pop_df <- data.frame(x = population[sample(pop_size, 5000)])

ggplot(pop_df, aes(x)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50,
                 fill = "coral", colour = "white", alpha = 0.8) +
  geom_vline(xintercept = pop_mean, colour = "firebrick", linewidth = 1) +
  annotate("text", x = pop_mean + 0.3, y = 0.18,
           label = paste0("μ = ", round(pop_mean, 1)),
           colour = "firebrick", hjust = 0, size = 4) +
  labs(title = "Population Distribution (Right-Skewed)", x = "Value", y = "Density") +
  theme_minimal(base_size = 13)
The skewed population distribution. The red line marks the population mean.

The skewed population distribution. The red line marks the population mean.

sample_sizes <- c(1, 5, 30, 100)
n_reps       <- 3000

clt_data <- lapply(sample_sizes, function(n) {
  data.frame(
    sample_mean = replicate(n_reps, mean(sample(population, size = n))),
    n_label     = factor(paste0("n = ", n),
                         levels = paste0("n = ", sample_sizes))
  )
}) |> do.call(what = rbind)

ggplot(clt_data, aes(x = sample_mean)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", colour = "white", alpha = 0.8) +
  geom_vline(xintercept = pop_mean, colour = "firebrick", linewidth = 1) +
  facet_wrap(~n_label, nrow = 1, scales = "free") +
  labs(title = "Sampling Distribution of the Mean — CLT in Action",
       x = "Sample Mean", y = "Density") +
  theme_minimal(base_size = 13)
Sampling distributions of the mean for increasing sample sizes. By n = 30 the distribution is already well-approximated by a bell curve.

Sampling distributions of the mean for increasing sample sizes. By n = 30 the distribution is already well-approximated by a bell curve.

Even though the population is heavily right-skewed, by n = 30 the sampling distribution of the mean is already well-approximated by a normal curve. By n = 100 it is essentially perfectly bell-shaped.


5 Confidence Intervals

5.1 What Is a Confidence Interval?

Suppose you take a sample, compute the mean, and want to say something about the true population mean μ. You cannot say “μ equals exactly this number” because your sample introduces uncertainty. Instead, you compute a confidence interval (CI): a range of plausible values for μ.

A 95% confidence interval is constructed so that, if you were to repeat your study many times, 95% of the intervals you compute would contain the true mean.

The Most Common Misconception

A 95% CI does not mean “there is a 95% probability that μ falls in this interval.”

After you compute a specific interval, μ either is or is not in it — there is no probability involved. The 95% refers to the long-run procedure, not the specific interval.

5.2 Formula

For a sample of size n drawn from N(μ, σ²), a 95% confidence interval for μ is:

\[\bar{X} \pm z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} = \bar{X} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}\]

When σ is unknown (the usual case), we replace it with the sample standard deviation s and use the t-distribution:

\[\bar{X} \pm t_{0.025,\, n-1} \cdot \frac{s}{\sqrt{n}}\]

set.seed(4271)
n      <- 50
mu     <- 100   # true mean (unknown in practice)
sigma  <- 15    # population SD

samp   <- rnorm(n, mean = mu, sd = sigma)

x_bar  <- mean(samp)
se     <- sd(samp) / sqrt(n)
t_crit <- qt(0.975, df = n - 1)

ci_lo  <- x_bar - t_crit * se
ci_hi  <- x_bar + t_crit * se

cat("Sample mean:        ", round(x_bar, 2), "\n")
## Sample mean:         97.32
cat("Standard error:     ", round(se, 3),   "\n")
## Standard error:      2.304
cat("95% CI:            [", round(ci_lo, 2), ",", round(ci_hi, 2), "]\n")
## 95% CI:            [ 92.69 , 101.95 ]
cat("True mean (μ =", mu, ") is", ifelse(mu >= ci_lo & mu <= ci_hi, "INSIDE", "OUTSIDE"), "the CI.\n")
## True mean (μ = 100 ) is INSIDE the CI.

6 The Simulation: Visualising CI Coverage

6.1 100 Confidence Intervals

The most illuminating way to understand CIs is to simulate the process hundreds of times. Each time, we:

  1. Draw a random sample from a known population.
  2. Compute a 95% CI from that sample.
  3. Check whether the CI captures the true mean.

Over many repetitions, about 95% should capture the true mean and about 5% should miss.

set.seed(4271)
true_mu   <- 0
sigma_pop <- 1
n_obs     <- 30
n_ci      <- 100

ci_sim <- data.frame(
  sample_id = seq_len(n_ci),
  x_bar     = replicate(n_ci, mean(rnorm(n_obs, mean = true_mu, sd = sigma_pop)))
)

ci_sim <- ci_sim |>
  mutate(
    se     = sigma_pop / sqrt(n_obs),
    lo     = x_bar - 1.96 * se,
    hi     = x_bar + 1.96 * se,
    covers = lo <= true_mu & hi >= true_mu,
    colour = ifelse(covers, "Contains μ", "Misses μ")
  )

pct_covered <- scales::percent(mean(ci_sim$covers), accuracy = 0.1)

ggplot(ci_sim, aes(y = sample_id)) +
  geom_segment(aes(x = lo, xend = hi,
                   y = sample_id, yend = sample_id,
                   colour = colour),
               linewidth = 0.7) +
  geom_point(aes(x = x_bar, colour = colour), size = 1.5) +
  geom_vline(xintercept = true_mu, linetype = "dashed",
             colour = "gray20", linewidth = 0.8) +
  scale_colour_manual(values = c("Contains μ" = "#2ca02c", "Misses μ" = "#d62728")) +
  scale_y_continuous(breaks = seq(10, 100, 10)) +
  labs(
    x      = "Interval",
    y      = "Sample Number",
    colour = NULL,
    title  = "100 Simulated 95% Confidence Intervals",
    subtitle = paste0(pct_covered, " of intervals contain the true mean (μ = 0)")
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")
100 simulated 95% confidence intervals. Green intervals contain the true mean (μ = 0); red intervals miss it. The dashed vertical line marks the true mean.

100 simulated 95% confidence intervals. Green intervals contain the true mean (μ = 0); red intervals miss it. The dashed vertical line marks the true mean.

6.2 Animated Build-Up

Watch the intervals accumulate one by one. Notice how red intervals appear roughly once every 20 draws — consistent with the 5% miss rate.

Animated build-up of 100 confidence intervals. Green = contains true mean; red = misses.
Animated build-up of 100 confidence intervals. Green = contains true mean; red = misses.

7 What Affects CI Width?

The width of a confidence interval is determined by three factors:

\[\text{Width} = 2 \times z^* \times \frac{\sigma}{\sqrt{n}}\]

Factor Effect on CI width
Larger sample size n Narrower — more data, less uncertainty
Larger σ (more variable population) Wider — more spread in the data
Higher confidence level (e.g., 99% vs 95%) Wider — greater certainty costs range
sample_sizes_w <- c(5, 10, 20, 30, 50, 100, 200, 500)
sigma_w        <- 1

width_df <- data.frame(
  n     = sample_sizes_w,
  width = 2 * 1.96 * sigma_w / sqrt(sample_sizes_w)
)

ggplot(width_df, aes(x = n, y = width)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(colour = "steelblue", size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "gray70") +
  scale_x_continuous(breaks = sample_sizes_w) +
  labs(
    x     = "Sample Size (n)",
    y     = "CI Width",
    title = "95% Confidence Interval Width vs. Sample Size",
    subtitle = "σ = 1  |  Diminishing returns: halving the width requires 4× the data"
  ) +
  theme_minimal(base_size = 13)
CI width for different sample sizes (σ=1, 95% confidence). Doubling n does not halve the width — it shrinks by a factor of √2.

CI width for different sample sizes (σ=1, 95% confidence). Doubling n does not halve the width — it shrinks by a factor of √2.

The √n Rule

To halve the width of a confidence interval you must quadruple the sample size. Collecting data is expensive — this rule helps you plan how much data is enough.


8 A Business Application: Customer Spending

8.1 Problem

A retail bank’s analytics team collects a random sample of 60 customer monthly transactions to estimate the mean monthly spend. The sample yields a mean of ₦47,300 with a standard deviation of ₦12,800. Construct and interpret a 95% confidence interval for the true mean monthly spend.

n_bank   <- 60
x_bank   <- 47300
s_bank   <- 12800
t_crit_b <- qt(0.975, df = n_bank - 1)
se_bank  <- s_bank / sqrt(n_bank)

ci_lo_b  <- x_bank - t_crit_b * se_bank
ci_hi_b  <- x_bank + t_crit_b * se_bank

cat("Sample mean:   ₦", format(round(x_bank), big.mark = ","), "\n")
## Sample mean:   ₦ 47,300
cat("Standard error: ₦", format(round(se_bank, 0), big.mark = ","), "\n")
## Standard error: ₦ 1,652
cat("t critical value (df =", n_bank - 1, "):", round(t_crit_b, 4), "\n")
## t critical value (df = 59 ): 2.001
cat("95% CI: [₦", format(round(ci_lo_b), big.mark = ","),
    ", ₦", format(round(ci_hi_b), big.mark = ","), "]\n")
## 95% CI: [₦ 43,993 , ₦ 50,607 ]
set.seed(4271)
sim_bank <- data.frame(
  mean_spend = rnorm(5000, mean = x_bank, sd = se_bank)
)

ggplot(sim_bank, aes(x = mean_spend)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50,
                 fill = "steelblue", colour = "white", alpha = 0.8) +
  geom_vline(xintercept = c(ci_lo_b, ci_hi_b),
             colour = "firebrick", linewidth = 1, linetype = "dashed") +
  geom_vline(xintercept = x_bank, colour = "steelblue", linewidth = 1.2) +
  annotate("rect", xmin = ci_lo_b, xmax = ci_hi_b,
           ymin = 0, ymax = Inf, fill = "orange", alpha = 0.15) +
  annotate("text", x = ci_lo_b - 200, y = 0.00025,
           label = paste0("₦", format(round(ci_lo_b), big.mark = ",")),
           hjust = 1, colour = "firebrick", size = 3.5) +
  annotate("text", x = ci_hi_b + 200, y = 0.00025,
           label = paste0("₦", format(round(ci_hi_b), big.mark = ",")),
           hjust = 0, colour = "firebrick", size = 3.5) +
  scale_x_continuous(labels = scales::label_comma(prefix = "₦")) +
  labs(
    x     = "Mean Monthly Spend",
    y     = "Density",
    title = "95% Confidence Interval for Mean Customer Monthly Spend",
    subtitle = "Shaded band = 95% CI  |  Blue line = sample mean"
  ) +
  theme_minimal(base_size = 13)
Simulated distribution of sample means from the retail bank data. The shaded band and dashed lines mark the 95% confidence interval.

Simulated distribution of sample means from the retail bank data. The shaded band and dashed lines mark the 95% confidence interval.

Interpretation

We are 95% confident that the true mean monthly spend of customers lies between ₦43,993 and ₦50,607.

This does not mean 95% of individual customers spend in this range — it is a statement about where the population mean is likely to be.


9 Summary: Key Concepts

Key concepts and formulas from this tutorial.
Concept Formula Interpretation
Normal Distribution X ~ N(μ, σ²) Symmetric bell curve; defined by mean and SD
Standardisation (Z-score) Z = (X − μ) / σ Number of SDs above/below the mean
Standard Error SE = σ / √n SD of the sampling distribution of the mean
95% CI (σ known) x̄ ± 1.96 × SE Range that covers μ 95% of the time (repeated sampling)
95% CI (σ unknown) x̄ ± t₀.₀₂₅ × (s / √n) Use t-distribution when σ is estimated from data
CLT rule of thumb n ≥ 30 usually sufficient Sampling distribution of mean approaches normal

10 Common Misconceptions

What a 95% CI Does NOT Mean

❌ “There is a 95% chance that μ is in the interval [a, b].”
❌ “95% of individual data values fall in this interval.”
❌ “If I compute one interval, μ has a 95% chance of being inside it.”

Correct interpretation: If we collected many samples and computed a CI from each, 95% of those intervals would contain the true parameter. Any single interval either does or does not contain μ — we just don’t know which.


11 Practice Problems

  1. (Standardisation) Customer waiting times at a bank branch are normally distributed with a mean of 8 minutes and a standard deviation of 2.5 minutes. What proportion of customers wait more than 12 minutes? What waiting time separates the slowest 10% of service from the rest?

  2. (CLT) A delivery company records parcel weights that are right-skewed with μ = 4.2 kg and σ = 1.8 kg. A driver loads 40 parcels. What is the probability that the mean weight of his load exceeds 4.6 kg?

  3. (Confidence Interval) A sample of 45 Nigerian SMEs reports a mean revenue growth of 12.3% with a standard deviation of 4.7%. Construct a 95% CI for the true mean growth rate. Construct a 99% CI. How and why do they differ?

  4. (Sample size) A bank wants to estimate the mean loan default amount with a margin of error of ₦5,000 at 95% confidence. Historical data suggests σ ≈ ₦40,000. What sample size is required?

    Hint: Margin of error = z × σ/√n. Solve for n.

  5. (Interpretation) An analyst states: “The 95% CI for mean daily production is [820, 890] units, so we can be 95% sure that tomorrow’s production will be between 820 and 890 units.” Identify the error in this statement and provide the correct interpretation.


Tutorial developed for Data Analytics II, Lagos Business School.
Simulations use set.seed() throughout for full reproducibility.