Walk into any large lecture hall and measure every student’s height. Plot it. You will see a bell — most students cluster around an average, with fewer and fewer as you move toward extremes. The same shape emerges when you measure exam scores, blood pressure readings, manufacturing tolerances, or daily stock returns.
This bell shape is the Normal Distribution, and it is arguably the most important probability distribution in all of statistics. Understanding it — and knowing how to use it to make confident claims from data — is a foundational skill for any data analyst.
By the end of this tutorial you will be able to:
The normal distribution is a continuous probability distribution defined by two parameters:
Its probability density function is:
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]
x <- seq(-8, 12, length.out = 500)
params <- data.frame(
mu = c(0, 0, 0, 4),
sigma = c(1, 2, 0.5, 1.5),
label = c("μ=0, σ=1", "μ=0, σ=2", "μ=0, σ=0.5", "μ=4, σ=1.5")
)
curves <- lapply(seq_len(nrow(params)), function(i) {
data.frame(
x = x,
y = dnorm(x, params$mu[i], params$sigma[i]),
label = params$label[i]
)
}) |> do.call(what = rbind)
ggplot(curves, aes(x = x, y = y, colour = label)) +
geom_line(linewidth = 1) +
labs(x = "x", y = "Density", colour = NULL,
title = "Normal Distributions — Effect of μ and σ") +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")Normal distributions with different means and standard deviations. The shape is always symmetric and bell-shaped; μ shifts the centre, σ controls the width.
The standard normal has μ = 0 and σ = 1. Any normal variable X ~ N(μ, σ²) can be converted to a standard normal Z using:
\[Z = \frac{X - \mu}{\sigma}\]
This standardisation allows us to use a single table (or function) to answer probability questions for any normal distribution.
x_std <- seq(-4, 4, length.out = 500)
df_std <- data.frame(x = x_std, y = dnorm(x_std))
sigmas <- c(-3, -2, -1, 1, 2, 3)
ggplot(df_std, aes(x, y)) +
geom_line(linewidth = 1.2, colour = "steelblue") +
geom_vline(xintercept = sigmas, linetype = "dashed", colour = "gray50", linewidth = 0.5) +
annotate("text", x = sigmas, y = -0.012,
label = paste0(sigmas, "σ"), size = 3.5, colour = "gray40") +
annotate("text", x = 0, y = -0.012, label = "μ", size = 3.5, colour = "gray40") +
geom_vline(xintercept = 0, colour = "gray40", linewidth = 0.5) +
scale_y_continuous(limits = c(-0.02, 0.42)) +
labs(x = "Standard Deviations from Mean (Z)", y = "Density",
title = "Standard Normal Distribution N(0, 1)") +
theme_minimal(base_size = 13)The standard normal distribution N(0,1). Vertical dashed lines mark ±1σ, ±2σ, and ±3σ from the mean.
One of the most useful properties of the normal distribution is that fixed proportions of data always fall within fixed distances of the mean, regardless of the specific μ and σ.
x_r <- seq(-4, 4, length.out = 500)
df_r <- data.frame(x = x_r, y = dnorm(x_r))
shade <- function(lo, hi, fill, alpha = 0.4) {
xs <- seq(lo, hi, length.out = 200)
data.frame(x = xs, y = dnorm(xs), fill = fill, alpha = alpha)
}
regions <- bind_rows(
shade(-3, 3, "99.7%", 0.25),
shade(-2, 2, "95%", 0.35),
shade(-1, 1, "68%", 0.55)
)
ggplot(df_r, aes(x, y)) +
geom_ribbon(data = regions |> filter(fill == "99.7%"),
aes(ymin = 0, ymax = y), fill = "#aec6cf", alpha = 0.4) +
geom_ribbon(data = regions |> filter(fill == "95%"),
aes(ymin = 0, ymax = y), fill = "#6baed6", alpha = 0.4) +
geom_ribbon(data = regions |> filter(fill == "68%"),
aes(ymin = 0, ymax = y), fill = "#2171b5", alpha = 0.4) +
geom_line(linewidth = 1.2, colour = "steelblue") +
annotate("text", x = 0, y = 0.20, label = "68%", size = 5, fontface = "bold", colour = "white") +
annotate("text", x = 1.55, y = 0.07, label = "95%", size = 4, colour = "#2171b5") +
annotate("text", x = 2.6, y = 0.02, label = "99.7%", size = 3.5, colour = "#2171b5") +
annotate("segment", x = -1, xend = 1, y = 0.24, yend = 0.24,
arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
annotate("text", x = 0, y = 0.255, label = "±1σ", size = 3.5) +
annotate("segment", x = -2, xend = 2, y = 0.31, yend = 0.31,
arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
annotate("text", x = 0, y = 0.325, label = "±2σ", size = 3.5) +
annotate("segment", x = -3, xend = 3, y = 0.38, yend = 0.38,
arrow = arrow(ends = "both", length = unit(0.15, "cm"))) +
annotate("text", x = 0, y = 0.395, label = "±3σ", size = 3.5) +
labs(x = "Standard Deviations from Mean", y = "Density",
title = "The 68–95–99.7 Empirical Rule") +
theme_minimal(base_size = 13)The empirical rule: shaded regions show the proportion of values within 1, 2, and 3 standard deviations of the mean.
The Empirical Rule — Summary
Range Proportion of data Within ±1σ of the mean 68.3% Within ±2σ of the mean 95.4% Within ±3σ of the mean 99.7% This rule applies to any normally distributed variable.
MBA students’ exam scores are approximately normally distributed with a mean of 65 and a standard deviation of 10.
mu_exam <- 65
sd_exam <- 10
x_exam <- seq(25, 105, length.out = 500)
df_exam <- data.frame(x = x_exam, y = dnorm(x_exam, mu_exam, sd_exam))
ggplot(df_exam, aes(x, y)) +
geom_ribbon(data = df_exam |> filter(x >= mu_exam - 3*sd_exam, x <= mu_exam + 3*sd_exam),
aes(ymin = 0, ymax = y), fill = "#aec6cf", alpha = 0.4) +
geom_ribbon(data = df_exam |> filter(x >= mu_exam - 2*sd_exam, x <= mu_exam + 2*sd_exam),
aes(ymin = 0, ymax = y), fill = "#6baed6", alpha = 0.4) +
geom_ribbon(data = df_exam |> filter(x >= mu_exam - sd_exam, x <= mu_exam + sd_exam),
aes(ymin = 0, ymax = y), fill = "#2171b5", alpha = 0.4) +
geom_line(linewidth = 1.2, colour = "steelblue") +
geom_vline(xintercept = mu_exam, colour = "firebrick", linewidth = 1) +
scale_x_continuous(breaks = c(35, 45, 55, 65, 75, 85, 95)) +
annotate("text", x = 65, y = 0.043, label = "μ = 65", colour = "firebrick", size = 4) +
labs(x = "Exam Score", y = "Density",
title = "MBA Exam Score Distribution N(65, 10²)") +
theme_minimal(base_size = 13)MBA exam score distribution. The shaded regions show the 68%, 95%, and 99.7% intervals.
Key questions you can now answer:
## Z-score for a score of 80: 1.5
## Proportion of students scoring above 80: 6.7%
When we collect data, we usually work with a sample rather than the full population. We compute a sample mean \(\bar{X}\) and use it to estimate the true population mean μ.
A crucial question: How much does \(\bar{X}\) vary across different samples?
The answer is the sampling distribution of the mean — the distribution of \(\bar{X}\) values you would get if you took many independent samples of size n from the population.
The Central Limit Theorem (CLT)
For large enough samples, the sampling distribution of the mean is approximately normal, regardless of the shape of the original population, with:
\[\bar{X} \sim N\!\left(\mu,\ \frac{\sigma^2}{n}\right)\]
The standard deviation of this distribution, \(\sigma/\sqrt{n}\), is called the Standard Error (SE).
Let’s draw repeated samples from a right-skewed population (a mixture distribution — nothing like a bell curve) and watch the distribution of sample means become normal.
set.seed(8831)
pop_size <- 100000
population <- c(rexp(pop_size * 0.6, rate = 0.5),
rnorm(pop_size * 0.4, mean = 8, sd = 1))
pop_mean <- mean(population)
pop_df <- data.frame(x = population[sample(pop_size, 5000)])
ggplot(pop_df, aes(x)) +
geom_histogram(aes(y = after_stat(density)), bins = 50,
fill = "coral", colour = "white", alpha = 0.8) +
geom_vline(xintercept = pop_mean, colour = "firebrick", linewidth = 1) +
annotate("text", x = pop_mean + 0.3, y = 0.18,
label = paste0("μ = ", round(pop_mean, 1)),
colour = "firebrick", hjust = 0, size = 4) +
labs(title = "Population Distribution (Right-Skewed)", x = "Value", y = "Density") +
theme_minimal(base_size = 13)The skewed population distribution. The red line marks the population mean.
sample_sizes <- c(1, 5, 30, 100)
n_reps <- 3000
clt_data <- lapply(sample_sizes, function(n) {
data.frame(
sample_mean = replicate(n_reps, mean(sample(population, size = n))),
n_label = factor(paste0("n = ", n),
levels = paste0("n = ", sample_sizes))
)
}) |> do.call(what = rbind)
ggplot(clt_data, aes(x = sample_mean)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", colour = "white", alpha = 0.8) +
geom_vline(xintercept = pop_mean, colour = "firebrick", linewidth = 1) +
facet_wrap(~n_label, nrow = 1, scales = "free") +
labs(title = "Sampling Distribution of the Mean — CLT in Action",
x = "Sample Mean", y = "Density") +
theme_minimal(base_size = 13)Sampling distributions of the mean for increasing sample sizes. By n = 30 the distribution is already well-approximated by a bell curve.
Even though the population is heavily right-skewed, by n = 30 the sampling distribution of the mean is already well-approximated by a normal curve. By n = 100 it is essentially perfectly bell-shaped.
Suppose you take a sample, compute the mean, and want to say something about the true population mean μ. You cannot say “μ equals exactly this number” because your sample introduces uncertainty. Instead, you compute a confidence interval (CI): a range of plausible values for μ.
A 95% confidence interval is constructed so that, if you were to repeat your study many times, 95% of the intervals you compute would contain the true mean.
The Most Common Misconception
A 95% CI does not mean “there is a 95% probability that μ falls in this interval.”
After you compute a specific interval, μ either is or is not in it — there is no probability involved. The 95% refers to the long-run procedure, not the specific interval.
For a sample of size n drawn from N(μ, σ²), a 95% confidence interval for μ is:
\[\bar{X} \pm z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} = \bar{X} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}\]
When σ is unknown (the usual case), we replace it with the sample standard deviation s and use the t-distribution:
\[\bar{X} \pm t_{0.025,\, n-1} \cdot \frac{s}{\sqrt{n}}\]
set.seed(4271)
n <- 50
mu <- 100 # true mean (unknown in practice)
sigma <- 15 # population SD
samp <- rnorm(n, mean = mu, sd = sigma)
x_bar <- mean(samp)
se <- sd(samp) / sqrt(n)
t_crit <- qt(0.975, df = n - 1)
ci_lo <- x_bar - t_crit * se
ci_hi <- x_bar + t_crit * se
cat("Sample mean: ", round(x_bar, 2), "\n")## Sample mean: 97.32
## Standard error: 2.304
## 95% CI: [ 92.69 , 101.95 ]
cat("True mean (μ =", mu, ") is", ifelse(mu >= ci_lo & mu <= ci_hi, "INSIDE", "OUTSIDE"), "the CI.\n")## True mean (μ = 100 ) is INSIDE the CI.
The most illuminating way to understand CIs is to simulate the process hundreds of times. Each time, we:
Over many repetitions, about 95% should capture the true mean and about 5% should miss.
set.seed(4271)
true_mu <- 0
sigma_pop <- 1
n_obs <- 30
n_ci <- 100
ci_sim <- data.frame(
sample_id = seq_len(n_ci),
x_bar = replicate(n_ci, mean(rnorm(n_obs, mean = true_mu, sd = sigma_pop)))
)
ci_sim <- ci_sim |>
mutate(
se = sigma_pop / sqrt(n_obs),
lo = x_bar - 1.96 * se,
hi = x_bar + 1.96 * se,
covers = lo <= true_mu & hi >= true_mu,
colour = ifelse(covers, "Contains μ", "Misses μ")
)
pct_covered <- scales::percent(mean(ci_sim$covers), accuracy = 0.1)
ggplot(ci_sim, aes(y = sample_id)) +
geom_segment(aes(x = lo, xend = hi,
y = sample_id, yend = sample_id,
colour = colour),
linewidth = 0.7) +
geom_point(aes(x = x_bar, colour = colour), size = 1.5) +
geom_vline(xintercept = true_mu, linetype = "dashed",
colour = "gray20", linewidth = 0.8) +
scale_colour_manual(values = c("Contains μ" = "#2ca02c", "Misses μ" = "#d62728")) +
scale_y_continuous(breaks = seq(10, 100, 10)) +
labs(
x = "Interval",
y = "Sample Number",
colour = NULL,
title = "100 Simulated 95% Confidence Intervals",
subtitle = paste0(pct_covered, " of intervals contain the true mean (μ = 0)")
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top")100 simulated 95% confidence intervals. Green intervals contain the true mean (μ = 0); red intervals miss it. The dashed vertical line marks the true mean.
Watch the intervals accumulate one by one. Notice how red intervals appear roughly once every 20 draws — consistent with the 5% miss rate.
The width of a confidence interval is determined by three factors:
\[\text{Width} = 2 \times z^* \times \frac{\sigma}{\sqrt{n}}\]
| Factor | Effect on CI width |
|---|---|
| Larger sample size n | Narrower — more data, less uncertainty |
| Larger σ (more variable population) | Wider — more spread in the data |
| Higher confidence level (e.g., 99% vs 95%) | Wider — greater certainty costs range |
sample_sizes_w <- c(5, 10, 20, 30, 50, 100, 200, 500)
sigma_w <- 1
width_df <- data.frame(
n = sample_sizes_w,
width = 2 * 1.96 * sigma_w / sqrt(sample_sizes_w)
)
ggplot(width_df, aes(x = n, y = width)) +
geom_line(colour = "steelblue", linewidth = 1.2) +
geom_point(colour = "steelblue", size = 3) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "gray70") +
scale_x_continuous(breaks = sample_sizes_w) +
labs(
x = "Sample Size (n)",
y = "CI Width",
title = "95% Confidence Interval Width vs. Sample Size",
subtitle = "σ = 1 | Diminishing returns: halving the width requires 4× the data"
) +
theme_minimal(base_size = 13)CI width for different sample sizes (σ=1, 95% confidence). Doubling n does not halve the width — it shrinks by a factor of √2.
The √n Rule
To halve the width of a confidence interval you must quadruple the sample size. Collecting data is expensive — this rule helps you plan how much data is enough.
A retail bank’s analytics team collects a random sample of 60 customer monthly transactions to estimate the mean monthly spend. The sample yields a mean of ₦47,300 with a standard deviation of ₦12,800. Construct and interpret a 95% confidence interval for the true mean monthly spend.
n_bank <- 60
x_bank <- 47300
s_bank <- 12800
t_crit_b <- qt(0.975, df = n_bank - 1)
se_bank <- s_bank / sqrt(n_bank)
ci_lo_b <- x_bank - t_crit_b * se_bank
ci_hi_b <- x_bank + t_crit_b * se_bank
cat("Sample mean: ₦", format(round(x_bank), big.mark = ","), "\n")## Sample mean: ₦ 47,300
## Standard error: ₦ 1,652
## t critical value (df = 59 ): 2.001
cat("95% CI: [₦", format(round(ci_lo_b), big.mark = ","),
", ₦", format(round(ci_hi_b), big.mark = ","), "]\n")## 95% CI: [₦ 43,993 , ₦ 50,607 ]
set.seed(4271)
sim_bank <- data.frame(
mean_spend = rnorm(5000, mean = x_bank, sd = se_bank)
)
ggplot(sim_bank, aes(x = mean_spend)) +
geom_histogram(aes(y = after_stat(density)), bins = 50,
fill = "steelblue", colour = "white", alpha = 0.8) +
geom_vline(xintercept = c(ci_lo_b, ci_hi_b),
colour = "firebrick", linewidth = 1, linetype = "dashed") +
geom_vline(xintercept = x_bank, colour = "steelblue", linewidth = 1.2) +
annotate("rect", xmin = ci_lo_b, xmax = ci_hi_b,
ymin = 0, ymax = Inf, fill = "orange", alpha = 0.15) +
annotate("text", x = ci_lo_b - 200, y = 0.00025,
label = paste0("₦", format(round(ci_lo_b), big.mark = ",")),
hjust = 1, colour = "firebrick", size = 3.5) +
annotate("text", x = ci_hi_b + 200, y = 0.00025,
label = paste0("₦", format(round(ci_hi_b), big.mark = ",")),
hjust = 0, colour = "firebrick", size = 3.5) +
scale_x_continuous(labels = scales::label_comma(prefix = "₦")) +
labs(
x = "Mean Monthly Spend",
y = "Density",
title = "95% Confidence Interval for Mean Customer Monthly Spend",
subtitle = "Shaded band = 95% CI | Blue line = sample mean"
) +
theme_minimal(base_size = 13)Simulated distribution of sample means from the retail bank data. The shaded band and dashed lines mark the 95% confidence interval.
Interpretation
We are 95% confident that the true mean monthly spend of customers lies between ₦43,993 and ₦50,607.
This does not mean 95% of individual customers spend in this range — it is a statement about where the population mean is likely to be.
| Concept | Formula | Interpretation |
|---|---|---|
| Normal Distribution | X ~ N(μ, σ²) | Symmetric bell curve; defined by mean and SD |
| Standardisation (Z-score) | Z = (X − μ) / σ | Number of SDs above/below the mean |
| Standard Error | SE = σ / √n | SD of the sampling distribution of the mean |
| 95% CI (σ known) | x̄ ± 1.96 × SE | Range that covers μ 95% of the time (repeated sampling) |
| 95% CI (σ unknown) | x̄ ± t₀.₀₂₅ × (s / √n) | Use t-distribution when σ is estimated from data |
| CLT rule of thumb | n ≥ 30 usually sufficient | Sampling distribution of mean approaches normal |
What a 95% CI Does NOT Mean
❌ “There is a 95% chance that μ is in the interval [a, b].”
❌ “95% of individual data values fall in this interval.”
❌ “If I compute one interval, μ has a 95% chance of being inside it.”✅ Correct interpretation: If we collected many samples and computed a CI from each, 95% of those intervals would contain the true parameter. Any single interval either does or does not contain μ — we just don’t know which.
(Standardisation) Customer waiting times at a bank branch are normally distributed with a mean of 8 minutes and a standard deviation of 2.5 minutes. What proportion of customers wait more than 12 minutes? What waiting time separates the slowest 10% of service from the rest?
(CLT) A delivery company records parcel weights that are right-skewed with μ = 4.2 kg and σ = 1.8 kg. A driver loads 40 parcels. What is the probability that the mean weight of his load exceeds 4.6 kg?
(Confidence Interval) A sample of 45 Nigerian SMEs reports a mean revenue growth of 12.3% with a standard deviation of 4.7%. Construct a 95% CI for the true mean growth rate. Construct a 99% CI. How and why do they differ?
(Sample size) A bank wants to estimate the mean loan default amount with a margin of error of ₦5,000 at 95% confidence. Historical data suggests σ ≈ ₦40,000. What sample size is required?
Hint: Margin of error = z × σ/√n. Solve for n.
(Interpretation) An analyst states: “The 95% CI for mean daily production is [820, 890] units, so we can be 95% sure that tomorrow’s production will be between 820 and 890 units.” Identify the error in this statement and provide the correct interpretation.
Tutorial developed for Data Analytics II, Lagos Business
School.
Simulations use set.seed() throughout for full
reproducibility.