11/9/2025

Why Measure Spread?

  • The mean describes the center of a dataset
  • But it doesn’t tell us the complete story
  • To understand the dataset, it’s critical to understand variability in data
    • How spread out are the values?
    • How much do individual observations differ from the mean?
  • The answers to these questions can demonstrate:
    • The reliability of the mean
    • The predictability of future measurements
    • Uncertainty in the data that must be accounted for

The Normal Distribution

  • Symmetric, bell-shaped curve where mean = median = mode
  • Defined by mean (μ) and standard deviation (σ)
  • Foundation for inferential statistics and hypothesis testing

Variance: Mathematical Definition

Population Variance:

\[\sigma^2 = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}\]

Sample Variance:

\[s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\]

  • \(x_i\) = individual observations
  • \(\mu\) (or \(\bar{x}\)) = population mean (or sample mean)
  • \(N\) (or \(n\)) = population size (or sample size)

Definition: Average of squared deviations from the mean

Standard Deviation: Mathematical Definition

Standard Deviation is the square root of variance:

\[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}}\]

\[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}\]

Why take the square root?

  • Variance is in squared units, whereas standard deviation returns to original units
  • SD is easier to understand and interpret

Visualizing Different Spreads Code

##create x axis
x <- seq(-10, 10, length.out = 1000)


##create a data frame that illustrates SD change in normal curve
df <- data.frame(x = rep(x, 3),
  y = c(dnorm(x, mean = 0, sd = 1),
        dnorm(x, mean = 0, sd = 2),
        dnorm(x, mean = 0, sd = 3)),
  Distribution = rep(c("SD = 1 (Low Variance)", 
                       "SD = 2 (Medium Variance)", 
                       "SD = 3 (High Variance)"), 
                     each = length(x)))


##plot each curve with ggplot
ggplot(df, aes(x = x, y = y, color = Distribution)) +
  geom_line(size = 1.2) +
  labs(title = "Normal Curves with Different Standard Deviations",
       x = "Value",
       y = "Probability Density") +
  scale_color_manual(values = c("mediumaquamarine", "thistle3", "palevioletred")) +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "bottom")

Visualizing Different Spreads Plot

Larger variance = wider, flatter curve

The Empirical Rule (68-95-99.7)

  • 68% of data within 1 standard deviation
  • 95% of data within 2 standard deviations
  • 99.7% of data within 3 standard deviations

Spread is critical for understanding statistical inference.

  • Variance and Standard Deviation both measure spread
  • Standard Deviation is more interpretable
  • Normal curve provides framework for understanding spread
  • Empirical Rule: 68-95-99.7 percent of observations within 1-2-3 standard deviations
Concept Formula Units Interpretation
Variance \(s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\) Squared Average squared deviation
Std Dev \(s = \sqrt{s^2}\) Original Typical deviation from mean