Lecture 3: Measures of Dispersion

POLS3316, Instructor: Tom Hanna, Spring 2025, University of Houston

2026-01-31

Last class

Measures of Central Tendency
```
  - Mean
  - Median
  - Mode
```
These tell us about the center of the data
They help us start to the visualize the data based on the typical or average value

Today

Measures of Dispersion (Variation or Spread)

  - Variance
  - Standard Deviation
  - Range
  - Quartiles

We start with the mean

Start with the mean
Our variable X: 1,7,21,13,19,5,9,17,11
Mean is 11.44

Visualizing distances from the mean

Plot the data points

Same Mean, Different Spread

Now, consider these two sets of data points with the same mean but different spreads:

Drawing on the board time!

Creating a measure of dispersion: distance to mean

So, we could define a measure of dispersion or variation that is the total length of the colored lines.
Our formula in English would be “the sum of the differences between each observation and the mean”

Drawing on the board time!

Problem with sum of distances

The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!

Simple Data Example

Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.

Distance from Mean Total

So, we want our new measure total_variation to equal the sum of the distances.

Math to the rescure!

Math comes to the rescue!

What is something we can do that turns a negative number into a positive number every time and leaves a positive number as a positive?
It’s also important that any effect it has on the actual size of the numbers is consistent between positive and negative numbers.

Math to the rescure! Code

We can square the distances

Results

Squaring 5 turned it into 25
Squaring -5, which is the same size but negative, also turned it into 25.
So, now we can add them to get a measure of total_squared_variation.

Are we done?

Suppose we had 1000 observations
Mean still 10
Each still 5 points away on average
What would our total variation be?

Given that the actual average distances is exactly the same for both groups, does that make sense? Is it useful?

Solution: Average Squared Difference - Variance

We want the average of the distances or
Average of the squared differences.
So our measure of variance is in the simplest form:

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

Variance formula

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

s^2 = variance n = number of observations xi = each observation x̄ = mean of the observations

Problem: Squares inflate the results

Squares inflate the numbers relative to the size of the mean.
25 is 2.5 times the mean.
But the distances aren’t really that big
Average distance is still 5
We want to get back to the original unit of measure instead of the squared unit of measure…

Solution

How can we solve this?

To partially account for this we can take the square root of the variance
That gives us our next measure: standard deviation

Standard deviation

standard deviation is the square root of the variance

\[ s = \sqrt{s^2} \]

Standard deviation: full formula

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]

Sample vs Population

When we have data for the entire population, we can compute the true variance and standard deviation directly, so we divide by n
When we only have a sample, this is systematically too small, underestimating the population spread.
Dividing by n−1 (Bessel’s correction) adjusts for this and makes the sample variance an unbiased estimator of the population variance

Population Variance

\[ \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \]

Population Standard Deviation

\[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{ n} (x_i - \mu)^2} \]

\[ \sigma = \sqrt{\sigma^2} \]

Quartiles

Quartiles divide the data into four equal parts
The first quartile (Q1) is the 25th percentile
The second quartile (Q2) is the 50th percentile (the median)
The third quartile (Q3) is the 75th percentile
The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)

Range

The range is the difference between the maximum and minimum values in the data set
Range = Max - Min

Authorship and License

Author: Tom Hanna
Website: tomhanna.me
License: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.