Lecture 3: Measures of Dispersion
POLS3316, Instructor: Tom Hanna, Spring 2025, University of Houston
2026-01-31
Last class
Measures of Central Tendency
- Mean
- Median
- Mode
These tell us about the center of the data
They help us start to the visualize the data based on the typical or average value
We start with the mean
- Start with the mean
- Our variable X: 1,7,21,13,19,5,9,17,11
- Mean is 11.44
Visualizing distances from the mean
Same Mean, Different Spread
- Now, consider these two sets of data points with the same mean but different spreads:
Drawing on the board time!
Creating a measure of dispersion: distance to mean
- So, we could define a measure of dispersion or variation that is the total length of the colored lines.
- Our formula in English would be “the sum of the differences between each observation and the mean”
Drawing on the board time!
Problem with sum of distances
The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!
Simple Data Example
Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.
Distance from Mean Total
So, we want our new measure total_variation to equal the sum of the distances.
Math to the rescure!
Math comes to the rescue!
- What is something we can do that turns a negative number into a positive number every time and leaves a positive number as a positive?
- It’s also important that any effect it has on the actual size of the numbers is consistent between positive and negative numbers.
Math to the rescure! Code
- We can square the distances
Results
- Squaring 5 turned it into 25
- Squaring -5, which is the same size but negative, also turned it into 25.
- So, now we can add them to get a measure of total_squared_variation.
``
Are we done?
- Suppose we had 1000 observations
- Mean still 10
- Each still 5 points away on average
- What would our total variation be?
Given that the actual average distances is exactly the same for both groups, does that make sense? Is it useful?
Solution: Average Squared Difference - Variance
We want the average of the distances or
Average of the squared differences.
So our measure of variance is in the simplest form:
\[
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
\]
Problem: Squares inflate the results
- Squares inflate the numbers relative to the size of the mean.
- 25 is 2.5 times the mean.
- But the distances aren’t really that big
- Average distance is still 5
- We want to get back to the original unit of measure instead of the squared unit of measure…
Solution
How can we solve this?
- To partially account for this we can take the square root of the variance
- That gives us our next measure: standard deviation
Standard deviation
- standard deviation is the square root of the variance
\[
s = \sqrt{s^2}
\]
Sample vs Population
- When we have data for the entire population, we can compute the true variance and standard deviation directly, so we divide by n
- When we only have a sample, this is systematically too small, underestimating the population spread.
- Dividing by n−1 (Bessel’s correction) adjusts for this and makes the sample variance an unbiased estimator of the population variance
Population Variance
\[
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
\]
Population Standard Deviation
\[
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{
n} (x_i - \mu)^2}
\]
or
\[
\sigma = \sqrt{\sigma^2}
\]
Quartiles
- Quartiles divide the data into four equal parts
- The first quartile (Q1) is the 25th percentile
- The second quartile (Q2) is the 50th percentile (the median)
- The third quartile (Q3) is the 75th percentile
- The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)
Range
- The range is the difference between the maximum and minimum values in the data set
- Range = Max - Min