Lecture 3: Measures of Dispersion
Last class
Measures of Central Tendency
- Mean - Median - ModeThese tell us about the center of the data
They help us start to the visualize the data based on the typical or average value
Today
Measures of Dispersion (Variation or Spread)
- Variance - Standard Deviation - Range - Quartiles
We start with the mean
- Start with the mean
- Our variable X: 1,7,21,13,19,5,9,17,11
- Mean is 11.44
Visualizing distances from the mean
- Plot the data points
Same Mean, Different Spread
- Now, consider these two sets of data points with the same mean but different spreads:
Drawing on the board time!
# Create a blank grid with both x and y axes numbered 0 to 20 and labeled
plot(0, 0, xlim=c(0, 20), ylim=c(0, 15), xlab="Value", ylab="", main="Data Points and Mean", xaxt='n', yaxt='n') # Create blank plot
axis(1, at=seq(0, 20, by=1)) # Add x-axis with ticks from 0 to 20
# add ticks and labels for y axis
axis(2, at=seq(0, 15, by=1), labels=rep("", 16)) # Add y-axis with ticks from 0 to 15 without labels
# label the y axis ticks at 5, 10, and 15
mtext(c("5", "10", "15"), side=2, at=c(5, 10, 15), line=0.5) # Label y-axis ticks at 5, 10, and 15Creating a measure of dispersion: distance to mean
- So, we could define a measure of dispersion or variation that is the total length of the colored lines.
- Our formula in English would be “the sum of the differences between each observation and the mean”
Drawing on the board time!
# Create a blank grid with both x and y axes numbered 0 to 20 and labeled
plot(0, 0, xlim=c(0, 20), ylim=c(0, 15), xlab="Value", ylab="", main="Data Points and Mean", xaxt='n', yaxt='n') # Create blank plot
axis(1, at=seq(0, 20, by=1)) # Add x-axis with ticks from 0 to 20
# add ticks and labels for y axis
axis(2, at=seq(0, 15, by=1), labels=rep("", 16)) # Add y-axis with ticks from 0 to 15 without labels
# label the y axis ticks at 5, 10, and 15
mtext(c("5", "10", "15"), side=2, at=c(5, 10, 15), line=0.5) # Label y-axis ticks at 5, 10, and 15Problem with sum of distances
The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!
Simple Data Example
Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.
Distance from Mean Total
So, we want our new measure total_variation to equal the sum of the distances.
Math to the rescure!
Math comes to the rescue!
- What is something we can do that turns a negative number into a positive number every time and leaves a positive number as a positive?
- It’s also important that any effect it has on the actual size of the numbers is consistent between positive and negative numbers.
Math to the rescure! Code
- We can square the distances
Results
- Squaring 5 turned it into 25
- Squaring -5, which is the same size but negative, also turned it into 25.
- So, now we can add them to get a measure of total_squared_variation.
``
Are we done?
- Suppose we had 1000 observations
- Mean still 10
- Each still 5 points away on average
- What would our total variation be?
Given that the actual average distances is exactly the same for both groups, does that make sense? Is it useful?
Solution: Average Squared Difference - Variance
We want the average of the distances or
Average of the squared differences.
So our measure of variance is in the simplest form:
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
Variance formula
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
s^2 = variance n = number of observations xi = each observation x̄ = mean of the observations
Problem: Squares inflate the results
- Squares inflate the numbers relative to the size of the mean.
- 25 is 2.5 times the mean.
- But the distances aren’t really that big
- Average distance is still 5
- We want to get back to the original unit of measure instead of the squared unit of measure…
Solution
How can we solve this?
- To partially account for this we can take the square root of the variance
- That gives us our next measure: standard deviation
Standard deviation
- standard deviation is the square root of the variance
\[ s = \sqrt{s^2} \]
Standard deviation: full formula
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]
Sample vs Population
- When we have data for the entire population, we can compute the true variance and standard deviation directly, so we divide by n
- When we only have a sample, this is systematically too small, underestimating the population spread.
- Dividing by n−1 (Bessel’s correction) adjusts for this and makes the sample variance an unbiased estimator of the population variance
Population Variance
\[ \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \]
Population Standard Deviation
\[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{ n} (x_i - \mu)^2} \]
or
\[ \sigma = \sqrt{\sigma^2} \]
Quartiles
- Quartiles divide the data into four equal parts
- The first quartile (Q1) is the 25th percentile
- The second quartile (Q2) is the 50th percentile (the median)
- The third quartile (Q3) is the 75th percentile
- The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)
Range
- The range is the difference between the maximum and minimum values in the data set
- Range = Max - Min