Let’s talk about two popular measures of the spread of a set of numbers. Take a very simple collection that’s easy to see - the integers between 1 and 5.

x = 1:5
## [1] 1 2 3 4 5

The natural starting point for talking about spread is to calculate the distance between each individual number and the center of the set. I’ll use the mean value as the center. Then I’ll subtract the mean from each number to get the “deviations from the mean.”

m = mean(x)
## [1] 3
xdev = x - m
## [1] -2 -1  0  1  2

Note that 5 and 1 are both two units away from the mean, but the deviation of 5 is positive and the deviation of 1 is negative.

The natural way to summarize the deviations is to calculate their average value, the mean of the deviations. Let’s do this and look at it.

md = mean(xdev)
## [1] 0

Why do we get zero? The positive and negative deviation values cancel each other out. I could prove mathematically that this will happen with any set of numbers. It’s a theoretical property of the mean. What we need to do is remove the signs from the deviations. We can use the absolute value function, abs(), in R to do this.

ad = abs(xdev)
## [1] 2 1 0 1 2

Now take the mean of the absolute values of the deviations to get something we can call the mad (mean absolute deviation).

mad = mean(ad)
## [1] 1.2

That’s a reasonable measure of spread and we can use it in a sentence. The average number in the set is 1.2 units away from the mean value of the set.

Now let’s talk about the standard deviation. The difference is the way we solve the cancelling out problem. Instead of taking the absolute values of the deviations, we square the deviations.

xdvesq = xdev^2
## [1] 4 1 0 1 4

That solved the problem with the negative and positive deviations. The average of these squared deviations is known as the variance.

variance = mean(xdvesq)
## [1] 2

How would you use this number in a sentence that anyone could understand? We certainly would not say that the variance is typical of what you would expect as the value of a deviation. The problem is that squaring inflated the deviation values of 1 and 5.

To compensate for this, we take the square root of the variance to get us back to the scale of typical numbers in our collection. This is known as the standard deviation.

standardDeviation = sqrt(variance)
## [1] 1.414214

That’s a reasonable value, close to the value of the mean absolute deviation (MAD). What does it mean? Most people say that it is a typical value of a deviation. This is vague, not as clear as the MAD.

Why do statisticians do this? You’ll rarerly hear any of them using the MAD. The standard deviation and/or the variance are preferred because of theories related to the normal distribution based on these concepts.