Deviance, Variance, and Standard Deviation

Any value in a data set can be explained by some sort of statistical model, plus some error. We call this error the deviance.

Deviance

The deviance is the difference between the value of each observation and the statistical model. Let’s say as an example that my statistical model is the mean of a set of values. The difference between each value and the mean of all the values will be written like this:

\[x_{i}-\bar{x}\]

where \(x_{i}\) is the value of the observation for an individual \(i\) and \(\bar{x}\) is the mean of our sample.

Let’s look at our personality data, and in particular at the breath hold variable. What is the mean of breath hold?

mean(data$breathhold)

## [1] 53.99208

Cool. To calculate the deviance, it will be enough to subtract each observation from the mean in this way:

data$breathhold - mean(data$breathhold)

##  [1]  -3.992083333 -29.992083333 -28.082083333 -19.192083333   7.347916667
##  [6] -16.992083333 -15.992083333 -26.992083333 -23.992083333 -28.992083333
## [11]  42.007916667   0.007916667  13.007916667   6.007916667  67.007916667
## [16] -18.992083333  14.007916667  41.007916667 -16.992083333  14.007916667
## [21]  11.007916667  17.007916667  25.007916667   7.007916667  14.007916667
## [26] -17.992083333 -14.892083333  -1.432083333  53.007916667  -6.992083333
## [31]   9.007916667 -11.992083333 -24.492083333  30.007916667  17.007916667
## [36]  -5.992083333 -15.992083333  24.007916667  11.007916667 -20.992083333
## [41] -15.992083333  18.007916667 -10.992083333 -11.992083333 -16.462083333
## [46]  -3.212083333 -19.892083333 -10.992083333

This is pretty chaotic. I can summon the power of the tidyverse to print the deviance next to each of your names (I will only show the first 10 rows):

data %>% 
  mutate(deviance = data$breathhold - mean(data$breathhold)) %>% 
  select(name, deviance) %>% 
  head(10)

## # A tibble: 10 x 2
##    name        deviance
##    <chr>          <dbl>
##  1 "Cornelia"     -3.99
##  2 "Tilde"       -30.0 
##  3 "Rebecca T"   -28.1 
##  4 "Sarah KS"    -19.2 
##  5 "Maja F"        7.35
##  6 "Vlada"       -17.0 
##  7 "Marie"       -16.0 
##  8 "Laura"       -27.0 
##  9 "M\xe1rton"   -24.0 
## 10 "Jakob"       -29.0

Sum of Squared Deviances

The deviance is cool, but what if I wanted to find one value that can be used as a metric for how much deviance there is in a variable? One way of quantifying the total deviance in a variable is the Sum of Squared Deviances, which has this formula:

\[SS = \Sigma(x_{i}-\bar{x})(x_{i}-\bar{x}) = \Sigma(x_{i}-\bar{x})^2\]

where \(\Sigma\) is the sum of the deviance times itself, that is, raised to the power of two. Let’s calculate this for our variable:

SS <- sum((data$breathhold - mean(data$breathhold))^2)
SS

## [1] 23773.53

That’s a large number! Let’s plot it for reference:

ggplot(data, aes(name, breathhold)) +
  geom_point() +
  theme_bw() +
  theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
  xlab("Student") +
  ylab("Breath hold (seconds)") +
  geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
  geom_hline(aes(yintercept = mean(breathhold) + SS), color = "blue") +
  geom_hline(aes(yintercept = mean(breathhold) - SS), color = "blue")

The \(SS\) is so large that the actual points look like they are all falling right on the mean line. This is because \(SS\) is very much dependent on \(n\), so that higher \(n\) will result in higher \(SS\).

Variance

A solution to this problem is to look at the mean deviance, which we call the variance, and which looks like this:

\[s^2 = \frac{SS}{n-1} = \frac{\Sigma(x_{i}-\bar{x})^2}{n-1}\]

The reason why we take \(n-1\) has to do with degrees of freedom. Let’s try to compute the variance of our breath hold variable:

s2 <- sum((data$breathhold - mean(data$breathhold))^2)/(length(data$breathhold)-1)
s2

## [1] 505.8199

This seems like a more reasonable number! Let’s plot it – this should make the plot more informative:

ggplot(data, aes(name, breathhold)) +
  geom_point() +
  theme_bw() +
  theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
  xlab("Student") +
  ylab("Breath hold (seconds)") +
  geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
  geom_hline(aes(yintercept = mean(breathhold) + s2), color = "blue") +
  geom_hline(aes(yintercept = mean(breathhold) - s2), color = "blue")

Pretty good, but still not entirely useful. The problem with variance (\(s^2\)) is that it is expressed in units squared, which is not very informative: for instance, in our variable, the variance is 505.82 seconds squared! That is pretty hard to interpret.

Standard Deviation

Luckily it’s easy to bring this value back to the original unit of measurement of the variable: we just take the square root of \(s^2\)! The value we obtain is called the standard deviation and has this formula:

\[s = \sqrt{\frac{\Sigma(x_{i}-\bar{x})^2}{n-1}} = \sqrt{s^2}\]

which, in our breath hold variable, would be this:

s <- sqrt(sum((data$breathhold - mean(data$breathhold))^2)/(length(data$breathhold)-1))
s

## [1] 22.49044

This is much better because this value is on the original scale – i.e., seconds! So this means that the standard deviation of breath hold is 22 seconds around the mean. Which means that most people in our data set have a value of breath hold that is somewhere in the range of the mean of breath thold plus or minus 22 seconds (but note that this only applies insofar as the variable is approximately normally distributed). In this sense, the standard deviation can be interpreted as a measure of whether the sample mean is a good model of the sample.

Let’s plot this:

ggplot(data, aes(name, breathhold)) +
  geom_point() +
  theme_bw() +
  theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
  xlab("Student") +
  ylab("Breath hold (seconds)") +
  geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
  geom_hline(aes(yintercept = mean(breathhold) + s), color = "blue") +
  geom_hline(aes(yintercept = mean(breathhold) - s), color = "blue")

Much prettier and much easier to understand! Now it’s easy to see that most participants (around two-thirds) have a value that is between one standard deviation below the mean and one standard deviation above the mean (i.e., between the two lines), but the rest of the people (around one-third) has values beyond this range: that is, either much worse than most people or much better than most people.

Notice however that the variable is not really normally distributed, so interpreting the standard deviation is a bit trickier. The distribution is quite skewed to the right, meaning that a few people have really high values that are very much above the mean.

hist(data$breathhold, breaks = 50)

Deviance, Variance, and Standard Deviation

Fabio Trecca

Fall 2021

Deviance

Sum of Squared Deviances

Variance

Standard Deviation