Any value in a data set can be explained by some sort of statistical model, plus some error. We call this error the deviance.
The deviance is the difference between the value of each observation and the statistical model. Let’s say as an example that my statistical model is the mean of a set of values. The difference between each value and the mean of all the values will be written like this:
\[x_{i}-\bar{x}\]
where \(x_{i}\) is the value of the observation for an individual \(i\) and \(\bar{x}\) is the mean of our sample.
Let’s look at our personality data, and in particular at the breath hold variable. What is the mean of breath hold?
mean(data$breathhold)
## [1] 53.99208
Cool. To calculate the deviance, it will be enough to subtract each observation from the mean in this way:
data$breathhold - mean(data$breathhold)
## [1] -3.992083333 -29.992083333 -28.082083333 -19.192083333 7.347916667
## [6] -16.992083333 -15.992083333 -26.992083333 -23.992083333 -28.992083333
## [11] 42.007916667 0.007916667 13.007916667 6.007916667 67.007916667
## [16] -18.992083333 14.007916667 41.007916667 -16.992083333 14.007916667
## [21] 11.007916667 17.007916667 25.007916667 7.007916667 14.007916667
## [26] -17.992083333 -14.892083333 -1.432083333 53.007916667 -6.992083333
## [31] 9.007916667 -11.992083333 -24.492083333 30.007916667 17.007916667
## [36] -5.992083333 -15.992083333 24.007916667 11.007916667 -20.992083333
## [41] -15.992083333 18.007916667 -10.992083333 -11.992083333 -16.462083333
## [46] -3.212083333 -19.892083333 -10.992083333
This is pretty chaotic. I can summon the power of the tidyverse to print the deviance next to each of your names (I will only show the first 10 rows):
data %>%
mutate(deviance = data$breathhold - mean(data$breathhold)) %>%
select(name, deviance) %>%
head(10)
## # A tibble: 10 x 2
## name deviance
## <chr> <dbl>
## 1 "Cornelia" -3.99
## 2 "Tilde" -30.0
## 3 "Rebecca T" -28.1
## 4 "Sarah KS" -19.2
## 5 "Maja F" 7.35
## 6 "Vlada" -17.0
## 7 "Marie" -16.0
## 8 "Laura" -27.0
## 9 "M\xe1rton" -24.0
## 10 "Jakob" -29.0
The deviance is cool, but what if I wanted to find one value that can be used as a metric for how much deviance there is in a variable? One way of quantifying the total deviance in a variable is the Sum of Squared Deviances, which has this formula:
\[SS = \Sigma(x_{i}-\bar{x})(x_{i}-\bar{x}) = \Sigma(x_{i}-\bar{x})^2\]
where \(\Sigma\) is the sum of the deviance times itself, that is, raised to the power of two. Let’s calculate this for our variable:
SS <- sum((data$breathhold - mean(data$breathhold))^2)
SS
## [1] 23773.53
That’s a large number! Let’s plot it for reference:
ggplot(data, aes(name, breathhold)) +
geom_point() +
theme_bw() +
theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
xlab("Student") +
ylab("Breath hold (seconds)") +
geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
geom_hline(aes(yintercept = mean(breathhold) + SS), color = "blue") +
geom_hline(aes(yintercept = mean(breathhold) - SS), color = "blue")
The \(SS\) is so large that the actual points look like they are all falling right on the mean line. This is because \(SS\) is very much dependent on \(n\), so that higher \(n\) will result in higher \(SS\).
A solution to this problem is to look at the mean deviance, which we call the variance, and which looks like this:
\[s^2 = \frac{SS}{n-1} = \frac{\Sigma(x_{i}-\bar{x})^2}{n-1}\]
The reason why we take \(n-1\) has to do with degrees of freedom. Let’s try to compute the variance of our breath hold variable:
s2 <- sum((data$breathhold - mean(data$breathhold))^2)/(length(data$breathhold)-1)
s2
## [1] 505.8199
This seems like a more reasonable number! Let’s plot it – this should make the plot more informative:
ggplot(data, aes(name, breathhold)) +
geom_point() +
theme_bw() +
theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
xlab("Student") +
ylab("Breath hold (seconds)") +
geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
geom_hline(aes(yintercept = mean(breathhold) + s2), color = "blue") +
geom_hline(aes(yintercept = mean(breathhold) - s2), color = "blue")
Pretty good, but still not entirely useful. The problem with variance (\(s^2\)) is that it is expressed in units squared, which is not very informative: for instance, in our variable, the variance is 505.82 seconds squared! That is pretty hard to interpret.
Luckily it’s easy to bring this value back to the original unit of measurement of the variable: we just take the square root of \(s^2\)! The value we obtain is called the standard deviation and has this formula:
\[s = \sqrt{\frac{\Sigma(x_{i}-\bar{x})^2}{n-1}} = \sqrt{s^2}\]
which, in our breath hold variable, would be this:
s <- sqrt(sum((data$breathhold - mean(data$breathhold))^2)/(length(data$breathhold)-1))
s
## [1] 22.49044
This is much better because this value is on the original scale – i.e., seconds! So this means that the standard deviation of breath hold is 22 seconds around the mean. Which means that most people in our data set have a value of breath hold that is somewhere in the range of the mean of breath thold plus or minus 22 seconds (but note that this only applies insofar as the variable is approximately normally distributed). In this sense, the standard deviation can be interpreted as a measure of whether the sample mean is a good model of the sample.
Let’s plot this:
ggplot(data, aes(name, breathhold)) +
geom_point() +
theme_bw() +
theme(axis.text.x = element_text(hjust = 1, angle = 90)) +
xlab("Student") +
ylab("Breath hold (seconds)") +
geom_hline(aes(yintercept = mean(breathhold)), color = "red") +
geom_hline(aes(yintercept = mean(breathhold) + s), color = "blue") +
geom_hline(aes(yintercept = mean(breathhold) - s), color = "blue")
Much prettier and much easier to understand! Now it’s easy to see that most participants (around two-thirds) have a value that is between one standard deviation below the mean and one standard deviation above the mean (i.e., between the two lines), but the rest of the people (around one-third) has values beyond this range: that is, either much worse than most people or much better than most people.
Notice however that the variable is not really normally distributed, so interpreting the standard deviation is a bit trickier. The distribution is quite skewed to the right, meaning that a few people have really high values that are very much above the mean.
hist(data$breathhold, breaks = 50)