In the book silent risk by NNT, I read that standard deviation was mistakenly named so and should have been called root mean squared. Moreover, when we say “standard deviation”, our minds really imagine “mean deviation”. I am interested in understanding the consequences of using different measures of variance.
Let us consider the case of the biased coin toss, where the probability of the outcome of heads is \(p\). We represent the outcome tails by the value \(0\) and the outcome heads by the value \(1\).
The expected value of an ensemble of coin tosses is \begin{equation} E[X] = 0 \cdot (1 - p) + 1 \cdot p. \end{equation} The standard deviation (SD) of the coin tosses via the root mean squared formula is \begin{equation} Sd(X) = \sqrt{(E[X^2] - E[X]^2)} = \sqrt{(p - p^2)}. \end{equation} Meanwhile, the mean deviation (MD) is given by \begin{equation} Md(X) = |0 - p|\cdot (1 - p) + |1 - p|\cdot p = 2p(1-p). \end{equation}Next, let us compare the measures of variance by plotting the respective formulae for various biased probabilities.
# vector of probability of getting heads
p <- (0 : 10) / 10
# Standard deviation
Sd <- sqrt(p * (1 - p))
# Mean deviation
Md <- 2 * p * (1 - p)
# Create data frame with all data
variancedf <- data.frame(prob = p, SD = Sd, MD = Md)
# melt data frame for plotting in ggplot
library(reshape2)
variancedflong <- melt(variancedf, id.vars = "prob")
# plot
library(ggplot2)
ggplot(variancedflong, aes(x = prob, y = value)) +
geom_line(aes(colour = variable), lwd = 1.5) +
xlab("Biased probability (of heads)") +
ylab("Measure of variance") +
ggtitle("Comparison of mean and standard deviation for a biased coin")
I do not find any stark differences. Note: Absence of evidence is NOT evidence of absence.
Construct \(nosim = 100\) data sets of \(n = 5\) random normals data points and study mean vs standard deviations.
# number of simulation
nosim <- 100
# data points per simulation
n <- 15
# generate matrix of random values from the normal distribution
randnums <- matrix(rnorm(nosim * n), nosim, n)
# take the MD and SD of each row
meandata <- apply(randnums, 1, mean)
MDdata <- apply(abs(randnums - meandata), 1, sum) / n
SDdata <- apply(randnums, 1, sd)
# maximum value of each data matrix row
maxdata <- apply(abs(randnums), 1, max)
# create data frame of max value and SD and MD values
normdf <- data.frame(maxval = maxdata, SD = SDdata, MD = MDdata)
# melt data frame
normdflong <- melt(normdf, id.vars = "maxval")
ggplot(normdflong, aes(x = maxval, y = value)) +
geom_point(aes(colour = variable), size = 3, alpha = 0.6) +
xlab("Maximum value of data set") +
ylab("Measure of variance") +
ggtitle("Mean vs standard deviation for normally distributed data")