Consider first a real-valued random variable \(X\) with distribution function \(F\) defined by \(F(x) = \Pr(X \le x)\), \(x \in \mathbb R\). We can also define \(F\) in the points \(-\infty\) and \(+\infty\) where we let it take the values \(0\) and \(1\) respectively. The function \(F\) is right-continuous with left-hand limits, and monotone non-decreasing.
It maps the extended real line \([-\infty, +\infty]\) into the unit interval \([0, 1]\) but this mapping is not necessarily surjective nor injective. Regarding (non)-surjectivity: the graph of \(F\) can have jumps, corresponding to positive probability at a point; and regarding (non)-injectivity: the graph of \(F\) can have flat sections, corresponding to intervals of positive length but zero probability.
Still we can usefully (as we will see) define an inverse to \(F\), called the quantile function of the distribution of \(X\), in the following way: we define \(Q:[0, 1]\to [-\infty, +\infty]\) by \(Q(p) = F^{-1}(p) = \inf\{x \in [\infty, +\infty] : F(x) \ge p\}\).
It is not hard to check that the function \(Q\) is right-continuous with left-hand limits, and monotone non-decreasing. Its flat sections and its jumps correspond to the jumps and flat sections of \(F\), respectively. In fact, defining the inverse of \(Q\) in the same way as we defined that of \(F\), we have \(F^{-1}=Q\) and \(Q^{-1}=F\).
par(mfrow = c(1, 2))
xpts <- seq(from = 0, to = 2, length = 401)
ypts <- numeric(401)
ypts[1:100] <- xpts[1:100]^2
ypts[101:200] <- 1/2
ypts[201:401] <- 1 - 2 * (1 - xpts[201:401]/2)^2
plot(xpts[1:101]-1, c(ypts[1:100], 1/4),
type = "l", xlim = c(-1, 1), ylim = c(0, 1),
xlab = "x", ylab = "p")
lines(c(1/2, 1/2) - 1, c(1/4, 1/2), lty = 2)
points(1/2 - 1, 1/4, pch = 1)
points(1/2 - 1, 1/2, pch = 16)
points(1 - 1, 1/2, pch = 8)
lines(xpts[101:401] - 1, ypts[101:401])
title(main = "Cumulative distribution function")
plot(y = xpts[1:101] - 1, x = ypts[1:101],
type = "l", xlim = c(0, 1), ylim = c(-1, 1),
xlab = "p", ylab = "x")
lines(y = c(1/2, 1) - 1, x = c(1/2, 1/2), lty = 2)
points(y = 0.5 - 1, x = 1/4, pch = 11)
points(y = 0.5 - 1, x = 1/2, pch = 1)
points(y = 1 - 1, x = 1/2, pch = 16)
lines(y = xpts[201:401] - 1, x = ypts[201:401])
title(main = "Quantile function")
Here is a useful fact about the quantile function.
Suppose \(U\) is uniformly distributed on the unit interval, and suppose \(Q\) is the quantile function corresponding to a distribution function \(F\). Define \(X = Q(U)\). Then the random variable \(X\) has the distribution function \(F\).
It would be nice if we conversely had that, if \(X\) has distribution function \(F\), then \(F(X)\) is uniformly distributed on the interval \([0, 1]\), but this is true if and only if the distribution \(F\) is continuous (in which case the distribution function \(F\) maps the real line onto the interval \([0, 1]\)).
Let me illustrate this. First I will define the distribution function and quantile function of the previous example as R functions.
par(mfrow = c(1, 2))
F <- function(x) {
if (x <= -1) 0 else {
if (x < -1/2) (1 + x)^2 else {
if (x <= 0) 1/2 else {
if (x <= 1) 1 - 2 * (1 - (x+1)/2)^2 else 1
}
}
}
}
Fv <- Vectorize(F)
xpts <- seq(from = -1, to = +1, length = 1000)
plot(xpts, Fv(xpts), main = "Distribution function F",
xlab = "x", ylab = "p", type = "l")
Q <- function(p) {
if (p <= 1/4) sqrt(p) - 1 else {
if (p <= 1/2) - 1/2 else 2 * (1 - sqrt((1 - p)/2)) - 1
}
}
Qv <- Vectorize(Q)
ppts <- seq(from = 0, to = 1, length = 1000)
plot(ppts, Qv(ppts), main = "Quantile function Q",
xlab = "p", ylab = "x", type = "l")
Now I have these functions I’ll take a sample of size 50 from the distribution \(F\).
set.seed(12345)
U <- runif(50)
X <- Qv(U)
X
## [1] 0.252877381 0.501548785 0.308599000 0.522767491 -0.500000000
## [6] -0.592113024 -0.500000000 0.009267277 0.262036930 0.856730589
## [11] -0.814162880 -0.609649529 0.272930474 -0.966286701 -0.500000000
## [16] -0.500000000 -0.500000000 -0.500000000 -0.576959122 0.689061915
## [21] -0.500000000 -0.500000000 0.736999328 0.235123379 0.156842407
## [26] -0.500000000 0.223525454 0.045073683 -0.524114322 -0.500000000
## [31] 0.356582826 -0.922620225 -0.566742056 0.202295322 -0.500000000
## [36] -0.500000000 0.487740102 0.562175071 0.125270974 -0.633896689
## [41] 0.339989819 -0.500000000 0.618618236 0.326565853 -0.500000000
## [46] -0.500000000 -0.754652986 -0.791537884 -0.765364499 0.134601591
sum(X == -0.5) # there should be lots
## [1] 15
sum((X > -0.5) & (X < 0)) # there should be none
## [1] 0
Well, has this worked as expected? To investigate that, I will procede to look at the empirical distribution of the data which we have just generated.
If we have a random sample of size \(n\) from some distribution on the real line, then we define the empirical distribution to be the probability distribution which assigns probability mass \(1/n\) to each of the \(n\) observations. If several observations are equal to one another, then the probability masses coming from those observations are added up together. This probability distribution also has a distribution function, and we call it the empirical distribution function, and often denote it by \(F_n\). It can be computed as follows: \(F_n(x) = \frac 1n\#\{i : X_i \le x\}\). Its inverse is called the empirical quantile function. It can be computed as follows. Define the order statistics \(X_{(1)}\le X_{(2)}\le \dots \le X_{(n)}\) to be the ordered sample, i.e., we arrange the values of \(X_1\), … , \(X_n\) from small to large. Then \(Q_n(p) = X_{(j)}\) if and only if \(\frac{j-1}n < p \le \frac j n\); \(Q_n(p) = -\infty\) if \(p < \frac 1n\).
Now I will take a random sample of size 50 from the distribution, and plot its empirical distribution function and the true distribution function together; I will also draw its empirical quantile function and the true quantile function together.
par(mfrow = c(1, 2))
Xs <- sort(X)
plot(c(-1, Xs, 1), c(0, (1:50)/50, 1),
type = "s", xlim = c(-1, 1), ylim = c(0, 1),
xlab = "x", ylab = "p", main = "Empirical distribution function")
lines(xpts, Fv(xpts), col = "blue")
plot(y = c(-1, Xs, Xs[50]), x = c(0, 0, (1:50)/50),
type = "s", xlim = c(0, 1), ylim = c(-1, 1),
xlab = "p", ylab = "x", main = "Empirical quantile function")
lines(y = xpts, x = Fv(xpts), col = "blue")
Consider now two theoretical probability distributions with distribution functions \(F\) and \(G\), and with quantile functions \(Q\) and \(R\). Think of \(F\) as the distribution function of a random variable \(X\) and \(G\) as the distribution function of a random variable \(Y\). The QQ plot (quantile-quantile plot, or just quantile plot for short) of the quantiles of \(Y\) against the quantiles of \(X\) is a plot of the parametric curve \(y = R(p) = G^{-1}(p)\), \(x = Q(p) = F^{-1}(p)\) as \(p\) varies throughout the interval \([0, 1]\). If \(F\) and \(G\) are both continuous and strictly increasing then, by writing \(p = Q^{-1}(x) = F(x)\), we can express \(y\) as the following function of \(x\): \(y = G^{-1}(F(x))\).
Now as we saw before, is \(X\) has continuous distribution function \(F\) then \(F(X)\) is uniformly distributed on the interval \([0, 1]\), and it follows in that case that \(G^{-1}(F(X))\) has the distribution function \(G\). Notice that the \(G^{-1}\) and \(F\) and hence also their composition are monotone increasing functions. Thus \(G\circ F^{-1}\) is a monotone increasing function which when applied to the random variable \(X\) produces a random variable distributed like \(Y\). It is easy to see that there can only be one monotone increasing function which does this job. In fact, it obviously has to transform quantiles of the distribution of \(X\) to the same quantiles of the distribution of \(Y\).
All this tells us that the QQ plot, i.e., the curve \((x, y) = (F^{-1}(p), G^{-1}(p))\) traced out as \(p\) varies from \(0\) to \(1\), is a plot of the unique monotone increasing function which converts a random variable distributed like \(X\) into a random variable distributed like \(Y\).
Here is an important example. Suppose \(X\sim N(0, 1)\), suppose \(\mu\) is arbitrary and \(\sigma > 0\). We know that \(Y = \mu + \sigma X \sim N(\mu, \sigma^2)\). The mapping \(x \mapsto y = \mu + \sigma x\) is a straight line with positive slope so it is a monotone increasing function. Therefore the QQ plot of the quantiles of \(N(\mu, \sigma^2)\) against the quantiles of \(N(0, 1)\) is exactly equal to this straight line \(x \mapsto y = \mu + \sigma x\).
If we want to find out if a distribution is a normal distribution, then all we need to do is to plot its quantiles against the quantiles of a standard normal distribution. If the result is a straight line, then the distribution in question was normal, and the intercept and slope of the curve tell us its mean and standard deviation.
Many parametric families of distributions have location and/or scale parameters. If \(X\) has some standard distribution then defining \(Y = \mu + \sigma X\) defines a family of probability distributions obtained by scaling (by the amount \(\sigma\)) and shifting (by the amount \(\mu)\). For example: the inverse of the rate parameter of an exponential distribution is a scale parameter. We get all exponential distributions from the standard exponential just by adding a scale parameter in this way. We can define useful location-scale families starting from the Cauchy distribution, or starting from a \(t\) distribution with given number of degrees of freedom.
The word “probability plot” is a bit misleading, since a probability plot is simply a plot of empirical quantiles from a sample from one distribution, against the theoretical quantiles from a given theoretical distribution. In other words: an (empirical vs. theoretical) QQ plot! The most famous probability plot is the normal probability plot. Now a little complication turns up. We saw that the \(1/n\) quantile of an empirical distribution based on a sample of size \(n\) is the smallest observation, the \(1 = n/n\) quantile is the largest observation. But intuitively speaking the smallest observation and the largest observation ought to correspond to “corresponding” small and large quantiles, i.e., symmetrically placed quantiles of the form \(\varepsilon\) and \(1 - \varepsilon\).
This “discomfort” is due to the fact that the empirical distribution is discrete, and we quite arbitrarily chose for “right continuous with left hand limits” in our definitions of distribution function and quantile function. In fact, in the old days, the people on the other side of the iron curtain took the opposite convention to us on this side.
There is a simple fix. If we have a sample of size \(n\) then the \(1/(n+1)\), \(2/(n+1)\), … , \(n/(n+1)\) empirical quantiles are exactly the ordered observations \(X_{(1)}\), \(X_{(2)}\), … , \(X_{(n)}\). And the sequence of \(n\) equally spaced probabilities \(p = 1/(n+1)\), \(2/(n+1)\), … , \(n/(n+1)\) runs through the unit interval in a beautifully regular and symmetric way!
Actually, for small \(n\), some people prefer a little more complicated sequence. But always, the idea is just to have \(n\) equally spaced points, the largest as close to \(1\) as the smallest one is to \(0\), and this small distance or “offset” smaller than \(1/n\) in size. There are several sensible ways to choose the offset. Offset equals \(1/(n+1)\) is not a bad idea.
A plot of the ordered observations from a sample of size \(n\) against the points \(\Phi^{-1}(i/(n+1))\), \(i = 1,\dots, n\) is called a normal probability plot. If the plot is close to a straight line \(y = \mu + \sigma x\) then we conclude that the data came from something close to the \(N(\mu,\sigma^2)\) distribution.
If the plot has clear systematic deviation from a straight line then we know the normality assumption is bad. If we see a convex curve, or a concave curve, then the data comes from a distribution which is skew (one tail is heavier than the other). Convex corresponds to skewness to the right: the very large observations are much too large, the very small observations are not small enough. Concave means skewness to the left. Sometimes a simple transformation of the data can “fix” the problem. The function “logarithm” is convex and its inverse, “exponential” is concave.
Generally speaking it is important to take note of deviations from a normal distribution in either tail, and especially if the data has much too large (in absolute value) observations in either tail, then one should be worried about possible outliers, and one should worry about their impact on sample means and sample variances (which are heavily influenced by extreme observations).
If we have two samples, possibly of different size, say \(X_1, \dots, X_n\) and \(Y_1, \dots Y_m\), then it can be very revealing to plot the empirical quantiles of the one distribution against the quantiles of the other. Calling the two empirical quantile functions \(Q_n\) and \(R_m\) then one plots \(y = R_m(p)\) against \(x = Q_n(p)\) as \(p\) varies from \(0\) to \(1\). In practice, we would just let \(p\) run through the “plotting points” \(p = 1/(n+1)\), \(2/(n+1)\), … , \(n/(n+1)\) or \(p = 1/(m+1)\), \(2/(m+1)\), … , \(m/(m+1)\), or both. If the resulting plot resembles a straight line, then the two distributions appear to be related in a location-scale way. Such a relationship might well be expected for scientific reasons; though possibly only after the observations have been transformed in some way, for instance, by taking logarithms (of strictly positive data) or square roots. If we see a straight line and it has slope +1, then the difference between the two distributions is just a matter of location: one of the distributions is shifted (amount = intercept) relative to the other. If we see a straight line and it goes through the origin then one of the distributions is a scale transformation of the other (amount = slope).
Such plots are easy to make and very quickly give us important information which can guide us in choosing sensible models for the data. They can also help us to describe, in a very concise and scientifically interpretable way, what the difference between the two distributions is.
Note: if you give the statistical program R a data set and ask it for the \(p\)th quantile of the data, for some probability \(p\), then R usually uses interpolation. In other words, if \((i-1)/n < p < i/n\) then R gives you a convex combination of the \((i-1)\)th and \(i\)th ordered observations. In particular, with an even sample size, there is no single “middle” observation, but there are two “middle observations”, and the median of the data is usually taken to be the average of those two. If you do help(qqplot), help(ppoints), and help(quantile) in R you will see that there are a lot of possible choices for how to make these plots, and it is possible to force R to do it “your way” if you don’t like the default. For largish data-sets these issues are not very important.