In the Stats I course for psychology freshman at Bremen University (Germany), we teach two software packages, R and SPSS. Sometimes confusion arises, when the software packages produce different results. This may be due to different implementions of a method or different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plot, produced by SPSS and R, respectively. The data used in the plots was generated by:
set.seed(0)
x <- sample(0:9, 100, rep=T)
SPSS
R
qqnorm(x, datax=T) # uses Blom method by default
qqline(x, datax=T)
There are some obvious differences:
To get a better understanding of the difference we will build the R and SPSS flavored QQ-plot from scratch.
In order to calculate theoretical quantiles, we first need to find a way to assign a probability to each value of the original data. A lot of different functions exist for this purpose (see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the rank order of the data points to calculate the correspoding p-values. The qqnorm
function uses two formulas for this purpose, depending on the number of observations \(n\). With \(r\) being the rank, for \(n > 10\) it will use the formula \(p = (r - 1/2) / n\), for \(n \leq 10\) the formula \(p = (r - 3/8) / (n + 1/4)\) to determine the probability value \(p\) for each observation. In the following we will only implement the \(n > 10\) case.
n <- length(x) # number of observations
r <- order(order(x)) # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n # assign to ranks using Blom's method
y <- qnorm(p) # theoretical standard normal quantiles for p values
plot(x, y) # plot empirical against theoretical values
The command order(order(x))
returns a rank order for the observations. The difference to the ranks as produced by the function rank
is that no average ranks are calculated for ties. The following codes shows the difference between the two.
v <- c(1,1,2,3,3)
order(order(v))
## [1] 1 2 3 4 5
rank(v)
## [1] 1.5 1.5 3.0 4.5 4.5
(Internally, qqnorm
uses the function ppoints
to generate p-values. Type in stats:::qqnorm.default
to the console to have a look at the code.) Now, let us compare our plot to the plot generated by qqnorm
above. They are identical.
The last point to understand is how the QQ-line is drawn in R. Looking at the probs
argument of qqline
reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline
to see if our code is correct.
plot(x, y) # plot empirical against theoretical values
ps <- c(.25, .75) # reference probabilities
a <- quantile(x, ps) # empirical quantiles
b <- qnorm(ps) # theoretical quantiles
lines(a, b, lwd=4, col="red") # our QQ line
qqline(x, datax=T) # R QQ line
The reason for different lines in R and SPSS is that several approaches to fitting a line exist (see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each of them has its relative merits. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, how to fit an adequate QQ-line depends on the purpose of the plot. An explanation on the rationale of the R approach can e.g. be found here.
The default SPSS approach differs from the one above in the following aspects:
n <- length(x) # number of observations
r <- rank(x) # a) ranks instead of the order index
p <- (r - 1/2) / n # assign to ranks using Blom's method
y <- qnorm(p) # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x) # b) transform SND quantiles to mean and sd from original data
plot(x, y) # plot empirical against theoretical values
Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.
abline(0,1) # c) slope 0 through origin
The comparison to the SPSS output shows that they are (visually) identical.
The whole point of this demonstration was to pinpoint and explain the differences between the two plots, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options for the plot. You can select the method to assign TODO and how to treat ties etc. All the above referred to the default case (Blom’s method and averaging across ties). Personally I like the SPSS version, that is why I implemented the function qqnorm_spss
in the ryouready
package, that accompanies the course. It is a preliminary version though, that has not yet been thoroughly tested. You can find the code here. Suggestions, improvements etc. are welcome.
library(ryouready)
library(ggplot2)
qq <- qqnorm_spss(x)
plot(qq)
ggplot(qq)
Castillo-Gutiérrez, S., Lozano-Aguilera, E. D., & Estudillo-Martínez, M. D. (2012a). A New Proposal to Adjust a Straight Line to a Normal Q-Q Plot. Journal of Mathematics and System Science, 2(5), 327–333.
Castillo-Gutiérrez, S., Lozano-Aguilera, E., & Estudillo-Martínez, M. D. (2012b). Selection of a Plotting Position for a Normal Q-Q Plot. R Script. Journal of Communication and Computer, 9(3), 243–250.