In the Stats I course for psychology freshman at Bremen University (Germany), we teach two software packages, R and SPSS. Sometimes confusion arises, when the software packages produce different results. This may be due to different implementions of a method or different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plot, produced by SPSS and R, respectively. The data used in the plots was generated by:

set.seed(0)
x <- sample(0:9, 100, rep=T)    

SPSS

QQ-plot in SPSS using Blom’s method

R

qqnorm(x, datax=T)      # uses Blom method by default
qqline(x, datax=T)

There are some obvious differences:

  1. The R plot seems to contain more points than the SPSS plot. Actually this is not the case. Most points in the SPSS plot are just printed upon each other. The difference is here that SPSS uses a different approach to assign probabilities to the values. We will expore the difference below.
  2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values.
  3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes.

QQ-plots from scratch

To get a better understanding of the difference we will build the R and SPSS flavored QQ-plot from scratch.

R type

In order to calculate theoretical quantiles, we first need to find a way to assign a probability to each value of the original data. A lot of different functions exist for this purpose (see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the rank order of the data points to calculate the correspoding p-values. The qqnorm function uses two formulas for this purpose, depending on the number of observations \(n\). With \(r\) being the rank, for \(n > 10\) it will use the formula \(p = (r - 1/2) / n\), for \(n \leq 10\) the formula \(p = (r - 3/8) / (n + 1/4)\) to determine the probability value \(p\) for each observation. In the following we will only implement the \(n > 10\) case.

n <- length(x)          # number of observations
r <- order(order(x))    # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n      # assign to ranks using Blom's method
y <- qnorm(p)           # theoretical standard normal quantiles for p values
plot(x, y)              # plot empirical against theoretical values

The command order(order(x)) returns a rank order for the observations. The difference to the ranks as produced by the function rank is that no average ranks are calculated for ties. The following codes shows the difference between the two.

v <- c(1,1,2,3,3)
order(order(v))
## [1] 1 2 3 4 5
rank(v)
## [1] 1.5 1.5 3.0 4.5 4.5

(Internally, qqnorm uses the function ppoints to generate p-values. Type in stats:::qqnorm.default to the console to have a look at the code.) Now, let us compare our plot to the plot generated by qqnorm above. They are identical.

The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

plot(x, y)                      # plot empirical against theoretical values
ps <- c(.25, .75)               # reference probabilities
a <- quantile(x, ps)            # empirical quantiles
b <- qnorm(ps)                  # theoretical quantiles
lines(a, b, lwd=4, col="red")   # our QQ line
qqline(x, datax=T)              # R QQ line

The reason for different lines in R and SPSS is that several approaches to fitting a line exist (see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each of them has its relative merits. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, how to fit an adequate QQ-line depends on the purpose of the plot. An explanation on the rationale of the R approach can e.g. be found here.

SPSS type

The default SPSS approach differs from the one above in the following aspects:

  1. SPSS uses ranks with averaged ties not the plain order index/ranks as in R to derive the corresponding probabilities. The rest of the code is identical to the one above, though I am not sure if SPSS makes a difference between \(n <= 10\) and \(n > 10\).
  2. The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.
  3. The QQ-line goes through all equal quantiles on the x and y axis.
n <- length(x)                # number of observations
r <- rank(x)                  # a) ranks instead of the order index
p <- (r - 1/2) / n            # assign to ranks using Blom's method
y <- qnorm(p)                 # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x)      # b) transform SND quantiles to mean and sd from original data
plot(x, y)                    # plot empirical against theoretical values

Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.

abline(0,1)                   # c) slope 0 through origin

The comparison to the SPSS output shows that they are (visually) identical.

What else?

The whole point of this demonstration was to pinpoint and explain the differences between the two plots, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options for the plot. You can select the method to assign TODO and how to treat ties etc. All the above referred to the default case (Blom’s method and averaging across ties). Personally I like the SPSS version, that is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. It is a preliminary version though, that has not yet been thoroughly tested. You can find the code here. Suggestions, improvements etc. are welcome.

library(ryouready)
library(ggplot2)
qq <- qqnorm_spss(x)
plot(qq)

ggplot(qq)

Literature

Castillo-Gutiérrez, S., Lozano-Aguilera, E. D., & Estudillo-Martínez, M. D. (2012a). A New Proposal to Adjust a Straight Line to a Normal Q-Q Plot. Journal of Mathematics and System Science, 2(5), 327–333.

Castillo-Gutiérrez, S., Lozano-Aguilera, E., & Estudillo-Martínez, M. D. (2012b). Selection of a Plotting Position for a Normal Q-Q Plot. R Script. Journal of Communication and Computer, 9(3), 243–250.