Michael Hunt
Cornwall College Newquay
09-02-2021

Adapted from an exercise by Jon Yearsley (School of Biology and Environmental Science, UCD)

Introduction

Q-Q plots can play a useful role when trying to decide whether a dataset is normally distributed, and if it is not, then how it differs from normality.

We will investigate the types of quantile-quantile plots you get from different types of distributions.

We will look at data distributed according to

What is a Q-Q plot?

Quantiles partition a dataset into equal subsets. For example, if we wished to partition a standard normal (mean = 0, standard deviation = 1) population into 4 equal subsets, the 3 quantiles (ie the three values of x) that would do this are -0.675, 0 and 0.675. In this way, 25% of the population would have a value greater than 0.675, 25% between 0 and 0.675, 25% between -0.675 and 0 and the final 25% would have a value less than -0.675. When we draw the distribution, the areas under the curve between these quantiles will be equal:

Normally distributed data.

Below we show an example of 150 observations that are drawn from a normal distribution. The normal distribution is symmetric, so has no skew. Its mean is equal to its median.

On a Q-Q plot normally distributed data lie roughly on a straight line, perhaps looking a bit ragged at each end. The box plot is symmetric with few or no outliers.

Right-skewed data.

Right skewed distributions are non-symmetric and have a long tail heading towards extreme values on the right-hand side of the distribution. The mean is more positive than the median.

In the example we show an exponential distribution.

In the Q-Q plot, such distributions give a distinctive convex curvature. The box-plot may show outliers out towards large values.

Left-skewed data.

Left skewed distributions are non-symmetric and have a long tail heading towards extreme values on the left-hand side of the distribution. The mean is more negative than the median. The box plot may show outliers down towards small values.

In the example we show a negative exponential distribution.

In the Q-Q plot, such distributions give a distinctive concave curvature.

Under-dispersed data

Under-dispersed data are data whose distribution is more concentrated around a central value than is the case for normally distributed data. There are fewer outliers and the tails of the distribution are lighter. As an example here we show 150 points drawn from a uniform distribution.

Note the distinctive curvature of the Q-Q plot. The ‘box’ of the boxplot is bigger than for a normal distribution, since the interquartile range covers a larger range of values.

Over-dispersed data

Over-dispersed data are data whose distribution is more widely spread around a central value than is the case for normally distributed data. There are more outliers and the tails of the distribution are fatter. As an example here we show 150 points drawn from a laplace distribution.

Note the distinctive curvature of the Q-Q plot - like the previous one but curving the other way. The ‘box’ of the boxplot is smaller than for a normal distribution, since the interquartile range covers a smaller range of values.

Caution

With small data sets, the scatter in the data can make it difficult to tell from the histogram or the Q-Q plot whether the dataset is normally distributed. In that case you need to combine use of the plots with a normality test, such as Kolmogorov-Smirnov or Shapiro-Wilk. The null hypothesis of these is that the data ARE normally distributed, so the smaller the p-value when they are applied to a dataset, the less likely it is that that data have been drawn from a normal distribution.

With large data sets,the Kolmogorov-Smirnov and Shapiro-Wilk tests become very sensitive to even small deviations from normality and might give a p-value that would lead you to suppose that a dataset was not normally distributed. Since no data set is ever truly normal, all we really need to know is whether the data are close enough to normal that the various tests (eg t-test, ANOVA, correlation, least square regression) that require it are going to work well enough. For these large data sets, histograms and Q-Q plots can be very useful indicators of approximate normality.

Who cares about normality anyway? The central limit theorem.

Lastly, for large enough data sets, we don’t actually need the data to be normally distributed for the tests that require normality to work! This is because what they require is not that the dataset itself be normal, but that the distribution of the means of many such data sets, the so-called sampling distribution, be normal. A very important mathematical result known as the Central Limit Theorem guarantees that this will be the case whatever the distribution of each of the data sets, as long as these datasets are large enough!

How large is large enough? There’s the rub! A common rule of thumb is that if the dataset has size N>30 or so, then it is safe to use tests that require normality. Indeed, one does find that sampling distributions for data drawn from uniform or mildly skewed distributions such as the exponential distribution are roughly normal when N exceeds 30 or so, but for more skewed datasets, a larger dataset can be needed - it depends on how far from normality the distribution is. The further from normal it is the larger the dataset needs to be before the Central Limit Theorem applies to a good approximation. For a highly skewed dataset, for example one distributed according to something like a log-normal distribution, it can require N>200 or so, or even more before it is OK to use t-tests and the like.