In class we saw how quantile-quantile plots can be used to visually assess if data are consistent with a normal distribution. In this write-up, I’ll describe how you could compute one “by hand”.
Rather than work with the large data set we saw in class, I’ll do this work with a toy data set of just six observations.
data <- c(282,289,313,318,347,355)
A quantile represents a cutoff value such that a specified fraction of a batch of numbers that is less than that value. One quantile that you are familiar with is the median. It tells you a number such that 1/2 of batch of numbers is less than that value. Another example are quartiles. They tell you values such that 1/4, 1/2, and 3/4 of a batch of numbers are less than that value.
There are actually a number of algorithms for determing quantiles of a batch of numbers. Here I’ll implement a simple one. We start by calculating a number called the \(f\) value for each observation \(i\):
\[f_i = \frac{i - 0.5}{n}\] The f values ranges from 0 to 1, and the observation associated with f of 0.5 is the median.
#If what is happening in this code block is unclear,
#try copy and pasting individiual parts to the console to see what happens,
num_obs <- length(data)
f_val <- (1:num_obs - 0.5)/num_obs
data_with_f <- tibble(data, f_val)
data_with_f
## # A tibble: 6 x 2
## data f_val
## <dbl> <dbl>
## 1 282 0.0833
## 2 289 0.25
## 3 313 0.417
## 4 318 0.583
## 5 347 0.75
## 6 355 0.917
Now that we have our quantiles (i.e. the values of \(f\)), we can can find the values of a normal distribution that divides the distribution up into those same amounts. In this case, we want the values such that 0.167, 0.5, and 0.83 of the normal distribution is less than this value. You can do this with any normal distribution. R (and most other software) uses the standard normal. We can find these values using the function qnorm
qnorm(f_val)
## [1] -1.3829941 -0.6744898 -0.2104284 0.2104284 0.6744898 1.3829941
standard_norm_quants <- qnorm(f_val)
data_with_f_norm <- tibble(data_with_f, standard_norm_quants)
data_with_f_norm
## # A tibble: 6 x 3
## data f_val standard_norm_quants
## <dbl> <dbl> <dbl>
## 1 282 0.0833 -1.38
## 2 289 0.25 -0.674
## 3 313 0.417 -0.210
## 4 318 0.583 0.210
## 5 347 0.75 0.674
## 6 355 0.917 1.38
Now that we know the values of the standard normal that has the same quantile our data, we can make a plot:
data_with_f_norm %>%
ggplot(aes(x = standard_norm_quants, y = data)) +
geom_point()
We can compare the plot above the “automatically” generated qqplot:
data_with_f_norm %>%
ggplot(aes(sample = data)) +
geom_point(stat = "qq")
For more information, see this detailed discussion by Manny Gimond: https://mgimond.github.io/ES218/Week05a.html#Quantile_plots