Quantiles
Quantiles are cutoff points that divide a dataset into intervals with set probabilities. The 𝑞 th quantile is the value at which 𝑞 % of the observations are equal to or less than that value.
Using the quantile function Given a dataset data and desired quantile q, you can find the qth quantile of data with:
# quantile(data,q)
Percentiles
Percentiles are the quantiles that divide a dataset into 100 intervals each with 1% probability. You can determine all percentiles of a dataset data like this:
# p <- seq(0.01, 0.99, 0.01)
# quantile(data, p)
Quartiles
Quartiles divide a dataset into 4 parts each with 25% probability. They are equal to the 25th, 50th and 75th percentiles. The 25th percentile is also known as the 1st quartile, the 50th percentile is also known as the median, and the 75th percentile is also known as the 3rd quartile.
The summary() function returns the minimum, quartiles and maximum of a vector.
Examples
Load the heights dataset from the dslabs package:
library(dslabs)
data(heights)
Use summary() on the heights$height variable to find the quartiles:
summary(heights$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 66.00 68.50 68.32 71.00 82.68
Find the percentiles of heights$height:
p <- seq(0.01, 0.99, 0.01)
percentiles <- quantile(heights$height, p)
Confirm that the 25th and 75th percentiles match the 1st and 3rd quartiles. Note that quantile() returns a named vector. You can access the 25th and 75th percentiles like this (adapt the code for other percentile values):
percentiles[names(percentiles) == "25%"]
## 25%
## 66
percentiles[names(percentiles) == "75%"]
## 75%
## 71
Definition of qnorm
The qnorm() function gives the theoretical value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with mean mu and standard deviation sigma:
# qnorm(p, mu, sigma)
By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives quantiles for the standard normal distribution.
qnorm(p)
## [1] -2.32634787 -2.05374891 -1.88079361 -1.75068607 -1.64485363 -1.55477359
## [7] -1.47579103 -1.40507156 -1.34075503 -1.28155157 -1.22652812 -1.17498679
## [13] -1.12639113 -1.08031934 -1.03643339 -0.99445788 -0.95416525 -0.91536509
## [19] -0.87789630 -0.84162123 -0.80642125 -0.77219321 -0.73884685 -0.70630256
## [25] -0.67448975 -0.64334541 -0.61281299 -0.58284151 -0.55338472 -0.52440051
## [31] -0.49585035 -0.46769880 -0.43991317 -0.41246313 -0.38532047 -0.35845879
## [37] -0.33185335 -0.30548079 -0.27931903 -0.25334710 -0.22754498 -0.20189348
## [43] -0.17637416 -0.15096922 -0.12566135 -0.10043372 -0.07526986 -0.05015358
## [49] -0.02506891 0.00000000 0.02506891 0.05015358 0.07526986 0.10043372
## [55] 0.12566135 0.15096922 0.17637416 0.20189348 0.22754498 0.25334710
## [61] 0.27931903 0.30548079 0.33185335 0.35845879 0.38532047 0.41246313
## [67] 0.43991317 0.46769880 0.49585035 0.52440051 0.55338472 0.58284151
## [73] 0.61281299 0.64334541 0.67448975 0.70630256 0.73884685 0.77219321
## [79] 0.80642125 0.84162123 0.87789630 0.91536509 0.95416525 0.99445788
## [85] 1.03643339 1.08031934 1.12639113 1.17498679 1.22652812 1.28155157
## [91] 1.34075503 1.40507156 1.47579103 1.55477359 1.64485363 1.75068607
## [97] 1.88079361 2.05374891 2.32634787
Recall that quantiles are defined such that p is the probability of a random observation less than or equal to the quantile.
Relation to pnorm
The pnorm() function gives the probability that a value from a standard normal distribution will be less than or equal to a z-score value z. Consider:
pnorm(-1.96) ≈0.025
The result of pnorm() is the quantile. Note that:
qnorm(0.025) ≈−1.96
qnorm() and pnorm() are inverse functions:
pnorm(qnorm(0.025)) =0.025
Theoretical quantiles
You can use qnorm() to determine the theoretical quantiles of a dataset: that is, the theoretical value of quantiles assuming that a dataset follows a normal distribution. Run the qnorm() function with the desired probabilities p, mean mu and standard deviation sigma.
Suppose male heights follow a normal distribution with a mean of 69 inches and standard deviation of 3 inches. The theoretical quantiles are:
p <- seq(0.01, 0.99, 0.01)
theoretical_quantiles <- qnorm(p, 69, 3)
Theoretical quantiles can be compared to sample quantiles determined with the quantile function in order to evaluate whether the sample follows a normal distribution.
Key points
Quantile-quantile plots, or QQ-plots, are used to check whether distributions are well-approximated by a normal distribution.
Given a proportion p, the quantile q is the value such that the proportion of values in the data below q is p.
In a QQ-plot, the sample quantiles in the observed data are compared to the theoretical quantiles expected from the normal distribution. If the data are well-approximated by the normal distribution, then the points on the QQ-plot will fall near the identity line (sample = theoretical).
Calculate sample quantiles (observed quantiles) using the quantile() function.
Calculate theoretical quantiles with the qnorm() function. qnorm() will calculate quantiles for the standard normal distribution (μ=0,σ=1) by default, but it can calculate quantiles for any normal distribution given mean() and sd() arguments. We will learn more about qnorm() in the probability course.
Note that we will learn alternate ways to make QQ-plots with less code later in the series.
Code
# define x and z
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dslabs)
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]
z <- scale(x)
# proportion of data below 69.5
mean(x <= 69.5)
## [1] 0.5147783
# calculate observed and theoretical quantiles
p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)
theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))
# make QQ-plot
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
# make QQ-plot with scaled values
observed_quantiles <- quantile(z, p)
theoretical_quantiles <- qnorm(p)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
Key points
Percentiles are the quantiles obtained when defining 𝑝 as 0.01,0.02,…,0.99 . They summarize the values at which a certain percent of the observations are equal to or less than that value.
The 50th percentile is also known as the median.
The quartiles are the 25th, 50th and 75th percentiles.
Key points
When data do not follow a normal distribution and cannot be succinctly summarized by only the mean and standard deviation, an alternative is to report a five-number summary: range (ignoring outliers) and the quartiles (25th, 50th, 75th percentile).
In a boxplot, the box is defined by the 25th and 75th percentiles and the median is a horizontal line through the box. The whiskers show the range excluding outliers, and outliers are plotted separately as individual points.
The interquartile range is the distance between the 25th and 75th percentiles.
Boxplots are particularly useful when comparing multiple distributions.
We discuss outliers in a later video.
Key points
If a distribution is not normal, it cannot be summarized with only the mean and standard deviation. Provide a histogram, smooth density or boxplot instead.
A plot can force us to see unexpected results that make us question the quality or implications of our data.