Definition of quantiles

Quantiles

Quantiles are cutoff points that divide a dataset into intervals with set probabilities. The 𝑞 th quantile is the value at which 𝑞 % of the observations are equal to or less than that value.

Using the quantile function Given a dataset data and desired quantile q, you can find the qth quantile of data with:

# quantile(data,q)

Percentiles

Percentiles are the quantiles that divide a dataset into 100 intervals each with 1% probability. You can determine all percentiles of a dataset data like this:

# p <- seq(0.01, 0.99, 0.01)
# quantile(data, p)

Quartiles

Quartiles divide a dataset into 4 parts each with 25% probability. They are equal to the 25th, 50th and 75th percentiles. The 25th percentile is also known as the 1st quartile, the 50th percentile is also known as the median, and the 75th percentile is also known as the 3rd quartile.

The summary() function returns the minimum, quartiles and maximum of a vector.

Examples

Load the heights dataset from the dslabs package:

library(dslabs)
data(heights)

Use summary() on the heights$height variable to find the quartiles:

summary(heights$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   66.00   68.50   68.32   71.00   82.68

Find the percentiles of heights$height:

p <- seq(0.01, 0.99, 0.01)
percentiles <- quantile(heights$height, p)

Confirm that the 25th and 75th percentiles match the 1st and 3rd quartiles. Note that quantile() returns a named vector. You can access the 25th and 75th percentiles like this (adapt the code for other percentile values):

percentiles[names(percentiles) == "25%"]

## 25% 
##  66

percentiles[names(percentiles) == "75%"]

## 75% 
##  71

Finding quantiles with qnorm

Definition of qnorm

The qnorm() function gives the theoretical value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with mean mu and standard deviation sigma:

# qnorm(p, mu, sigma)

By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives quantiles for the standard normal distribution.

qnorm(p)

##  [1] -2.32634787 -2.05374891 -1.88079361 -1.75068607 -1.64485363 -1.55477359
##  [7] -1.47579103 -1.40507156 -1.34075503 -1.28155157 -1.22652812 -1.17498679
## [13] -1.12639113 -1.08031934 -1.03643339 -0.99445788 -0.95416525 -0.91536509
## [19] -0.87789630 -0.84162123 -0.80642125 -0.77219321 -0.73884685 -0.70630256
## [25] -0.67448975 -0.64334541 -0.61281299 -0.58284151 -0.55338472 -0.52440051
## [31] -0.49585035 -0.46769880 -0.43991317 -0.41246313 -0.38532047 -0.35845879
## [37] -0.33185335 -0.30548079 -0.27931903 -0.25334710 -0.22754498 -0.20189348
## [43] -0.17637416 -0.15096922 -0.12566135 -0.10043372 -0.07526986 -0.05015358
## [49] -0.02506891  0.00000000  0.02506891  0.05015358  0.07526986  0.10043372
## [55]  0.12566135  0.15096922  0.17637416  0.20189348  0.22754498  0.25334710
## [61]  0.27931903  0.30548079  0.33185335  0.35845879  0.38532047  0.41246313
## [67]  0.43991317  0.46769880  0.49585035  0.52440051  0.55338472  0.58284151
## [73]  0.61281299  0.64334541  0.67448975  0.70630256  0.73884685  0.77219321
## [79]  0.80642125  0.84162123  0.87789630  0.91536509  0.95416525  0.99445788
## [85]  1.03643339  1.08031934  1.12639113  1.17498679  1.22652812  1.28155157
## [91]  1.34075503  1.40507156  1.47579103  1.55477359  1.64485363  1.75068607
## [97]  1.88079361  2.05374891  2.32634787

Recall that quantiles are defined such that p is the probability of a random observation less than or equal to the quantile.

Relation to pnorm

The pnorm() function gives the probability that a value from a standard normal distribution will be less than or equal to a z-score value z. Consider:

pnorm(-1.96) ≈0.025

The result of pnorm() is the quantile. Note that:

qnorm(0.025) ≈−1.96

qnorm() and pnorm() are inverse functions:

pnorm(qnorm(0.025)) =0.025

Theoretical quantiles

You can use qnorm() to determine the theoretical quantiles of a dataset: that is, the theoretical value of quantiles assuming that a dataset follows a normal distribution. Run the qnorm() function with the desired probabilities p, mean mu and standard deviation sigma.

Suppose male heights follow a normal distribution with a mean of 69 inches and standard deviation of 3 inches. The theoretical quantiles are:

p <- seq(0.01, 0.99, 0.01)
theoretical_quantiles <- qnorm(p, 69, 3)

Theoretical quantiles can be compared to sample quantiles determined with the quantile function in order to evaluate whether the sample follows a normal distribution.

Quantile-Quantile Plots

Key points

Quantile-quantile plots, or QQ-plots, are used to check whether distributions are well-approximated by a normal distribution.
Given a proportion p, the quantile q is the value such that the proportion of values in the data below q is p.
In a QQ-plot, the sample quantiles in the observed data are compared to the theoretical quantiles expected from the normal distribution. If the data are well-approximated by the normal distribution, then the points on the QQ-plot will fall near the identity line (sample = theoretical).
Calculate sample quantiles (observed quantiles) using the quantile() function.
Calculate theoretical quantiles with the qnorm() function. qnorm() will calculate quantiles for the standard normal distribution (μ=0,σ=1) by default, but it can calculate quantiles for any normal distribution given mean() and sd() arguments. We will learn more about qnorm() in the probability course.
Note that we will learn alternate ways to make QQ-plots with less code later in the series.

Code

# define x and z
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)
data(heights)
index <- heights$sex=="Male"
x <- heights$height[index]
z <- scale(x)

# proportion of data below 69.5
mean(x <= 69.5)

## [1] 0.5147783

# calculate observed and theoretical quantiles
p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)
theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))

# make QQ-plot
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

# make QQ-plot with scaled values
observed_quantiles <- quantile(z, p)
theoretical_quantiles <- qnorm(p)
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)

Percentiles

Key points

Percentiles are the quantiles obtained when defining 𝑝 as 0.01,0.02,…,0.99 . They summarize the values at which a certain percent of the observations are equal to or less than that value.
The 50th percentile is also known as the median.
The quartiles are the 25th, 50th and 75th percentiles.

Boxplots

Key points

When data do not follow a normal distribution and cannot be succinctly summarized by only the mean and standard deviation, an alternative is to report a five-number summary: range (ignoring outliers) and the quartiles (25th, 50th, 75th percentile).
In a boxplot, the box is defined by the 25th and 75th percentiles and the median is a horizontal line through the box. The whiskers show the range excluding outliers, and outliers are plotted separately as individual points.
The interquartile range is the distance between the 25th and 75th percentiles.
Boxplots are particularly useful when comparing multiple distributions.
We discuss outliers in a later video.

Distribution of Female Heights

Key points

If a distribution is not normal, it cannot be summarized with only the mean and standard deviation. Provide a histogram, smooth density or boxplot instead.
A plot can force us to see unexpected results that make us question the quality or implications of our data.

Data Science: Visualization - Section 1 - Quantiles, Percentiles, Boxplots

Feb 24, 2020

Definition of quantiles

Finding quantiles with qnorm

Quantile-Quantile Plots

Percentiles

Boxplots

Distribution of Female Heights