Probability Distributions in R

Purpose: to create normal probability simulations using dnorm, pnorm, qnorm, and rnorm

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.2

Distribution Functions in R

1 dnorm: Probability Density Function (PDF) of the normal distribution
2 pnorm: Cumulative Density Function (CDF) of the normal distribution
3 qnorm: Quantile function of the normal distribution
4 rnorm: Random sampling from the normal distribution

1. Probability Density Function (dnorm)

The probability density function (PDF, in short: density) indicates the probability of observing a measurement with a specific value and thus the integral over the density is always 1. For a value \(x\), the normal density is defined as \[{\displaystyle f(x\mid \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\text{exp}\left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)}\] where \(\mu\) is the mean, \(\sigma\) is the standard deviation, and \(\sigma^2\) is the variance.

Using the density, it is possible to determine the probabilities of each event. For example, you may wonder: What is the likelihood that a person has an IQ of exactly 140?. In this case, you would need to retrieve the density of the IQ distribution at value 140. The IQ distribution can be modeled with a mean of 100 and a standard deviation of 15. The corresponding density is:

#Data
sample.range <- 50:150
iq.mean <- 100
iq.sd <- 15
iq.dist <- dnorm(sample.range, mean = iq.mean, sd = iq.sd)
iq.df <- data.frame("IQ" = sample.range, "Density" = iq.dist)
ggplot(iq.df, aes(x = IQ, y = Density)) + geom_point() + ylab("Density (Probability)")

#Function to print percentage
pp <- function(x) {
        print(paste0(round(x * 100, 3), "%"))
}

#Likelihood IQ >= 140?
pp(sum(iq.df$Density[iq.df$IQ >= 140]))

## [1] "0.384%"

#Likelihood of 75 <= IQ <= 90?
pp(sum(iq.df$Density[iq.df$IQ >= 75 & iq.df$IQ <= 90]))

## [1] "21.868%"

2. Cumulative Density Function (pnorm)

The cumulative density (CDF) function is a monotonically increasing function as it integrates over densities via \[f(x | \mu, \sigma) = \displaystyle {\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x-\mu }{\sigma {\sqrt {2}}}}\right)\right]\] where \(\text{erf}(x) = \frac {1}{\sqrt {\pi }}\int _{-x}^{x}e^{-t^{2}}\,dt\) is the error function.

#CDF Lower Tail (P[X <= x])
cdf_lower <- pnorm(sample.range, mean = iq.mean, sd = iq.sd, lower.tail = TRUE)
iq.df <- cbind(iq.df, "CDF_LowerTail" = cdf_lower)
ggplot(iq.df, aes(x = IQ, y = CDF_LowerTail)) + geom_point()

#Likelihood of 50 < IQ <= 90?
pp(iq.df$CDF_LowerTail[iq.df$IQ == 90])

## [1] "25.249%"

#CDF Upper Tail (P[X > x])
cdf_upper <- pnorm(sample.range, mean = iq.mean, sd = iq.sd, lower.tail = FALSE)
iq.df <- cbind(iq.df, "CDF_UpperTail" = cdf_upper)
ggplot(iq.df, aes(x = IQ, y = CDF_UpperTail)) + geom_point()

#Likelihood of IQ > 125
pp(iq.df$CDF_UpperTail[iq.df$IQ == 125])

## [1] "4.779%"

From the chart above, we can see the cumulative probability distribution of P[X <= x].

3. The Quantile Function (qnorm)

The quantile function is simply the inverse of the cumulative density function (iCDF). Thus, the quantile function maps from probabilities to values. Let’s take a look at the quantile function for \(P[X <= x]\):

#List of probabilities from 0 to 100%, by 0.1%
prob.range <- seq(from = 0, to = 1, by = 0.001)

#Minimum IQ required to reach certain percentile
iCDF_lower <- qnorm(prob.range, mean = iq.mean, sd = iq.sd, lower.tail = TRUE)
iCDF.df <- data.frame("Percentile" = prob.range, "Min_IQ" = iCDF_lower)
ggplot(iCDF.df, aes(x = Percentile, y = Min_IQ)) +geom_point()

#What is the 25th IQ percentile?
iCDF.df$Min_IQ[iCDF.df$Percentile == 0.25]

## [1] 89.88265

#What is the minimum IQ for reaching the TOP-10 or 90th percentile?
#Option 1 (Using iCDF table)
iCDF.df$Min_IQ[iCDF.df$Percentile == 0.75]

## [1] 110.1173

#Option 2 (using qnorm)
qnorm(0.9, mean = iq.mean, sd = iq.sd)

## [1] 119.2233

#Or we can see minimum IQ to reach 25th, 50th, 75th, 100th percentile at once
quantile(iCDF.df$Min_IQ)

##        0%       25%       50%       75%      100% 
##      -Inf  89.88265 100.00000 110.11735       Inf

#To see various custom quantiles (eg. 50th to 100th, separated by 10th)
quantile(iCDF.df$Min_IQ, probs = seq(from = 0.5, to = 1, by = 0.1))

##      50%      60%      70%      80%      90%     100% 
## 100.0000 103.8002 107.8660 112.6243 119.2233      Inf

4. Probability Density Function (rnorm)

When you want to draw random samples from the normal distribution, you can use rnorm. For example, we could use rnorm to simulate random samples from the IQ distribution.

#To fix random seed for reproducibility
set.seed(1)

#Law of large numbers: the more samples, the more mean reaches the expected value
samples <- c(100, 1000, 10000)

#Combine list of sample IQs (rbind) for each samples (sapply)
sample.df <- do.call(rbind, lapply(samples, function(x) {
        
        #To create a random IQ for each sample size group
        data.frame("SampleSize" = x,
                   "IQ" = rnorm(x, mean = iq.mean, sd = iq.sd))
}))

#Plot the results (histograms) for each SampleSize group
ggplot(data = sample.df, aes(x = IQ)) +
        
        #Create histogram
        geom_histogram() +
        
        #To separate histogram based on different groups (ie. SampleSize)
        facet_wrap(
                facets = sample.df$SampleSize, #The groups
                scales = "free_y") #To fix x-axis scale, but to allow y-axis to scale dynamically

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#We can also create a sample (with replacement, aka. independent variables when taking one sample doesn't affect the probability of others) from the Probability Density Table
sample_pdf <- sample(iq.df$IQ, size = 100, prob = iq.df$Density, replace = TRUE)
sample_pdf.df <- data.frame("IQ" = sample_pdf)
ggplot(sample_pdf.df, aes(x = IQ)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.