The implementation of statistical distributions in R

Seth J. Chandler

August 30, 2014

Introduction

The key to understanding statistical distributions in R is to understand a naming convention. Functions of a distribution object start with the letter d, p, q or r (for density, cumulative probability, quantile, and random variate) and then follow with some short form of the name of the distribution. Thus, the function to find the density of the normal distribution is “dnorm.” There’s also a convention about the order of the arguments. The first argument is the value at which the distribution object is evaluated, the next sequence of arguments are the parameters of the distribution (such as mean and standard deviation in the case of the normal distribution or shape and rate in the case of the gamma distribution), and then some optional arguments that might relate, for example, to conventional transformations of the distribution.

A quick example

Maybe the easiest way to see this is with an example. Here’s a table of the values of the density function of a normal distribution with mean 1 and standard deviation of 2. You can see how the value starts out small, gets bigger and then ends up smaller.1

dnorm(seq(-3,9,0.5),3,2)
##  [1] 0.002216 0.004547 0.008764 0.015870 0.026995 0.043139 0.064759
##  [8] 0.091325 0.120985 0.150569 0.176033 0.193334 0.199471 0.193334
## [15] 0.176033 0.150569 0.120985 0.091325 0.064759 0.043139 0.026995
## [22] 0.015870 0.008764 0.004547 0.002216

I can visualize the density function using the following code (after importing ggplot2). The idea is to create a mini-data.frame containing the x values to be plotted and to make sure that ggplot understands that the x-value of the plot is to be taken from the myx column of the data.frame. I then add a stat_function. The stat_function used here is dnorm and takes as its first argument the x value designated by aes. But we need some additional arguments to get dnorm to work, so we add them as args=c(3,2).2

By the way, I’m going to need the ggplot2 package a lot for this article, so let me give R access to its functionality.

require(ggplot2)
xdata<-ggplot(data.frame(myx=c(-3,9)),aes(x=myx))
xdata+stat_function(fun=dnorm,args=c(3,2))+xlab("x")

plot of chunk dnormplot

Cumulative Distributions

Suppose I wanted the cumulative distribution of the same distribution as used above, a normal distribution with mean of 1 and standard deviation 2. It just requires a tiny alteration of the code. I changed dnorm to pnorm. I’ve also added a title so you can see how to do that using ggplot.

xdata+stat_function(fun=pnorm,args=c(3,2))+xlab("x")+ggtitle("The cumulative distribution of N(3,2)")

plot of chunk pnormplot

Quantiles

The quantile function is the inverse of the cumulative density function. You give it a number between 0 and 1 and it tells you the x-value of the cumulative density function that would yield that number.

ggplot(data.frame(myq=c(0,1)),aes(x=myq))+stat_function(fun=qnorm,args=c(3,2))+xlab("x")+ggtitle("The quantile function of N(3,2)")

plot of chunk quantilenormplot

So, looking at this graph we can confirm that the 0.9 quantile (a/k/a 90th percentile) of our normal distribution is 5.5631.

Survival Function

The survival function value is just 1 minus the cumulative distribution function value. If you think of a cumulative distribution as being the probability that a draw from the distribution will be less than or equal to some abount, the survival function is the probability that a draw from the distribution will be greater than that amount.

R provides built in survival functions for distributions, but in a somewhat sneaky way. I am not sure what the logic is behind the name of the option, but all you do is add lower.tail=FALSE to your function and your CDF(cumulative distribution function) of a distribution turns into a survival function. Let’s plot the survival function of the normal distribution with which we have been working.

xdata+stat_function(fun=function (x) pnorm(x,3,2,lower.tail=FALSE))+xlab("x")+ggtitle("The survival function of N(3,2)")

plot of chunk survivalnormplot

Hazard Function

The hazard function is essentially the probability, given that you are alive now that you will die in the next “instant.” Again, R does not have hazard as a built in. There is, unfortunately, no hnorm. But we can make it quite easily.

xdata+stat_function(fun=function (x) dnorm(x,3,2)/(1-pnorm(x,3,2)))+xlab("x")+ggtitle("The hazard function of N(3,2)")

plot of chunk hazardnormplot

Random Variates

In understanding random functions, we often want to take draws from the distribution. R does have random variate as a built in concept. You stick r in front of the name of the distribution, and, voila. Here are 20 draws from the normal distribution we have been using all along.

rnorm(10,3,2)
##  [1]  4.2526  2.8250  3.3037 -0.5000  4.9332  0.7131  1.5866  3.6721
##  [9]  2.0136  4.7026

The log normal distribution

One of the most frequently used distributions in finance (and elsewhere) is the lognormal distribution. It is the same thing as the normal distribution if we replace x by \(e^x\). While the normal distribution has a domain of \((-\infty,\infty)\), the lognormal distribution has a domain of \((0,\infty)\); no negative values. We can create it simply by adding an option to the norm family. Here’s a plot of the “PDF” (probability density function) of the lognormal distribution with “mean log”3 3 and “standard deviation log” of 2.

lxdata<-ggplot(data.frame(myx=c(0.,15)),aes(x=myx))
lxdata+stat_function(fun=function (x) dlnorm(x,2,3))+xlab("x")+ggtitle("The probability density function of lognormal(3,2)")

plot of chunk unnamed-chunk-1

We can do the same things with our log normal distribution as we have with our normal distribution. Here, for example, is a vector of 10 draws from our log normal distribution

rlnorm(10,3,2)
##  [1]  21.959 258.150   5.671   7.982   2.822  77.036   6.879  18.782
##  [9]  32.224   1.869

The beta distribution

It is often convenient to have a distribution whose domain is just \((0,1)\). Although there are many such distributions, one of the most common and flexible is the beta distribution. R implements this distribution with functions such as dbeta, pbeta, etc. using the naming convention described earlier. The distribution takes two parameters. I’m now going to create a list of plots but hold off printing them out. The idea is to use expand.grid (similar to Outer in Mathematica) to create a data.frame that contains all possible combinations of 0.5, 1, 2, 4, which are instructive paramters in considering this distribution.

parameters<-expand.grid(shape1=2^c(-1,0,1,2),shape2=2^c(-1,0,1,2))
head(parameters)
##   shape1 shape2
## 1    0.5    0.5
## 2    1.0    0.5
## 3    2.0    0.5
## 4    4.0    0.5
## 5    0.5    1.0
## 6    1.0    1.0

I then create a function makeabeta that takes two arguments and makes a plot of the PDF of the beta distribution with those arguments as its “shape1” and “shape2” parameters.

makeabetafig<-function(shape1,shape2) {
        ggplot(data.frame(myx=c(0,1)),aes(x=myx))+
                stat_function(fun=dbeta,args=c(shape1,shape2))+xlab("x")+
                ggtitle(paste("PDF of","\n","Beta Distribution(",shape1,",",shape2,")"))+
                theme(axis.title=element_text(face="bold",size="8",color="brown"),
                      plot.title=element_text(size="8"),
                      axis.text=element_text(size="7"))
                      }

I then apply that function across the rows of the parameters data.frame and obtain a list of figures. I’ll print out the twelfth of the sixteen figures as an example.

betafigs<-apply(parameters,1, function (params) makeabetafig(params["shape1"],params["shape2"]))
print(length(betafigs))
## [1] 16
betafigs[[12]]

plot of chunk unnamed-chunk-5 # Advanced visualization

This material is intended as “extra credit” advanced material. What I would like to do is fit all 16 plots on a page. To do this, I need the gridExtra package and one of its functions grid.arrange. If you’re doing this at home, remember to install the gridExtra package before you require its usage.
require("gridExtra")
## Loading required package: gridExtra
## Loading required package: grid

What I would like to do is create a vector args.list that has two lists. The first list is our plots, betafigs. The second list is the dimensions of grid of graphics we want to create.

args.list<-c(betafigslist(nrow=4,ncol=4,heights=unit(2.25,"inches")))

Unfortunately, because I can’t get the graphics to fit all on one page and I have yet to figure out how page breaking works in R Markdown, I will create two lists, one for each half of betafigs and then run the code twice. A very ugly solution!

args.list1<-c(betafigs[1:8],list(nrow=2,ncol=4,heights=unit(2.25,"inches")))
args.list2<-c(betafigs[9:16],list(nrow=2,ncol=4,heights=unit(2.25,"inches")))

Now we will use do.call (which has a resemblance to Mathematica’s Apply function) to force grid.arrange, which basically takes a sequence of graphics and produces a nice grid, to take the sequence of items in args.list as its argument. In the abstract, do.call(f,list(a,b,c)) will return f(a,b,c).

do.call(grid.arrange,args.list1)

plot of chunk unnamed-chunk-9

do.call(grid.arrange,args.list2)

plot of chunk unnamed-chunk-10

Discrete Distributions

The distributions I have shown you thus far had a continuous outputs. The normal distribution could generate a random variate 3.45343 or 2.134 or any number in between. There are other distributions, however, that have discrete outputs. Think about the flip or a coin or the roll of a die. You can’t roll 2.134 on a die. One example of a discrete distribution that is well implemented in R is the binomial distribution. You can think of this distribution as the number of successes you will get in flipping a (possibly corrupt) coin a specified number of times given that the probability of a success on a single flip is some value, not necessarily 0.5. The “Bernoulli Distribution” is treated as a special case of the binomial distribution in which the number of flips is one.

Here are some random draws from two Binomial distributions. In the first one we take 10 draws from a single coin flip in which the probability of a success on a single draw is 0.3. In the second one we take 12 draws from five coin flips in which the probability of a success on a single draw is 0.6

print(rbinom(10,1,0.3))
##  [1] 1 1 0 0 1 0 0 0 0 0
print(rbinom(12,5,0.6))
##  [1] 2 4 3 2 2 1 4 3 3 3 2 5

I can compute the probability of getting a specified number of successes in draws from a five-flip binomial distribution with a single-draw probability of 0.6. Notice how I don’t have to write any loops. I just exploit R’s ability to deal with vector arguments.4

dbinom(0:5,5,0.6)
## [1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776

  1. Do you see how this code works? I some vector, here a sequence from -3 to 9 by 0.5s. (-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9). I then exploit the fact that many R functions work on a vector of arguments as well as they do on a single number.

  2. I could also have written this code as ggplot(xdata)+stat_function(fun=function (x) dnorm(x,1,2))+xlab(“x”)

  3. footnote explaining what a mean log is.

  4. In Mathematica, some functions are automatically capable of dealing with vector arguments. You can make functions you create capable of dealing with vectors without using constructs such as Map by setting the attribute of the function equal to Listable.