Real world distributions

This is a playground to experiment approximating some real world distributions with mathematical distributions. Things like household size, age, income, frequency of purchases etc.

The distributions should eventually be mapped to functions via http://commons.apache.org/proper/commons-math/userguide/distribution.html but instead (of next to) exposing the full math, goal is to expose a few functions that will be sufficient to easily generate realistic looking data and interactions for the purpose of simulations.

Mathematical properties of various distributions can be found in https://en.wikipedia.org/wiki/List_of_probability_distributions.

Applications

The gamma distribution has been used to model https://en.wikipedia.org/wiki/Gamma_distribution * the size of insurance claims

The normal distribution is assumed for things like https://en.wikipedia.org/wiki/Normal_distribution * In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal

Clipped normal distributions

The distribution of many natural phenomena can be approximated using the normal (Gaussian) distribution; e.g. length http://stats.stackexchange.com/questions/204471/is-there-an-explanation-for-why-there-are-so-many-natural-phenomena-that-follow.

In many cases (e.g. length again) there are physical boundaries, i.e. length can’t be < 0. So we need a little tweaking.

Right tailed distributions

Many non-natural phenomena follow a sort of power law, i.e. the outcome is proportional to another variable raised to the x-th power. Examples are city population vs rank. Zipf’s law is an special case in which the power is -1, e.g. the outcome is inverse proportional to the rank. See e.g. http://pages.stern.nyu.edu/~xgabaix/papers/pl-jep.pdf.

mean <- 5
n <- 10000
plotdata <- NULL
for (s in c(0.5,1,1.5)) {
  df <- data.frame(frequency = rexp(n, rate=s)*mean*s) %>%
    mutate(rank=rank(frequency), rate=paste(s))
  if(is.null(plotdata)) {
    plotdata <- df
  } else {
    plotdata <- rbind(plotdata, df)
  }
}
ggplot(plotdata, aes(frequency, colour=rate, fill=rate)) + geom_histogram(alpha=0.6,bins=200) + ggtitle(paste("Contact frequency with mean", mean, "modelled by an Exponential distribution"))+xlim(0, 5*mean) +
  geom_vline(data = group_by(plotdata, rate) %>% summarise(mean = mean(frequency)), aes(xintercept=mean, colour=rate), linetype="dashed")
## Warning: Removed 206 rows containing non-finite values (stat_bin).

Loglog

Right tailed continuous values

For continuous values like income, the Gamma distribution has nice properties. In the below plot, the blue plot would seem okay to model income.

An alternative might be the Levy distribution, also available in Apache math https://en.wikipedia.org/wiki/L%C3%A9vy_distribution but the mean is inf.

mean <- 40000
n <- 10000
plotdata <- NULL
for (s in c(1,3,20)) {
  df <- data.frame(income = rgamma(n, shape=s)*mean/s) %>%
    mutate(rank=rank(income), gamma=paste(s))
  if(is.null(plotdata)) {
    plotdata <- df
  } else {
    plotdata <- rbind(plotdata, df)
  }
}
ggplot(plotdata, aes(income, colour=gamma, fill=gamma)) + geom_density(alpha=0.6) + ggtitle(paste("Income with mean", mean, "modelled by a Gamma distribution"))+xlim(0, 5*mean) +
  geom_vline(data = group_by(plotdata, gamma) %>% summarise(mean = mean(income)), aes(xintercept=mean, colour=gamma), linetype="dashed")
## Warning: Removed 80 rows containing non-finite values (stat_density).

A log log plot of this is largely linear, even though the Gamma distribution is not the same as a Power distribution.

Transformations

n <- 10000
x <- runif(n)
df <- data.frame(n = 1:n,
                 x,
                 uniform = x,
                 squared = x^2,
                 sqrt = sqrt(x),
                 division = 2 - 2/(1+x),
                 gonio = 0.5 * (1 + sin((x*pi)-(pi/2))),
                 tanh = 0.5+tanh(2*(x-0.5)*pi)/2,
                 circular = 1-sqrt(1 - x^2)) %>% 
  gather(transformation, value, -n, -x)

format(group_by(df, transformation) %>% 
         summarise(avg = mean(value), 
         sd  = sd(value),
         Q1 = quantile(value, 0.25),
         median = median(value),
         Q3 = quantile(value, 0.75),
         min = min(value), 
         max = max(value)), scientific=F, digits=5)
##   transformation     avg      sd       Q1  median      Q3              min
## 1       circular 0.21529 0.22379 0.031022 0.13107 0.34516 0.00000000024622
## 2       division 0.61309 0.28101 0.396342 0.66215 0.86090 0.00004438119243
## 3          gonio 0.49984 0.35538 0.143294 0.49205 0.85990 0.00000000121506
## 4           sqrt 0.66624 0.23679 0.497140 0.70352 0.86935 0.00471074183613
## 5        squared 0.33415 0.29969 0.061082 0.24497 0.57119 0.00000000049244
## 6           tanh 0.49910 0.41441 0.040024 0.48411 0.96136 0.00186448077994
## 7        uniform 0.49993 0.29022 0.247149 0.49494 0.75577 0.00002219108865
##       max
## 1 0.98865
## 2 0.99997
## 3 1.00000
## 4 0.99997
## 5 0.99987
## 6 0.99813
## 7 0.99994
ggplot(df, aes(x, value, colour=transformation))+geom_line(size=1)+ggtitle("Transformations of the 0-1 interval")

ggplot(df, aes(value, colour=transformation))+geom_density(size=1,linetype="dashed")+ggtitle("Transformations of the 0-1 interval (density)")

Notes

nextDiscreteLongTailed fat tail distribution : user generated content, rank-size distributions, Zipf discrete Exponential 1.5 Pareto 1,1?

Poisson: “How many emails will I receive today?”

nextLongTailed(mean) income : ini, lorenz continuous Weibull 1.5, 1?

nextBoundedNormal(min, max) bounded: e.g. household size Beta 2,5?

Age: omgekeerd beta?

normal : natural phenomena

Poisson distribution : large number of individually very improbable events, accidents , such as number of car crashes in New York in a day

http://stats.stackexchange.com/questions/33776/real-life-examples-of-common-distributions

https://en.wikipedia.org/wiki/List_of_probability_distributions

http://commons.apache.org/proper/commons-math/userguide/distribution.html

https://www.quora.com/Give-some-real-world-examples-where-the-following-probability-distributions-show-themselves