R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like:Binomial, Poisson, Exponential, Uniform,Normal, Gamma, Weibull, Log-normal, Beta, Chi-Squared, logistic, Pareto ect.
In base R you may find a number of popular (for some of us) probability distributions. See below.
beta - beta(shape1, shape2, ncp)
binomial - binom(size, prob)
chi-squared - chisq(df, ncp)
exponential - exp(rate)
gamma - gamma(shape, scale)
logistic - logis(location, scale)
normal - norm(mean, sd)
Poisson - pois(lambda)
Student’s t - t(df, ncp)
uniform - unif(min, max)
Placing a prefix (one of the: “d”, “p”, “q”,“r”) for the distribution function will change it’s behavior in the following ways:
dxxx(x,) returns the density or the value on the y-axis of a probability distribution for a discrete value of x
pxxx(q,) returns the cumulative density function (CDF) or the area under the curve to the left of an x value on a probability distribution curve
qxxx(p,) returns the quantile value, i.e. the standardized z value for x
rxxx(n,) returns a random simulation of size n
rbinom: generate random Binomial variables with a given n (sample size) and p (probability of success)
dbinom: evaluate the Binomial probability density (with a given n,p) at a point x (or vector of points)
pbinom: evaluate the cumulative distribution function for a Binomial distribution
qbinom: returns the quatile value for a given probability
rbinom(20,2,3/4) # simulate 20 observations from B(2,1/3)
[1] 1 2 2 1 2 2 2 1 1 2 2 2 2 2 2 1 1 2 0 2
# let us observe the frequencies of 0,1 and 2 from different sample size
table(rbinom(100,2,3/4))/100
0 1 2
0.08 0.35 0.57
table(rbinom(100,2,3/4))/100
0 1 2
0.10 0.34 0.56
table(rbinom(1000,2,3/4))/1000
0 1 2
0.060 0.369 0.571
table(rbinom(1000,2,3/4))/1000
0 1 2
0.061 0.349 0.590
Cumulative Density Function (CDF) returns the probability P(X<x) for a given value of x from [0,n], for a Binomial distribution.
# CDF
pbinom(0,2,1/3) # P(X<0)
[1] 0.4444444
pbinom(1,2,1/3) # P(X<1)
[1] 0.8888889
pbinom(2,2,1/3) # P(X<2)
[1] 1
# PDF
dbinom(0,2,1/3) # calculate P(X=0)
[1] 0.4444444
dbinom(1,2,1/3) # calculate P(X=1)
[1] 0.4444444
dbinom(2,2,1/3) # calculate P(X=2)
[1] 0.1111111
Graphically we have this output:
library(ggplot2)
var.X <- rbinom(100, 10, 0.3)
ggplot() +
geom_histogram( aes(var.X), binwidth = 1, fill = "white", color = "black" )
The seed argument in set.seed is a single value, interpreted as an integer (as defined in help(set.seed()).
The seed in set.seed produces random values which are unique to that seed (and will be same irrespective of the computer you run and hence ensures reproducibility).
So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same.
set.seed(512)
x <- seq(0,100,by=1) # create a sequence from 0 to 100 with step=1
y <- dbinom(x,100,0.6) # Calculate probabilities for X~B(100,0.6) # a vector of values
y
[1] 1.606938e-40 2.410407e-38 1.789727e-36 8.769664e-35 3.189965e-33
[6] 9.187099e-32 2.181936e-30 4.395043e-29 7.663856e-28 1.175125e-26
[11] 1.604045e-25 1.968601e-24 2.190068e-23 2.223762e-22 2.072864e-21
[16] 1.782663e-20 1.420559e-19 1.052885e-18 7.282455e-18 4.714432e-17
[21] 2.864017e-16 1.636581e-15 8.815222e-15 4.484265e-14 2.158053e-13
[26] 9.840720e-13 4.258004e-12 1.750513e-11 6.845755e-11 2.549454e-10
[31] 9.050560e-10 3.065512e-09 9.915016e-09 3.064641e-08 9.058719e-08
[36] 2.562323e-07 6.939626e-07 1.800552e-06 4.477688e-06 1.067756e-05
[41] 2.442492e-05 5.361569e-05 1.129759e-04 2.285792e-04 4.441709e-04
[46] 8.291190e-04 1.487007e-03 2.562714e-03 4.244495e-03 6.756543e-03
[51] 1.033751e-02 1.520222e-02 2.148776e-02 2.919091e-02 3.811036e-02
[56] 4.781118e-02 5.762955e-02 6.672895e-02 7.420719e-02 7.923819e-02
[61] 8.121914e-02 7.988768e-02 7.537790e-02 6.819905e-02 5.914136e-02
[66] 4.913282e-02 3.908293e-02 2.974969e-02 2.165603e-02 1.506506e-02
[71] 1.000750e-02 6.342785e-03 3.832099e-03 2.204769e-03 1.206664e-03
[76] 6.274654e-04 3.096047e-04 1.447502e-04 6.402414e-05 2.674426e-05
[81] 1.053055e-05 3.900205e-06 1.355559e-06 4.409650e-07 1.338644e-07
[86] 3.779700e-08 9.888749e-09 2.386939e-09 5.289241e-10 1.069734e-10
[91] 1.961179e-11 3.232713e-12 4.743655e-13 6.120845e-14 6.837114e-15
[96] 6.477266e-16 5.060364e-17 3.130122e-18 1.437301e-19 4.355457e-21
[101] 6.533186e-23
which(y==max(y)) # which is the value of x with the highest probability
[1] 61
# you may also use: dbinom(0:100,100,0.6) to calculate the probabilities
plot(x,y,main="Binomial density B(100,0.6)",col="red",xlab="Observations",ylab="Probability") # pdf
abline(v=which(y==max(y)),col="blue")
text(80,0.08,"Highest probability",col="green")
dbinom(60,100,0.6) # calculate P(X=60)
[1] 0.08121914
dbinom(61,100,0.6) # calculate P(X=61)
[1] 0.07988768
Try to change the number of simulations above and observe the behavior of the pdf.
# Calculate P(X=27) when X~B(100,0.25)
dbinom(27, size=100, prob=0.25)
[1] 0.08064075
#F(x)=P(X<x) pbinom(x,n,p)
pbinom(25,100,0.6)# will give as output P(X<=25)= when X~B(100,0.6)
[1] 1.255514e-12
pbinom(60,100,0.6)# will give as output P(X<=60)= when X~B(100,0.6)
[1] 0.5379247
# Find x such that F(x)= a given value
# P(X<x)=0.5, qbinom(F(x),n,p)
qbinom(0.5,100,0.6) # output x such as F(x)=0.5 when X~B(100,0.6)
[1] 60
A bank issues credit cards to customers.Based on the past data, the bank has found out that 75% of all accounts pay on time the bill. If a sample of 25 accounts is selected at random from the current database, construct the Binomial Probability Distribution of accounts paying on time.
We have p=0.75 (pays on time), n=25 individuals, We need to calculate P(X=x) when x=0,1,2,3,4,5,…,25
i=seq(0,25)
prob_binom=dbinom(i, size=25, prob=0.75)# creates a vector with calculated probabilities
prob_binom
[1] 8.881784e-16 6.661338e-14 2.398082e-12 5.515588e-11 9.100720e-10
[6] 1.146691e-08 1.146691e-07 9.337339e-07 6.302704e-06 3.571532e-05
[11] 1.714335e-04 7.013190e-04 2.454617e-03 7.363850e-03 1.893561e-02
[16] 4.165835e-02 7.810941e-02 1.240561e-01 1.654082e-01 1.828195e-01
[21] 1.645376e-01 1.175268e-01 6.410555e-02 2.508478e-02 6.271195e-03
[26] 7.525435e-04
Graphically we have:
plot(i,prob_binom,type="l",col="red",lwd=2,main="Density distribution of n clients paying on time",xlab="nr of clients paying on time",ylab="probability of paying on time")
abline(v=which(prob_binom==max(prob_binom)),col="blue")
text(10,0.15,"Highest probability",col="green")
which(prob_binom== max(prob_binom))# x-20 has the highest probability of paying on time. based on the information above the highest probability of individuals paying on time is 20 individuals with a probability = 0.1828195.
[1] 20
prob_binom[20] # P(X=20)
[1] 0.1828195
# just to check the probabilities based also in the graph below
prob_binom[18:21]
[1] 0.1240561 0.1654082 0.1828195 0.1645376
Exercise Try to change the number of individuals, what you observe?
Central Limit Theorem
par(mfrow=c(1,3))
hist(rbinom(15,15,0.6),main="N=15 simulation")
hist(rbinom(150,15,0.6),main="N=150 simulation")
hist(rbinom(1500,15,0.6),main="N=1500 simulation")
Suppose the probability a client will buy a product in a store is 0.35. What is the distribution of the probabilities of buying a product if in the store in one day arrive 15 clients? We have, X = 0,1,2,..,15 n = 15, and p = 0.35
j=seq(0,15)
Vect=dbinom(j,15,0.35)#
barplot(Vect,main="probability 15 clients buy",xlab="Nr of clients buying",ylab="Probability of buying")
plot(j,Vect,type="h",lwd=3,col="red",main="Probability that in 15 clients x will buy",xlab="Nr of clients buying",ylab="Probability of buying")
plot(j,Vect,type="l",lwd=3,col="red",main="Probability that in 15 clients x will buy",xlab="Nr of clients buying",ylab="Probability of buying")
How we simulate and plot, calculate probabilities and find quantiles from a Poisson distribution.
rpois(20,5)# simulate 20 values from a Poisson distribution a=5
[1] 4 4 3 4 5 9 8 6 6 1 3 1 7 4 3 6 3 2 4 4
# Simulation graphs -Histogram
hist(rpois(50,5),col="red",xlab="Values",ylab="Frequencies", main="Histogram of P(5) for 50 simulations")
hist(rpois(1000,5),col="red",xlab="Values",ylab="Frequencies", main="Histogram of P(5) for 10^3 simulations")
# probability density function for the simulations above
plot(density(rpois(1000,5)),col="blue",lwd=2,main="PDF of P(5) for n=10^3 simulations")
# CDF for P(5), F(x)=P(X<x)
ppois(2,5)# calculates P(X<=2) when X~P(5), including x=2
[1] 0.124652
# Calculates probability of X=x, P(X=x)
dpois(0,5)# P(X=0) when X~P(5)
[1] 0.006737947
dpois(1,5) # P(X=1) when X~P(5)
[1] 0.03368973
dpois(0,5)+dpois(1,5) # P(X=0)+P(X=1)=exp(-5)+exp(-5)*5^1
[1] 0.04042768
# Finds the value of x for wich we have F(x)=p , qpois(p,a)
qpois(0.03368973,5) # finds x such that P(X<x)=F(x)=0.5 when X~P(5)
[1] 1
Try to change the value of a=20 and a=0.5 what you observe for the same number of observations?
The random variable X has a Poisson distribution with lambda=15. Calculate the probabilities: P(X=20) ; P(X>20); P(10<X<15) ;
la=15 # declare object lambda
# Calculate P(X=20)
dpois(20,15)
[1] 0.04181031
# Calculate P(X>20)
1-ppois(20,15)
[1] 0.08297091
# Calculate P(10<X<15), Attention! for discrete distribution = is important
ppois(14,15)-ppois(9,15)
[1] 0.3958
If we increase the simulation number how will the histogram change? Poisson approximation to Normal distribution
par(mfrow=c(1,4))
hist(rpois(15,3),main="N=15 simulation")
hist(rpois(150,3),main="N=150 simulation")
hist(rpois(1500,3),main="N=1500 simulation")
hist(rpois(15000,3),main="N=15000 simulation")
We know that the parameter of a Poisson distribution is estimated as te mean of the sample. Let us check!
mean(rpois(10,3))
[1] 3.5
mean(rpois(100,3))
[1] 3.01
mean(rpois(1000,3))
[1] 3.118
Observe, the simulation have a mean value very close to the parameter a=3, especially this is observed with the increasing number of simulations. This confirms the fact that: Parameter estimation in probability distributions is closely related with the sample size, the larger the sample the more accurate the estimation. (Central Limit Theorem and Law of Large Numbers)
Graphically the Poisson distribution.
set.seed(34)
pois=rpois(150,3)# simulate 150 values from P(3)
table(pois)
pois
0 1 2 3 4 5 6 7 8 9
9 19 33 27 26 24 4 5 2 1
hist(pois, main="150 Simulation")
lines(table(pois),col="red",type="l")
barplot(table(pois)/150)
Suppose you get approximately 16 calls a day from your co-workers. Simulate a 365 day call and visualize it. How can you interpret the histogram below?
set.seed(123)
call <- rpois(n = 365, lambda = 16)
which(call==max(call))
[1] 202
call[202]
[1] 29
ggplot() +
geom_histogram(aes(call), binwidth = 1, fill = "white", color = "black" )
A service center has observed that during one day it can offer {0,1,2,3} services with probabilities respectively : 0.6; 0.2; 0.15; 0.05 a. Simulate number of services for a period of 10 days, 50 days, 100 days (plot it)
Sim.10<-sample(c(0,1,2,3), size=10, replace=TRUE, prob=c(.6,.2,.15,.05))
Sim.50<-sample(c(0,1,2,3), size=50, replace=TRUE, prob=c(.6,.2,.15,.05))
Sim.100<-sample(c(0,1,2,3), size=100, replace=TRUE, prob=c(.6,.2,.15,.05))
# Histograms
par(mfrow=c(1,3))
hist(Sim.10)
hist(Sim.50)
hist(Sim.100)
The probability density function of normal distribution is: f(x)=exp((-(x-mu)^2)/2s^2)/ssqrt(2*pi).
f.norm<-function(x){
exp(-(x-64)^2/2*9^2)/9*sqrt(2*pi)
}
# calculate P(X<40)
integrate(f.norm,-Inf,40)
0 with absolute error < 0
# Calculate P(X>70)
integrate(f.norm,70,Inf)
0 with absolute error < 0
# Calculate P(40<X<70)
integrate(f.norm,-Inf,70)
0.07757019 with absolute error < 8.2e-05
# Probability density function for N(0,1)
f.1<-function(x){
(1/sqrt(2*pi))*exp(-x^2/2)
}
# Density plot for a normal distribution
curve(f.1,xlim=c(-10,10),main="Normal Density plot", xlab="X values", ylab="Probabilities")
curve(f.1,xlim=c(-3,3),main="Normal Density plot", xlab="X values", ylab="Probabilities")
rnorm: generate random Normal variables with a given mean and standard deviation
pnorm: evaluate the cumulative distribution function for a Normal distribution
qnorm: return the quantile for a given value of x
rnorm(4) # simulate 4 observations from X~N(0,1)
[1] -1.0396804 -0.1843089 0.9672673 -0.1082801
rnorm(4,mean=3) # simulate 4 observations from X~N(3,1)
[1] 2.301579 2.724055 4.114649 3.550044
rnorm(4,mean=3,sd=3) # simulate 4 observations from X~N(3,9) (sd=3)
[1] 6.710027 3.417294 4.230825 1.324629
Below are some examples of simulating, finding probabilities and quantiles.
pnorm(1, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # P(X<1)
[1] 0.8413447
qnorm(0.8413, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # find x so P(X<x)=0.8413
[1] 0.9998151
#
pnorm(0, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # P(X<0)
[1] 0.5
qnorm(0.5, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # find x so P(X<x)=0.5
[1] 0
rnorm(10, mean = 0, sd = 1)# simulate 10 values from a normal dist mu=0, sd=1
[1] 0.6053707 -0.5063335 -1.4205655 0.1279930 1.9458512 0.8009143
[7] 1.1652534 0.3588557 -0.6085572 -0.2022409
rnorm(10, mean = 0, sd = 1)# simulate 10 values from a normal dist mu=0, sd=1
[1] -0.2732481 -0.4686998 0.7041673 -1.1973635 0.8663661 0.8641525
[7] -1.1986224 0.6394920 2.4302267 -0.5572155
rnorm(10, mean = 0, sd = 1)# simulate 10 values from a normal dist mu=0, sd=1
[1] 0.84490424 -0.78220185 1.11071142 0.24982472 1.65191539
[6] -1.45897073 -0.05129789 -0.52692518 -0.19726487 -0.62957874
# Every simulation is different
summary(rnorm(10, mean = 0, sd = 1))# summary desc statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.186207 -0.700819 0.090291 -0.004632 0.567289 1.484031
summary(rnorm(100, mean = 0, sd = 1))# summary desc statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.80977 -0.83166 -0.02167 -0.10858 0.63131 2.08672
summary(rnorm(1000, mean = 0, sd = 1))# summary desc statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.69533 -0.65303 0.05172 0.03393 0.68304 3.39037
summary(rnorm(10000, mean = 0, sd = 1))# summary desc statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.845320 -0.665178 -0.012122 -0.000968 0.687333 3.847768
# increasing the number of simulations will converge the values of mean and sd to theoretical values mu=0 and sd=1
Graphically: Histogram and QQ-Plot
y <- rnorm(200,mean=-2,sd=4) #simulate 200 observations from N(2,16)
hist(y,col="red",xlab="Observation",ylab="Frequence", main="Histogram N(2,16)")
qqnorm(y,col="red") # plot empirical vs theoretical N(2,16)
# add y=x for a better view of the fit
qqline(y,col="blue",lwd=2)
Exercise Try to increase the number of observations and observe QQ-plot.
Histogram and density plot for normal distribution.
set.seed(34)
N <- rnorm(1000, mean=50, sd=8)# simulate n=1000 from N(50,25)
hist(N, probability=TRUE,col="red",main="Normal distribution")
N.1 <- seq(min(N), max(N), length=100)
lines(N.1, dnorm(N.1, mean=50, sd=8))
Standardized values and histogram/ density plot
Z=(N-mean(N))/sd(N)# standardize simulations (N)
hist(Z, probability=TRUE,col="red",main="Standardized density plot")
N.2 <- seq(min(Z), max(Z), length=100)
lines(N.2, dnorm(N.2),col="black",lwd=2)
Normal distribution Simulation example
set.seed(512)
x=rnorm(100,0,1)# real observations
plot(density(x),main=" Simulated vs Real",lwd=2,xlim=c(-7,7),ylim=c(0,0.45))
x1=rnorm(100,0.5,1.5)# simulate 100 observations from N(0.5,1.5)
lines(density(x1),col="red",lwd=2)# plot x1
x2=rnorm(100,1,1)#simulate 100 observations from N(1,1)
lines(density(x2),col="blue",lwd=2)# plot x2
x3=rnorm(100,3,3)#simulate 100 observations from N(3,3)
lines(density(x3),col="green",lwd=2)# plot x3
legend(-6,0.4,c("reale","Sim 1","Sim 2","Sim 3"),fill=c("black","blue","red","green"),box.col="white")
# We can also plot a scatterplot to check the fitting
plot(x,x1,col="red",pch="R",ylim=c(-4,10),lwd=4,ylab="X2,X3",xlab="Real",main="Real vs Simulations")
points(x,x2,col="blue",pch="O")
points(x,x3,col="green",pch="X3")
legend(-3,8,c("real~1","real~2","real~3"),fill=c("blue","red","green"),box.col="white")
Exercise Try to graph the empirical rules for a normal distribution.
Simulations: Simulate 20 observations from exponential distribution with rate a=2
rexp(20,2)
[1] 0.236461562 1.185195289 0.306916572 0.853144883 0.517788602
[6] 0.356570007 0.223254350 0.112111269 0.126689595 0.128251989
[11] 0.595154632 0.295709382 0.244065052 0.410756204 0.275743932
[16] 0.341076302 0.039352908 0.556349813 0.005826933 0.752978405
# Histogram
hist(rexp(100,2), breaks = 100, col="red",xlab="vlerat X",main = "Histogram of exponential distribution")
text(1,6,"X~E(2)")
text(1,5,"100 observations")
#
hist(rexp(1000,5),col="red",xlab="Simulations",ylab="Frequency", main="Histogram of Exp(5) for 10^3 simulations")
plot(density(rexp(1000,5)),col="blue",lwd=2,main="Probability Density Function ")
Calculate probabilities
# Cumulative function F(x) for X~E(a)
pexp(2,5)# calculates F(2)=P(X<2) for X~E(5)
[1] 0.9999546
Exp=function(x,a=5){a*exp(-a*x)}# p.d.f for exponential distribution of unknown parameter a (by default a=5)
# P(X<2)<-integrate(Exp,0,2)
F.exp=integrate(Exp,0,2) # same as: pexp(x,a)=F(x) for X~E(a)
# qexp(p,a) will find x such as F(x)=P(X<x)=p for X~E(a)
qexp(0.9999546,5) # Finds x such as F(x)=0.98168; Look pexp() above.
[1] 2
Remember from the Poisson example above,lambda=15,Let suppose the time from two arrivals is exponential with intensity 15 arrival per hour (60minutes. Estimate the parameter of exponential distribution in minutes. 15 arrivals/60 minutes=0.25 arrivals/min =rate Find the probability that time between two arrivals is:
less than 10 minutes.
more than 5 minutes and less than 10 minutes
more than 30 minutes
library(ggplot2)
max.x<-qexp(0.999, rate=0.25)# finds the maximum value of x for which exist P(X<x)=0.999 very close to 1
x.value <- seq(0, max.x, length=100)# construct a sequence of x
qplot(x.value, dexp(x.value, rate=0.2), geom="line", ylab="f(x)", xlab="x",main="Density plot for exponential distribution")+
geom_vline(xintercept=c(5,10,30),col="red",lwd=1)
# we added the vertical lines to observe the probabilities
# calculate probabilities
pexp(10,0.25)
[1] 0.917915
pexp(10,0.25)- pexp(5,0.25)
[1] 0.2044198
1-pexp(30,0.25)
[1] 0.0005530844
Use runif() to sample from a continuous uniform distribution.
runif(n, min=0, max=1)# simulate n values from Uniform (0,1) punif(): To calculate the cdf qunif(): to find the quantile for a given value of probability
u <- runif(100000, min = 0, max = 1)
punif(0.5,0.1)# Calculate P(X<0.5)
[1] 0.4444444
qunif(0.4444444,0.1)# Find x such as P(X<x)=0.444
[1] 0.5
# plot to visualize
ggplot() +
geom_histogram(aes(u), binwidth = 0.05, boundary = 0, fill = "white", colour = "black")