Introduction to Probablity Distribution

There are a lot distributions which are used widely across diciplines of ststistics for assignment purposes we are focusing on frequently used distributions but if you want to look at the list of all the distributtions please use the link mentioned below

https://en.wikipedia.org/wiki/List_of_probability_distributions

Setting up the WD

setwd("C:/Users/nikhil.sharma/Desktop/Data Science Training/Day3")
Probablity Distributions

1. Random Variable: Output of an statistical experiment

A probability distribution is a table or an equation that links each 
outcome of a statistical experiment with its probability of occurrence. 


2. Continious Probabality Distribution


A cumulative probability refers to the probability that the value 
of a random variable falls within a specified range.

3. Uniform Probabality Distribution


When all the values of RV occurs with equal probablity

4. Discrete Probabality Distribution

when we can Calulate the probablity for specific value of Random variable.

Discrete Distributions

Definition: A distribution whose variables can only take discrete values

Note: With a discrete probability distribution, each possible value of the discrete random variable
can be associated with a non-zero probability. Thus, a discrete probability distribution can 
always be presented in tabular form.

List of Distributions

1. Uniform Distribution
2. Bernouli Distribution
3. binomial
4. -ve binomial
5. Poission

Uniform Distributions

#list of functions that can be used

#dunif(n, min = 1, max = 6, log = FALSE) #for the density function
#punif(n, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) #gives the distribution function
#qunif(n, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) #gives the quantile function
#runif(n, min = 0, max = 1) #generates random variables with parameters
#creating a simulation for Rolling a Dice


trials <- 300                          
min <- 1                                 
max <- 6
x <- as.integer(runif(trials,min,max+1) ) #genrating uniform distribution
#as.integer truncates, round converts to integers, add .5 for equal intervals 
hist(x,main=paste( trials," roles of a single die"),breaks=seq(min-.5,max+.5,1)) 

Bernoullli Disribution

#A distribution that can take only two possible values

trials <- 100
min <- 0
max <- 1

x <- as.integer(runif(trials,min,max+1))
hist(x,main=paste( trials," flipping of coin"),breaks=seq(min-.5,max+.5))

Geometric Disribution

Count of Number of failures before it hits the first success

#functions
#dgeom(x, prob, log = FALSE)
#pgeom(q, prob, lower.tail = TRUE, log.p = FALSE)
#qgeom(p, prob, lower.tail = TRUE, log.p = FALSE)
#rgeom(n, prob)

x <- rgeom(10000, prob = .5)
hist(x,main=paste( 10000," flipping of coin to get 1st head"))

Binomial Disribution

Calculating total number of Successes in x trials and the trials are bernauli’s trial has fixed probablity of success p Random variable is X = Number of successes.

#list of functions
#dbinom(x, size, prob, log = FALSE)
#pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
#qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
#rbinom(n, size, prob)

#simulation

x <- rbinom(10000,size = 100, prob = .1)
hist(x,main=paste( 10000," flipping of coin"))

x1 <- rbinom(5000, prob = .4, size = 100)
hist(x1, breaks = 25)

x2 <- rbinom(5000, prob = .5, size = 10000)
hist(x2, breaks = 25)

x3 <- rbinom(5000, prob = .6, size = 1000000)
hist(x3, breaks = 25)

-ve Binomial Disribution

Calculating th number of tirals until you observe r number of successes has fixed probablity of success p

Random variable is Y = Number of trials until the r th success.

#list of functions for neg binom
#dnbinom(x, size, prob, mu, log = FALSE)
#pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
#qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
#rnbinom(n, size, prob, mu)

#simulation example
x <- rnbinom(100000, 1, .15)
hist(x,main=paste("-ve binomial distribution"))

x1 <- rnbinom(500, mu = 4, size = 1)
hist(x1, breaks = 20)

x2 <- rnbinom(500, mu = 4, size = 10)
hist(x2, breaks = 20)

x3 <- rnbinom(500, mu = 4, size = 100)
hist(x3, breaks = 20)

Poission Distribution

Definition: a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event

Note: Poisson distribution assumes that the mean and variance are the same. Sometimes, your data show extra variation that is greater than the mean. This situation is called overdispersion and negative binomial regression is more flexible in that regard than Poisson regression (you could still use Poisson regression in that case but the standard errors could be biased). The negative binomial distribution has one parameter more than the Poisson regression that adjusts the variance independently from the mean. In fact, the Poisson distribution is a special case of the negative binomial distribution.

#list of functions
#dpois(x, lambda, log = FALSE) #density
#ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) #dustribution
#qpois(p, lambda, lower.tail = TRUE, log.p = FALSE) #quantile
#rpois(n, lambda) #random generation


#lambda: total number of events
#n = population size

#simulation

x1 <- rpois(10, 5)
x2 <- rpois(100, 6)
x3 <- rpois(1000, 9)
x4 <- rpois(10000, 2)
x5 <- rpois(100000, 10)
hist(x1, breaks = 10, main = "Poission distribition")

hist(x2, breaks = 5, main = "Poission distribition")

hist(x3, breaks = 8, main = "Poission distribition")

hist(x4, breaks = 7, main = "Poission distribition")

hist(x5, breaks = 9, main = "Poission distribition")

Continious Distributions

Definition: If a random variable is a continuous variable, its probability distribution is called 
a continuous probability distribution.

Difference between Continious and Discrete Distribution:

1. The probability that a continuous random variable will assume a particular value is zero.

2. As a result, a continuous probability distribution cannot be expressed in tabular form.

3. Instead, an equation or formula is used to describe a continuous probability distribution.
The equation used to describe a continuous probability distribution is called
a probability density function. Sometimes, it is referred to as a density function,
a PDF, or a pdf. 

The probability that a random variable assumes a value between a and b is equal to the 
area under the density function bounded by a and b.
List of Continious distributions:

1. Gamma
2. Inverse Gamma
3. Normal 
4. Log Normal
5. Inverse Gaussian
6. Chi Sq
7....................Many more

Gamma Distribution

2 parameter family of continious variable (alpha: Shape parameter, beta: Scale Parameter)

Only defined of RV > 0

As alpha increases then the pdf becomes more and more flat curve along x axis

As beta increases then the pdf becomes more steeper because of increase in decay rate

Where ever we want model a variable which is always posetive and can have continious values we use Gamma function

#functions
#dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE)
#pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE,log.p = FALSE)
#qgamma(p, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
#rgamma(n, shape, rate = 1, scale = 1/rate)


#Simulation

x1 <- rgamma(1000, 2, scale = 1/1)
x2 <- rgamma(1000, 4, scale = 1/2)
x3 <- rgamma(1000, 6, scale = 1/3)
Gx4 <- rgamma(1000, 8, scale = 1/4)
hist(x1, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x1), col = "red")

hist(x2, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x2), col = "red")

hist(x1, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x3), col = "red")

hist(Gx4, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(Gx4), col = "red")

Inverse Gamma Distribution

Inverted gamma distribution

alpha which is shape parameter conrols height

beta which is scale parameter which controls the spread

can be used to model uncertain quantities

Used in survival modelling

What are posterior probabilities and prior probabilities?

Ans: A posterior probability is the probability of assigning observations to groups given the data.

A prior probability is the probability that an observation will fall into a group before you collect the data.

Prior probabilities are the original probabilities of an outcome, which be will updated with new information to create posterior probabilities. Prior Probability http://www.investopedia.com/terms/p/prior_probability.asp#ixzz4mBPk3cqg

library(invgamma)
#functions

#rinvgamma(n, shape, rate = 1)
#dinvgamma(x, shape, rate = 1)

#simuation

x1 <- rinvgamma(1000, 2, scale = 1/1)
x2 <- rinvgamma(1000, 4, scale = 1/2)
x3 <- rinvgamma(1000, 6, scale = 1/3)
IGx4 <- rinvgamma(1000, 8, scale = 1/4)
hist(x1, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x1), col = "red")

hist(x2, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x2), col = "red")

hist(x1, main = " Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x3), col = "red")

hist(IGx4, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(IGx4), col = "red")

#difference between inverse gamma and gamma keeping shape and scale parameters same

hist(Gx4, main = "Gamma Distribution", probability = T, col = "blue")

hist(IGx4, main = "Inverse Gamma Distribution", probability = T, col = "blue")

Central Limit Theorem

The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.

  1. Why CLT is important?

CLT is important because under certain condition, you can approximate some distribution with Normal distribution although the distribution is not Normally distributed. … We can then take 10 random samples from the distribution, and calculate the mean.

Normal Distribution

Most common distribtion also known as Gaussian Distribuion.

The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the “bell curve”.

The identification of a normal distribution should be done using mu +/- sigma concept

Is symmetric arrounf mean which is equal to mode and mdeian as well

#functions

#dnorm(x, mean = 0, sd = 1, log = FALSE)
#pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
#qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
#rnorm(n, mean = 0, sd = 1)



x <- rnorm(100000, 0, 3)
hist(x, probability = T, main = "Normal Distribution", col = "blue")
lines(density(x), col = "red")

Log Normal Distribution

This distribution is used wherever RV’s log follows the Normal Distribution

Always positive values of y

#functions

#dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
#plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
#qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
#rlnorm(n, meanlog = 0, sdlog = 1)


x <- rlnorm(1000, 3, 1)
hist(x, probability = T, col = "blue", main = "Log Normal Distribution")
lines(density(x), col = "red")

Inverse Gaussian

AKA Wald Distribution

RV ranges from [0,infinity)

The distribution’s tail decreases slowly compared to the normal distribution. Therefore, it’s suitable for modeling phenomena where there is a greater likelihood of getting extremely large values compared to the normal distribution.

This distribution has a similar shape to the Weibull distribution, but has the advantage it’s easier to estimate probabilities (the Weibull, with three parameters, is more difficult).

Note: The Gamma distribution also has a similar shape. In fact, they can look exactly the same given the right parameters. However, it’s easier to produce extremely large values with the inverse Gaussian.

source: http://www.statisticshowto.com/statistics-basics/

#functions

#dinvgauss(x, mu, lambda=1)
#pinvgauss(q, mu, lambda=1)
#qinvgauss(p, mu, lambda=1)
#rinvgauss(n, mu, lambda=1)

#labda is scale parameter

library(statmod)
## Warning: package 'statmod' was built under R version 3.4.1
x <- rinvgauss(10000,4,3)
hist(x,probability = T, col = "blue", main = "Inverse Gaussian Distribution")
lines(density(x), col = "red")

Chi Square Distribution

The chi-square distribution has the following properties:

The mean of the distribution is equal to the number of degrees of freedom: μ = v. The variance is equal to two times the number of degrees of freedom: σ2 = 2 * v When the degrees of freedom are greater than or equal to 2, the maximum value for Y occurs when Χ2 = v - 2. As the degrees of freedom increase, the chi-square curve approaches a normal distribution.

If a RV has a Std. Normal distribution then its square will have chi square distribution

#functions

#dchisq(x, df, ncp = 0, log = FALSE)
#pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
#qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
#rchisq(n, df, ncp = 0)

x <- rchisq(10000, 7, 10)
hist(x, probability = T, col = "blue", main = "Chi Sq Distribution")
lines(density(x), col = "red")

Tweedie Distribution

They are the subset of Exponentiral Dispersion Models (EDM: set of probability distributions that represents a generalisation of the natural exponential family)

The Tweedie distribution is a special case of an exponential distribution. It can have a cluster of data items at zero (called a “point mass”), which is particularly useful for modeling claims in the insurance industry, in medical/genomic testing, or anywhere else there is a mixture of zeros and non-negative data points. Basically, if you see a histogram with a spike at zero, it’s a possible candidate to be fitted to a Tweedie model.

This family of distributions has the following characteristics:

a mean of E(Y) = μ a variance of Var(Y) = φ μp. The p in the variance function is an additional shape parameter for the distribution. “p” is sometimes written in terms of the shape parameter α: p = (α – 2) / (α -1). Some familiar distributions are special cases of the Tweedie distribution:

p = 0 : Normal distribution, p = 1: Poisson distribution, 1 < p < 2: Compound Poisson/gamma distribution, p = 2 gamma distribution, 2 < p < 3 Positive stable distributions, p = 3: Inverse Gaussian distribution / Wald distribution, p > 3: Positive stable distributions, p = ∞ Extreme stable distributions.

#function

#rTweedie(mu,p=1.5,phi=1)

library(tweedie)
## Warning: package 'tweedie' was built under R version 3.4.1
x <- rtweedie(10000, 2,4, 5)
hist(x, col = "blue", probability = T, main = "Tweedie Distribution")
lines(density(x), col = "red")