There are a lot distributions which are used widely across diciplines of ststistics for assignment purposes we are focusing on frequently used distributions but if you want to look at the list of all the distributtions please use the link mentioned below
https://en.wikipedia.org/wiki/List_of_probability_distributions
setwd("C:/Users/nikhil.sharma/Desktop/Data Science Training/Day3")
Probablity Distributions
1. Random Variable: Output of an statistical experiment
A probability distribution is a table or an equation that links each
outcome of a statistical experiment with its probability of occurrence.
2. Continious Probabality Distribution
A cumulative probability refers to the probability that the value
of a random variable falls within a specified range.
3. Uniform Probabality Distribution
When all the values of RV occurs with equal probablity
4. Discrete Probabality Distribution
when we can Calulate the probablity for specific value of Random variable.
Definition: A distribution whose variables can only take discrete values
Note: With a discrete probability distribution, each possible value of the discrete random variable
can be associated with a non-zero probability. Thus, a discrete probability distribution can
always be presented in tabular form.
List of Distributions
1. Uniform Distribution
2. Bernouli Distribution
3. binomial
4. -ve binomial
5. Poission
#list of functions that can be used
#dunif(n, min = 1, max = 6, log = FALSE) #for the density function
#punif(n, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) #gives the distribution function
#qunif(n, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) #gives the quantile function
#runif(n, min = 0, max = 1) #generates random variables with parameters
#creating a simulation for Rolling a Dice
trials <- 300
min <- 1
max <- 6
x <- as.integer(runif(trials,min,max+1) ) #genrating uniform distribution
#as.integer truncates, round converts to integers, add .5 for equal intervals
hist(x,main=paste( trials," roles of a single die"),breaks=seq(min-.5,max+.5,1))
#A distribution that can take only two possible values
trials <- 100
min <- 0
max <- 1
x <- as.integer(runif(trials,min,max+1))
hist(x,main=paste( trials," flipping of coin"),breaks=seq(min-.5,max+.5))
Count of Number of failures before it hits the first success
#functions
#dgeom(x, prob, log = FALSE)
#pgeom(q, prob, lower.tail = TRUE, log.p = FALSE)
#qgeom(p, prob, lower.tail = TRUE, log.p = FALSE)
#rgeom(n, prob)
x <- rgeom(10000, prob = .5)
hist(x,main=paste( 10000," flipping of coin to get 1st head"))
Calculating total number of Successes in x trials and the trials are bernauli’s trial has fixed probablity of success p Random variable is X = Number of successes.
#list of functions
#dbinom(x, size, prob, log = FALSE)
#pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
#qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
#rbinom(n, size, prob)
#simulation
x <- rbinom(10000,size = 100, prob = .1)
hist(x,main=paste( 10000," flipping of coin"))
x1 <- rbinom(5000, prob = .4, size = 100)
hist(x1, breaks = 25)
x2 <- rbinom(5000, prob = .5, size = 10000)
hist(x2, breaks = 25)
x3 <- rbinom(5000, prob = .6, size = 1000000)
hist(x3, breaks = 25)
Calculating th number of tirals until you observe r number of successes has fixed probablity of success p
Random variable is Y = Number of trials until the r th success.
#list of functions for neg binom
#dnbinom(x, size, prob, mu, log = FALSE)
#pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
#qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
#rnbinom(n, size, prob, mu)
#simulation example
x <- rnbinom(100000, 1, .15)
hist(x,main=paste("-ve binomial distribution"))
x1 <- rnbinom(500, mu = 4, size = 1)
hist(x1, breaks = 20)
x2 <- rnbinom(500, mu = 4, size = 10)
hist(x2, breaks = 20)
x3 <- rnbinom(500, mu = 4, size = 100)
hist(x3, breaks = 20)
Definition: a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event
Note: Poisson distribution assumes that the mean and variance are the same. Sometimes, your data show extra variation that is greater than the mean. This situation is called overdispersion and negative binomial regression is more flexible in that regard than Poisson regression (you could still use Poisson regression in that case but the standard errors could be biased). The negative binomial distribution has one parameter more than the Poisson regression that adjusts the variance independently from the mean. In fact, the Poisson distribution is a special case of the negative binomial distribution.
#list of functions
#dpois(x, lambda, log = FALSE) #density
#ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) #dustribution
#qpois(p, lambda, lower.tail = TRUE, log.p = FALSE) #quantile
#rpois(n, lambda) #random generation
#lambda: total number of events
#n = population size
#simulation
x1 <- rpois(10, 5)
x2 <- rpois(100, 6)
x3 <- rpois(1000, 9)
x4 <- rpois(10000, 2)
x5 <- rpois(100000, 10)
hist(x1, breaks = 10, main = "Poission distribition")
hist(x2, breaks = 5, main = "Poission distribition")
hist(x3, breaks = 8, main = "Poission distribition")
hist(x4, breaks = 7, main = "Poission distribition")
hist(x5, breaks = 9, main = "Poission distribition")
Definition: If a random variable is a continuous variable, its probability distribution is called
a continuous probability distribution.
Difference between Continious and Discrete Distribution:
1. The probability that a continuous random variable will assume a particular value is zero.
2. As a result, a continuous probability distribution cannot be expressed in tabular form.
3. Instead, an equation or formula is used to describe a continuous probability distribution.
The equation used to describe a continuous probability distribution is called
a probability density function. Sometimes, it is referred to as a density function,
a PDF, or a pdf.
The probability that a random variable assumes a value between a and b is equal to the
area under the density function bounded by a and b.
List of Continious distributions:
1. Gamma
2. Inverse Gamma
3. Normal
4. Log Normal
5. Inverse Gaussian
6. Chi Sq
7....................Many more
2 parameter family of continious variable (alpha: Shape parameter, beta: Scale Parameter)
Only defined of RV > 0
As alpha increases then the pdf becomes more and more flat curve along x axis
As beta increases then the pdf becomes more steeper because of increase in decay rate
Where ever we want model a variable which is always posetive and can have continious values we use Gamma function
#functions
#dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE)
#pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE,log.p = FALSE)
#qgamma(p, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
#rgamma(n, shape, rate = 1, scale = 1/rate)
#Simulation
x1 <- rgamma(1000, 2, scale = 1/1)
x2 <- rgamma(1000, 4, scale = 1/2)
x3 <- rgamma(1000, 6, scale = 1/3)
Gx4 <- rgamma(1000, 8, scale = 1/4)
hist(x1, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x1), col = "red")
hist(x2, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x2), col = "red")
hist(x1, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(x3), col = "red")
hist(Gx4, main = "Gamma Distribution", probability = T, col = "blue")
lines(density(Gx4), col = "red")
Inverted gamma distribution
alpha which is shape parameter conrols height
beta which is scale parameter which controls the spread
can be used to model uncertain quantities
Used in survival modelling
What are posterior probabilities and prior probabilities?
Ans: A posterior probability is the probability of assigning observations to groups given the data.
A prior probability is the probability that an observation will fall into a group before you collect the data.
Prior probabilities are the original probabilities of an outcome, which be will updated with new information to create posterior probabilities. Prior Probability http://www.investopedia.com/terms/p/prior_probability.asp#ixzz4mBPk3cqg
library(invgamma)
#functions
#rinvgamma(n, shape, rate = 1)
#dinvgamma(x, shape, rate = 1)
#simuation
x1 <- rinvgamma(1000, 2, scale = 1/1)
x2 <- rinvgamma(1000, 4, scale = 1/2)
x3 <- rinvgamma(1000, 6, scale = 1/3)
IGx4 <- rinvgamma(1000, 8, scale = 1/4)
hist(x1, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x1), col = "red")
hist(x2, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x2), col = "red")
hist(x1, main = " Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(x3), col = "red")
hist(IGx4, main = "Inverse Gamma Distribution", probability = T, col = "blue")
lines(density(IGx4), col = "red")
#difference between inverse gamma and gamma keeping shape and scale parameters same
hist(Gx4, main = "Gamma Distribution", probability = T, col = "blue")
hist(IGx4, main = "Inverse Gamma Distribution", probability = T, col = "blue")
The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
CLT is important because under certain condition, you can approximate some distribution with Normal distribution although the distribution is not Normally distributed. … We can then take 10 random samples from the distribution, and calculate the mean.
Most common distribtion also known as Gaussian Distribuion.
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the “bell curve”.
The identification of a normal distribution should be done using mu +/- sigma concept
Is symmetric arrounf mean which is equal to mode and mdeian as well
#functions
#dnorm(x, mean = 0, sd = 1, log = FALSE)
#pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
#qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
#rnorm(n, mean = 0, sd = 1)
x <- rnorm(100000, 0, 3)
hist(x, probability = T, main = "Normal Distribution", col = "blue")
lines(density(x), col = "red")
This distribution is used wherever RV’s log follows the Normal Distribution
Always positive values of y
#functions
#dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
#plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
#qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
#rlnorm(n, meanlog = 0, sdlog = 1)
x <- rlnorm(1000, 3, 1)
hist(x, probability = T, col = "blue", main = "Log Normal Distribution")
lines(density(x), col = "red")
AKA Wald Distribution
RV ranges from [0,infinity)
The distribution’s tail decreases slowly compared to the normal distribution. Therefore, it’s suitable for modeling phenomena where there is a greater likelihood of getting extremely large values compared to the normal distribution.
This distribution has a similar shape to the Weibull distribution, but has the advantage it’s easier to estimate probabilities (the Weibull, with three parameters, is more difficult).
Note: The Gamma distribution also has a similar shape. In fact, they can look exactly the same given the right parameters. However, it’s easier to produce extremely large values with the inverse Gaussian.
source: http://www.statisticshowto.com/statistics-basics/
#functions
#dinvgauss(x, mu, lambda=1)
#pinvgauss(q, mu, lambda=1)
#qinvgauss(p, mu, lambda=1)
#rinvgauss(n, mu, lambda=1)
#labda is scale parameter
library(statmod)
## Warning: package 'statmod' was built under R version 3.4.1
x <- rinvgauss(10000,4,3)
hist(x,probability = T, col = "blue", main = "Inverse Gaussian Distribution")
lines(density(x), col = "red")
The chi-square distribution has the following properties:
The mean of the distribution is equal to the number of degrees of freedom: μ = v. The variance is equal to two times the number of degrees of freedom: σ2 = 2 * v When the degrees of freedom are greater than or equal to 2, the maximum value for Y occurs when Χ2 = v - 2. As the degrees of freedom increase, the chi-square curve approaches a normal distribution.
If a RV has a Std. Normal distribution then its square will have chi square distribution
#functions
#dchisq(x, df, ncp = 0, log = FALSE)
#pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
#qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
#rchisq(n, df, ncp = 0)
x <- rchisq(10000, 7, 10)
hist(x, probability = T, col = "blue", main = "Chi Sq Distribution")
lines(density(x), col = "red")
They are the subset of Exponentiral Dispersion Models (EDM: set of probability distributions that represents a generalisation of the natural exponential family)
The Tweedie distribution is a special case of an exponential distribution. It can have a cluster of data items at zero (called a “point mass”), which is particularly useful for modeling claims in the insurance industry, in medical/genomic testing, or anywhere else there is a mixture of zeros and non-negative data points. Basically, if you see a histogram with a spike at zero, it’s a possible candidate to be fitted to a Tweedie model.
This family of distributions has the following characteristics:
a mean of E(Y) = μ a variance of Var(Y) = φ μp. The p in the variance function is an additional shape parameter for the distribution. “p” is sometimes written in terms of the shape parameter α: p = (α – 2) / (α -1). Some familiar distributions are special cases of the Tweedie distribution:
p = 0 : Normal distribution, p = 1: Poisson distribution, 1 < p < 2: Compound Poisson/gamma distribution, p = 2 gamma distribution, 2 < p < 3 Positive stable distributions, p = 3: Inverse Gaussian distribution / Wald distribution, p > 3: Positive stable distributions, p = ∞ Extreme stable distributions.
#function
#rTweedie(mu,p=1.5,phi=1)
library(tweedie)
## Warning: package 'tweedie' was built under R version 3.4.1
x <- rtweedie(10000, 2,4, 5)
hist(x, col = "blue", probability = T, main = "Tweedie Distribution")
lines(density(x), col = "red")