Probability Distributions, Central Limit Theorem

Rasim Muzaffer Musal

Goals:

Learn some of the distributions that are widely used both in academia and industry.
Learn difference between discrete and continous distributions.
Learn why we need to know about these distributions in modeling.
Learn why we focus on the Normal distribution so much (Central Limit Theorem).

Discrete vs Continous probability distributions:

This distinctions occur due the scale of the random variable we are using.
Recall that the difference between discrete and continous is the following:
- A continuous random variable is such that if you take any two arbitrary values from its scale a middle point exists. In a random variable which has a discrete scale a middle point might not exist.

Continous

Discrete

Why is this relavent?

You can always represent observations with numbers.
There is no reason why you can not do calculations with these numbers.
The problem is that you are using values that are not defined and therefore depending on the type of discrete random variable could be completely meaningless.
Car Brands are nominal and can be represented by numbers. You might prefer one brand over the other but those preferences are based on other synthesized attributes (safety, power etc…).

Parametric Distributions: Discrete

Some examples:
- Bernoulli: \(\{p\}\)
- Binomial: \(\{p,n\}\)
- Discrete Uniform: \(\{min(X),max(X)\}\)
- Geometric: \(\{p,k\}\)
- Multinomial: \(\{n,p_{1},\ldots,p_{k},K\}\)
- Poisson: \(\{\lambda\}\)

Parametric Distributions: Continuous

Some examples:
- Normal/Gaussian: \(\{\mu,\sigma\}\)
- t: \(\{\mu,\sigma,\nu \}\)
- Exponential: \(\{\frac{1}{\lambda}\}\)
- Beta: \(\{\alpha,\beta\}\)
- Dirichlet: \(K,\alpha_{1},\ldots,\alpha_{K}\)

Thinking about the distributions

The above distributions are just some examples.
The distributions are picked to represent an abstraction of reality.
The probability mass function (for discrete, p.m.f. P(X=x) ) and probability density function (p.d.f. f(X=x)) is used for this purpose.
The cumulative distribution function is the probability of observing a particular value or smaller values \(F(X \le x)\).

How do we pick our distribution

Knowing characteristics of the random variable you have allows you to make an initial choice.
Examples: If we are trying to model (learn about) whether someone survives Covid-19, we can think about binary outcomes as \(0/1\) where 1 is survival. This means we should pick Bernoulli Distribution for our target variable. This allows us to use logistic regression that models \(E(Y=1)=p\). In turn we can answer the question, what are the factors and extent that \(X_{1},\ldots,X_{K}\) affect \(p\) where K is the number of independent (or covariates or predictors).

How do we pick our distribution

Examples: If we are working at a car insurance company and would like to model not just whether someone is going to get into an accident but instead how much loss they are going to incur to the car insurance company, actuaries should use a distribution based on Gamma that is defined on the positive scale or real numbers. In fact to make it realistic they would have to create a mixture distribution of 0s and randomly generated Gamma distributions.

How do we pick our distribution

Reality (Processes) in many cases are more complex than the distributions we have mentioned. Therefore in these cases we can improve our understanding of the real world phenomena by increasing the complexity of the distribution we end up working.
For instance we can create a mixed distribution of inflated 0s and positive values. Interpretation becomes more challenging.

P.D.F. of a normal distribution

The normal distribution has the following p.d.f. \[ f(X=x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right) \]
Recall that this distribution has two parameters. The mean \(\mu\) and standard deviation \(\sigma\). The small caps letter \(x\) represents the value of the random variable. The rest of the function maps to the density value. We will plot in a couple of slides further down.

Generating Random Numbers

Assume that mean \((\mu)\) is 50. Standard deviation \((\sigma)\) is 10. I am going to generate 5 values from this distribution.

rnorm(n=5,mean=50,sd=10)

[1] 45.14724 70.21451 61.49509 53.18533 53.18820

There are different algorithms (set of rules) to generate random numbers. We will not give any details here but know that there really is no such thing as real randomness in computer generated r.v.s. There is just the properties of being so. That is why they are called pseudo random numbers.

Plotting Random Numbers

## 50,000 random numbers, that are normally distributed are generated
## from a normal distribution with mean 50 and standard deviation 10.
data=as.data.frame(rnorm(n=50000,mean=50,sd=10))
names(data)='x1'
ggplot(data=data,aes(x=x1))+geom_density()

Calculating the density for 5 observations

library(cowplot)
library(ggpubr)
#obtain the first 5 rows from the data dataframe. #concatenate it to the column with numbers from 1 to 5.
Y=round(as.data.frame(cbind(c(data[1,1],data[2,1],data[3,1],data[4,1],data[5,1]),
c(1:5),
c(dnorm(data[1,1],mean=50,sd=10),
dnorm(data[2,1],mean=50,sd=10),
dnorm(data[3,1],mean=50,sd=10),
dnorm(data[4,1],mean=50,sd=10),
dnorm(data[5,1],mean=50,sd=10)))),1)

p2=ggplot(data = data.frame(x = c(20, 80)), aes(x)) +
ggtitle((title=bquote("Densities of X when" ~ mu ~ "=mean(data) and" ~ sigma ~ "sd(data)")))+
theme(plot.title = element_text(hjust = 0.5))+
stat_function(fun = dnorm, n = 101, args = list(mean = mean(data$x), sd = sd(data$x))) + ylab("")+
scale_x_continuous(breaks = Y$V1)+
geom_point(data=Y,aes(size=4,x=V1,y=dnorm(V1,mean=mean(data$x),sd=sd(data$x)),color="red"))+theme(legend.position = "none")+
geom_segment(data=Y,aes(x = V1, y = rep(0,length(V1)), 
xend = V1, yend =dnorm(V1,mean=mean(data$x1),sd=sd(data$x1)),color="blue"))
p2

Evaluating Density

The Normal distribution

As you know the normal distribution is symmetric. Few things are symmetrically distributed.
Why then do we talk about this distribution so much in intro classes.
Central Limit Theorem

Demo of dif distributions: Gamma

## 50,000 random numbers, that are normally distributed are generated
## from a normal distribution with mean 50 and standard deviation 10.
data=as.data.frame(rgamma(n=50000,shape=250,rate=5))
names(data)='x1'
ggplot(data=data,aes(x=x1))+geom_density()

Demo of dif distributions: Mixed

## N random numbers, that have a mixed distribution.
## K numbers greater than 0.
## K+1 to N numbers that are 0
## p is the probability of a number being greater than 0.
N=50000
p=0.8
K=round(N*p,0)
data=as.data.frame(matrix(nrow=N,ncol=1))
data[1:K,1]=(rgamma(n=K,shape=250,rate=5))
data[(K+1):N,1]=0
names(data)='x1'
ggplot(data=data,aes(x=x1))+geom_density()

Central Limit Theorem (CLT)

Central Limit Theorem tells us that regardless of how a random variable itself is distributed, averages (and sums but ignore that for now) from the samples of a population will be approximately normally distributed with mean \(\mu\) and standard error (standard deviation of sample means) of \(\sigma/\sqrt{n}\), if \(n\) is large enough. A lot of results in classical statistics depends on this theorem.

If CLT is valid 1

This idea is used to make inferences on or from \(\mu\)

Confidence Intervals -If you know \(\mu\) and \(\sigma\), for a given \(n\), you can create confidence intervals to create an Upper and a Lower bound to give an idea of where \(\bar{x}\) is going to be. \[ Upper_Bound = \mu + t_{(1-\frac{\alpha}{2})}\frac{\sigma}{\sqrt{n}} \] \[ Lower_Bound = \mu - t_{(1-\frac{\alpha}{2})}\frac{\sigma}{\sqrt{n}} \]

If CLT is valid 2

-If you do not know \(\mu\) and \(\sigma\), but instead know \(\bar{x}\) and \(s\) for a given \(n\), you can create confidence intervals to create an Upper and a Lower bound to give an idea of where \(\mu\) is going to be.

\[ Upper Bound = \bar{x} + t_{(1-\frac{\alpha}{2})}\frac{s}{\sqrt{n}} \] \[ Lower Bound = \bar{x} - t_{(1-\frac{\alpha}{2})}\frac{s}{\sqrt{n}} \]

If CLT is valid 2

Hypothesis tests can be formulated about \(\mu\) from \(\bar{x}\) and s.

Demonstration of CLT

We will set up a 3 dimensional array.
First dimension will determine distribution type for random variable to be simulated.
Second dimension will assign number of rows (sample size)
Third dimension will be number of samples.

Demonstration 2

Populate the array with data
Create a matrix of sample means and standard deviations.
Create 95% confidence intervals around each sample mean.
Discuss whether C.L.T. holds using Boolean vectors.

Application 1

#Setting parameters
sample_size=10
num_samples=500
# Shape and Rate parameters mean is sh/rt
sh=2
rt=1
#Setting up the array described above
demo=array(dim=c(2,sample_size,num_samples))

#The first set of distributions to simulate. 
demo[1,,]=rgamma(n=sample_size*num_samples,shape=sh,rate=rt)
# Where summary statistics are going to be
summary_statistics=matrix(nrow=2,ncol=num_samples)
#Sample means
summary_statistics[1,]=colMeans(demo[1,,])
#Sample Standard deviation
summary_statistics[2,]=apply(demo[1,,],2,sd)

Application 2

#Upper Bound of the Confidence Intervals
Upper_Bound=summary_statistics[1,]+qt(p=0.975,df=sample_size-1)*summary_statistics[2,]/sqrt(sample_size)

#Lower Bound of the confidence intervals
Lower_Bound=summary_statistics[1,]-qt(p=0.975,df=sample_size-1)*summary_statistics[2,]/sqrt(sample_size)

Application 2

#Counting the booleans on the upper bounds being less than mu
sum(Upper_Bound<sh/rt)

[1] 34

#Counting the booleans on the lower bounds being less than mu 
sum(Lower_Bound>sh/rt)

[1] 3