Introduction to Distribution Theory

Probability theory

Bongani Ncube

2023-11-29

Probability concepts

The probability of an event (E) is the number of ways event E can occur divided by the total number of probable outcomes. We live in a world where decision making is based on conditions of uncertainty. It is therefore important to know the chance of a particular event occurring to aid decision making.

Probability laws

Here are three rules that come up all the time.

Non-mutually exclusive events

\(Pr(A \cup B) = Pr(A)+Pr(B) - Pr(A \cap B)\). This rule generalizes to \(Pr(A \cup B \cup C)=Pr(A)+Pr(B)+Pr(C)-Pr(A \cap B)-Pr(A \cap C)-Pr(B \cap C)+Pr(A \cap B \cap C)\).

Conditional probability

\(Pr(A|B) = \frac{P(A \cap B)}{P(B)}\)
If A and B are independent, \(Pr(A \cap B) = Pr(A)Pr(B)\), and \(Pr(A|B)=Pr(A)\).

Discrete distributions

A discrete random variable \(X\) is described by its probability mass function \(f(x) = P(X = x)\). The set of \(x\) values for which \(f(x) > 0\) is called the support. If the distribution depends on unknown parameter(s) \(\theta\) we write it as \(f(x; \theta)\) (frequentist) or \(f(x | \theta)\) (Bayesian).

Bernoulli distribution

Context: A single trial with two outcomes, success/failure

\(X \sim \text{Bern}(p)\) with \(p\) probability of having a success

x	\(P(X=x)\)
1	\(p\)
0	\(1-p\)

Example: \(X\) is the random variable being born a female

Ten Bernoulli trials with \(p=0.5\)

Summary: Bernoulli distribution

notation: \(X \sim \text{Bern}(p)\)
range: discrete, \(x = 0, 1\)
distribution: \(P(X=x) = p^x (1-p)^{1-x}\)
parameters: \(p\) is the probability of success
mean: \(E(X) = \pi\)
variance: \(Var(X) = \pi(1 - \pi)\)

Binomial distribution

If \(X\) is the count of successful events in \(n\) identical and independent Bernoulli trials of success probability \(\pi\), then \(X\) is a random variable with a binomial distribution \(X \sim Bin(n,\pi)\)

\[f(x;n, \pi) = \frac{n!}{x!(n-x)!} \pi^x (1-\pi)^{n-x} \hspace{1cm} x \in (0, 1, ..., n), \hspace{2mm} \pi \in [0, 1]\]

Context: Total number of successes from a fixed number of independent Bernoulli trials, all with same probability of success

\(X \sim \text{Bin}(N,p)\) with \(p\) probability of having a success and \(N\) number of trials

\[P(X=x) = {{N!}\over{x!(N-x)!}}p^x(1-p)^{N-x} = \binom{N}{x}p^x(1-p)^{N-x}\]

Example: \(X\) is the random variable number of heads in a series of coin flipping

Binomial distribution

\[P(X=x) = \binom{N}{x}p^x(1-p)^{N-x}\]

\(x\)	\(P(X=x)\)
0	\((1-p)^N\)
1	\(Np(1-p)^{N-1}\)
…	…
N	\(p^N\)

Playing around with probabilities

Let’s say \(X \sim \text{Bin}(N=10,p=0.5)\) is a random variable counting the number of males. What is the probability of having at most 1 male?

\(P(X \leq 1) = P(X=0) + P(X=1)\)
How to compute this in R?
\(dbinom(x=0,size=10,prob=0.5) + dbinom(x=1,size=10,prob=0.5)\)

Question

A blindfolded marksman finds that on the average he hits the target 4 times out of 5. If he fires 4 shots, what is the probability of

More than two hits?
At least 3 misses?

Solution

\(N=4,p=\frac{4}{5}\)

More than two hits?

\[P(X>2)= P(X=3)+P(X=4)+P(X=5)+...\]

which is generally

\[P(X>2)=1-P(X \leq 2)=P(X=0)+P(X=1)+P(X=2)\] which is equal to

\[\binom{4}{0}\frac{4}{5}^0(1-\frac{4}{5})^{4-0}+\binom{4}{1}\frac{4}{5}^1(1-\frac{4}{5})^{4-1}+\binom{4}{2}\frac{4}{5}^2(1-\frac{4}{5})^{4-2}\]

choose(4,0)*(4/5)^0*(1-4/5)^4+
  choose(4,1)*(4/5)^1*(1-4/5)^(4-1)+
  choose(4,2)*(4/5)^2*(1-4/5)^(4-2)

## [1] 0.1808

At least 3 misses?

\[P(X>=3)= P(X=3)+P(X=4)+P(X=5)+...\]

same as above

Fortunately, R has this pre-programmed dbinom(x , size , prob )

dbinom(x = 0, size = 4, prob = 4/5) +
  dbinom(x = 1, size = 4, prob = 4/5)+
   dbinom(x = 2, size = 4, prob = 4/5)
## [1] 0.1808

Summary: Binomial distribution

notation: \(X \sim \text{Bin}(N,p)\)
- range: discrete, \(0 \leq x \leq N\)
- distribution: \(P(X=x) = \binom{N}{x}p^x (1-p)^{1-x}\)
- parameters: \(p\) the probability of success, and \(N\) the number of trials
mean: \(Np\)
- variance: \(Np(1-p)\)
- in R: rbinom, dbinom

Poisson distribution

Context: Number of occurrences of an event over a given unit of space or time.

\(X \sim \text{Poisson}(\lambda)\) with \(\lambda\) expected number of occurrences

\[P(X=x) = {{e^{-\lambda}\lambda^x}\over{x!}}\]

Example: \(X\) is the random variable number of birds counted on a colony during the breeding season

Poisson distribution

\[P(X=x) = {{e^{-\lambda}\lambda^x}\over{x!}}\]

\(x\)	\(P(X=x)\)
0	\(e^{-\lambda}\)
1	\(\lambda e^{-\lambda}\)
…	…

Example: A small town’s police department issues 5 speeding tickets per month on average. Using a Poisson random variable, what is the likelihood that the police department issues 3 or fewer tickets in one month?

First, we note that here \(P(Y \le 3) = P(Y=0) + P(Y=1) + \cdots + P(Y=3)\). Applying the probability mass function for a Poisson distribution with \(\lambda = 5\), we find that

\[\begin{align*} P(Y \le 3) &= P(Y=0) + P(Y=1) + P(Y=2) + P(Y=3) \\ &= \frac{e^{-5}5^0}{0!} + \frac{e^{-5}5^1}{1!} + \frac{e^{-5}5^2}{2!} + \frac{e^{-5}5^3}{3!}\\ &= 0.27. \end{align*}\]

We can verify through R:

sum(dpois(0:3, lambda = 5))

## [1] 0.2650259

Therefore, there is a 27% chance of 3 or fewer tickets being issued within one month.

Hundred Poisson trials with \(\lambda=1\)

Summary: Poisson distribution

notation: \(X \sim \text{Poisson}(\lambda)\)
- range: discrete, \(x \geq 0\)
- distribution: \(P(X=x) = {{e^{-\lambda}\lambda^x}\over{x!}}\)
- parameters: \(\lambda\) the rate or expected number per sample
mean: \(\lambda\)
- variance: \(\lambda\)
- in R: rpois, dpois

Continuous distribution

Normal (Gaussian) distribution

Context: Distribution of “adding lots of things together”. Derived from Central Limit Theorem, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal.

\(X \sim \text{Normal}(\mu,\sigma^2)\) where \(\mu\) is the mean and \(\sigma^2\) the variance

\[f(x) = {{1}\over{\sqrt{2\pi\sigma}}}\exp\left( - {{(x-\mu)^2}\over{2\sigma^2}} \right)\]

Example: Practically everything.

Most things in nature follow a normal distributon. Think of exam scores at your university, weight and heights of students or weight of newborn babies.

Normal probability density function

Summary: Normal distribution

notation: \(X \sim \text{N}(\mu,\sigma^2)\)
- range: continuous, all real values
distribution: \(f(x) = {{1}\over{\sqrt{2\pi\sigma}}}\exp\left( - {{(x-\mu)^2}\over{2\sigma^2}} \right)\)
- parameters: \(\mu\) the mean and \(\sigma\) the standard deviation
mean: \(\mu\)
- variance: \(\sigma^2\)
- in R: rnorm, dnorm

Example 1: The weight of a box of Fruity Tootie cereal is approximately normally distributed with an average weight of 15 ounces and a standard deviation of 0.5 ounces. What is the probability that the weight of a randomly selected box is more than 15.5 ounces?

Using a normal distribution,

\[\begin{align*} P(Y > 15.5) = \int_{15.5}^{\infty} \frac{e^{-(y-15)^2/ (2\cdot 0.5^2)}}{\sqrt{2\pi\cdot 0.5^2}}dy = 0.159 \end{align*}\]

However the formula above is hard to work with hence the need to standardize and use tables.

standardising entails transforming Y into Z such that \[Z=\frac{Y-\mu}{\sigma}\]

therefore we would calculate as follows

\[P(Y > 15.5)=P(Z> \frac{15.5-15}{0.5})=P(Z> 1)\]

in high school or varsity the above would be found in statistical tables such that \[P(Z>1)=1-P(Z<1)=1-\Phi(1)\]

but it is easier to do in R such that using pnorm()

pnorm(1, mean = 0, sd = 1, lower.tail = FALSE)

## [1] 0.1586553

We can use R with the originial values as well:

pnorm(15.5, mean = 15, sd = 0.5, lower.tail = FALSE)

## [1] 0.1586553

There is a 16% chance of a randomly selected box weighing more than 15.5 ounces.

__example:2__ Suppose IQ scores are distributed \(X \sim N\left(100, 16^2\right)\). What is the probability that a randomly selected person’s IQ is less than 90.

in R

pnorm(q = 90, mean = 100, sd = 16, lower.tail = TRUE)
## [1] 0.2659855

example 3: What is the probability that a randomly selected person’s IQ is greater than 90

in R

pnorm(q = 140, mean = 100, sd = 16, lower.tail = FALSE)
## [1] 0.006209665

example 3: What is the probability that a randomly selected person’s IQ is between 92 and 114.

in R

pnorm(q = 114, mean = 100, sd = 16, lower.tail = TRUE) -
  pnorm(q = 92, mean = 100, sd = 16, lower.tail = TRUE)
## [1] 0.5006755

Why do we love the Normal distribution

If has nice properties, such as: if \(X \sim \text{N}(\mu,\sigma^2)\), then \(Z = \displaystyle{{{X - \mu}\over{\sigma}} \sim \text{N}(0,1)}\)
It is a limiting distribution (Central Limit Theorem)
It can be a good approximation for other distributions

Example: Approximating Binomial by Normal

By the central limit theorem (CLT) the binomial distribution \(X \sim B(n,p)\) approaches the normal distribution with mean \(\mu = n p\) and variance \(\sigma^2=np(1-p)\) as \(n \rightarrow \infty\). The approximation is useful when the expected number of successes and failures is at least 5, \(np>=5\) and \(n(1-p)>=5\).

Exact binomial

pbinom(q = 460, size = 1000, prob = 0.50, lower.tail = TRUE)
## [1] 0.006222073

Normal approximation

pnorm(q = 460, mean = 0.50 * 1000, sd = sqrt(1000 * 0.50 * (1 - 0.50)), lower.tail = TRUE)

## [1] 0.005706018

suppose

\(X \sim \text{Bin}(N=50,p=0.3)\)

Mean is \(Np = 50 \times 0.3 = 15\)

Variance is \(Np(1-p) = 50 \times 0.3 \times 0.7 = 10.5\)

Therefore, \(X\) can be approximated by \(Y \sim \text{N}(15,\sigma=\sqrt{10.5})\)

Review of some common random variables.
Distribution Name	pmf / pdf	Parameters	Possible Y Values	Description
Binomial	\({n \choose y} p^y (1-p)^{n-y}\)	\(p,\ n\)	\(0, 1, \ldots , n\)	Number of successes after \(n\) trials.
Geometric	\((1-p)^yp\)	\(p\)	\(0, 1, \ldots, \infty\)	Number of failures until the first success.
Negative Binomial	\({y + r - 1\choose r-1} (1-p)^{y}(p)^r\)	\(p,\ r\)	\(0, 1, \ldots, \infty\)	Number of failures before \(r\) successes.
Hypergeometric	\({m \choose y}{N-m \choose n-y}\big/{N \choose n}\)	\(n,\ m,\ N\)	\(0, 1, \ldots , \min(m,n)\)	Number of successes after \(n\) trials without replacement.
Poisson	\({e^{-\lambda}\lambda^y}\big/{y!}\)	\(\lambda\)	\(0, 1, \ldots, \infty\)	Number of events in a fixed interval.
Exponential	\(\lambda e^{-\lambda y}\)	\(\lambda\)	\((0, \infty)\)	Wait time for one event in a Poisson process.
Gamma	\(\displaystyle\frac{\lambda^r}{\Gamma(r)} y^{r-1} e^{-\lambda y}\)	\(\lambda, \ r\)	\((0, \infty)\)	Wait time for \(r\) events in a Poisson process.
Normal	\(\displaystyle\frac{e^{-(y-\mu)^2/ (2 \sigma^2)}}{\sqrt{2\pi\sigma^2}}\)	\(\mu,\ \sigma\)	\((-\infty,\ \infty)\)	Used to model many naturally occurring phenomena.
Beta	\(\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} y^{\alpha-1} (1-y)^{\beta-1}\)	\(\alpha,\ \beta\)	\((0,\ 1)\)	Useful for modeling probabilities.