Introduction

We define statistical inference as the process of generating conclusions about a population from a noisy sample. Without statistical inference we simply live within our data. With statistical inference, we attempt to generate new knowledge. Knowledge and parsimony, (using simplest reasonable models to explain complex phenomena), go hand in hand. Probability models will serve as our parsimonious description of the world. The use of probability models as the connection between our data and a population represents the most effective way to obtain inference.

Probability

Probability forms the foundation for almost all treatments of statistical inference. In our treatment, probability is a law that assigns numbers to the long run occurrence of random phenomena after repeated unrelated realizations.

For our purposes, randomness is any process occurring without an apparent deterministic patterns. Thus we will treat many things as if they were random when, in fact they are completely deterministic. We will treat probability as the long run proportion of times something occurs in repeated unrelated realizations \[P(A) = \dfrac{\text{Number of occurrences of A}}{\text{Number of trials}}. \] As an example, the proportion of times that you get a head when flipping a coin.

Kolmogorov’s Three Rules

Given a random experiment (like rolling a die) a probability measure is a population quantity that summarizes the randomness. The Russian mathematician Kolmogorov noted that to understand how probability should behave, only three rules were needed.

Consider an experiment with a random outcome. Probability takes a possible outcome from an experiment and:

assigns it a number between 0 and 1,
requires that the probability that something occurs is 1,
required that the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) is the sum of their respective probabilities.

From these simple rules all of the familiar rules of probability can be developed.

Two events are mutually exclusive if they cannot both simultaneously occur. For example, we cannot simultaneously get a 1 and a 2 on a die. Rule 3 says that since the event of getting a 1 and 2 on a die are mutually exclusive, the probability of getting at least one (the union) is the sum of their probabilities. So if we know that the probability of getting a 1 is 1/6 and the probability of getting a 2 is 1/6, then the probability of getting a 1 or a 2 is 2/6, the sum of the two probabilities since they are mutually exclusive.

There are some consequences of these rules.

As an example, if \(A\) is an event, define \(A^c\) as the complement (opposite) of the event. Since \(A\) and \(A^c\) are mutually exclusive and one of these events happens with certainty, we know that \(P(A) + P(A^c) = 1\). Thus, \[P(A^c) = 1-P(A).\]

Here are five consequences of Kolmogorov’s rules:

The probability that nothing occurs is 0. (\(P(\phi) = 0\))
The probability that something occurs is 1. (\(P(S) = 1\))
The probability of something is 1 minus the probability that the opposite occurs. (\(P(A^c) = 1-P(A)\))
The probability of at least one of two (or more) things that can not simultaneously occur (mutually exclusive) is the sum of their respective probabilities. (\(P(A) + P(B) = P(A \cup B)\) if \(A \cap B = \phi\))
For any two events the probability that at least one occurs is the sum of their probabilities minus their intersection. (\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\))

Random Variables

A random variable is a numerical outcome of an experiment. The random variables that we study will are either discrete or continuous. Discrete random variables are random variables that take on only a countable number of possibilities. Mass functions will assign probabilities that they take specific values. Continuous random variable can conceptually take any value on the real line or some subset of the real line and we talk about the probability that they lie within some range. Densities will characterize these probabilities.

Some examples that could be considered random variables include:

Gambling experiments like the tossing of a coin and the rolling of a die produce random variables. For the coin, we typically code a tail as a 0 and a head as a 1. (For the die, the number facing up would be the random variable.)
The number of web hits for a site each day is a random variable. This variable is a count, but is largely unbounded. Random variables like this are often modeled with the so called Poisson distribution.
Lengths and weights are continuous random variables. It is mathematically convenient to model these as if they were continuous (even if measurements were truncated liberally). In fact, even discrete random variables with lots of levels are often treated as continuous for convenience.

For all of these kinds of random variables, we need convenient mathematical functions to model the probabilities of collections of realizations. These functions, called mass functions and densities, take possible values of the random variables, and assign the associated probabilities. These entities describe the population of interest.

For example, consider the normal distribution. Saying that body mass indices follow a normal distribution is a statement about the population of interest. The goal is to use our data to determine properties of that normal distribution such as where it is centered and how spread out it is.

Probability Mass Functions

A probability mass function (pmf) evaluated at a value corresponds to the probability that a random variable takes that value. To be a valid pmf a function \(p\) must satisfy:

It must always be larger than or equal to 0. (\(P(x) \geq 0\))
The sum of the possible values that the random variable can take has to add up to one. (\(P(x_1) + P(x_2) + \cdots = 1\))

Probability Density Functions

A probability density function (pdf), is a function associated with a continuous random variable. The area under a pdf corresponds to probabilities for that random variable. Therefore, when one says that intelligence quotients (IQ) in population follows a bell curve, they are saying that the probability of a randomly selected person from this population having an IQ between two values is given by the area under the bell curve.

Not every function can be a valid probability density function. Specifically, to be a valid pdf, a function must satisfy two properties:

It must be larger than or equal to zero everywhere.
The total area under it must be one.

Cumulative Distribution Functions

Certain areas of PDFs and PMFs are so useful, we give them names. The cumulative distribution function (CDF) of a random variable, \(X\), returns the probability that the random variable is less than or equal to the value \(x\). We wrote the distribution function \(F\) as \[F(x) = P(X \leq x).\] This definition holds for both discrete and continuous random variables.

The survival function of a random variable \(X\) is the probability that the random variable is greater than the value \(x\), \[S(x) = P(X > x).\]

Conditional Probability

When we are given information about a random variable, it changes the probabilities associated with it. For example, the probability of getting a one when rolling a die is 1/6. However, if you also knew that that the die roll was an odd number, then conditional on this new information, the probability changes to 1/3.

We formalize the definition of conditional probability. Let \(B\) be an event with \(P(B)>0\). Then the conditional probability of an event \(A\) given that \(B\) has occurred is: \[P(A|B) = \dfrac{P(A \cap B)}{P(B)}.\] Now, if \(A\) and \(B\) are independent events, then \(P(A \cap B) = P(A)P(B)\), and so the \(P(A|B) = P(A)\). In other words, if \(A\) and \(B\) are independent, the probability of \(A\) given \(B\) is the probability of \(A\) (since \(A\) has no dependence on \(B\)).

Bayes’ Theorem

Bayes’ rule forms allows us to reverse the conditioning set provided that we know some marginal probabilities.

As an example, it is relatively easy for physicians to calculate the probability that the diagnostic method is positive for people with lung cancer and negative for people without. They could take several people who are already known to have the disease and apply the test and conversely take people known not to have the disease. However, for the collection of people with a positive test result, the reverse probability is more of interest, “given a positive test what is the probability of having the disease?”, and “given a given a negative test what is the probability of not having the disease?”. Bayes’ rule allows us to switch the conditioning event, provided a little bit of extra information.

Formally Bayes’ theorem states: \[P(B|A) = \dfrac{P(A|B)P(B)}{P(A|B)P(B) + P(A|B^c)P(B^c)}.\]

Example

Diagnostic tests are a good example of Bayes’ rule. Let + and − be the events that the result of a diagnostic test is positive or negative respectively. Let \(D\) and \(D^c\) be the event that the subject of the test has or does not have the disease respectively.

Sensitivity is the probability that the test is positive given that the subject actually has the disease, \(P(+ | D)\).

Specificity is the probability that the test is negative given that the subject does not have the disease, \(P(− | D^c)\).

Conceptually, the sensitivity and specificity are straightforward to estimate. Take people known to have and not have the disease and apply the diagnostic test to them.

The positive predictive value is the probability that the subject has the disease given that the test is positive, \(P(D | +)\).

The negative predictive value is the probability that the subject does not have the disease given that the test is negative, \(P(D^c| −)\).

Finally, we need one last thing, the prevalence of the disease - which is the marginal probability of disease, \(P(D)\).

QUESTION:

A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5% Suppose that a subject, from a population with a 0.1% prevalence of HIV, receives a positive test result. What is the positive predictive value?

Mathematically, we want \(P(D | +)\) given the sensitivity, \(P(+ | D) = .997\), the specificity, \(P(− |D^c) = .985\) and the prevalence \(P(D) = .001\). Therefore, \[\begin{aligned} P(D | +) &= \dfrac{P(+ | D)P(D)}{P(+ | D)P(D) + P(+ |D^c)P(D^c)}\\\\ &= \dfrac{P(+ | D)P(D)}{P(+ | D)P(D) + {1 − P(− |D^c)}{1 − P(D)}}\\\\ &= \dfrac{.997 × .001}{.997 × .001 + .015 × .999}\\\\ &= .062. \end{aligned}\]

In this population a positive test result only suggests a 6% probability that the subject has the disease, (the positive predictive value is 6% for this test).

Independence

Statistical independence of events is the idea that the events are unrelated. Two events \(A\) and \(B\) are independent if \(P(A \cap B) = P(A)P(B)\) or equivalently if \(P(A | B) = P(A)\). Note that if \(A\) is independent of \(B\) we know that \(A^c\) is independent of \(B^c\).

Common Distributions

Discrete Distributions

The probability distribution for a random variable describes the range and relative likelihood of possible values for a random variable. For a discrete random variable \(x\), the probability distribution is defined by a probability mass function, denoted by \(f(x)\). The probability mass function provides the probability for each value of the random variable.

Discrete Uniform Probability Distribution

When the probability distribution contains only a finite number of outcomes and the possible values of the probability mass function, \(f(x)\), are all equal, then the probability distribution is a discrete uniform probability distribution. The general form of the probability mass function for a discrete uniform probability distribution is as follows: \(f(x) = 1/n\)

The Bernoulli Distribution

The Bernoulli distribution arises as the result of an experiment of a binary outcome. Thus, Bernoulli random variables have values of only 1 (success) and 0 (failure) with probabilities \(p\) and \(1 − p\). The PMF for a Bernoulli random variable \(X\) is \[P(X = x) = p^x (1 − p)^{1-x}.\] The mean of a Bernoulli random variable is \(p\) and the variance is \(p(1−p)\).

Bernoulli random variables are used for modeling binary traits for a random sample. For example, in a random sample whether or not a participant has high blood pressure would be reasonably modeled as Bernoulli.

The Binomial Distribution

The binomial random variables are obtained as the sum of \(n\) Bernoulli trials. So if a Bernoulli trial is the result of a coin flip, a binomial random variable is the total number of heads.

Mathematically, let \(X_1, \ldots , X_n\) be n Bernoulli trials, then \(X = \sum_{i=1}^n X_i\) is a binomial random variable. The binomial mass function is \[P(X = x) =\binom{n}{x}p^x(1-p)^{n-x}.\]

The Poisson distribution

The Poisson distribution is used to model counts. The Poisson distribution is especially useful for modeling unbounded counts or counts per unit of time (rates). When \(n\) is large and \(p\) is small, the Poisson is an accurate approximation to the binomial distribution. The Poisson mass function is given by \[P(X=x, \lambda) = \dfrac{\lambda^x e^{-\lambda}}{x!},\] for \(x = 0, 1,\ldots\). The mean of this distribution is \(\lambda\). The variance of this distribution is also \(\lambda\). Notice that \(x\) ranges from 0 to infinity. Thus, the Poisson distribution is especially useful for modeling unbounded counts.

Continuous Probability Distributions

In this section we consider continuous random variables. Specifically, we discuss some of the more useful continuous probability distributions for analytics models: the uniform, the triangular, the normal, and the exponential.

A fundamental difference separates discrete and continuous random variables in terms of how probabilities are computed. For a discrete random variable, the probability mass function \(f(x)\) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability mass function is the probability density function, also denoted by \(f(x)\). The difference is that the probability density function does not directly provide probabilities. However, the area under the graph of \(f(x)\) corresponding to a given interval does provide the probability that the continuous random variable \(x\) assumes a value in that interval. So when we compute probabilities for continuous random variables, we are computing the probability that the random variable assumes any value in an interval. Because the area under the graph of \(f(x)\) at any particular point is zero, one of the implications of the definition of probability for continuous random variables is that the probability of any particular value of the random variable is zero.

Uniform Probability Distribution

The uniform probability density function for a random variable \(x\) is defined by the following formula: \(1/(b-a)\), where \(a\) and \(b\) are the endpoints of the interval.

For a continuous random variable, we consider probability only in terms of the likelihood that a random variable assumes a value within a specified interval.

The calculation of the expected value and variance for a continuous random variable is analogous to that for a discrete random variable. However, because the computational procedure involves integral calculus, we do not show the calculations here. For the uniform continuous probability distribution, the formulas for the expected value and variance are as follows: \[E(x) = \dfrac{a+b}{2} \; \hspace{1in} \; Var(x) = \dfrac{(b-a)^2}{12}.\] In these formulas, \(a\) is the minimum value and \(b\) is the maximum value that the random variable may assume. The standard deviation of flight times can be found by taking the square root of the variance.

Triangular Probability Distribution

The triangular probability distribution is useful when only subjective probability estimates are available. There are many situations for which we do not have sufficient data and only subjective estimates of possible values are available. In the triangular probability distribution, we need only to specify the minimum possible value \(a\), the maximum possible value \(b\), and the most likely value (or mode) of the distribution \(m\). If these values can be knowledgeably estimated for a continuous random variable by a subject-matter expert, then as an approximation of the actual probability density function, we can assume that the triangular distribution applies.

Consider a situation in which a project manager is attempting to estimate the time that will be required to complete an initial assessment of the capital project of constructing a new corporate headquarters. There is considerable uncertainty regarding the duration of these tasks, and generally little or no historical data are available to help estimate the probability distribution for the time required for this assessment process.

From expert opinions and our own experience, we estimate that the minimum required time for the initial assessment phase is six months and that the worst-case estimate is that this phase could require 24. The consensus is that the most likely amount of time required for the initial assessment phase of the project is 12 months. From these estimates, we can use a triangular distribution as an approximation for the probability density function for the time required for the initial assessment phase of constructing a new corporate headquarters.

The general form of the triangular probability density function is as follows: \[f(x) = \begin{cases}\dfrac{2(x-a)}{(b-a)(m-a)} & a \leq x \leq m\\ \dfrac{2(b-x)}{(b-a)(b-m)} & m < x \leq b \end{cases}.\]

We can calculate the probability that the time required is less than 12 months by finding the area under the graph of \(f(x)\) from 6 to 12. The geometry required to find this area for any given value is slightly more complex than that required to find the area for a uniform distribution, but the resulting formula for a triangular distribution is relatively simple: \[P(x \leq x_0) = \begin{cases}\dfrac{(x_0-a)^2}{(b-a)(m-a)} & a \leq x_0 \leq m\\ 1- \dfrac{(b-x_0)^2}{(b-a)(b-m)} & m < x \leq b \end{cases}.\] This equation provides the cumulative probability of obtaining a value for a triangular random variable of less than or equal to some specific value denoted by \(x_0\).

The Normal Distribution

The normal distribution is one of the most recognized distributions in statistics. As we will see, sample means follow normal distributions for large sample sizes. The normal distribution requires two numbers to characterize it. Specifically, a random variable is said to follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\) if the associated density is \[f(x)= \dfrac{1}{\sigma\sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}.\]

If \(X\) is a RV with this density then \(E[X] = \mu\) and \(Var(X) = \sigma^2\). That is, the normal distribution is characterized by the mean and variance. When \(\mu = 0\) and \(\sigma = 1\), the resulting distribution is called the standard normal distribution. Standard normal RVs are often labeled \(Z\).

Taken another way, if we know that the population is normally distributed then to estimate everything about the population, we need only estimate the population mean and variance.

For the normal distribution, it is useful to remember reference probabilities and quantiles. The most relevant probabilities are given by:

Approximately 68%, 95% and 99% of the normal density lies within 1, 2 and 3 standard deviations from the mean, respectively.
-1.28, -1.645, -1.96 and -2.33 are the 10th, 5th, 2.5th and 1st percentiles of the standard normal distribution, respectively.
By symmetry, 1.28, 1.645, 1.96 and 2.33 are the 90th, 95th, 97.5th and 99th percentiles of the standard normal distribution, respectively.

Since the normal distribution is characterized by only the mean and variance, which are a shift and a scale, we can transform normal random variables to be standard normals and vice versa. This means that if we scale the data using \[Z = \dfrac{X - \mu}{\sigma},\] then our data is normalized with mean 0 and standard deviation 1.

Citations

Caffo, Brian. Statistical Inference for Data Science. Leanpub, 2014. Available here.

Camm, Jeffrey D. Business Analytics. Third edition. Boston, MA, USA: Cengage, 2019.

Rohatgi, V. K. Statistical Inference. Dover ed. Mineola, N.Y: Dover Publications, 2003.

Wikipedia contributors, “Bernoulli distribution,” Wikipedia, The Free Encyclopedia, Available here (accessed July 17, 2021).

Wikipedia contributors, “Probability mass function,” Wikipedia, The Free Encyclopedia, Available here (accessed July 17, 2021).

Probability

OC Data Science

Introduction

Probability

Kolmogorov’s Three Rules

Random Variables

Probability Mass Functions

Probability Density Functions

Cumulative Distribution Functions

Conditional Probability

Bayes’ Theorem

Example

Independence

Common Distributions

Discrete Distributions

Discrete Uniform Probability Distribution

The Bernoulli Distribution

The Binomial Distribution

The Poisson distribution

Continuous Probability Distributions

Uniform Probability Distribution

Triangular Probability Distribution

The Normal Distribution

Citations