The Logic of Inference

And a Brief Review

Christopher Weber

2025-09-01

Introduction

So many techniques
Parametric, Non Parametric, Semi-Parametric, Frequentist, Bayesian, etc.
So many assumptions
Gauss-Markov, IID, Normality, Homoskedasticity, Multicollinearity, Autocorrelation, Distributions, etc.
So many ways to get it wrong!
Type I, II, III errors, Overfitting, Underfitting, Omitted Variable Bias, etc.

Political Ideology

Repeated observation
Measurement error.
Political Ideology, repeated measures
\(y_{i,j}=\lambda_j x_{Ideology,i}+d_j\)

Image Description

A Spatial Model

An individual \(i\)’s score on the \(j\)th variable is a linear expression of one’s underlying ideology and error.
Assume this individual prefers A to B, since A is closer. Trace the expected utility.

A Spatial Model

Comparisons

Candidate A: \(-(x_{A}-\theta)^2\)
Candidate B: \(-(x_{B}-\theta)^2\)
\[ p(Y=A)=[-(x_{A}-{\theta})^2+e_1] - [-(x_{B}-{\theta})^2+e_2] \] \[ \begin{align} p(Y=A) &= \theta^2 + x_A x_B - \theta x_A - \theta x_B + e_1 - e_2 \nonumber \\ &= x_A^2 - x_B^2 + 2(x_A - x_B)\theta + e_2 - e_1 \\ &= \alpha + \beta\theta + \epsilon \end{align} \]

Comparisons

The quadratic utility model reduces to a linear probability model, or a logit model. \[ \begin{align} p(Y=A) &= \theta^2 + x_A x_B - \theta x_A - \theta x_B + e_1 - e_2 \nonumber \\ &= x_A^2 - x_B^2 + 2(x_A - x_B)\theta + e_2 - e_1 \\ &= \alpha + \beta\theta + \epsilon \end{align} \]

The Directed Acyclic Graph

Suppose we wanted to find the joint probability of these five variables; for convenience, \(p(A, B, C, D, E)\)
Knowing nothing else, \(5^5=3125\) possible combinations.
But, if we know the causal structure, we can reduce the number of combinations.

The DAG

\[P(A,B,C,D,E) = P(A) \times P(B|A) \times P(C|B) \times P(D|C) \times P(E|D)\]

The DAG

A more complicated example, where \(A\) causes both \(B\) and \(C\), which both cause \(D\), which causes \(E\).

\[P(A,B,C,D,E) = P(A) \times P(B|A) \times P(C|A,B) \times P(D|C) \times P(E|D)\]

Probabilities

The joint probability is the product of the conditional probabilities.
The marginal probability is the sum of the joint probabilities over all other variables. \[p(x)=\sum_y p(x_i, y_i) \hspace{0.5in} \texttt{Marginal Probability}\] \[ p(E) = \sum_A \sum_B \sum_C \sum_D p(E|D)p(D|C)p(A) \] \[p(x)=\int p(x, y) dy \hspace{0.5in} \texttt{Marginal Probability}\] \[p(x|y)=p(x,y)/p(y)\hspace{0.5in} \texttt{Conditional Probability}\]

Independence

Because there is no connected path from \(z\) to \(y\), \(z\) is independent of \(y\) given \(x\). \[z \perp y \mid x\]

Independence

\(u\) is a fork - It has an effect on two variables, \(x\) and \(y\)
This creates confounding

Conditional Independence: \(x \not\perp y\)

\(x \perp y \mid u\)

An Example

Cancer and Smoking

An Old, Persistent Critique

\[smoking \perp cancer \mid gene \]

D-Connection and Separation

Colliders and Mediation

The Problem: - This is common in social sciences and forms the basis for mediation analysis - Notice that \(z\) can affect \(m\) in two ways: 1. Direct path: \(z \rightarrow m\) 2. Indirect path: \(z \rightarrow y \rightarrow m\)

The Collider Issue: - \(y\) is a collider (both \(x\) and \(z\) point into it) - If we estimate: \(m = \beta_1 z + \beta_2 y + \epsilon\) - Conditioning on collider \(y\) opens unintended backdoor path between \(z\) and \(m\) - This creates spurious association between \(y\) and \(m\) - Endogeneity

Negative Advertising

Advertising and Turnout (Lau and Pomper 2004; Ansolabahere, Iyengar and Simon 1999)

Inference

The Frequentist and Bayesian traditions.
The Frequentist tradition of probability: Jerzy Neyman, Karl Pearson and Ronald Fisher, among others.
Probability refers to the frequency of an event occurring in many repeated trials.
Frequentism is actually quite intuitive when applied to circumstances in which we can observe repeated trials.
Rolling die: 1/6 of those trials will yield a 1.
We can calculate multiple events: Each trial involves rolling two dice, should we bet that at least one of these will yield two 1s?
The probability of observing two ones is 1/6 \(\times\) 1/6 \(=\) 1/36. This implies that the rest of the sample space (i.e., not rolling two ones) is 35/36. Since the rolls are independent, then the probability that two ones are not observed in 10 trials is \((35/36)^{10}=0.75\) – thus one should probably not bet on “snake eyes.”

Inference

[1] "After flipping a coin  10000  times  4943  were heads! "

Writing a Function

library(ggplot2)
simulate_coin_toss <- function(n_draws, prob_heads) {
  # Set seed for reproducibility
  set.seed(123)
# Catch Errors
  if (prob_heads < 0 || prob_heads > 1) {
    stop("Probability of heads must be between 0 and 1.")
  }
  # Calculate probability of tails
  prob_tails <- 1 - prob_heads

  # Simulate draws
  outcomes <- sample(c("H", "T"), 
                     size = n_draws,
                    replace = TRUE, 
                     prob = c(prob_heads, prob_tails))

  # Create a data frame for plotting
  df <- data.frame(Outcome = outcomes)

  plot = ggplot(df, aes(x = Outcome)) +
    geom_bar() +
    labs(title = "Coin Simulation", x = "Outcome", y = "Frequency") +
    theme_minimal() +
    scale_y_continuous(limits = c(0, n_draws))
  return(plot)
}

Simulating A Coin

simulate_coin_toss(n_draws = 100, prob_heads = 0.25)

Repeated Trials

The Frequentist tradition views probability as an outcome observed “in the long run.”
What does this mean vis-à-vis the central limit theorem, the standard error, confidence intervals, p-values, and Type I and Type II errors?
The Bayesian tradition views probability in a more subjective sense.
Conditional Probability: \(p(A|B)\) is the probability of event A occurring given that event B has occurred. \[p(A|B) = \frac{p(B|A) \times p(A)}{p(B)}\]
\(p(A)\) is the prior probability of event A occurring.
\(p(B|A)\) is the likelihood of observing event B given that A has occurred.
\(p(B)\) is the marginal probability of event B occurring.
The derivation…

The Monty Hall Experiment

Backround: The Monty Hall Experiment comes from the classic game show, “Let’s Make a Deal.” Here’s how it goes. There are three doors. Behind one door is a substantial prize – a car – behind the others are goats (undesirable prizes).
The contestant chooses one of three doors, but doesn’t open it. The door is announced. The host then chooses a door – obviously not the one with a car – opens it, and reveals a goat. The contestant offers the option of staying with their original choice, or switching doors.
Should the contestant stick with the original choice or switch to the unopened door?

The Monty Hall Experiment

\[ P(Car(A)) = P(Car(B)) = P(Car(C))=1/3\]
The contestant’s initial choice is entirely arbitrary, as the car is equally likely to be behind any door.
Let’s say the contestant chooses door A.
Monty Hall then opens Door B, revealing a goat.
The contestant now may stay with door A, or switch to C.
What would you do?
And more importantly, why?

The Monty Hall Experiment

From Monty’s Perspective: (1) He will never choose the door with the car, and (2) he will never choose the door the contestant initially picked. \[ P(Open(B) | Car(A)) = 1/2\] \[ P(Open(B) | Car(B)) = 0\] \[ P(Open(B) | Car(C)) = 1\]
Let’s use Bayes Rule to calculate the probability that the car is behind doors A or C, given that Monty opened door B. \[p(Car(A)|Open(B))?\] \[p(Car(B)|Open(B))?\] \[p(Car(C)|Open(B))?\]

The Monty Hall Experiment

These are all inversions of the conditional probabilities we already know, and we can use Bayes Rule to solve. \[p(Car(A)|Open(B)) = {{(1/3 * 1/2)}\over{(1/3*1/2)+ (1/3*0) + (1/3*1)}} = 1/3\] \[p(Car(B)|Open(B)) = {{(1/3 * 0)}\over{(1/3*1/2)+ (1/3*0) + (1/3*1)}} = 0\] \[p(Car(C)|Open(B)) = {{(1/3 * 1)}\over{(1/3*1/2)+ (1/3*0) + (1/3*1)}} = 2/3\]
The contestant should switch to door C, as the probability of winning the car is 2/3, versus 1/3 if they stick with door A.

Simulating the Monty Hall Experiment

The Number of Trials: 10000 
If the player switches doors, they win: 0.6681 of the time 
If the player stays with original choice, they win: 0.3319 of the time

Two Traditions

The Frequentist tradition views probability as an outcome observed “in the long run.”
The Bayesian tradition views probability in a more subjective sense. \[p(M|D) ={p(D|M) p(M)\over p(D)}\] \[p(M|D) \propto p(D|M) p(M)\]
\(p(M|D)\) is the posterior probability of model \(M\) given data \(D\).
\(p(D|M)\) is the likelihood of observing data \(D\) given model \(M\).
\(p(M)\) is the prior probability of model \(M\).

Bernoulli Trials

Suppose we have a coin, and we want to know the probability of heads, \(\theta\), where \(y \in [0 , 1]\)
Viewing this from the Frequentist perspective, we can flip the coin \(n\) times, and observe \(y\) heads.
The parameter \(\theta\) is fixed but unknown, but we can estimate it as \(\widehat{\theta} = y/n\).
This estimate is the maximum likelihood estimate (MLE) of \(\theta\).
The joint probability of observing (H, H) in two trials is \(0.5 \times 0.5 = 0.25\).
Assume \(\theta = 0.3\), then the joint probabilty of (H,H is \(0.3 \times 0.3 = 0.09\)). \[p(H,H)=\theta^y(1-\theta)^{1-y}\]
And across, \(n\) indepedendent (Bernoulli) trials \[p(k)=\prod\theta^y_i(1-\theta)^{1-y_i}\] \[=\theta^k(1-\theta)^{n-k}\]
The Binomial Density \[p(K | \theta, N)= {n \choose k} \theta^K(1-\theta)^{N-K}\]

When \(\theta\) is Unknown

Given the data available, what is the most likely parameter or set of parameters to generate the data?
The principle of maximum likelihood estimation (MLE) is to find the parameter values that maximize the likelihood of observing the data, given a particular parameter or set of parameters’ value(s).
The likelihood function is the joint probability of observing the data, given the parameters.
Assume we observe 7 heads, 3 tails. What’s our best guess of theta?

Bernoulli Trials

library(ggplot2)
theta <- seq(0, 1, by = 0.01)
# The likelihood function, using bernoulli trials
likelihood <- theta^7 * (1 - theta)^3
df <- data.frame(theta, likelihood)

max_likelihood_theta <- df$theta[which.max(df$likelihood)]

ggplot(df, aes(x = theta, y = likelihood)) +
  geom_line() +
  geom_vline(xintercept = max_likelihood_theta, color = "red", linetype = "dashed") +
  labs(title = "Likelihood Function for 7 Heads and 3 Tails",
       x = "Theta (Probability of Heads)",
       y = "Likelihood") +
  theme_minimal()

cat("Possible values of theta:", paste(head(theta), collapse = ", "), "...\n")
cat("Assume we observe 7 heads, 3 tails. What's our best guess?\n\n")
cat("The maximum value of the distribution is at theta =", max_likelihood_theta, "\n")

Bernoulli Trials

Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...

Assume we observe 7 heads, 3 tails. What's our best guess?

The maximum value of the distribution is at theta = 0.7

A Quick Review

Correlated random variables. Assume we observe two variables, \(x\) and \(y\). They have means \(\mu_x\) and \(\mu_y\) and standard deviations, \(\sigma_x\) and \(\sigma_y\). The mean of \(x+y\) is \(\mu_x+\mu_y\).
The standard deviation is \(\sqrt{\sigma^2_x+\sigma^2_y+\rho\sigma_x\sigma_x}\)
Multivariate normal distribution, \(z_k \sim N(\mu_k, \Sigma_{kk})\). \(\mu_k\)

library(MASS)
library(dplyr)
library(ggplot2)
# A covariance matrix (correlation because var = 1)
cov = matrix(c(1, 0.25, 0.25, 1), 2, 2)
## Simulate 1000 draws and plot
x <- mvrnorm(1000, mu = c(0, 0), Sigma = cov) %>% data.frame()
head(x)
colnames(x) <- c("X1", "X2")  # Rename columns for clarity

ggplot(x, aes(x = X1, y = X2)) +
  geom_point() +  
  labs(title = "Scatter Plot of Bivariate Normal Data, rho = 0.25",
       x = "Variable X1",
       y = "Variable X2") +
  theme_minimal()  

paste0("The estimated correlation is " , round(cor(x)[2,1], 3))

A Quick Review

          X1          X2
1  0.9658946  0.08199969
2  1.4894525  0.69235941
3  0.4244329  0.49882277
4  0.4459264 -0.49196931
5 -0.4380806  0.20868850
6  1.4560852  0.61328398

[1] "The estimated correlation is 0.257"

A Quick Review

The Population and the Sample.
The Expected Value. The expected value is the average value over many draws – it’s useful to think of it from the perspective of frequentism. For a discrete distribution, then:

\[E(x_i)=\sum_{i}^{K} x_i \times p(x_i)\]

Continuous distributions \[E(x)=\int_{-\infty}^{\infty}x f(x)dx\].

A Quick Review

A CDF is written as F(x), or capital greek notation, e.g., \(\Gamma(x)\).
A PDF is written in lower case font, f(x), and we can find any area under a PDF by summation (for categorical data), and integration (for continuous data). That is, \[p(a<x<b)=\sum_i^{K} x_i\] \[p(a<x<b)=\int f(x) dx\]
And, \(F(\infty)=1\), and \(F(-\infty)=0\), and \(p(-\infty < x < \infty)=\int_{-\infty}^{\infty}f(x)dx=1\).

Estimator Properties

If we were able to repeatedly draw samples and run our estimation procedure, we are left with a . \[E(\hat {\theta})\] \[var(\hat{\theta})=E(\hat{\theta}-E(\hat{\theta}))^2\]
Bias is the difference between the expected value of an estimator relative to the true population parameter, i.e., \(\hat{\theta}-\theta\). Note how this is different from the sampling error. The sampling error refers to the difference for a single estimate produced from our estimator. Bias is the average of many samples analyzed with an estimator.
The Mean Squared Error (MSE), or \(E(\hat{\theta}-\theta)^2\).

\[E(\hat{\theta}-E(\hat{\theta}))^2 +[E(\hat{\theta}-\theta)]^2\]

Finite Sample Properties

Ordinary Least Squares has desirable small sample properties. What does this mean? Some estimators are unbiased or have minimal variance even when a sample is small. \[MSE(\hat{\theta})=E(\hat{\theta}-\theta)^2\]
The Bias Variance Tradeoff. \(Y_i=\beta_0+\beta_1 x_{1,i}+\beta_2 x_{2,i}+\dots+\beta_k x_{k,i}+e_i\)

Asymptotic Properties

We can also explore the properties of an estimator by varying the sample size; in particular, what happens to the estimator as the sample size approaches infinity?
As the sample size gets larger and larger, what happens to the characteristics of an estimator?
An estimator is asymptotically unbiased if

\[\lim_{n\to\infty} E(\hat{\theta})=\theta\] - An estimator is consistent if

\[P\lim_{n\to\infty}(|\hat{\theta}-\theta|<d)=1\] \[Plim_{n\to\infty}(\hat{\theta})=\theta\]