Binary Variable Regression

Christopher Weber

2025-09-08

Introduction to Binary Variable Regression

The Binomial density

\[ f\left(k\right) = \binom{n}{k} \theta^k\left(1-\theta\right)^{n-k} \] - Success or failures; 1/0; binary variables

A Bernoulli variable is a binary variable, observed over a series of \(n\) trials, that takes on a value of 1 with probability \(\theta\) and 0 with probability \(1-\theta\)
The Binomial Distribution is the associated probability density function (PDF) for the number of successes in \(n\) trials

Binomial Density and R

rbinom(10, 1, 0.5) #  10 observations, 1 trial, 0.5 probability

 [1] 0 1 0 0 0 0 1 1 1 0

dbinom(5, 10, 0.5) # Success, Trials, Probability, the PDF

[1] 0.2460938

pbinom(5, size=10, prob=.50, lower.tail=FALSE) # The CDF

[1] 0.3769531

The Binomial PDF

100 trials, where \(\theta = 0.3\)

An Example: Public Policy

Imagine you conduct a poll of Arizona residents about water scarcity and management in the state
1,243 respondents, 902 stated that water scarcity is a “somewhat” or “very serious” problem
Prior research has not explored this issue, so your “prior beliefs” are non-informative

The Posterior

\(P(D|\theta)\). This is the likelihood of observing the data
\(P(\theta)\). This is the prior of of the parameter
\(P(D)\). This is the probability of the data
\(P(\theta | D)\) is the posterior distribution

The Likelihood (Log)

\(L(\theta) = \prod_{i=1}^{n} \theta^{y_i} (1-\theta)^{1-y_i}\)
\(LL(\theta) = \sum_{i=1}^{n} y_i \log(\theta) + (1-y_i) \log(1-\theta)\)

The Log Likelihood

Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...

Assume we observe 902 heads, 341 tails. What's our best guess?

The maximum value of the distribution is at theta = 0.73

The Likelihood

Maximum likelihood estimate: 0.73

The Likelihood

This is the problem
The computer finds the maximum, but we’re dealing with exceedingly small probabilities across values of \(\theta\)
Underflow and overflow are remedied by taking the log of the likelihood function, \(LL(\theta)\)

The Prior

A noninformative prior
Every value of \(\theta\) is equally probable; a uniform density
The uniform along the interval [0,1] means any value of \(\theta\) on this interval is as likely as any other value

grid_values <- seq(0, 1, by = 0.01)

# Define the prior, likelihood
prior <- rep(1, length(grid_values))  # All values equally probable

likelihood <- dbinom(902, 1243, grid_values)  # binomial

# Calculate the posterior
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)  # p(D)

data <- data.frame(grid_values = grid_values, posterior = posterior)

# Calculate the 95% credible interval
cumulative_posterior <- cumsum(posterior)
lower_bound <- grid_values[which(cumulative_posterior >= 0.025)[1]] # find bounds
upper_bound <- grid_values[which(cumulative_posterior >= 0.975)[1]]

Possible values of theta, 0 to 1: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...

Assume we observe 902 heads, 341 tails. What's our best guess?

The maximum value of the distribution is at theta = 0.73

The 95% credible interval is from 0.7 to 0.75

Interpretation

\(\theta\) isn’t one value, it’s a distribution
We can sample from that distribution

# Sample 100 values from the posterior
sample(grid_values, 100, prob = posterior, replace = TRUE) |> mean()

[1] 0.7257

Remember that, in practice, we should log the likelihood

The Log Likelihood

\[log(p(D|\theta))=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \] \[log(p(D|\theta)) \equiv LL(\theta)\]

Multiplication can give exceedingly large (or small) numbers
The “unknown” parameter in these trials is \(\theta\), \(\widehat{\theta}\), the probability of success, \(p(y=1)\)

The Log Likelihood

We could use either the Bernoulli or Binomial densities
\(\widehat {\theta}\) as some function of predictor variables

\[\widehat \theta_i=a+b_x+\epsilon\]

\(\widehat \theta_i=a+bx+\epsilon\)

The dependent variable is binary, scored 1 or 0, \(y_{obs} \in [0,1]\)
The independent variable is a vector of real numbers of length \(x \in \rm I\!R^{n}\)
Imagine: \((a+bx+\epsilon) \rightarrow \theta \rightarrow y\)
But: What is \(\theta_i\) and how can we regress it on \(x\) to retrieve a regression coefficient?

The Linear Probability Model

What options are available?
How about OLS?
A linear regression, regressing \(y_{obs}\) on a set of covariates
The binary variable is just 0 or 1, so that it can be interpreted as a probability
The linear probability model (Long 1997)

Problems

Remember \(y \in [0,1]\)
Also recall, the sum of squared residuals associated with the difference of \(\widehat{y}\) and \(y_{obs}\)?
We don’t observe a probability, we observe a binary outcome
There is nothing that restricts \(\widehat{y}<1\), or \(\widehat{y}>0\)

Problems

There is nothing that restricts \(\widehat{y}<1\), or \(\widehat{y}>0\)
The residuals are not normally distributed, nor are they constant
Recall from POL 682, \(e_i=y_i-\widehat{y}_i\)
But, we know that \(y_i\) may only take on two values, 0 or 1

Heteroskedasticity

But, we know that \(y_i\) may only take on two values, 0 or 1
When \(y=1\), the residual becomes: \(1-b_0-b_1x\)
But if \(y=0\), the residual becomes \(-b_0-b_1x\)
We will always have a different variance estimate when \(y=0\) relative to when \(y=1\) (Long 1997, p. 38)

Heteroskedasticity

\[var(y|x)=pr(y=1|x)(1-pr(y=1|x))=xb(1-xb)\]

Functional Form

Functional form. The linear model assumes a constant effect of \(x\rightarrow y\)
This may be unlikely, particularly when dealing with a probability. Maybe we expect the effect of \(x\) on \(y\) to be non-linear, where the effect of \(x\) on \(y\) is not constant
The linear model assumes a constant effect of \(x\rightarrow y\)
\[ \frac{\partial y}{\partial x} = b \]

Alternative Specification

Alternative Specificiation

\[ x_{obs} \rightarrow y_{latent} \rightarrow y_{obs} \]

If we knew \(y_{latent}\), all would be good and we could estimate OLS
Instead, we observe a realization of \(y_{latent}\) in \(y_{obs}\)

Goals

We’d like to estimate \(y_{latent}=x_iB+e_i\)
We can’t, so the errors are unknown
What residuals can we minimize?
Instead, let’s make an assumption about the error process, either the errors
\(e_i \sim normal(0, 1)\) is the probit regression model
\(e_i \sim logistic(0, \pi^2/3)\) is the logit regression model
And, Sigmoid \(=\) logistic

Probit

\[probit={{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})\]

\[probit=\int_{-\infty}^{e}{{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})dt\]

Logit

\[logit(e)={{exp(e)}\over{1+exp(e)^2}}\]

\[logit(e)={{exp(e)}\over{1+exp(e)}}\]

The Latent Variable Model

The choice – logit or probit – is somewhat arbitrary
Returning to the example

\(\bullet\) The mean of the error for the latent variable \(e\) is 0

\(\bullet\) \(\tau\) is 0

\(\bullet\) The variance of the error term is constant

The Latent Variable Model

\[p(y_{obs}=1|x)=p(y_{latent}>0|x)\]

then,

\[p(x\beta+\epsilon|x>0|x)=p(\epsilon<-x\beta|x)\]

Because the distribution for \(e\) must be symmetric

\[p(x\beta+\epsilon>0|x>0|x)=p(\epsilon>-x\beta|x)=p(\epsilon<x\beta|x)\]

Scale Indeterminacy

\(y_{latent}\) is unobserved, we need to make an assumption about the scale of the error term
This is the scale indeterminacy problem. What we choose, will affect the parameter estimates
Logit coefficients will differ from probit coefficients, with the same data

……..by a factor of 1.81

\[1.81 \times \beta_{probit}=\beta_{logit}\]

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> 
  mutate(
    support_protest = ifelse(violent >3, 1, 0)
  ) |>
  select(support_protest, VIOLENT)
logitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "logit")) 
probitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "probit"))


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.34186    0.04804  -7.115 1.12e-12 ***
VIOLENT     -0.55257    0.07058  -7.829 4.92e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "probit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.21378    0.02992  -7.145 9.01e-13 ***
VIOLENT     -0.33902    0.04316  -7.855 3.99e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4

Identical Predictions

pred.data = expand.grid(VIOLENT = c(0, 1)) |> as.data.frame()

predictions_logit = predict(logitModel, pred.data,  type = "response") |> 
  data.frame() |>
  mutate(
    violent = c(0, 1)
  ) 

predictions_probit = predict(probitModel, pred.data,  type = "response") |> 
  data.frame() |>
  mutate(
    violent = c(0, 1)
  ) 


cat("The treatment effect for the logit model is:\n",
predictions_logit[1,1] - predictions_logit[2,1], "\n",
"The treatment effect for the probit model is:\n",
predictions_probit[1,1] - predictions_probit[2,1]

)

The treatment effect for the logit model is:
 0.1251605 
 The treatment effect for the probit model is:
 0.1251605

Nonlinear Regression

A different approach to the same probolem, a binary dependent variable
The log log likelihood

\[ log(p(D|\theta))=\sum_{n=1}^Nlogp(y_n|\theta)=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \]

If \(\theta\) is bound between 0 and 1. But our regression model requires a continuous depenendent variable,
First, convert the probability to an odds

\[{{\theta_i}\over{1-\theta_i}}\]

This frees the upper bound restriction.

Nonlinear Regression

Then the lower bounds
The transformation is an “odds” that may approach \(\infty\). But, we still have the zero restriction
Take the natural logarithm of the odds, resulting in the log odds, and the logit transformation

\[\eta_i=log[{{\theta_i)}\over{1-\theta_i}}]\]

\[\eta \in [-\infty, \infty]\]

Nonlinear Regression

\[log{{x\beta}\over{1-(x\beta)}}\]

The inverse of this function,is the probability scale.

\[p(y=1|x)={{exp(x\beta)}\over{1+exp(x\beta)}}\]

\[p(y=1|x\beta)=\theta_i={1\over{1+exp(-(x\beta))}}\]

Nonlinear Regression

We generated a function that maps the linear prediction onto the probability that \(y=1\)
We have linked the prediction formed by \(a+bx\) to a probability
Knowing this, we can simply input \(\theta_i\) onto the likelihood function

\[p(D|\theta)=\prod_{n=1}^N[\frac{1}{1+exp(-(x\beta))}]^{y_n}[1-\frac{1}{1+exp(-(x\beta)}]^{1-{y_n}}\]

Vectors

Assume \(\textbf{a}=[3,2,1]\) and \(\textbf{b}=[1,1,1]\)
Or, more simply, let’s deal in \(\mathbb{R}^{2}\)
\(\textbf{a}=[9,2]\) and \(\textbf{b}=[1,1]\)
Addition and Subtraction
Subtract or add elementwise \[\textbf{a}+\textbf{b}=[9+1,2+1]=[10,3]\]

Multiplication

Inner, outer, cross \(\textbf{a}\)=[3,2,1] and \(\textbf{b}\)=[1,1,1]
The inner product \[\textbf{a}\cdot \textbf{b}=3*1+2*1+1*1=6\]
\(\sum_k=a_ib_i\): Multiply the ith element in a with the ith element in b. Then sum from 1 to k, where k is the length of the vectors

The Inner Product (Linear Regression)

Why? It’s a measure of similarity.
The correlation \[cor(\textbf{a},\textbf{b})={{\sum (a_i-\bar a)(b_i-\bar b)}\over{\sqrt{\sum (a_i-\bar a)^2}\sqrt{\sum (b_i-\bar b)^2}}} = cos(\theta)\]
Remember \(a \perp b \Rightarrow a \cdot b =0\)
\(cos(\theta)=0 \Rightarrow \theta=90^{\circ}\)
The inner product forms the basis for linear regression, where \(y_i\) is a linear composite of the predictors \(x_i\) and the coefficients \(\beta\). \[y_i=x_i\beta+e_i\]
The inner product of the ith row of \(X\) with the vector of coefficients \(\beta\)

Inner Product Rules

\[\textbf{Commutative}=\textbf{a} x \textbf{b}=\textbf{b} x \textbf{a}\]

\[\textbf{Associative}=d(\textbf{a}+\textbf{b})=(d\textbf{a}) x \textbf{b}=\textbf{a} x d(\textbf{b})\]

\[\textbf{Distributive}=\textbf{c}(\textbf{a}+\textbf{b})=\textbf{c}\textbf{a}+\textbf{c}\textbf{b}\]

\[\textbf{Zero}=\textbf{a}0=0\]

\[\textbf{Unit}=1\textbf{a}=\sum a_i\]

The Cross Product

[27-41, 11-37, 34-21]
This is useful to invert a matrix, which is central to linear regression
\(\widehat{\beta} = (X^T X)^{-1} X^T Y\)
If we are unable to invert \(X^TX\), we cannot solve for \(\widehat{\beta}\)
There is no linear solution.

The Outer Product

\[ \begin{bmatrix} 3 \\ 2\\ 1\\ \end{bmatrix} \begin{bmatrix} 1&4&7\\ \end{bmatrix} =\]

\[ \begin{bmatrix} 3&12&21\\ 2&8&14\\ 1&4&2\\ \end{bmatrix} \]

Outer Product Example

\(\text{Cov} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})^T\)
Two variables, three observations
Two vectors, \(\mathbb{R}^{3}\) \[\begin{align} \mathbf{x} &= [1, 3, 5], \quad \bar{x} = 3\\ \mathbf{y} &= [2, 4, 8], \quad \bar{y} = 4.67 \end{align}\]

\[\begin{align} \mathbf{x}_{\text{centered}} &= [-2, 0, 2]\\ \mathbf{y}_{\text{centered}} &= [-2.67, -0.67, 3.33] \end{align}\]

\[\begin{align} (x_1-\bar{x})(y_1-\bar{y})^T &= -2 \times -2.67 = 5.33\\ (x_2-\bar{x})(y_2-\bar{y})^T &= 0 \times -0.67 = 0\\ (x_3-\bar{x})(y_3-\bar{y})^T &= 2 \times 3.33 = 6.67 \end{align}\]

\[\begin{equation} \text{Cov}(x,y) = \frac{1}{3}(5.33 + 0 + 6.67) = 4.0 \end{equation}\]

Matrices

Matrices = An \(n \times 3\) matrix \[ \begin{bmatrix} Vote&PID&Ideology \\\hline a_{11}&a_{12}&a_{13} \\ a_{21}&a_{22}&a_{23} \\ \vdots & \vdots & \vdots\\ a_{n1}&a_{n2}&a_{n3}\\ \end{bmatrix} \]

Matrices

Matrices are vectors arranged in rows and columns
Subtraction and addition are elementwise
The problem is there are three unknown quantities that we need to find, \(b_0, b_1, b_2\)

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{Ideology,1}+b_2 x_{PID,1}\\ y_{2}= & b_0+ b_1 x_{Ideology,2}+b_2 x_{PID,2}\\ y_{3}= & b_0+ b_1 x_{Ideology,3}+b_2 x_{PID,3}\\ y_{4}= & b_0+ b_1 x_{Ideology,4}+b_2 x_{PID,4}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{Ideology,n}+b_2 x_{PID,n}\\ \end{bmatrix}\]

Matrices

Conformability
Matrix multiplication is not elementwise
Order matters, where \(AB \neq BA\)
The inner dimensions must match \[ \begin{bmatrix} 1&3 \\ 2&4\\ \end{bmatrix} \begin{bmatrix} 3&5 \\ 2&4\\ \end{bmatrix}= \begin{bmatrix} 1*3+3*2&1*5+3*4 \\ 2*3+4*2&2*5+2*4\\ \end{bmatrix} \]
The result is a \(2 \times 2\) matrix
Chapter 2 in the course notes provides far more detail on vectors and matrices, and includes important details about matrix inversion, projection, etc.

Data Analysis

Vectors and matrices

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
cat("The dimensions of the data are," ,
    dim(dataActive))

The dimensions of the data are, 3600 11

head(dataActive)

  VIOLENT violent age                 educ authoritarianism         mip
1       0       5  67            Post-grad              0.0       covid
2       1       3  63 High school graduate              0.5       covid
3       1       2  62         Some college              1.0     economy
4       0       5  54               4-year              0.0       covid
5       0       4  45               2-year              0.0 environment
6       1       2  51 High school graduate              1.0     economy
  moral_individualism    sdo             ideo5 post_call        pid3
1              0.3125 0.1250          Moderate         0    Democrat
2              0.5625 0.1250           Liberal         0 Independent
3              0.0625 0.3125 Very conservative         0  Republican
4              0.0000 0.0625           Liberal         0       Other
5              0.5000 0.1250           Liberal         0    Not sure
6              0.5000 0.5000      Conservative         0  Republican

The Linear Regression Model

Moral Individualism (Karpowitz and Patterson 2023)
Social Dominance Orientation (Sidanius and Pratto 1999)

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
myModel = lm(sdo ~ moral_individualism, 
             data = dataActive)
summary(myModel)


Call:
lm(formula = sdo ~ moral_individualism, data = dataActive)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38235 -0.15389  0.00182  0.13706  0.69956 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.218543   0.007058   30.96   <2e-16 ***
moral_individualism 0.163803   0.011895   13.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1864 on 3598 degrees of freedom
Multiple R-squared:  0.05006,   Adjusted R-squared:  0.0498 
F-statistic: 189.6 on 1 and 3598 DF,  p-value: < 2.2e-16

Interpretation

A one unit increase in moral individualism is associated with a 0.163803 increase in social dominance orientation, on average, holding all else constant

library(dplyr)
myModel = lm(sdo ~ moral_individualism, 
             data = dataActive)

coefficients = coef(myModel) # coefficient vector
dataActive |>
  select(sdo, moral_individualism) -> dat # data matrix
cat("The coefficient vector is length", length(coefficients), "\n")

The coefficient vector is length 2

cat("The data matrix is", dim(dat)[1], "X", dim(dat)[2], "\n")

The data matrix is 3600 X 2

Multiply these, \(\widehat{y} = X \beta\), to retrieve a prediction for each observation
Use the %*% operator in R

predictions = as.matrix(cbind(1, dat$moral_individualism)) %*% coefficients
head(predictions,3)

          [,1]
[1,] 0.2697310
[2,] 0.3106817
[3,] 0.2287802

Logistic Regression

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
  select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, 
                 family = binomial(link = "logit")) 
logitModel


Call:  glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
(Intercept)      VIOLENT  
    -0.3419      -0.5526  

Degrees of Freedom: 3599 Total (i.e. Null);  3598 Residual
Null Deviance:      4671 
Residual Deviance: 4609     AIC: 4613

Logistic Regression

The process of prediction is identical, but the outcome \(pr(y|x)\) is non-linear
Use \(plogis()\) to convert the linear prediction to a probability

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
  select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, 
                 family = binomial(link = "logit")) 
coefficients = coefficients(logitModel) # coefficient vector
predictions = as.matrix(cbind(1, dat$VIOLENT)) %*% coefficients |> 
  plogis()
head(predictions, 10)

           [,1]
 [1,] 0.4153587
 [2,] 0.2901982
 [3,] 0.2901982
 [4,] 0.4153587
 [5,] 0.4153587
 [6,] 0.2901982
 [7,] 0.2901982
 [8,] 0.4153587
 [9,] 0.4153587
[10,] 0.4153587