Binary Variable Regression

Christopher Weber

2025-09-08

Introduction to Binary Variable Regression

  • The Binomial density

\[ f\left(k\right) = \binom{n}{k} \theta^k\left(1-\theta\right)^{n-k} \] - Success or failures; 1/0; binary variables

  • A Bernoulli variable is a binary variable, observed over a series of \(n\) trials, that takes on a value of 1 with probability \(\theta\) and 0 with probability \(1-\theta\)

  • The Binomial Distribution is the associated probability density function (PDF) for the number of successes in \(n\) trials

Binomial Density and R

rbinom(10, 1, 0.5) #  10 observations, 1 trial, 0.5 probability
 [1] 0 1 0 0 0 0 1 1 1 0
dbinom(5, 10, 0.5) # Success, Trials, Probability, the PDF
[1] 0.2460938
pbinom(5, size=10, prob=.50, lower.tail=FALSE) # The CDF
[1] 0.3769531

The Binomial PDF

  • 100 trials, where \(\theta = 0.3\)

An Example: Public Policy

  • Imagine you conduct a poll of Arizona residents about water scarcity and management in the state
  • 1,243 respondents, 902 stated that water scarcity is a “somewhat” or “very serious” problem
  • Prior research has not explored this issue, so your “prior beliefs” are non-informative

The Posterior

  • \(P(D|\theta)\). This is the likelihood of observing the data

  • \(P(\theta)\). This is the prior of of the parameter

  • \(P(D)\). This is the probability of the data

  • \(P(\theta | D)\) is the posterior distribution

The Likelihood (Log)

  • \(L(\theta) = \prod_{i=1}^{n} \theta^{y_i} (1-\theta)^{1-y_i}\)
  • \(LL(\theta) = \sum_{i=1}^{n} y_i \log(\theta) + (1-y_i) \log(1-\theta)\)

The Log Likelihood

Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73 

The Likelihood

Maximum likelihood estimate: 0.73 

The Likelihood

  • This is the problem
  • The computer finds the maximum, but we’re dealing with exceedingly small probabilities across values of \(\theta\)
  • Underflow and overflow are remedied by taking the log of the likelihood function, \(LL(\theta)\)

The Prior

  • A noninformative prior

  • Every value of \(\theta\) is equally probable; a uniform density

  • The uniform along the interval [0,1] means any value of \(\theta\) on this interval is as likely as any other value

grid_values <- seq(0, 1, by = 0.01)

# Define the prior, likelihood
prior <- rep(1, length(grid_values))  # All values equally probable

likelihood <- dbinom(902, 1243, grid_values)  # binomial

# Calculate the posterior
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)  # p(D)

data <- data.frame(grid_values = grid_values, posterior = posterior)

# Calculate the 95% credible interval
cumulative_posterior <- cumsum(posterior)
lower_bound <- grid_values[which(cumulative_posterior >= 0.025)[1]] # find bounds
upper_bound <- grid_values[which(cumulative_posterior >= 0.975)[1]]

Possible values of theta, 0 to 1: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73 
The 95% credible interval is from 0.7 to 0.75 

Interpretation

  • \(\theta\) isn’t one value, it’s a distribution
  • We can sample from that distribution
# Sample 100 values from the posterior
sample(grid_values, 100, prob = posterior, replace = TRUE) |> mean()
[1] 0.7257
  • Remember that, in practice, we should log the likelihood

The Log Likelihood

\[log(p(D|\theta))=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \] \[log(p(D|\theta)) \equiv LL(\theta)\]

  • Multiplication can give exceedingly large (or small) numbers

  • The “unknown” parameter in these trials is \(\theta\), \(\widehat{\theta}\), the probability of success, \(p(y=1)\)

The Log Likelihood

  • We could use either the Bernoulli or Binomial densities

  • \(\widehat {\theta}\) as some function of predictor variables

\[\widehat \theta_i=a+b_x+\epsilon\]

\(\widehat \theta_i=a+bx+\epsilon\)

  • The dependent variable is binary, scored 1 or 0, \(y_{obs} \in [0,1]\)
  • The independent variable is a vector of real numbers of length \(x \in \rm I\!R^{n}\)
  • Imagine: \((a+bx+\epsilon) \rightarrow \theta \rightarrow y\)
  • But: What is \(\theta_i\) and how can we regress it on \(x\) to retrieve a regression coefficient?

The Linear Probability Model

  • What options are available?
  • How about OLS?
  • A linear regression, regressing \(y_{obs}\) on a set of covariates
  • The binary variable is just 0 or 1, so that it can be interpreted as a probability
  • The linear probability model (Long 1997)

Problems

  • Remember \(y \in [0,1]\)
  • Also recall, the sum of squared residuals associated with the difference of \(\widehat{y}\) and \(y_{obs}\)?
  • We don’t observe a probability, we observe a binary outcome
  • There is nothing that restricts \(\widehat{y}<1\), or \(\widehat{y}>0\)

Problems

  • There is nothing that restricts \(\widehat{y}<1\), or \(\widehat{y}>0\)
  • The residuals are not normally distributed, nor are they constant
  • Recall from POL 682, \(e_i=y_i-\widehat{y}_i\)
  • But, we know that \(y_i\) may only take on two values, 0 or 1

Heteroskedasticity

  • But, we know that \(y_i\) may only take on two values, 0 or 1
  • When \(y=1\), the residual becomes: \(1-b_0-b_1x\)
  • But if \(y=0\), the residual becomes \(-b_0-b_1x\)
  • We will always have a different variance estimate when \(y=0\) relative to when \(y=1\) (Long 1997, p. 38)

Heteroskedasticity

\[var(y|x)=pr(y=1|x)(1-pr(y=1|x))=xb(1-xb)\]

Functional Form

  • Functional form. The linear model assumes a constant effect of \(x\rightarrow y\)

  • This may be unlikely, particularly when dealing with a probability. Maybe we expect the effect of \(x\) on \(y\) to be non-linear, where the effect of \(x\) on \(y\) is not constant

  • The linear model assumes a constant effect of \(x\rightarrow y\)
    \[ \frac{\partial y}{\partial x} = b \]

Alternative Specification

Alternative Specificiation

\[ x_{obs} \rightarrow y_{latent} \rightarrow y_{obs} \]

  • If we knew \(y_{latent}\), all would be good and we could estimate OLS

  • Instead, we observe a realization of \(y_{latent}\) in \(y_{obs}\)

Goals

  • We’d like to estimate \(y_{latent}=x_iB+e_i\)

  • We can’t, so the errors are unknown

  • What residuals can we minimize?

  • Instead, let’s make an assumption about the error process, either the errors

  • \(e_i \sim normal(0, 1)\) is the probit regression model

  • \(e_i \sim logistic(0, \pi^2/3)\) is the logit regression model

  • And, Sigmoid \(=\) logistic

Probit

\[probit={{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})\]

\[probit=\int_{-\infty}^{e}{{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})dt\]

Logit

\[logit(e)={{exp(e)}\over{1+exp(e)^2}}\]

\[logit(e)={{exp(e)}\over{1+exp(e)}}\]

The Latent Variable Model

  • The choice – logit or probit – is somewhat arbitrary

  • Returning to the example

\(\bullet\) The mean of the error for the latent variable \(e\) is 0

\(\bullet\) \(\tau\) is 0

\(\bullet\) The variance of the error term is constant

The Latent Variable Model

If

\[p(y_{obs}=1|x)=p(y_{latent}>0|x)\]

then,

\[p(x\beta+\epsilon|x>0|x)=p(\epsilon<-x\beta|x)\]

Because the distribution for \(e\) must be symmetric

\[p(x\beta+\epsilon>0|x>0|x)=p(\epsilon>-x\beta|x)=p(\epsilon<x\beta|x)\]

Scale Indeterminacy

  • \(y_{latent}\) is unobserved, we need to make an assumption about the scale of the error term

  • This is the scale indeterminacy problem. What we choose, will affect the parameter estimates

  • Logit coefficients will differ from probit coefficients, with the same data

……..by a factor of 1.81

\[1.81 \times \beta_{probit}=\beta_{logit}\]

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> 
  mutate(
    support_protest = ifelse(violent >3, 1, 0)
  ) |>
  select(support_protest, VIOLENT)
logitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "logit")) 
probitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "probit")) 


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.34186    0.04804  -7.115 1.12e-12 ***
VIOLENT     -0.55257    0.07058  -7.829 4.92e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "probit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.21378    0.02992  -7.145 9.01e-13 ***
VIOLENT     -0.33902    0.04316  -7.855 3.99e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4

Identical Predictions

pred.data = expand.grid(VIOLENT = c(0, 1)) |> as.data.frame()

predictions_logit = predict(logitModel, pred.data,  type = "response") |> 
  data.frame() |>
  mutate(
    violent = c(0, 1)
  ) 

predictions_probit = predict(probitModel, pred.data,  type = "response") |> 
  data.frame() |>
  mutate(
    violent = c(0, 1)
  ) 


cat("The treatment effect for the logit model is:\n",
predictions_logit[1,1] - predictions_logit[2,1], "\n",
"The treatment effect for the probit model is:\n",
predictions_probit[1,1] - predictions_probit[2,1]

)
The treatment effect for the logit model is:
 0.1251605 
 The treatment effect for the probit model is:
 0.1251605

Nonlinear Regression

  • A different approach to the same probolem, a binary dependent variable
  • The log log likelihood

\[ log(p(D|\theta))=\sum_{n=1}^Nlogp(y_n|\theta)=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \]

  • If \(\theta\) is bound between 0 and 1. But our regression model requires a continuous depenendent variable,

  • First, convert the probability to an odds

\[{{\theta_i}\over{1-\theta_i}}\]

  • This frees the upper bound restriction.

Nonlinear Regression

  • Then the lower bounds

  • The transformation is an “odds” that may approach \(\infty\). But, we still have the zero restriction

  • Take the natural logarithm of the odds, resulting in the log odds, and the logit transformation

\[\eta_i=log[{{\theta_i)}\over{1-\theta_i}}]\]

\[\eta \in [-\infty, \infty]\]

Nonlinear Regression

\[log{{x\beta}\over{1-(x\beta)}}\]

The inverse of this function,is the probability scale.

\[p(y=1|x)={{exp(x\beta)}\over{1+exp(x\beta)}}\]

\[p(y=1|x\beta)=\theta_i={1\over{1+exp(-(x\beta))}}\]

Nonlinear Regression

  • We generated a function that maps the linear prediction onto the probability that \(y=1\)
  • We have linked the prediction formed by \(a+bx\) to a probability
  • Knowing this, we can simply input \(\theta_i\) onto the likelihood function

\[p(D|\theta)=\prod_{n=1}^N[\frac{1}{1+exp(-(x\beta))}]^{y_n}[1-\frac{1}{1+exp(-(x\beta)}]^{1-{y_n}}\]

Vectors

  • Assume \(\textbf{a}=[3,2,1]\) and \(\textbf{b}=[1,1,1]\)
  • Or, more simply, let’s deal in \(\mathbb{R}^{2}\)
  • \(\textbf{a}=[9,2]\) and \(\textbf{b}=[1,1]\)
  • Addition and Subtraction
  • Subtract or add elementwise \[\textbf{a}+\textbf{b}=[9+1,2+1]=[10,3]\]

Multiplication

  • Inner, outer, cross \(\textbf{a}\)=[3,2,1] and \(\textbf{b}\)=[1,1,1]
  • The inner product \[\textbf{a}\cdot \textbf{b}=3*1+2*1+1*1=6\]
  • \(\sum_k=a_ib_i\): Multiply the ith element in a with the ith element in b. Then sum from 1 to k, where k is the length of the vectors

The Inner Product (Linear Regression)

  • Why? It’s a measure of similarity.
  • The correlation \[cor(\textbf{a},\textbf{b})={{\sum (a_i-\bar a)(b_i-\bar b)}\over{\sqrt{\sum (a_i-\bar a)^2}\sqrt{\sum (b_i-\bar b)^2}}} = cos(\theta)\]
  • Remember \(a \perp b \Rightarrow a \cdot b =0\)
  • \(cos(\theta)=0 \Rightarrow \theta=90^{\circ}\)
  • The inner product forms the basis for linear regression, where \(y_i\) is a linear composite of the predictors \(x_i\) and the coefficients \(\beta\). \[y_i=x_i\beta+e_i\]
  • The inner product of the ith row of \(X\) with the vector of coefficients \(\beta\)

Inner Product Rules

  • \[\textbf{Commutative}=\textbf{a} x \textbf{b}=\textbf{b} x \textbf{a}\]

\[\textbf{Associative}=d(\textbf{a}+\textbf{b})=(d\textbf{a}) x \textbf{b}=\textbf{a} x d(\textbf{b})\]

\[\textbf{Distributive}=\textbf{c}(\textbf{a}+\textbf{b})=\textbf{c}\textbf{a}+\textbf{c}\textbf{b}\]

\[\textbf{Zero}=\textbf{a}0=0\]

\[\textbf{Unit}=1\textbf{a}=\sum a_i\]

The Cross Product

  • [27-41, 11-37, 34-21]
  • This is useful to invert a matrix, which is central to linear regression
  • \(\widehat{\beta} = (X^T X)^{-1} X^T Y\)
  • If we are unable to invert \(X^TX\), we cannot solve for \(\widehat{\beta}\)
  • There is no linear solution.

The Outer Product

\[ \begin{bmatrix} 3 \\ 2\\ 1\\ \end{bmatrix} \begin{bmatrix} 1&4&7\\ \end{bmatrix} =\]

\[ \begin{bmatrix} 3&12&21\\ 2&8&14\\ 1&4&2\\ \end{bmatrix} \]

Outer Product Example

  • \(\text{Cov} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})^T\)
  • Two variables, three observations
  • Two vectors, \(\mathbb{R}^{3}\) \[\begin{align} \mathbf{x} &= [1, 3, 5], \quad \bar{x} = 3\\ \mathbf{y} &= [2, 4, 8], \quad \bar{y} = 4.67 \end{align}\]

\[\begin{align} \mathbf{x}_{\text{centered}} &= [-2, 0, 2]\\ \mathbf{y}_{\text{centered}} &= [-2.67, -0.67, 3.33] \end{align}\]

\[\begin{align} (x_1-\bar{x})(y_1-\bar{y})^T &= -2 \times -2.67 = 5.33\\ (x_2-\bar{x})(y_2-\bar{y})^T &= 0 \times -0.67 = 0\\ (x_3-\bar{x})(y_3-\bar{y})^T &= 2 \times 3.33 = 6.67 \end{align}\]

\[\begin{equation} \text{Cov}(x,y) = \frac{1}{3}(5.33 + 0 + 6.67) = 4.0 \end{equation}\]

Matrices

  • Matrices = An \(n \times 3\) matrix \[ \begin{bmatrix} Vote&PID&Ideology \\\hline a_{11}&a_{12}&a_{13} \\ a_{21}&a_{22}&a_{23} \\ \vdots & \vdots & \vdots\\ a_{n1}&a_{n2}&a_{n3}\\ \end{bmatrix} \]

Matrices

  • Matrices are vectors arranged in rows and columns
  • Subtraction and addition are elementwise
  • The problem is there are three unknown quantities that we need to find, \(b_0, b_1, b_2\)

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{Ideology,1}+b_2 x_{PID,1}\\ y_{2}= & b_0+ b_1 x_{Ideology,2}+b_2 x_{PID,2}\\ y_{3}= & b_0+ b_1 x_{Ideology,3}+b_2 x_{PID,3}\\ y_{4}= & b_0+ b_1 x_{Ideology,4}+b_2 x_{PID,4}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{Ideology,n}+b_2 x_{PID,n}\\ \end{bmatrix}\]

Matrices

  • Conformability
  • Matrix multiplication is not elementwise
  • Order matters, where \(AB \neq BA\)
  • The inner dimensions must match \[ \begin{bmatrix} 1&3 \\ 2&4\\ \end{bmatrix} \begin{bmatrix} 3&5 \\ 2&4\\ \end{bmatrix}= \begin{bmatrix} 1*3+3*2&1*5+3*4 \\ 2*3+4*2&2*5+2*4\\ \end{bmatrix} \]
  • The result is a \(2 \times 2\) matrix
  • Chapter 2 in the course notes provides far more detail on vectors and matrices, and includes important details about matrix inversion, projection, etc.

Data Analysis

  • Vectors and matrices
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
cat("The dimensions of the data are," ,
    dim(dataActive))
The dimensions of the data are, 3600 11
head(dataActive)
  VIOLENT violent age                 educ authoritarianism         mip
1       0       5  67            Post-grad              0.0       covid
2       1       3  63 High school graduate              0.5       covid
3       1       2  62         Some college              1.0     economy
4       0       5  54               4-year              0.0       covid
5       0       4  45               2-year              0.0 environment
6       1       2  51 High school graduate              1.0     economy
  moral_individualism    sdo             ideo5 post_call        pid3
1              0.3125 0.1250          Moderate         0    Democrat
2              0.5625 0.1250           Liberal         0 Independent
3              0.0625 0.3125 Very conservative         0  Republican
4              0.0000 0.0625           Liberal         0       Other
5              0.5000 0.1250           Liberal         0    Not sure
6              0.5000 0.5000      Conservative         0  Republican

The Linear Regression Model

  • Moral Individualism (Karpowitz and Patterson 2023)
  • Social Dominance Orientation (Sidanius and Pratto 1999)
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
myModel = lm(sdo ~ moral_individualism, 
             data = dataActive)
summary(myModel)

Call:
lm(formula = sdo ~ moral_individualism, data = dataActive)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38235 -0.15389  0.00182  0.13706  0.69956 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.218543   0.007058   30.96   <2e-16 ***
moral_individualism 0.163803   0.011895   13.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1864 on 3598 degrees of freedom
Multiple R-squared:  0.05006,   Adjusted R-squared:  0.0498 
F-statistic: 189.6 on 1 and 3598 DF,  p-value: < 2.2e-16

Interpretation

  • A one unit increase in moral individualism is associated with a 0.163803 increase in social dominance orientation, on average, holding all else constant
library(dplyr)
myModel = lm(sdo ~ moral_individualism, 
             data = dataActive)

coefficients = coef(myModel) # coefficient vector
dataActive |>
  select(sdo, moral_individualism) -> dat # data matrix
cat("The coefficient vector is length", length(coefficients), "\n")
The coefficient vector is length 2 
cat("The data matrix is", dim(dat)[1], "X", dim(dat)[2], "\n")
The data matrix is 3600 X 2 
  • Multiply these, \(\widehat{y} = X \beta\), to retrieve a prediction for each observation
  • Use the %*% operator in R
predictions = as.matrix(cbind(1, dat$moral_individualism)) %*% coefficients
head(predictions,3)
          [,1]
[1,] 0.2697310
[2,] 0.3106817
[3,] 0.2287802

Logistic Regression

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
  select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, 
                 family = binomial(link = "logit")) 
logitModel

Call:  glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
(Intercept)      VIOLENT  
    -0.3419      -0.5526  

Degrees of Freedom: 3599 Total (i.e. Null);  3598 Residual
Null Deviance:      4671 
Residual Deviance: 4609     AIC: 4613

Logistic Regression

  • The process of prediction is identical, but the outcome \(pr(y|x)\) is non-linear
  • Use \(plogis()\) to convert the linear prediction to a probability
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
  select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, 
                 family = binomial(link = "logit")) 
coefficients = coefficients(logitModel) # coefficient vector
predictions = as.matrix(cbind(1, dat$VIOLENT)) %*% coefficients |> 
  plogis()
head(predictions, 10)
           [,1]
 [1,] 0.4153587
 [2,] 0.2901982
 [3,] 0.2901982
 [4,] 0.4153587
 [5,] 0.4153587
 [6,] 0.2901982
 [7,] 0.2901982
 [8,] 0.4153587
 [9,] 0.4153587
[10,] 0.4153587