[1] 0 1 0 0 0 0 1 1 1 0
[1] 0.2460938
[1] 0.3769531
2025-09-08
\[ f\left(k\right) = \binom{n}{k} \theta^k\left(1-\theta\right)^{n-k} \] - Success or failures; 1/0; binary variables
A Bernoulli variable is a binary variable, observed over a series of \(n\) trials, that takes on a value of 1 with probability \(\theta\) and 0 with probability \(1-\theta\)
The Binomial Distribution is the associated probability density function (PDF) for the number of successes in \(n\) trials
\(P(D|\theta)\). This is the likelihood of observing the data
\(P(\theta)\). This is the prior of of the parameter
\(P(D)\). This is the probability of the data
\(P(\theta | D)\) is the posterior distribution
Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73
Maximum likelihood estimate: 0.73
A noninformative prior
Every value of \(\theta\) is equally probable; a uniform density
The uniform along the interval [0,1] means any value of \(\theta\) on this interval is as likely as any other value
grid_values <- seq(0, 1, by = 0.01)
# Define the prior, likelihood
prior <- rep(1, length(grid_values)) # All values equally probable
likelihood <- dbinom(902, 1243, grid_values) # binomial
# Calculate the posterior
posterior <- likelihood * prior
posterior <- posterior / sum(posterior) # p(D)
data <- data.frame(grid_values = grid_values, posterior = posterior)
# Calculate the 95% credible interval
cumulative_posterior <- cumsum(posterior)
lower_bound <- grid_values[which(cumulative_posterior >= 0.025)[1]] # find bounds
upper_bound <- grid_values[which(cumulative_posterior >= 0.975)[1]]
Possible values of theta, 0 to 1: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73
The 95% credible interval is from 0.7 to 0.75
# Sample 100 values from the posterior
sample(grid_values, 100, prob = posterior, replace = TRUE) |> mean()
[1] 0.7257
\[log(p(D|\theta))=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \] \[log(p(D|\theta)) \equiv LL(\theta)\]
Multiplication can give exceedingly large (or small) numbers
The “unknown” parameter in these trials is \(\theta\), \(\widehat{\theta}\), the probability of success, \(p(y=1)\)
We could use either the Bernoulli or Binomial densities
\(\widehat {\theta}\) as some function of predictor variables
\[\widehat \theta_i=a+b_x+\epsilon\]
\[var(y|x)=pr(y=1|x)(1-pr(y=1|x))=xb(1-xb)\]
Functional form. The linear model assumes a constant effect of \(x\rightarrow y\)
This may be unlikely, particularly when dealing with a probability. Maybe we expect the effect of \(x\) on \(y\) to be non-linear, where the effect of \(x\) on \(y\) is not constant
The linear model assumes a constant effect of \(x\rightarrow y\)
\[
\frac{\partial y}{\partial x} = b
\]
\[ x_{obs} \rightarrow y_{latent} \rightarrow y_{obs} \]
If we knew \(y_{latent}\), all would be good and we could estimate OLS
Instead, we observe a realization of \(y_{latent}\) in \(y_{obs}\)
We’d like to estimate \(y_{latent}=x_iB+e_i\)
We can’t, so the errors are unknown
What residuals can we minimize?
Instead, let’s make an assumption about the error process, either the errors
\(e_i \sim normal(0, 1)\) is the probit regression model
\(e_i \sim logistic(0, \pi^2/3)\) is the logit regression model
And, Sigmoid \(=\) logistic
\[probit={{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})\]
\[probit=\int_{-\infty}^{e}{{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})dt\]
\[logit(e)={{exp(e)}\over{1+exp(e)^2}}\]
\[logit(e)={{exp(e)}\over{1+exp(e)}}\]
The choice – logit or probit – is somewhat arbitrary
Returning to the example
\(\bullet\) The mean of the error for the latent variable \(e\) is 0
\(\bullet\) \(\tau\) is 0
\(\bullet\) The variance of the error term is constant
If
\[p(y_{obs}=1|x)=p(y_{latent}>0|x)\]
then,
\[p(x\beta+\epsilon|x>0|x)=p(\epsilon<-x\beta|x)\]
Because the distribution for \(e\) must be symmetric
\[p(x\beta+\epsilon>0|x>0|x)=p(\epsilon>-x\beta|x)=p(\epsilon<x\beta|x)\]
\(y_{latent}\) is unobserved, we need to make an assumption about the scale of the error term
This is the scale indeterminacy problem. What we choose, will affect the parameter estimates
Logit coefficients will differ from probit coefficients, with the same data
……..by a factor of 1.81
\[1.81 \times \beta_{probit}=\beta_{logit}\]
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |>
mutate(
support_protest = ifelse(violent >3, 1, 0)
) |>
select(support_protest, VIOLENT)
logitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "logit"))
probitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "probit"))
Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.34186 0.04804 -7.115 1.12e-12 ***
VIOLENT -0.55257 0.07058 -7.829 4.92e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4671.4 on 3599 degrees of freedom
Residual deviance: 4609.4 on 3598 degrees of freedom
AIC: 4613.4
Number of Fisher Scoring iterations: 4
Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "probit"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.21378 0.02992 -7.145 9.01e-13 ***
VIOLENT -0.33902 0.04316 -7.855 3.99e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4671.4 on 3599 degrees of freedom
Residual deviance: 4609.4 on 3598 degrees of freedom
AIC: 4613.4
Number of Fisher Scoring iterations: 4
pred.data = expand.grid(VIOLENT = c(0, 1)) |> as.data.frame()
predictions_logit = predict(logitModel, pred.data, type = "response") |>
data.frame() |>
mutate(
violent = c(0, 1)
)
predictions_probit = predict(probitModel, pred.data, type = "response") |>
data.frame() |>
mutate(
violent = c(0, 1)
)
cat("The treatment effect for the logit model is:\n",
predictions_logit[1,1] - predictions_logit[2,1], "\n",
"The treatment effect for the probit model is:\n",
predictions_probit[1,1] - predictions_probit[2,1]
)
The treatment effect for the logit model is:
0.1251605
The treatment effect for the probit model is:
0.1251605
\[ log(p(D|\theta))=\sum_{n=1}^Nlogp(y_n|\theta)=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \]
If \(\theta\) is bound between 0 and 1. But our regression model requires a continuous depenendent variable,
First, convert the probability to an odds
\[{{\theta_i}\over{1-\theta_i}}\]
Then the lower bounds
The transformation is an “odds” that may approach \(\infty\). But, we still have the zero restriction
Take the natural logarithm of the odds, resulting in the log odds, and the logit transformation
\[\eta_i=log[{{\theta_i)}\over{1-\theta_i}}]\]
\[\eta \in [-\infty, \infty]\]
\[log{{x\beta}\over{1-(x\beta)}}\]
The inverse of this function,is the probability scale.
\[p(y=1|x)={{exp(x\beta)}\over{1+exp(x\beta)}}\]
\[p(y=1|x\beta)=\theta_i={1\over{1+exp(-(x\beta))}}\]
\[p(D|\theta)=\prod_{n=1}^N[\frac{1}{1+exp(-(x\beta))}]^{y_n}[1-\frac{1}{1+exp(-(x\beta)}]^{1-{y_n}}\]
\[\textbf{Associative}=d(\textbf{a}+\textbf{b})=(d\textbf{a}) x \textbf{b}=\textbf{a} x d(\textbf{b})\]
\[\textbf{Distributive}=\textbf{c}(\textbf{a}+\textbf{b})=\textbf{c}\textbf{a}+\textbf{c}\textbf{b}\]
\[\textbf{Zero}=\textbf{a}0=0\]
\[\textbf{Unit}=1\textbf{a}=\sum a_i\]
\[ \begin{bmatrix} 3 \\ 2\\ 1\\ \end{bmatrix} \begin{bmatrix} 1&4&7\\ \end{bmatrix} =\]
\[ \begin{bmatrix} 3&12&21\\ 2&8&14\\ 1&4&2\\ \end{bmatrix} \]
\[\begin{align} \mathbf{x}_{\text{centered}} &= [-2, 0, 2]\\ \mathbf{y}_{\text{centered}} &= [-2.67, -0.67, 3.33] \end{align}\]
\[\begin{align} (x_1-\bar{x})(y_1-\bar{y})^T &= -2 \times -2.67 = 5.33\\ (x_2-\bar{x})(y_2-\bar{y})^T &= 0 \times -0.67 = 0\\ (x_3-\bar{x})(y_3-\bar{y})^T &= 2 \times 3.33 = 6.67 \end{align}\]
\[\begin{equation} \text{Cov}(x,y) = \frac{1}{3}(5.33 + 0 + 6.67) = 4.0 \end{equation}\]
\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{Ideology,1}+b_2 x_{PID,1}\\ y_{2}= & b_0+ b_1 x_{Ideology,2}+b_2 x_{PID,2}\\ y_{3}= & b_0+ b_1 x_{Ideology,3}+b_2 x_{PID,3}\\ y_{4}= & b_0+ b_1 x_{Ideology,4}+b_2 x_{PID,4}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{Ideology,n}+b_2 x_{PID,n}\\ \end{bmatrix}\]
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
cat("The dimensions of the data are," ,
dim(dataActive))
The dimensions of the data are, 3600 11
VIOLENT violent age educ authoritarianism mip
1 0 5 67 Post-grad 0.0 covid
2 1 3 63 High school graduate 0.5 covid
3 1 2 62 Some college 1.0 economy
4 0 5 54 4-year 0.0 covid
5 0 4 45 2-year 0.0 environment
6 1 2 51 High school graduate 1.0 economy
moral_individualism sdo ideo5 post_call pid3
1 0.3125 0.1250 Moderate 0 Democrat
2 0.5625 0.1250 Liberal 0 Independent
3 0.0625 0.3125 Very conservative 0 Republican
4 0.0000 0.0625 Liberal 0 Other
5 0.5000 0.1250 Liberal 0 Not sure
6 0.5000 0.5000 Conservative 0 Republican
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
myModel = lm(sdo ~ moral_individualism,
data = dataActive)
summary(myModel)
Call:
lm(formula = sdo ~ moral_individualism, data = dataActive)
Residuals:
Min 1Q Median 3Q Max
-0.38235 -0.15389 0.00182 0.13706 0.69956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.218543 0.007058 30.96 <2e-16 ***
moral_individualism 0.163803 0.011895 13.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1864 on 3598 degrees of freedom
Multiple R-squared: 0.05006, Adjusted R-squared: 0.0498
F-statistic: 189.6 on 1 and 3598 DF, p-value: < 2.2e-16
library(dplyr)
myModel = lm(sdo ~ moral_individualism,
data = dataActive)
coefficients = coef(myModel) # coefficient vector
dataActive |>
select(sdo, moral_individualism) -> dat # data matrix
cat("The coefficient vector is length", length(coefficients), "\n")
The coefficient vector is length 2
The data matrix is 3600 X 2
%*%
operator in Rload("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat,
family = binomial(link = "logit"))
logitModel
Call: glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"),
data = dat)
Coefficients:
(Intercept) VIOLENT
-0.3419 -0.5526
Degrees of Freedom: 3599 Total (i.e. Null); 3598 Residual
Null Deviance: 4671
Residual Deviance: 4609 AIC: 4613
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive |> mutate( support_protest = ifelse(violent >3, 1, 0)) |>
select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat,
family = binomial(link = "logit"))
coefficients = coefficients(logitModel) # coefficient vector
predictions = as.matrix(cbind(1, dat$VIOLENT)) %*% coefficients |>
plogis()
head(predictions, 10)
[,1]
[1,] 0.4153587
[2,] 0.2901982
[3,] 0.2901982
[4,] 0.4153587
[5,] 0.4153587
[6,] 0.2901982
[7,] 0.2901982
[8,] 0.4153587
[9,] 0.4153587
[10,] 0.4153587