Probit and Logit Regression
Lab Overview
- Linear Probability Models (LPM) - interpretation and limitations
- Logit and Probit regression
- Marginal effects of logit and probit models
The Apples data set contains household-level
data from a web survey that was conducted in order to elicit the demand
for a (fictional) ‘ecologically friendly’ apple. Each family
was (randomly) presented with a set of prices for regular apples and
eco-labeled apples. They were asked how many pounds of each kind of
apple they would buy. Note that, the key explanatory variables in this
case (the price of eco and regular apples) were set experimentally,
without regard to other factors that might affect the dependent
variable. Each family (actually, family head) was presented with a
description of eco-labeled apples, along with prices of regular apples
(variable regprc) and prices of the hypothetical
eco-labeled apples (variable ecoprc). Because the price
pairs were randomly assigned to each family, they are unrelated to other
observed factors (such as family income) and unobserved factors (such as
desire for a clean environment).
GOAL: To estimate the impact of the price of eco apples and regular apples on the probability of purchase of eco apples.
The variable description is presented below:
| variable | label |
|---|---|
| id | respondent identifier |
| educ | years schooling |
| date | date: month/day/year |
| state | home state |
| regprc | price of regular apples in $ |
| ecoprc | price of ecolabeled apples in $ |
| inseason | =1 if interviewed in Nov. |
| hhsize | household size |
| male | =1 if male |
| faminc | family income, thousands $ |
| age | in years |
| reglbs | quantity regular apples, lbs |
| ecolbs | quantity ecolabeled apples, lbs |
| numlt5 | # in household younger than 5 |
| num5_17 | # in household 5 to 17 |
| num18_64 | # in household 18 to 64 |
| numgt64 | # in household older than 64 |
1. Getting Started
Download Lab 6’s materials from Moodle:
- Save provided data set in your
datafolder in BRM-Labs project folder. - Save provided R script in your
codefolder in BRM-Labs project folder.
- Save provided data set in your
Open the provided lab 6’s R script.
Package installation
# Install new packages
#install.packages("lmtest")
#install.packages("margins")
- Setup your R environment.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(margins)
library(stargazer)
library(lmtest)
library(ggthemes)
library(tidyverse)
- Load the data.
# Load data, convert to tibble format, discard original
load("data/apple.RData")
tb.apples <- as_tibble(data)
rm(data)
2. Getting to Know the Data
As the usual first step in your analysis, you should try to get to
know the key features of the data set. In this case, we can read the
variable description (available in the desc object) and
look at the summary statistics of the numerical variables using
stargazer().
# Summary statistics of numeric variables
stargazer( data.frame(tb.apples %>% select(-id)), type = "text"
, no.space = TRUE, header = FALSE
, title = "Summary Statistics")
Summary Statistics
==========================================
Statistic N Mean St. Dev. Min Max
------------------------------------------
educ 660 14.380 2.274 8 20
regprc 660 0.883 0.244 0.590 1.190
ecoprc 660 1.082 0.296 0.590 1.590
inseason 660 0.336 0.473 0 1
hhsize 660 2.941 1.526 1 9
male 660 0.262 0.440 0 1
faminc 660 53.410 35.740 5 250
age 660 44.520 15.210 19 88
reglbs 660 1.282 2.910 0.000 42.000
ecolbs 660 1.474 2.526 0.000 42.000
numlt5 660 0.286 0.643 0 4
num5_17 660 0.621 0.994 0 6
num18_64 660 1.805 1.005 0 7
numgt64 660 0.229 0.549 0 3
------------------------------------------
From the summary statistics table we can see that the survey has 660 respondents, and all observations from the numerical variables seem to be complete (no missing data.)
In this exercise, we are interested in whether or not a family would buy the eco apples (a yes/no decision) and not in the quantity that they would buy. As we don’t have a variable indicating whether or not the family was willing to buy the eco apples (or regular apples) we will start by creating that variable.
# Create an indicator variable "buy_eco"
tb.apples <- tb.apples %>%
mutate(buy_eco = if_else(ecolbs > 0, 1, 0))
# Alternatively
tb.apples <- tb.apples %>%
mutate(buy_eco = as.double(ecolbs > 0))
# Check variable creation
head(tb.apples %>%
select(ecolbs, buy_eco))
Once our indicator for “buy/no buy” is created, we can ask ourselves a few questions about the data and the relationships among the variables and start investigating. For instance:
What is the fraction of families that would buy eco apples?
What household (or head of household) characteristics are associated to a greater likelihood of purchase?
What is the fraction of families that would buy eco apples?
# Percentage of families that would buy the eco apples
tb.apples %>%
summarise(mean_buy_eco = mean(buy_eco))
62.4% of the surveyed families would by the eco apples.
What household (or head of household) characteristics are associated to a greater likelihood of purchase?
Below is one example of a table and a plot comparing men’s and women’s decisions to buy eco-apples. A similar analysis can be conducted for the other variables in the data set.
# Create factor variables
tb.apples <- tb.apples %>% mutate(
male_fact = factor(male , levels = c(0,1), labels = c("Female", "Male")),
buy_eco_fact = factor(buy_eco, levels = c(0,1), labels = c("No", "Yes")))
# Summarize purchase intentions by group
tb.apples %>%
count(male_fact, buy_eco_fact) %>%
group_by(male_fact) %>%
mutate(freq = round(n / sum(n),2)) %>%
pivot_wider(id_cols = 'male_fact'
, names_from ='buy_eco_fact'
, values_from = 'freq')
From the table we can see that there is a higher share of women that indicated they would buy the eco apples (64%) then men (58%). However, we do not know yet if the difference is statistically significant.
The information from the table can also be represented in a colored bar graph as shown below.
# Plot purchase intentions by group
ggplot(tb.apples, aes(x = male_fact, fill = buy_eco_fact)) +
geom_bar(position = "fill") +
labs(x = 'Gender', y = 'Buy eco percentage', fill = 'Buy Eco') +
theme_bw() + scale_fill_grey()
You can use your knowledge of tables and plots to further explore interesting relationships in this data set.
3. Linear Probability Model (LPM)
The linear probability model takes the following equation:
\[P(y=1|\mathbf x)=E(y|\mathbf x)=\beta_0 + \beta_1 x_1 + ... + \beta_k x_k\] The above equation tells us that the probability of success, \(P(y=1|x)\), is a linear function of the explanatory variables \(x_j\). In the LPM, the \(\beta_j\) measures the change in the probability of sucess when \(x_j\) changes, holding other factors fixed.
3.1 Interpreting LPM coefficients
To estimate a linear probability model in R, we use the
usual lm function:
Note:
In the LPM specification, the dependent variable is
buy_ecoand notbuy_eco_factbecause regression models require a numerical dependent variable. The variablebuy_ecois coded as 0 or 1, which allows the model to interpret it as a probability — that is, the expected value ofbuy_ecorepresents the probability of purchasing the eco-labeled apples and can take any value between 0 and 1.Similarly, for explanatory variables, we use
male(coded as 0 or 1) instead ofmale_fact, because factor variables in R are handled differently, especially if they are treated as ordered factors, which can lead to unintended interpretations in linear models. Using the numeric version (male) ensures that the coefficient can be interpreted as the difference in probability associated with being male, all else constant.
# Linear probability model (LPM)
model <- buy_eco ~ ecoprc + regprc + faminc + hhsize + male
out.ols <- lm(model, data = tb.apples)
stargazer(out.ols, type = 'text'
, no.space = TRUE, header = FALSE)
===============================================
Dependent variable:
---------------------------
buy_eco
-----------------------------------------------
ecoprc -0.862***
(0.110)
regprc 0.760***
(0.132)
faminc 0.001**
(0.001)
hhsize 0.018
(0.012)
male -0.092**
(0.042)
Constant 0.799***
(0.085)
-----------------------------------------------
Observations 660
R2 0.104
Adjusted R2 0.097
Residual Std. Error 0.461 (df = 654)
F Statistic 15.190*** (df = 5; 654)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
If
ecoprcincreases by, say, 10 cents (0.10), then the probability of buying eco-labeled apples falls by about .086. The coefficient is statistically significant at the 1% level.If
regprcincreases by 10 cents, the probability of buying eco-labeled apples increases by about .076. The coefficient is statistically significant at the 1% level.
Note that we are considering a 10 cents increase and not a dollar increase as this makes more sense in the context of apple prices (as you can see by the summary statistics above, the mean price is about 1 dollar).
We are also assuming that the probabilities are not close to the boundaries of zero and one, respectively.
3.2 Limitations of LPM
3.2.1 Probabilities outside the (0,1) range
One limitation of the linear probability model is that it can produce probabilities smaller than 0 and probabilities greater than 1, which makes no sense. We can check if there are predicted probabilities outside of the (0,1) interval by inspecting the minimum and maximum of our model’s predicted values:
# Check minimum and maximum of predicted probabilities
min(predict(out.ols))
[1] 0.275
max(predict(out.ols))
[1] 1.068
There aren’t any predictions below 0 but there are predictions greater than 1.
Alternatively, we can use a plot to check if the predicted probabilities are all inside the (0,1) interval:
# Create column with predicted probabilities
tb.apples <- tb.apples %>%
mutate(lpm_pred = predict(out.ols))
# Plot distribution of predicted probabilities
ggplot(data = tb.apples, aes(x = lpm_pred)) + geom_histogram() +
geom_vline(xintercept = 0, col = 'Red', linetype = 2) +
geom_vline(xintercept = 1, col = 'Red', linetype = 2) +
scale_x_continuous(breaks = seq(-0.1,1.1,0.1)) +
labs(x = 'Probability of buying Eco Apple', y = 'Number of People') +
theme_bw()
3.2.2 Heteroskedasticity
Another limitation of the linear probability model is that there usually is heteroskedasticity. We can use the Breusch-Pagan test to investigate the presence of heteroskedasticity:
# Breusch-Pagan Test
bptest(out.ols)
studentized Breusch-Pagan test
data: out.ols
BP = 42, df = 5, p-value = 6e-08
The Breusch-Pagan test p-value is very low so we can reject the null that the error term is homoskedastic.
3.2.3. Linear relationship between the variables
Finally, OLS assumes a linear relationship between the
variables. This means, that each unit increase in
x causes the same change in y. However, it is
possible that, for instance, changing the price of apples by one cent
from $0.01 to $0.02 does not have the same effect on consumers’
propensity to buy as changing the price by one cent from $0.99 to
$1.
4. Probit and Logit
Probit and Logit are two alternatives to OLS that address some of the above mentioned limitations. These models are of a class of binary response models of the form:
\[P(y=1|\mathbf x)=G(\beta_0 + \beta_1 x_1 + ... + \beta_k x_k) = G(\beta_0 + \mathbf x \boldsymbol{\beta})\] where \(0<G(z)<1\).
In the logit model, \(G()\) is the logistic function. In the probit model, \(G()\) is the standard Normal cumulative density function.
Because of the more complicated (non-linear) nature of the probit and logit models, the estimated betas cannot be directly interpreted or compared to the OLS coefficients; to find the partial effect of roughly continuous variables on the response probability we must rely on calculus (take the derivative).
4.1 Estimation
To estimate probit and logit models in R we must use the
glm() function. This function is used to fit generalized
linear models. When using glm() we must specify the link
function we want to use in the model using the options
family = binomial(link = 'probit') or
family = binomial(link = 'logit').
# Logit, Probit, and LPM
out.ols <- lm(model , data = tb.apples)
out.probit <- glm(model, data = tb.apples
, family = binomial(link = 'probit'))
out.logit <- glm(model, data = tb.apples
, family = binomial(link = 'logit'))
# Print regression table
stargazer(out.ols, out.probit, out.logit
, type = 'text', no.space = TRUE
, header = FALSE)
===============================================================
Dependent variable:
-------------------------------------------
buy_eco
OLS probit logistic
(1) (2) (3)
---------------------------------------------------------------
ecoprc -0.862*** -2.428*** -3.956***
(0.110) (0.322) (0.540)
regprc 0.760*** 2.138*** 3.471***
(0.132) (0.382) (0.634)
faminc 0.001** 0.003** 0.005**
(0.001) (0.002) (0.003)
hhsize 0.018 0.050 0.081
(0.012) (0.035) (0.058)
male -0.092** -0.265** -0.444**
(0.042) (0.119) (0.196)
Constant 0.799*** 0.843*** 1.371***
(0.085) (0.247) (0.411)
---------------------------------------------------------------
Observations 660 660 660
R2 0.104
Adjusted R2 0.097
Log Likelihood -401.100 -401.100
Akaike Inf. Crit. 814.100 814.200
Residual Std. Error 0.461 (df = 654)
F Statistic 15.190*** (df = 5; 654)
===============================================================
Note: *p<0.1; **p<0.05; ***p<0.01
While the partial effects of the OLS are the betas, in probit and logit, the partial effect of the dependent variable with respect to the variable \(j\) is:
\[\frac{\delta P(y=1|\mathbf x)}{\delta x_j} = g(\beta_0 + \mathbf x\beta)\beta_j, \text{ where }g(z)\equiv \frac{dG}{dz}(z). \]
Note that, compared with the linear probability model, partial
effects in probit and logit are harder to summarize because the scale
factors depend on z (that is, on all of the explanatory variables). One
possibility is to plug in interesting values for \(z_j\) such as means, medians, minimums,
maximums, and lower and upper quartiles. Luckily, R makes
it easy to do this. While the coefficients from the logit and probit
models are not directly interpretable in terms of marginal effects, the
indicators of statistical significance (e.g., p-values or stars in the
stargazer() output) can still be interpreted as usual.
4.2 PEA
One method, commonly used in econometric packages that routinely estimate probit and logit models, is to replace each explanatory variable with its sample average. This is called the partial effect at the average (PEA): we compute the partial effect at the average of all explanatory variables.
\[g(\hat{\beta_0} + \mathbf{\bar{x}} \hat{\beta}) = g(\hat{\beta_0} + \hat{\beta_1}\bar{x}_1 + \hat{\beta_2}\bar{x}_2 + ... + \hat{\beta_k}\bar{x}_k)\]
This is equation (17.14) in Introductory Econometrics - J. Wooldridge, where \(g(.)\) is the standard Normal density in the probit case, and \(g(z)=exp(z)/[1+exp(z)]^2\) in the logit case. If we multiply that by \(\hat\beta_j\), we obtain the partial effect of \(x_j\) for an “average” person in the sample — the partial effect at the average.
To estimate the PEA (partial effects at the average) in R using the
margins package, we create a single-row data frame where
all numeric explanatory variables are set to their sample means, and
then pass it to the data argument of
margins():
# Create a single-row "average" data frame
avg_obs <- tb.apples %>%
summarise(across(where(is.numeric), mean))
# PEA Probit
pea_probit <- margins(out.probit, data = avg_obs)
summary(pea_probit)
# PEA Logit
pea_logit <- margins(out.probit, data = avg_obs)
summary(pea_logit)
Interpretation:
Interpretation is now similar to the LPM interpretation but we must keep in mind that these marginal effects are “at the average” of all explanatory variables. For instance:
- For a respondent who is “average” in all explanatory variables, all
else constant, an increase of 10 cents in
ecoprcis associated with a decrease of 0.091 in the probability of purchasing the eco-labeled apples. This effect is statistically significant at the 1% level. In this case, this interpretation applies to both the logit and probit models.
There are at least two potential problems with using PEAs to summarize the partial effects of the explanatory variables:
- If some of the explanatory variables are discrete, the averages of them represent no one in the sample.
# Mean household size
tb.apples %>%
summarise(mean_hhsize = mean(hhsize))
- If a continuous explanatory variable appears as a nonlinear function what should we do? Do we average the variable or the function? (Most packages default to the latter). Say if we had the square of the family income here, what would we use? The average of the square value or the square of the average value?
4.3 APE
A different approach results from averaging the individual partial effects across the sample, leading to what is called the average partial effect (APE) or, sometimes, the average marginal effect (AME).
The expression for the APE is:
\[\frac{\sum_{i=1}^{n} g(\hat{\beta_0} + \mathbf{{x}_i}\hat{\beta})}{n}\]
This is equation (17.15) in Introductory Econometrics. If we multiply it by \(\hat\beta_j\), we get the APE for \(x_j\).
Rather than computing the scale factor at the mean, we compute the scale factor for every single observed value in our sample (every single observation), and then we take the mean of those.
Therefore, the APE is often more representative than the PEA. However, for very large samples, or very complicated models, calculating the APE (and especially its standard errors), can take a long time even with a fast computer. This is an argument in favor of the PEA.
To estimate the APE — that is, the average of the marginal effects
computed for each observation in the sample — we use the
margins() function. This is the default behavior of
margins, and it works for both probit and logit models:
# APE Probit
ape_probit <- margins(out.probit)
summary(ape_probit)
# APE Logit
ape_logit <- margins(out.probit)
summary(ape_logit)
Interpretation:
Interpretation is now similar to the LPM interpretation.
- On average, all else constant, an increase of 10 cents in
ecoprcis associated with a decrease of 0.084/0.083 in the probability of purchasing the eco-labeled apples. This effect is statistically significant at the 1% level.
4.4 Discrete Variables
For discrete variables (e.g., hhsize),
using a partial derivative is not meaningful. Instead, we compute the
discrete change in predicted probability when the
variable changes from one level to another — for example, from 1 to 2 —
while holding all other variables fixed.
This is calculated using the expression:
\[G(\beta_0 + \beta_1 \cdot 2 + \dots + \beta_k x_k) - G(\beta_0 + \beta_1 \cdot 1 + \dots + \beta_k x_k)\]
For the Partial Effect at the Average (PEA), we evaluate this difference at the mean values of all other explanatory variables.
For the Average Partial Effect (APE), we calculate this difference for each observation in the dataset (setting
x = 2andx = 1), and then take the average of those differences.
For binary (dummy) variables, the
margins() function performs this calculation
automatically.
For non-binary discrete variables, like
hhsize, this comparison must be done
manually.
Partial Effect at the Average (PEA)
To estimate the PEA for a discrete variable such as
hhsize, we fix all other explanatory variables at their
sample means and compute the predicted probabilities for two values of
hhsize — for example, hhsize = 2 and
hhsize = 3. This is done using the predict()
function with the type = "response" option, which returns
the predicted probability (rather than the linear prediction). The
difference between these two predicted probabilities reflects the
discrete change in the probability of the outcome when
hhsize increases from 2 to 3, holding all other variables
at their average values. This provides an interpretable estimate of the
marginal effect of changing hhsize in an “average”
household.
Note: The default type = "link" in
predict() returns the linear prediction \(\hat{\beta}_0 + \mathbf{x}_i \hat{\beta}\),
which is not directly interpretable in terms of probabilities.
# Get means of explanatory variables
means <- tb.apples %>%
summarise(across(where(is.numeric), mean))
# Create copies with hhsize = 2 and 3
family_2 <- means %>% mutate(hhsize = 2)
family_3 <- means %>% mutate(hhsize = 3)
# Predicted probabilities
p2 <- predict(out.probit, newdata = family_2, type = "response")
p3 <- predict(out.probit, newdata = family_3, type = "response")
# PEA: discrete effect of increasing hhsize from 2 to 3 at average values
pea_effect <- p3 - p2
pea_effect
1
0.01895
Average Partial Effect (APE)
To estimate the APE, we apply the same idea but across all observations in the sample. For each observation, we compute the predicted probability first with hhsize = 2, then with hhsize = 3, and average the difference.
# Create two datasets with hhsize forced to 2 and 3
tmp_data_2 <- tb.apples %>% mutate(hhsize = 2)
tmp_data_3 <- tb.apples %>% mutate(hhsize = 3)
# Predicted probabilities
p2 <- predict(out.probit, newdata = tmp_data_2, type = "response")
p3 <- predict(out.probit, newdata = tmp_data_3, type = "response")
# APE: average effect of increasing hhsize from 2 to 3 across sample
ape_effect <- mean(p3 - p2)
ape_effect
[1] 0.01752
4.5 Predicted Values
Using the type = response option of
predict, we get the predicted probability for each family.
The default type of predict will give us \(\hat{\beta_0} + \mathbf{{x}}_i\hat{\beta}\)
(the linear prediction) for each family, which is not really
meaningful.
# Predicted probabilities Probit
tb.apples <- tb.apples %>%
mutate(prob_probit = predict(out.probit, type = "response"))
# Mean of predicted probit probabilities
tb.apples %>%
summarize(mean_probs_probit = mean(prob_probit))
# Actual sample proportion
tb.apples %>% summarize(sample_proportion = sum(buy_eco)/n())