FSAR | Lab #6 | Fall 2025

Probit and Logit Regression

Lab Overview

  • Linear Probability Models (LPM) - interpretation and limitations
  • Logit and Probit regression
  • Marginal effects of logit and probit models


The Apples data set contains household-level data from a web survey that was conducted in order to elicit the demand for a (fictional) ‘ecologically friendly’ apple. Each family was (randomly) presented with a set of prices for regular apples and eco-labeled apples. They were asked how many pounds of each kind of apple they would buy. Note that, the key explanatory variables in this case (the price of eco and regular apples) were set experimentally, without regard to other factors that might affect the dependent variable. Each family (actually, family head) was presented with a description of eco-labeled apples, along with prices of regular apples (variable regprc) and prices of the hypothetical eco-labeled apples (variable ecoprc). Because the price pairs were randomly assigned to each family, they are unrelated to other observed factors (such as family income) and unobserved factors (such as desire for a clean environment).

GOAL: To estimate the impact of the price of eco apples and regular apples on the probability of purchase of eco apples.

The variable description is presented below:

variable label
id respondent identifier
educ years schooling
date date: month/day/year
state home state
regprc price of regular apples in $
ecoprc price of ecolabeled apples in $
inseason =1 if interviewed in Nov.
hhsize household size
male =1 if male
faminc family income, thousands $
age in years
reglbs quantity regular apples, lbs
ecolbs quantity ecolabeled apples, lbs
numlt5 # in household younger than 5
num5_17 # in household 5 to 17
num18_64 # in household 18 to 64
numgt64 # in household older than 64




1. Getting Started

  1. Download Lab 6’s materials from Moodle:

    • Save provided data set in your data folder in BRM-Labs project folder.
    • Save provided R script in your code folder in BRM-Labs project folder.
  2. Open the provided lab 6’s R script.

  3. Package installation

# Install new packages 
#install.packages("lmtest")
#install.packages("margins")
  1. Setup your R environment.
# Clean work environment 
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(margins)
library(stargazer)
library(lmtest)
library(ggthemes)
library(tidyverse)
  1. Load the data.
# Load data, convert to tibble format, discard original
load("data/apple.RData")
tb.apples <- as_tibble(data)
rm(data)




2. Getting to Know the Data

As the usual first step in your analysis, you should try to get to know the key features of the data set. In this case, we can read the variable description (available in the desc object) and look at the summary statistics of the numerical variables using stargazer().

# Summary statistics of numeric variables
stargazer(  data.frame(tb.apples %>% select(-id)), type = "text"
          , no.space = TRUE, header = FALSE
          , title = "Summary Statistics")

Summary Statistics
==========================================
Statistic  N   Mean  St. Dev.  Min   Max  
------------------------------------------
educ      660 14.380  2.274     8     20  
regprc    660 0.883   0.244   0.590 1.190 
ecoprc    660 1.082   0.296   0.590 1.590 
inseason  660 0.336   0.473     0     1   
hhsize    660 2.941   1.526     1     9   
male      660 0.262   0.440     0     1   
faminc    660 53.410  35.740    5    250  
age       660 44.520  15.210   19     88  
reglbs    660 1.282   2.910   0.000 42.000
ecolbs    660 1.474   2.526   0.000 42.000
numlt5    660 0.286   0.643     0     4   
num5_17   660 0.621   0.994     0     6   
num18_64  660 1.805   1.005     0     7   
numgt64   660 0.229   0.549     0     3   
------------------------------------------

From the summary statistics table we can see that the survey has 660 respondents, and all observations from the numerical variables seem to be complete (no missing data.)

In this exercise, we are interested in whether or not a family would buy the eco apples (a yes/no decision) and not in the quantity that they would buy. As we don’t have a variable indicating whether or not the family was willing to buy the eco apples (or regular apples) we will start by creating that variable.

# Create an indicator variable "buy_eco"
tb.apples <- tb.apples %>% 
  mutate(buy_eco = if_else(ecolbs > 0, 1, 0))

# Alternatively
tb.apples <- tb.apples %>% 
  mutate(buy_eco = as.double(ecolbs > 0))

# Check variable creation
head(tb.apples %>% 
       select(ecolbs, buy_eco))

Once our indicator for “buy/no buy” is created, we can ask ourselves a few questions about the data and the relationships among the variables and start investigating. For instance:

  • What is the fraction of families that would buy eco apples?

  • What household (or head of household) characteristics are associated to a greater likelihood of purchase?


What is the fraction of families that would buy eco apples?

# Percentage of families that would buy the eco apples
tb.apples %>% 
  summarise(mean_buy_eco = mean(buy_eco))

62.4% of the surveyed families would by the eco apples.


What household (or head of household) characteristics are associated to a greater likelihood of purchase?

Below is one example of a table and a plot comparing men’s and women’s decisions to buy eco-apples. A similar analysis can be conducted for the other variables in the data set.

# Create factor variables
tb.apples <- tb.apples %>% mutate(
  male_fact    = factor(male   , levels = c(0,1), labels = c("Female", "Male")),
  buy_eco_fact = factor(buy_eco, levels = c(0,1), labels = c("No", "Yes")))

# Summarize purchase intentions by group
tb.apples %>% 
  count(male_fact, buy_eco_fact) %>% 
  group_by(male_fact) %>%  
  mutate(freq = round(n / sum(n),2)) %>% 
  pivot_wider(id_cols = 'male_fact'
            , names_from ='buy_eco_fact'
            , values_from = 'freq')

From the table we can see that there is a higher share of women that indicated they would buy the eco apples (64%) then men (58%). However, we do not know yet if the difference is statistically significant.

The information from the table can also be represented in a colored bar graph as shown below.

# Plot purchase intentions by group
ggplot(tb.apples, aes(x = male_fact, fill = buy_eco_fact)) +
  geom_bar(position = "fill") +
  labs(x = 'Gender', y = 'Buy eco percentage', fill = 'Buy Eco') + 
  theme_bw() + scale_fill_grey()

You can use your knowledge of tables and plots to further explore interesting relationships in this data set.




3. Linear Probability Model (LPM)

The linear probability model takes the following equation:

\[P(y=1|\mathbf x)=E(y|\mathbf x)=\beta_0 + \beta_1 x_1 + ... + \beta_k x_k\] The above equation tells us that the probability of success, \(P(y=1|x)\), is a linear function of the explanatory variables \(x_j\). In the LPM, the \(\beta_j\) measures the change in the probability of sucess when \(x_j\) changes, holding other factors fixed.


3.1 Interpreting LPM coefficients


To estimate a linear probability model in R, we use the usual lm function:

Note:

  • In the LPM specification, the dependent variable is buy_eco and not buy_eco_fact because regression models require a numerical dependent variable. The variable buy_eco is coded as 0 or 1, which allows the model to interpret it as a probability — that is, the expected value of buy_eco represents the probability of purchasing the eco-labeled apples and can take any value between 0 and 1.

  • Similarly, for explanatory variables, we use male (coded as 0 or 1) instead of male_fact, because factor variables in R are handled differently, especially if they are treated as ordered factors, which can lead to unintended interpretations in linear models. Using the numeric version (male) ensures that the coefficient can be interpreted as the difference in probability associated with being male, all else constant.

# Linear probability model (LPM)
model <- buy_eco ~  ecoprc + regprc + faminc + hhsize + male
out.ols <- lm(model, data = tb.apples)
stargazer(out.ols, type = 'text'
        , no.space = TRUE, header = FALSE)

===============================================
                        Dependent variable:    
                    ---------------------------
                              buy_eco          
-----------------------------------------------
ecoprc                       -0.862***         
                              (0.110)          
regprc                       0.760***          
                              (0.132)          
faminc                        0.001**          
                              (0.001)          
hhsize                         0.018           
                              (0.012)          
male                         -0.092**          
                              (0.042)          
Constant                     0.799***          
                              (0.085)          
-----------------------------------------------
Observations                    660            
R2                             0.104           
Adjusted R2                    0.097           
Residual Std. Error      0.461 (df = 654)      
F Statistic           15.190*** (df = 5; 654)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Interpretation:

  • If ecoprc increases by, say, 10 cents (0.10), then the probability of buying eco-labeled apples falls by about .086. The coefficient is statistically significant at the 1% level.

  • If regprc increases by 10 cents, the probability of buying eco-labeled apples increases by about .076. The coefficient is statistically significant at the 1% level.

Note that we are considering a 10 cents increase and not a dollar increase as this makes more sense in the context of apple prices (as you can see by the summary statistics above, the mean price is about 1 dollar).

We are also assuming that the probabilities are not close to the boundaries of zero and one, respectively.


to top



3.2 Limitations of LPM


3.2.1 Probabilities outside the (0,1) range

One limitation of the linear probability model is that it can produce probabilities smaller than 0 and probabilities greater than 1, which makes no sense. We can check if there are predicted probabilities outside of the (0,1) interval by inspecting the minimum and maximum of our model’s predicted values:

# Check minimum and maximum of predicted probabilities
min(predict(out.ols))
[1] 0.275
max(predict(out.ols))
[1] 1.068

There aren’t any predictions below 0 but there are predictions greater than 1.

Alternatively, we can use a plot to check if the predicted probabilities are all inside the (0,1) interval:

# Create column with predicted probabilities
tb.apples <- tb.apples %>% 
  mutate(lpm_pred =  predict(out.ols))

# Plot distribution of predicted probabilities
ggplot(data = tb.apples, aes(x = lpm_pred)) + geom_histogram() +
  geom_vline(xintercept = 0, col = 'Red', linetype = 2) +
  geom_vline(xintercept = 1, col = 'Red', linetype = 2) +
  scale_x_continuous(breaks = seq(-0.1,1.1,0.1)) +
  labs(x = 'Probability of buying Eco Apple', y = 'Number of People') +
  theme_bw()


3.2.2 Heteroskedasticity

Another limitation of the linear probability model is that there usually is heteroskedasticity. We can use the Breusch-Pagan test to investigate the presence of heteroskedasticity:

# Breusch-Pagan Test
bptest(out.ols)

    studentized Breusch-Pagan test

data:  out.ols
BP = 42, df = 5, p-value = 6e-08

The Breusch-Pagan test p-value is very low so we can reject the null that the error term is homoskedastic.


3.2.3. Linear relationship between the variables

Finally, OLS assumes a linear relationship between the variables. This means, that each unit increase in x causes the same change in y. However, it is possible that, for instance, changing the price of apples by one cent from $0.01 to $0.02 does not have the same effect on consumers’ propensity to buy as changing the price by one cent from $0.99 to $1.


to top



4. Probit and Logit

Probit and Logit are two alternatives to OLS that address some of the above mentioned limitations. These models are of a class of binary response models of the form:

\[P(y=1|\mathbf x)=G(\beta_0 + \beta_1 x_1 + ... + \beta_k x_k) = G(\beta_0 + \mathbf x \boldsymbol{\beta})\] where \(0<G(z)<1\).

In the logit model, \(G()\) is the logistic function. In the probit model, \(G()\) is the standard Normal cumulative density function.

Because of the more complicated (non-linear) nature of the probit and logit models, the estimated betas cannot be directly interpreted or compared to the OLS coefficients; to find the partial effect of roughly continuous variables on the response probability we must rely on calculus (take the derivative).


4.1 Estimation

To estimate probit and logit models in R we must use the glm() function. This function is used to fit generalized linear models. When using glm() we must specify the link function we want to use in the model using the options family = binomial(link = 'probit') or family = binomial(link = 'logit').

# Logit, Probit, and LPM
out.ols    <- lm(model , data = tb.apples)
out.probit <- glm(model, data = tb.apples
                , family = binomial(link = 'probit'))
out.logit  <- glm(model, data = tb.apples
                , family = binomial(link = 'logit'))

# Print regression table
stargazer(out.ols, out.probit, out.logit
        , type = 'text', no.space = TRUE
        , header = FALSE)

===============================================================
                                Dependent variable:            
                    -------------------------------------------
                                      buy_eco                  
                              OLS            probit   logistic 
                              (1)              (2)       (3)   
---------------------------------------------------------------
ecoprc                     -0.862***        -2.428*** -3.956***
                            (0.110)          (0.322)   (0.540) 
regprc                     0.760***         2.138***  3.471*** 
                            (0.132)          (0.382)   (0.634) 
faminc                      0.001**          0.003**   0.005** 
                            (0.001)          (0.002)   (0.003) 
hhsize                       0.018            0.050     0.081  
                            (0.012)          (0.035)   (0.058) 
male                       -0.092**         -0.265**  -0.444** 
                            (0.042)          (0.119)   (0.196) 
Constant                   0.799***         0.843***  1.371*** 
                            (0.085)          (0.247)   (0.411) 
---------------------------------------------------------------
Observations                  660              660       660   
R2                           0.104                             
Adjusted R2                  0.097                             
Log Likelihood                              -401.100  -401.100 
Akaike Inf. Crit.                            814.100   814.200 
Residual Std. Error    0.461 (df = 654)                        
F Statistic         15.190*** (df = 5; 654)                    
===============================================================
Note:                               *p<0.1; **p<0.05; ***p<0.01

While the partial effects of the OLS are the betas, in probit and logit, the partial effect of the dependent variable with respect to the variable \(j\) is:

\[\frac{\delta P(y=1|\mathbf x)}{\delta x_j} = g(\beta_0 + \mathbf x\beta)\beta_j, \text{ where }g(z)\equiv \frac{dG}{dz}(z). \]

Note that, compared with the linear probability model, partial effects in probit and logit are harder to summarize because the scale factors depend on z (that is, on all of the explanatory variables). One possibility is to plug in interesting values for \(z_j\) such as means, medians, minimums, maximums, and lower and upper quartiles. Luckily, R makes it easy to do this. While the coefficients from the logit and probit models are not directly interpretable in terms of marginal effects, the indicators of statistical significance (e.g., p-values or stars in the stargazer() output) can still be interpreted as usual.


to top



4.2 PEA

One method, commonly used in econometric packages that routinely estimate probit and logit models, is to replace each explanatory variable with its sample average. This is called the partial effect at the average (PEA): we compute the partial effect at the average of all explanatory variables.

\[g(\hat{\beta_0} + \mathbf{\bar{x}} \hat{\beta}) = g(\hat{\beta_0} + \hat{\beta_1}\bar{x}_1 + \hat{\beta_2}\bar{x}_2 + ... + \hat{\beta_k}\bar{x}_k)\]

This is equation (17.14) in Introductory Econometrics - J. Wooldridge, where \(g(.)\) is the standard Normal density in the probit case, and \(g(z)=exp(z)/[1+exp(z)]^2\) in the logit case. If we multiply that by \(\hat\beta_j\), we obtain the partial effect of \(x_j\) for an “average” person in the sample — the partial effect at the average.

To estimate the PEA (partial effects at the average) in R using the margins package, we create a single-row data frame where all numeric explanatory variables are set to their sample means, and then pass it to the data argument of margins():

# Create a single-row "average" data frame
avg_obs <- tb.apples %>% 
  summarise(across(where(is.numeric), mean))

# PEA Probit
pea_probit <- margins(out.probit, data = avg_obs)
summary(pea_probit)
# PEA Logit
pea_logit <- margins(out.probit, data = avg_obs)
summary(pea_logit)

Interpretation:

Interpretation is now similar to the LPM interpretation but we must keep in mind that these marginal effects are “at the average” of all explanatory variables. For instance:

  • For a respondent who is “average” in all explanatory variables, all else constant, an increase of 10 cents in ecoprc is associated with a decrease of 0.091 in the probability of purchasing the eco-labeled apples. This effect is statistically significant at the 1% level. In this case, this interpretation applies to both the logit and probit models.


There are at least two potential problems with using PEAs to summarize the partial effects of the explanatory variables:

  1. If some of the explanatory variables are discrete, the averages of them represent no one in the sample.
# Mean household size
tb.apples %>% 
  summarise(mean_hhsize = mean(hhsize))
  1. If a continuous explanatory variable appears as a nonlinear function what should we do? Do we average the variable or the function? (Most packages default to the latter). Say if we had the square of the family income here, what would we use? The average of the square value or the square of the average value?


to top



4.3 APE

A different approach results from averaging the individual partial effects across the sample, leading to what is called the average partial effect (APE) or, sometimes, the average marginal effect (AME).

The expression for the APE is:

\[\frac{\sum_{i=1}^{n} g(\hat{\beta_0} + \mathbf{{x}_i}\hat{\beta})}{n}\]

This is equation (17.15) in Introductory Econometrics. If we multiply it by \(\hat\beta_j\), we get the APE for \(x_j\).

Rather than computing the scale factor at the mean, we compute the scale factor for every single observed value in our sample (every single observation), and then we take the mean of those.

Therefore, the APE is often more representative than the PEA. However, for very large samples, or very complicated models, calculating the APE (and especially its standard errors), can take a long time even with a fast computer. This is an argument in favor of the PEA.

To estimate the APE — that is, the average of the marginal effects computed for each observation in the sample — we use the margins() function. This is the default behavior of margins, and it works for both probit and logit models:

# APE Probit
ape_probit <- margins(out.probit)
summary(ape_probit)
# APE Logit
ape_logit <- margins(out.probit)
summary(ape_logit)

Interpretation:

Interpretation is now similar to the LPM interpretation.

  • On average, all else constant, an increase of 10 cents in ecoprc is associated with a decrease of 0.084/0.083 in the probability of purchasing the eco-labeled apples. This effect is statistically significant at the 1% level.


to top



4.4 Discrete Variables

For discrete variables (e.g., hhsize), using a partial derivative is not meaningful. Instead, we compute the discrete change in predicted probability when the variable changes from one level to another — for example, from 1 to 2 — while holding all other variables fixed.

This is calculated using the expression:

\[G(\beta_0 + \beta_1 \cdot 2 + \dots + \beta_k x_k) - G(\beta_0 + \beta_1 \cdot 1 + \dots + \beta_k x_k)\]

  • For the Partial Effect at the Average (PEA), we evaluate this difference at the mean values of all other explanatory variables.

  • For the Average Partial Effect (APE), we calculate this difference for each observation in the dataset (setting x = 2 and x = 1), and then take the average of those differences.

For binary (dummy) variables, the margins() function performs this calculation automatically.

For non-binary discrete variables, like hhsize, this comparison must be done manually.

Partial Effect at the Average (PEA)

To estimate the PEA for a discrete variable such as hhsize, we fix all other explanatory variables at their sample means and compute the predicted probabilities for two values of hhsize — for example, hhsize = 2 and hhsize = 3. This is done using the predict() function with the type = "response" option, which returns the predicted probability (rather than the linear prediction). The difference between these two predicted probabilities reflects the discrete change in the probability of the outcome when hhsize increases from 2 to 3, holding all other variables at their average values. This provides an interpretable estimate of the marginal effect of changing hhsize in an “average” household.

Note: The default type = "link" in predict() returns the linear prediction \(\hat{\beta}_0 + \mathbf{x}_i \hat{\beta}\), which is not directly interpretable in terms of probabilities.

# Get means of explanatory variables
means <- tb.apples %>%
  summarise(across(where(is.numeric), mean)) 

# Create copies with hhsize = 2 and 3
family_2 <- means %>% mutate(hhsize = 2)
family_3 <- means %>% mutate(hhsize = 3)

# Predicted probabilities
p2 <- predict(out.probit, newdata = family_2, type = "response")
p3 <- predict(out.probit, newdata = family_3, type = "response")

# PEA: discrete effect of increasing hhsize from 2 to 3 at average values
pea_effect <- p3 - p2
pea_effect
      1 
0.01895 

Average Partial Effect (APE)

To estimate the APE, we apply the same idea but across all observations in the sample. For each observation, we compute the predicted probability first with hhsize = 2, then with hhsize = 3, and average the difference.

# Create two datasets with hhsize forced to 2 and 3
tmp_data_2 <- tb.apples %>% mutate(hhsize = 2)
tmp_data_3 <- tb.apples %>% mutate(hhsize = 3)

# Predicted probabilities
p2 <- predict(out.probit, newdata = tmp_data_2, type = "response")
p3 <- predict(out.probit, newdata = tmp_data_3, type = "response")

# APE: average effect of increasing hhsize from 2 to 3 across sample
ape_effect <- mean(p3 - p2)
ape_effect
[1] 0.01752


to top



4.5 Predicted Values

Using the type = response option of predict, we get the predicted probability for each family. The default type of predict will give us \(\hat{\beta_0} + \mathbf{{x}}_i\hat{\beta}\) (the linear prediction) for each family, which is not really meaningful.

# Predicted probabilities Probit
tb.apples <- tb.apples %>% 
  mutate(prob_probit = predict(out.probit, type = "response"))

# Mean of predicted probit probabilities
tb.apples %>% 
  summarize(mean_probs_probit = mean(prob_probit))
# Actual sample proportion
tb.apples %>% summarize(sample_proportion =  sum(buy_eco)/n())


to top