LPM, Logit and Probit

1 Continue the exploratory data analysis from the ‘Learning Notebook’ and explore further any interesting relationships in the ‘apple’ data set. What other relationships can we find?

2 Interpret the coefficient ‘male’ in the LPM model.

3 Consider the two LPM models presented below and note how the introduction of the control variables in model 2 has very little impact on the estimated coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\). Explain why this occurs.

model1 <- buy_eco ~ ecoprc + regprc

model2 <- buy_eco ~ ecoprc + regprc + faminc + hhsize + male

4 Predict the (intended) purchase probabilities for the logit and probit models and plot their distributions.

5 Create two scatterplots: (1) predicted probabilities of the LPM vs Probit and (2) predicted probabilities of the LPM vs Logit. Add a 45 degree line to each plot using the function geom_abline(). Discuss your findings.


Get started by loading libraries, reading and preprocessing data.

library(dplyr)
library(ggplot2)
library(stargazer)

# Load data, convert to tibble format, discard original
load("data/apple.RData")
tb.apples <- as_tibble(data)
rm(data)

# Create an indicator variable "buy_eco"
tb.apples <- tb.apples %>% 
  mutate(buy_eco = if_else(ecolbs > 0, 1, 0))


  1. Continue the exploratory data analysis from the Learning Notebook and explore further any interesting relationships in the ‘apple’ data set. What other relationships can we find?
# Looking at propensity to buy per state
tb.apples %>% 
  group_by(state) %>%
  summarise(n = n(), buy_eco_mean = mean(buy_eco)) %>%
  filter(n > 10)
## # A tibble: 19 × 3
##    state     n buy_eco_mean
##    <chr> <int>        <dbl>
##  1 CA       66        0.545
##  2 FL       33        0.697
##  3 GA       14        0.5  
##  4 IL       30        0.567
##  5 IN       12        0.667
##  6 MA       14        0.5  
##  7 MD       20        0.65 
##  8 MI       25        0.84 
##  9 MN       24        0.667
## 10 MO       15        0.667
## 11 NC       15        0.667
## 12 NJ       20        0.6  
## 13 NY       47        0.702
## 14 OH       32        0.781
## 15 PA       22        0.5  
## 16 TN       15        0.733
## 17 TX       41        0.585
## 18 VA       24        0.625
## 19 WI       23        0.565
# Looking at propensity to buy per age group
tb.apples %>%
  mutate(age_bin = cut(age, breaks=seq(0, 100, by = 10))) %>%
  group_by(age_bin) %>%
  summarise(n = n(), buy_eco_mean = mean(buy_eco))
## # A tibble: 8 × 3
##   age_bin     n buy_eco_mean
##   <fct>   <int>        <dbl>
## 1 (10,20]     9        0.222
## 2 (20,30]   123        0.699
## 3 (30,40]   165        0.624
## 4 (40,50]   165        0.642
## 5 (50,60]    84        0.631
## 6 (60,70]    64        0.547
## 7 (70,80]    42        0.548
## 8 (80,90]     8        0.5


  1. Interpret the coefficient ‘male’ in the LPM model.
# Linear Probability Model (LPM)
model <- buy_eco ~  ecoprc + regprc + faminc + hhsize + male
out.ols <- lm(model, data = tb.apples)
stargazer(out.ols, type = 'text', no.space = TRUE, header = FALSE)
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               buy_eco          
## -----------------------------------------------
## ecoprc                       -0.862***         
##                               (0.110)          
## regprc                       0.760***          
##                               (0.132)          
## faminc                        0.001**          
##                               (0.001)          
## hhsize                         0.018           
##                               (0.012)          
## male                         -0.092**          
##                               (0.042)          
## Constant                     0.799***          
##                               (0.085)          
## -----------------------------------------------
## Observations                    660            
## R2                             0.104           
## Adjusted R2                    0.097           
## Residual Std. Error      0.461 (df = 654)      
## F Statistic           15.188*** (df = 5; 654)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01


Interpretation:

On average, all else equal, male household heads are 9 percentage points less likely to intend to buy eco-labelled apples compared to females. This effect is statistically significant at the 5% level.


  1. Consider the two LPM models presented below and note how the introduction of the control variables in model 2 has very little impact on the estimated coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\). Explain why this occurs.
# Alternative Linear Probability Models
model1   <- buy_eco ~  ecoprc + regprc
model2   <- buy_eco ~  ecoprc + regprc + faminc + hhsize + male

out_ols1 <- lm(model1, data = tb.apples)
out_ols2 <- lm(model2, data = tb.apples)

stargazer(out_ols1, out_ols2, type = 'text', no.space = TRUE, header = FALSE)
## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                         buy_eco                    
##                               (1)                     (2)          
## -------------------------------------------------------------------
## ecoprc                     -0.845***               -0.862***       
##                             (0.110)                 (0.110)        
## regprc                     0.735***                0.760***        
##                             (0.133)                 (0.132)        
## faminc                                              0.001**        
##                                                     (0.001)        
## hhsize                                               0.018         
##                                                     (0.012)        
## male                                               -0.092**        
##                                                     (0.042)        
## Constant                   0.890***                0.799***        
##                             (0.071)                 (0.085)        
## -------------------------------------------------------------------
## Observations                  660                     660          
## R2                           0.086                   0.104         
## Adjusted R2                  0.083                   0.097         
## Residual Std. Error    0.464 (df = 657)        0.461 (df = 654)    
## F Statistic         30.784*** (df = 2; 657) 15.188*** (df = 5; 654)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01


Interpretation:

Since prices were randomly assigned (experimental design), they are uncorrelated with household characteristics. Therefore, adding controls has little to no effect on the estimated price coefficients.


  1. Predict the (intended) purchase probabilities for the logit and probit models and plot their distributions.
# Logit and Probit
out.logit  <- glm(model, data = tb.apples, family = binomial(link = 'logit'))
out.probit <- glm(model, data = tb.apples, family = binomial(link = 'probit'))

# Predicted probabilities 
tb.apples <- tb.apples %>%
  mutate(prob_logit = predict(out.logit, type = "response"))

tb.apples <- tb.apples %>%
  mutate(prob_probit = predict(out.probit, type = "response"))

# Plot the predicted probabilities
ggplot(tb.apples, aes(x = prob_logit)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
  labs(title = "Predicted Purchase Probabilities (Logit Model)",
       x = "Predicted Probability", y = "Count") +
  theme_minimal()

ggplot(tb.apples, aes(x = prob_probit)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
  labs(title = "Predicted Purchase Probabilities (Probit Model)",
       x = "Predicted Probability", y = "Count") +
  theme_minimal()


  1. Create two scatterplots: (1) predicted probabilities of the LPM vs Probit and (2) predicted probabilities of the LPM vs Logit. Add a 45 degree line to each plot using the function geom_abline(). Discuss your findings.
# Predicted probabilities for the LPM
tb.apples <- tb.apples %>%
  mutate(lpm_pred = predict(out.ols))
## LPM vs Probit
ggplot(tb.apples, aes(x = lpm_pred, y = prob_probit)) +
  geom_point(alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "blue", linewidth = 0.6) +
  labs(title = "LPM vs Probit Predicted Probabilities",
       x = "LPM Predictions", y = "Probit Predictions") +
  theme_minimal()

## LPM vs Logit
ggplot(tb.apples, aes(x = lpm_pred, y = prob_logit)) +
  geom_point(alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "blue", linewidth = 0.6) +
  labs(title = "LPM vs Logit Predicted Probabilities",
       x = "LPM Predictions", y = "Logit Predictions") +
  theme_minimal()


Interpretation:

The LPM, Probit, and Logit models give very similar predicted probabilities for most observations. Deviations appear mainly at the tails (close to 0 or 1), where LPM can predict values outside the [0,1] range and Logit/Probit predictions tend to be compressed toward the middle.