LPM, Logit and Probit
1 Continue the exploratory data analysis from the ‘Learning Notebook’ and explore further any interesting relationships in the ‘apple’ data set. What other relationships can we find?
2 Interpret the coefficient ‘male’ in the LPM model.
3 Consider the two LPM models presented below and note how the introduction of the control variables in model 2 has very little impact on the estimated coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\). Explain why this occurs.
model1 <- buy_eco ~ ecoprc + regprc
model2 <- buy_eco ~ ecoprc + regprc + faminc + hhsize + male
4 Predict the (intended) purchase probabilities for the logit and probit models and plot their distributions.
5 Create two scatterplots: (1) predicted probabilities of the LPM vs
Probit and (2) predicted probabilities of the LPM vs Logit. Add a 45
degree line to each plot using the function geom_abline()
.
Discuss your findings.
Get started by loading libraries, reading and preprocessing data.
library(dplyr)
library(ggplot2)
library(stargazer)
# Load data, convert to tibble format, discard original
load("data/apple.RData")
tb.apples <- as_tibble(data)
rm(data)
# Create an indicator variable "buy_eco"
tb.apples <- tb.apples %>%
mutate(buy_eco = if_else(ecolbs > 0, 1, 0))
Learning Notebook
and explore further any interesting
relationships in the ‘apple’ data set. What other relationships can we
find?# Looking at propensity to buy per state
tb.apples %>%
group_by(state) %>%
summarise(n = n(), buy_eco_mean = mean(buy_eco)) %>%
filter(n > 10)
## # A tibble: 19 × 3
## state n buy_eco_mean
## <chr> <int> <dbl>
## 1 CA 66 0.545
## 2 FL 33 0.697
## 3 GA 14 0.5
## 4 IL 30 0.567
## 5 IN 12 0.667
## 6 MA 14 0.5
## 7 MD 20 0.65
## 8 MI 25 0.84
## 9 MN 24 0.667
## 10 MO 15 0.667
## 11 NC 15 0.667
## 12 NJ 20 0.6
## 13 NY 47 0.702
## 14 OH 32 0.781
## 15 PA 22 0.5
## 16 TN 15 0.733
## 17 TX 41 0.585
## 18 VA 24 0.625
## 19 WI 23 0.565
# Looking at propensity to buy per age group
tb.apples %>%
mutate(age_bin = cut(age, breaks=seq(0, 100, by = 10))) %>%
group_by(age_bin) %>%
summarise(n = n(), buy_eco_mean = mean(buy_eco))
## # A tibble: 8 × 3
## age_bin n buy_eco_mean
## <fct> <int> <dbl>
## 1 (10,20] 9 0.222
## 2 (20,30] 123 0.699
## 3 (30,40] 165 0.624
## 4 (40,50] 165 0.642
## 5 (50,60] 84 0.631
## 6 (60,70] 64 0.547
## 7 (70,80] 42 0.548
## 8 (80,90] 8 0.5
# Linear Probability Model (LPM)
model <- buy_eco ~ ecoprc + regprc + faminc + hhsize + male
out.ols <- lm(model, data = tb.apples)
stargazer(out.ols, type = 'text', no.space = TRUE, header = FALSE)
##
## ===============================================
## Dependent variable:
## ---------------------------
## buy_eco
## -----------------------------------------------
## ecoprc -0.862***
## (0.110)
## regprc 0.760***
## (0.132)
## faminc 0.001**
## (0.001)
## hhsize 0.018
## (0.012)
## male -0.092**
## (0.042)
## Constant 0.799***
## (0.085)
## -----------------------------------------------
## Observations 660
## R2 0.104
## Adjusted R2 0.097
## Residual Std. Error 0.461 (df = 654)
## F Statistic 15.188*** (df = 5; 654)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
On average, all else equal, male household heads are 9 percentage points less likely to intend to buy eco-labelled apples compared to females. This effect is statistically significant at the 5% level.
# Alternative Linear Probability Models
model1 <- buy_eco ~ ecoprc + regprc
model2 <- buy_eco ~ ecoprc + regprc + faminc + hhsize + male
out_ols1 <- lm(model1, data = tb.apples)
out_ols2 <- lm(model2, data = tb.apples)
stargazer(out_ols1, out_ols2, type = 'text', no.space = TRUE, header = FALSE)
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## buy_eco
## (1) (2)
## -------------------------------------------------------------------
## ecoprc -0.845*** -0.862***
## (0.110) (0.110)
## regprc 0.735*** 0.760***
## (0.133) (0.132)
## faminc 0.001**
## (0.001)
## hhsize 0.018
## (0.012)
## male -0.092**
## (0.042)
## Constant 0.890*** 0.799***
## (0.071) (0.085)
## -------------------------------------------------------------------
## Observations 660 660
## R2 0.086 0.104
## Adjusted R2 0.083 0.097
## Residual Std. Error 0.464 (df = 657) 0.461 (df = 654)
## F Statistic 30.784*** (df = 2; 657) 15.188*** (df = 5; 654)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Interpretation:
Since prices were randomly assigned (experimental design), they are uncorrelated with household characteristics. Therefore, adding controls has little to no effect on the estimated price coefficients.
# Logit and Probit
out.logit <- glm(model, data = tb.apples, family = binomial(link = 'logit'))
out.probit <- glm(model, data = tb.apples, family = binomial(link = 'probit'))
# Predicted probabilities
tb.apples <- tb.apples %>%
mutate(prob_logit = predict(out.logit, type = "response"))
tb.apples <- tb.apples %>%
mutate(prob_probit = predict(out.probit, type = "response"))
# Plot the predicted probabilities
ggplot(tb.apples, aes(x = prob_logit)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
labs(title = "Predicted Purchase Probabilities (Logit Model)",
x = "Predicted Probability", y = "Count") +
theme_minimal()
ggplot(tb.apples, aes(x = prob_probit)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
labs(title = "Predicted Purchase Probabilities (Probit Model)",
x = "Predicted Probability", y = "Count") +
theme_minimal()
geom_abline()
.
Discuss your findings.# Predicted probabilities for the LPM
tb.apples <- tb.apples %>%
mutate(lpm_pred = predict(out.ols))
## LPM vs Probit
ggplot(tb.apples, aes(x = lpm_pred, y = prob_probit)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "blue", linewidth = 0.6) +
labs(title = "LPM vs Probit Predicted Probabilities",
x = "LPM Predictions", y = "Probit Predictions") +
theme_minimal()
## LPM vs Logit
ggplot(tb.apples, aes(x = lpm_pred, y = prob_logit)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "blue", linewidth = 0.6) +
labs(title = "LPM vs Logit Predicted Probabilities",
x = "LPM Predictions", y = "Logit Predictions") +
theme_minimal()
Interpretation:
The LPM, Probit, and Logit models give very similar predicted probabilities for most observations. Deviations appear mainly at the tails (close to 0 or 1), where LPM can predict values outside the [0,1] range and Logit/Probit predictions tend to be compressed toward the middle.