# Load the packages
library(haven) # For reading .dta files
library(tidyverse) # For data manipulation (dplyr, ggplot2)
library(stargazer) # For regression tables
library(margins) # For marginal effects
library(ggplot2) # For plots
library(broom) # For tidy model output
library(lmtest) # For testing
library(sandwich) # For robust standard errors
library(gt) # For tables
library(jtools) # For tables
library(estimatr) # For robust standard errors in lpm
library(officer)
library(flextable)
library(patchwork)Estimating Logit and Probit Models
Application using R
Objective
We will learn how to use R to estimate logit and probit models. Before diving into the estimation process, we will first consider an interesting research question that warrants the use of logit or probit models. Some examples include research questions that can be answered with a simple yes/no, such as crop choice (e.g., 1 = corn and 0 = soybean) by farmers in response to a specific treatment (e.g., extreme heat). Refer to Lee (2025) for a discussion of an application that investigates this example. Or we could think of the factors that influence the adoption of a new technology. In the face of global warming induced climate change that is leading to extreme heat, unpredictable rainfall, rising sea-levels, the need for agricultural adaptation is of paramount significance. Farmers and producers are faced with two choices - adapt existing cropping practices by using stress-tolerant varieties, or undertake alternative economic activities through new crops. Lee (2025) examined the choice of crop (corn versus soybean) when farmers face extreme heat in the US Corn Belt. Interestingly, Lee (2025) models the choice using a linear probability method. This is an interesting example where discrete choice decisions are modeled using a linear model. We have shown earlier how this can lead to biased estimates. Paik et al. (2020), on the other hand, explores the factors that influence adoption of stress-tolerant crop varieties when faced with external stress. We are going to use the data from the study by Paik et al. (2020) to demonstrate an application of logit and probit models in R.
The study setting is the Mekong River Delta (MRD) of Vietnam. The MRD despite accounting for only 12 percent of total area, is responsible for 55 percent of planted rice and 57 percent of total rice production in Vietnam. With Vietnam only behind India and Thailand in global rice production, the MRD is an extremely crucial region for global rice production. This region is facing an increasingly unfavorable environment through increased incidence of salinity inundation in rice production. Most of the MRD is below one meter above sea level. Additionally, the average sea-level has been increased from 9-13 cm in the period 1980-2007. The increased salinity due to rising sea levels, accompanyied by decreased flow in the the Mekong River, threaten stable growth in rice production. In addition to damaging rice crops in the field, the flushing time required to leach salt out of field soil will increase which can delay seeding and the productivity of rice in the next cropping season. The use of salt tolerant rice varieties (STRVs) can potentially reduce rice-yield losses under salinity exposure. Research institutions, including International Rice Research Institute (IRRI) and the Cuu Long Delta Rice Research Institute (CLRRI) in the MRD, have recognized this need and generated STRVs that are adapted to the MRD through the Consortium for Unfavorable Rice Environments (CURE). CURE has evaluated varietal performance in multilocational trials and on farmers’ fields and introduced promising varieties into the countries’ seed multiplication system. These varieties are henceforth referred to as CURE-related varieties.1
Install and load packages
Before we start discussing the application let’s complete a few housekeeping tasks. This is essentially where install all the tools and packages that we would require for the analysis.
We don’t run the above codes every time. We load in into our system and henceforth just ‘call’ them whenever we need to use the functions from those packages.
Now that we have the required libraries loaded on our R environment, we are going to set up our working directory- this is where we are going to store tables, figures and call the data for our analysis from. It is the path where your data is stored.
Set working directory
# Set working directory
data_path <- "C:/Users/rzs0133/OneDrive - Auburn University/Desktop_Auburn Department Computer/Teaching/R_probit_logit_models/Replication Data_Paik et al 2020/Dataset in Stata format (.dta)"Application 1: What influences adoption of salt-tolerant rice in Vietnam?
What are the factors that influence the adoption of CURE-related varieties in salinity-prone areas of the MRD?
Data
The first step is to get data that can help us answer the question. We are in luck!! There are several repositories that contain replication data for studies. The replication data for Paik et al. (2020) is available on the Harvard Dataverse (Paik et al., n.d.). We will access the dataset and store it in our working directory. The data is available in Stata format. We will us the haven package to read .dta format data in R.
Data cleaning
We need to clean and preprocess the data. We follow the codes provided in the replication package. The original codes were run in Stata. We replicate the same in Stata and store the cleaned data in two datasets. The analysis is done separately for two seasons- Dong Xuan (Winter-Spring) and He Thu (Summer-Autumn). We load the two datasets using the haven package. Because we are using a dataset that was created by other researchers, we save a lot of time that would have otherwise gone into cleaning and pre-processing the data.
There is no unique way to teach this. This step, you will realize, is going to take one of the longest when you do your own research. You will get better at it with practice and also when you start making yourself familiar with the data sets. A good way to start would be to look at the replication files of other researchers that have used the data set you are working on. If its your own data, then I am sorry, you are all alone! But, don’t worry there are a lot of tools out there on the internet to help you out. Let’s read the data files.
# Rice is planted in three season in Vietnam: Mua(monsoon), He-Thu(summer-autumn) and Dong-Xuan(winter-spring).
dong_xuan <- read_dta(file.path(data_path, "field_data_dong_xuan.dta"))
he_thu <- read_dta(file.path(data_path, "field_data_he_thu.dta"))Now that we have the loaded the data into R, we can dive into the analysis of the factors that influence adoption of salt-tolerant rice varieties. We are going to run the analysis for just one of the seasons. Paik et al. (2020) analyzes both seasons, but we don’t need to run both. You can try it on your own. I will use the dong_xuan data set in this lecture.
Model
All the algebra and math that we discussed in the previous lecture will be done using a one or two-line code. Isn’t it cool? We were interested in estimating the following equation:
\[ \text{Adoption}_i = \beta_0 + \sum_{j=1}^{20} \beta_j X_{ij} + \epsilon_i \tag{1}\]
where, \(\text{Adoption}_i\) is a binary indicator (1 = adopter, 0 = non-adopter). \(X_{i}\) represents the independent (or explanatory) variables including \(Salinity\) (number of years in which the salinity level in April exceeded the salinity threshold) and \(Nei\) (ratio of neighbors that are adopters to total neighbors).
Let’s look at some summary statistics.
Summary statistics
# Define the variables that will be added to the model
vars <- c("Adoption","Age", "Male", "Eth", "Eduprim", "Edusec", "Eduhigh",
"Off", "Diverse", "Meeting", "Size", "Child", "Credit",
"Saving", "Dis1", "Dis2", "Irri1", "Irri2", "Irri3",
"Salinity", "Nei")
# simplest
summary_stats1 <- dong_xuan %>%
select(all_of(vars)) %>%
summarise()
# first generate summary stats
summary_stats <- dong_xuan %>%
select(all_of(vars)) %>%
summarise(across(everything(),
list(
Mean = ~mean(., na.rm = TRUE),
SD = ~sd(., na.rm = TRUE),
Min = ~min(., na.rm = TRUE),
Max = ~max(., na.rm = TRUE),
N = ~sum(!is.na(.))
),
.names = "{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_") %>%
pivot_wider(names_from = "Statistic",
values_from = "value") %>%
select(Variable, Mean, SD, Min, Max, N)
# Then create a well-formated table
summary_stats %>%
mutate(across(Mean:Max, ~round(., 3))) %>%
gt() %>%
tab_header(
title = "Summary Statistics",
subtitle = "Dong Xuan Season, Field Level Data"
) %>%
fmt_number(columns = c(Mean, SD), decimals = 3) %>%
fmt_integer(columns = c(N, Min, Max))| Summary Statistics | |||||
|---|---|---|---|---|---|
| Dong Xuan Season, Field Level Data | |||||
| Variable | Mean | SD | Min | Max | N |
| Adoption | 0.346 | 0.476 | 0 | 1 | 809 |
| Age | 52.315 | 10.985 | 20 | 81 | 809 |
| Male | 0.889 | 0.315 | 0 | 1 | 809 |
| Eth | 0.155 | 0.362 | 0 | 1 | 809 |
| Eduprim | 0.420 | 0.494 | 0 | 1 | 809 |
| Edusec | 0.247 | 0.432 | 0 | 1 | 809 |
| Eduhigh | 0.117 | 0.322 | 0 | 1 | 809 |
| Off | 0.878 | 1.045 | 0 | 5 | 809 |
| Diverse | 1.844 | 1.155 | 0 | 12 | 809 |
| Meeting | 1.843 | 1.114 | 0 | 6 | 809 |
| Size | 1.627 | 2.379 | 0 | 38 | 809 |
| Child | 0.808 | 0.963 | 0 | 9 | 809 |
| Credit | 0.277 | 0.448 | 0 | 1 | 809 |
| Saving | 0.227 | 0.419 | 0 | 1 | 809 |
| Dis1 | 4.001 | 4.840 | 0 | 27 | 809 |
| Dis2 | 8.011 | 6.432 | 0 | 26 | 809 |
| Irri1 | 0.125 | 0.331 | 0 | 1 | 809 |
| Irri2 | 0.292 | 0.455 | 0 | 1 | 809 |
| Irri3 | 0.518 | 0.500 | 0 | 1 | 809 |
| Salinity | 12.222 | 4.578 | 0 | 15 | 809 |
| Nei | 0.464 | 0.380 | 0 | 1 | 809 |
Linear probability model
First, we estimate a linear model. We write the formula for the regression model expressed in Equation 1.
# Define variables
x_vars <- c("Age", "Male", "Eth", "Eduprim", "Edusec", "Eduhigh",
"Off", "Diverse", "Meeting", "Size", "Child", "Credit",
"Saving", "Dis1", "Dis2", "Irri1", "Irri2", "Irri3",
"Salinity", "Nei")
# Create formula
formula_regression <- paste("Adoption ~", paste(x_vars, collapse = " + "))
# this is just an elegant way of writing code.
# alternatively, you can also run:
# lpm_model1 <- lm(Adoption ~ Age + Male + Eth +
# Eduprim + Edusec + Eduhigh +
# Off + Diverse + Meeting + Size + Child +
# Credit + Saving + Dis1 + Dis2 + Irri1 +
# Irri2 + Irri3 + Salinity + Nei,
# data = dong_xuan)Let’s first estimate a linear probability model. This method models the choice of adoption using an OLS framework. We showed why this might be prone to error. But it is still a good baseline model.
lpm_model1 <- lm(as.formula(formula_regression), data = dong_xuan)
# if you want a linear probability model with heteroskedasticity corrected standard errors because as we showed in the earlier lecture the errors are not constant.
lpm_model1robust <- lm_robust(as.formula(formula_regression),
data = dong_xuan,
se_type = "HC1")
export_summs(lpm_model1, lpm_model1robust,
robust = TRUE,
vcov.type = "HC1",
model.names = c("LPM", "LPM (robust SE)"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| LPM | LPM (robust SE) | |
|---|---|---|
| (Intercept) | -0.16 | -0.16 |
| (0.13) | (0.13) | |
| Age | 0.00 * | 0.00 * |
| (0.00) | (0.00) | |
| Male | -0.05 | -0.05 |
| (0.04) | (0.04) | |
| Eth | 0.00 | 0.00 |
| (0.05) | (0.05) | |
| Eduprim | 0.02 | 0.02 |
| (0.04) | (0.04) | |
| Edusec | 0.05 | 0.05 |
| (0.04) | (0.04) | |
| Eduhigh | 0.11 * | 0.11 * |
| (0.06) | (0.06) | |
| Off | 0.01 | 0.01 |
| (0.01) | (0.01) | |
| Diverse | 0.02 | 0.02 |
| (0.01) | (0.01) | |
| Meeting | -0.01 | -0.01 |
| (0.01) | (0.01) | |
| Size | -0.01 | -0.01 |
| (0.01) | (0.01) | |
| Child | 0.00 | 0.00 |
| (0.01) | (0.01) | |
| Credit | -0.02 | -0.02 |
| (0.03) | (0.03) | |
| Saving | 0.03 | 0.03 |
| (0.04) | (0.04) | |
| Dis1 | -0.00 | -0.00 |
| (0.00) | (0.00) | |
| Dis2 | 0.00 | 0.00 |
| (0.00) | (0.00) | |
| Irri1 | -0.03 | -0.03 |
| (0.08) | (0.08) | |
| Irri2 | -0.02 | -0.02 |
| (0.07) | (0.06) | |
| Irri3 | -0.12 | -0.12 |
| (0.06) | (0.06) | |
| Salinity | 0.00 | 0.00 |
| (0.00) | (0.00) | |
| Nei | 0.66 *** | 0.66 *** |
| (0.04) | (0.04) | |
| N | 809 | 809 |
| R2 | 0.30 | 0.30 |
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | ||
Let’s check how the predicted values look like. Remember that our dependent variable was binary (1 = adopt, 0 = don’t adopt) (see Summary Statistics above).
Predictions from linear model
predicted_values <- predict(lpm_model1robust, se.fit = TRUE,
newdata = dong_xuan)
# Create data for plotting
plot_data <- data.frame(
Actual = dong_xuan$Adoption,
Predicted_lpm = predicted_values$fit,
SE_lpm = predicted_values$se.fit,
Observation = 1:nrow(dong_xuan)
)
# Plot
ggplot(plot_data, aes(y = Actual, x = Predicted_lpm)) +
geom_point(size = 2, alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", size = 1) +
labs(
title = "Predicted vs Actual Adoption",
x = "Actual Adoption (0 or 1)",
y = "Predicted Probability"
) +
theme_minimal()In the above plot we compare the predicted probabilities that we estimated using the linear probability model with the actual data on adoption and non-adoption. Some predicted probabilities are below zero. It has no meaning and highlights one of the pitfalls of modelling a binary outcome using linear models.
Logit and Probit
Now, let’s look at the code to run logit and probit models. They are part of the glm class of functions that fit generalized linear models that can capture non-linearity.
logit_model1 <- glm(as.formula(formula_regression),
family = binomial(link = "logit"),
data = dong_xuan)
export_summs(logit_model1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("Logit"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| Logit | |
|---|---|
| (Intercept) | -3.83 *** |
| (0.87) | |
| Age | 0.02 * |
| (0.01) | |
| Male | -0.26 |
| (0.31) | |
| Eth | -0.02 |
| (0.28) | |
| Eduprim | 0.11 |
| (0.25) | |
| Edusec | 0.31 |
| (0.29) | |
| Eduhigh | 0.74 * |
| (0.36) | |
| Off | 0.02 |
| (0.09) | |
| Diverse | 0.13 |
| (0.08) | |
| Meeting | -0.09 |
| (0.09) | |
| Size | -0.04 |
| (0.06) | |
| Child | 0.02 |
| (0.10) | |
| Credit | -0.06 |
| (0.21) | |
| Saving | 0.26 |
| (0.24) | |
| Dis1 | -0.00 |
| (0.02) | |
| Dis2 | 0.02 |
| (0.02) | |
| Irri1 | -0.21 |
| (0.44) | |
| Irri2 | -0.14 |
| (0.39) | |
| Irri3 | -0.78 * |
| (0.38) | |
| Salinity | 0.03 |
| (0.02) | |
| Nei | 3.68 *** |
| (0.30) | |
| N | 809 |
| AIC | 813.28 |
| BIC | 911.89 |
| Pseudo R2 | 0.39 |
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | |
Take note of the coefficients. One, they differ from what we saw in the LPM model earlier. Two, the interpretation is also very different. For example, the estimated coefficient for \(Nei\) is 3.676- which implies that a unit increase in \(Nei\) multiplies the odds of adoption by \(e^{3.676}\). It does not give us the probability directly but just the direction of the relationship between \(Nei\) and \(Adoption\). To estimate the effect of each covariate on the likelihood of \(Adoption\) we need to estimate the marginal effects. There are several ways to calculate the marginal effect- at means, average of individual marginal effects, and at representative values.
Average Marginal Effects (AME)
# Calculate marginal effects (Average Marginal Effects - AME)
logit_margins1 <- margins(logit_model1, type = "response")
export_summs(logit_model1, logit_margins1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("Logit", "Average Marginal Effects"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| Logit | Average Marginal Effects | |
|---|---|---|
| (Intercept) | -3.83 *** | |
| (0.87) | ||
| Age | 0.02 * | 0.00 * |
| (0.01) | (0.00) | |
| Male | -0.26 | -0.04 |
| (0.31) | (0.05) | |
| Eth | -0.02 | -0.00 |
| (0.28) | (0.04) | |
| Eduprim | 0.11 | 0.02 |
| (0.25) | (0.04) | |
| Edusec | 0.31 | 0.05 |
| (0.29) | (0.04) | |
| Eduhigh | 0.74 * | 0.12 * |
| (0.36) | (0.05) | |
| Off | 0.02 | 0.00 |
| (0.09) | (0.01) | |
| Diverse | 0.13 | 0.02 |
| (0.08) | (0.01) | |
| Meeting | -0.09 | -0.01 |
| (0.09) | (0.01) | |
| Size | -0.04 | -0.01 |
| (0.06) | (0.01) | |
| Child | 0.02 | 0.00 |
| (0.10) | (0.02) | |
| Credit | -0.06 | -0.01 |
| (0.21) | (0.03) | |
| Saving | 0.26 | 0.04 |
| (0.24) | (0.04) | |
| Dis1 | -0.00 | -0.00 |
| (0.02) | (0.00) | |
| Dis2 | 0.02 | 0.00 |
| (0.02) | (0.00) | |
| Irri1 | -0.21 | -0.03 |
| (0.44) | (0.07) | |
| Irri2 | -0.14 | -0.02 |
| (0.39) | (0.06) | |
| Irri3 | -0.78 * | -0.12 * |
| (0.38) | (0.06) | |
| Salinity | 0.03 | 0.00 |
| (0.02) | (0.00) | |
| Nei | 3.68 *** | 0.57 *** |
| (0.30) | (0.03) | |
| N | 809 | 0 |
| AIC | 813.28 | 813.28 |
| BIC | 911.89 | 911.89 |
| Pseudo R2 | 0.39 | |
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | ||
These are the average marginal effects of the different variables. Notice the difference in the coefficients. The marginal effects are much easier to interpret. Once again, if we look at the coefficient for the marginal effect of \(Nei\), 0.57. It can be interpreted as a 1-unit increase in \(Nei\) results in a 5.7 percentage point increase in probability of \(Adoption\). We represent the same for probit as well below.
probit_model1 <- glm(as.formula(formula_regression),
family = binomial(link = "probit"),
data = dong_xuan)
# Calculate marginal effects (Average Marginal Effects - AME)
probit_margins1 <- margins(probit_model1, type = "response")
export_summs(probit_model1, probit_margins1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("Probit", "Average Marginal Effects"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| Probit | Average Marginal Effects | |
|---|---|---|
| (Intercept) | -2.22 *** | |
| (0.49) | ||
| Age | 0.01 * | 0.00 * |
| (0.01) | (0.00) | |
| Male | -0.12 | -0.03 |
| (0.17) | (0.05) | |
| Eth | -0.01 | -0.00 |
| (0.17) | (0.04) | |
| Eduprim | 0.05 | 0.01 |
| (0.15) | (0.04) | |
| Edusec | 0.16 | 0.04 |
| (0.17) | (0.04) | |
| Eduhigh | 0.41 * | 0.11 * |
| (0.21) | (0.05) | |
| Off | 0.02 | 0.00 |
| (0.05) | (0.01) | |
| Diverse | 0.07 | 0.02 |
| (0.05) | (0.01) | |
| Meeting | -0.05 | -0.01 |
| (0.05) | (0.01) | |
| Size | -0.02 | -0.00 |
| (0.03) | (0.01) | |
| Child | 0.01 | 0.00 |
| (0.05) | (0.02) | |
| Credit | -0.04 | -0.01 |
| (0.12) | (0.03) | |
| Saving | 0.16 | 0.04 |
| (0.14) | (0.04) | |
| Dis1 | -0.00 | -0.00 |
| (0.01) | (0.00) | |
| Dis2 | 0.01 | 0.00 |
| (0.01) | (0.00) | |
| Irri1 | -0.12 | -0.03 |
| (0.25) | (0.07) | |
| Irri2 | -0.07 | -0.02 |
| (0.22) | (0.06) | |
| Irri3 | -0.47 * | -0.13 * |
| (0.22) | (0.06) | |
| Salinity | 0.01 | 0.00 |
| (0.01) | (0.00) | |
| Nei | 2.16 *** | 0.58 *** |
| (0.17) | (0.03) | |
| N | 809 | 0 |
| AIC | 812.06 | 812.06 |
| BIC | 910.67 | 910.67 |
| Pseudo R2 | 0.40 | |
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | ||
The marginal effects are similar to the marginal effects from the logit model. But the coefficients of probit model are very different from that of the logit model. For instance, the coefficient for \(Nei\) is 2.16. This implies a one-unit increase in \(Nei\) increases the likelihood of Adoption by a standard deviation of 2.16. The marginal effect is 5.8 (very similar to the logit model marginal effect). So far we computed the AME.
We can compare all the AME together in one table (helpful when writing a paper or for presentation purposes).
# table with coefficients of linear probability model, marginal effects of logit and probit
export_summs(lpm_model1, lpm_model1robust, logit_margins1, probit_margins1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("LPM", "LPM (robust SE)","AME (Logit)", "AME (Probit)"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| LPM | LPM (robust SE) | AME (Logit) | AME (Probit) | |
|---|---|---|---|---|
| (Intercept) | -0.16 | -0.16 | ||
| (0.13) | (0.13) | |||
| Age | 0.00 * | 0.00 * | 0.00 * | 0.00 * |
| (0.00) | (0.00) | (0.00) | (0.00) | |
| Male | -0.05 | -0.05 | -0.04 | -0.03 |
| (0.04) | (0.04) | (0.05) | (0.05) | |
| Eth | 0.00 | 0.00 | -0.00 | -0.00 |
| (0.05) | (0.05) | (0.04) | (0.04) | |
| Eduprim | 0.02 | 0.02 | 0.02 | 0.01 |
| (0.04) | (0.04) | (0.04) | (0.04) | |
| Edusec | 0.05 | 0.05 | 0.05 | 0.04 |
| (0.04) | (0.04) | (0.04) | (0.04) | |
| Eduhigh | 0.11 * | 0.11 * | 0.12 * | 0.11 * |
| (0.06) | (0.06) | (0.05) | (0.05) | |
| Off | 0.01 | 0.01 | 0.00 | 0.00 |
| (0.01) | (0.01) | (0.01) | (0.01) | |
| Diverse | 0.02 | 0.02 | 0.02 | 0.02 |
| (0.01) | (0.01) | (0.01) | (0.01) | |
| Meeting | -0.01 | -0.01 | -0.01 | -0.01 |
| (0.01) | (0.01) | (0.01) | (0.01) | |
| Size | -0.01 | -0.01 | -0.01 | -0.00 |
| (0.01) | (0.01) | (0.01) | (0.01) | |
| Child | 0.00 | 0.00 | 0.00 | 0.00 |
| (0.01) | (0.01) | (0.02) | (0.02) | |
| Credit | -0.02 | -0.02 | -0.01 | -0.01 |
| (0.03) | (0.03) | (0.03) | (0.03) | |
| Saving | 0.03 | 0.03 | 0.04 | 0.04 |
| (0.04) | (0.04) | (0.04) | (0.04) | |
| Dis1 | -0.00 | -0.00 | -0.00 | -0.00 |
| (0.00) | (0.00) | (0.00) | (0.00) | |
| Dis2 | 0.00 | 0.00 | 0.00 | 0.00 |
| (0.00) | (0.00) | (0.00) | (0.00) | |
| Irri1 | -0.03 | -0.03 | -0.03 | -0.03 |
| (0.08) | (0.08) | (0.07) | (0.07) | |
| Irri2 | -0.02 | -0.02 | -0.02 | -0.02 |
| (0.07) | (0.06) | (0.06) | (0.06) | |
| Irri3 | -0.12 | -0.12 | -0.12 * | -0.13 * |
| (0.06) | (0.06) | (0.06) | (0.06) | |
| Salinity | 0.00 | 0.00 | 0.00 | 0.00 |
| (0.00) | (0.00) | (0.00) | (0.00) | |
| Nei | 0.66 *** | 0.66 *** | 0.57 *** | 0.58 *** |
| (0.04) | (0.04) | (0.03) | (0.03) | |
| N | 809 | 809 | 0 | 0 |
| R2 | 0.30 | 0.30 | ||
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | ||||
As you can see, this is a very long table. Maybe we just want to focus on the \(Nei\) variable. We can do that by adding the following code (coefs = c("Nei") in the export_summs function. The default was to include all the coefficients.
export_summs(lpm_model1, lpm_model1robust, logit_margins1, probit_margins1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("LPM", "LPM (robust SE)","AME (Logit)", "AME (Probit)"),
coefs = c("Nei"),
note = "Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05")| LPM | LPM (robust SE) | AME (Logit) | AME (Probit) | |
|---|---|---|---|---|
| Nei | 0.66 *** | 0.66 *** | 0.57 *** | 0.58 *** |
| (0.04) | (0.04) | (0.03) | (0.03) | |
| N | 809 | 809 | 0 | 0 |
| R2 | 0.30 | 0.30 | ||
| Standard errors in parentheses.*** p < 0.001; ** p< 0.01; * p < 0.05 | ||||
Marginal effect at means
We can also calculate the marginal effect at means. But this has several issues because it might not represent an ‘actual’ observation. The average values might not be representative. For example, the mean for \(Male\) is 0.89. This implies that the representative individual is 89 percent male. This really does not make sense. So, the best approach is to use the average marginal effect. You can calculate marginal effect at means using the following procedure. First, create a list of the mean values of all covariates used in the model. Then use the at = mean_list option to fix the covariates at their mean values while calculating the marginal effects.
# Calculate means for all variables (this will be needed to calculate marginal effect at means)
means_list <- list()
for (var in x_vars) {
means_list[[var]] <- mean(dong_xuan[[var]], na.rm = TRUE)
}
logit_atmeans <- margins(logit_model1, type = "response",
at = means_list)
probit_atmeans <- margins(probit_model1, type = "response",
at = means_list)Finally, we represent all the marginal effects in a single table for comparison.
# compare marginal effect at means with average marginal effects
export_summs(logit_atmeans, logit_margins1, probit_atmeans, probit_margins1, lpm_model1,
robust = TRUE,
vcov.type = "HC1",
model.names = c("Logit (at means)", "Logit (AME)", "Probit (at means)", "Probit (AME)", "LPM"),
coefs = c("Adoption rate of Neighbors" = "Nei"))| Logit (at means) | Logit (AME) | Probit (at means) | Probit (AME) | LPM | |
|---|---|---|---|---|---|
| Adoption rate of Neighbors | 0.74 *** | 0.57 *** | 0.74 *** | 0.58 *** | 0.66 *** |
| (0.06) | (0.03) | (0.05) | (0.03) | (0.04) | |
| N | 0 | 0 | 0 | 0 | 809 |
| R2 | 0.30 | ||||
| Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
Marginal effect at representative values
Let’s assume we want to estimate the marginal effect of Nei for different levels of field size, Size. We can do that by:
# first create a sequence of values for Size
size_values <- seq(0,38, by = 2)
me_by_size <- margins(logit_model1,
at = list(Size = size_values),
variables = "Nei")
# convert it into a dataframe and then calculate the confidence intervals
me_df <- summary(me_by_size) %>%
select(Size, AME, SE) %>%
mutate(
CI_lower = AME - 1.96 * SE,
CI_upper = AME + 1.96 * SE
)
# Plot
ggplot(me_df, aes(x = Size, y = AME)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", size = 0.4) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "blue", size = 3) +
geom_ribbon(aes(ymin = CI_lower, ymax = CI_upper),
fill = "blue", alpha = 0.2) +
labs(
title = "Marginal Effect of Neighbor Adoption (Nei) by Farm Size",
x = "Farm Size",
y = "Marginal Effect",
subtitle = "95% Confidence Intervals shown in light blue"
) +
theme_minimal()Predicted probabilities
Plot predicted probabilities of the logit (or probit) model to get an idea of how our predictions look like.
predicted.data <- data.frame(
prob_logit=logit_model1$fitted.values,
prob_probit=probit_model1$fitted.values,
prob_lpm=lpm_model1$fitted.values,
adoption=dong_xuan$Adoption)
predicted.data <- predicted.data[
order(predicted.data$prob_logit, decreasing=FALSE),]
predicted.data$rank <- 1:nrow(predicted.data)
# # Reshape the data to get all predicted probabilities in column. Easier for plotting
# predicted_long <- predicted.data %>%
# pivot_longer(cols= c(prob_logit, prob_probit, prob_lpm),
# names_to = "Model",
# values_to = "Predicted_prob") %>%
# mutate(Model = case_when(
# Model == "prob_logit" ~ "Logit",
# Model == "prob_probit" ~ "Probit",
# Model == "prob_lpm" ~ "LPM"
# ))
ggplot(data = predicted.data, aes(x = rank, y = prob_logit)) +
geom_point(aes(color = factor(adoption)), alpha = 0.2) +
xlab("Index") +
ylab("Predicted probability of adoption") +
scale_color_manual(values = c("red", "blue"),
labels = c("Non-Adopter", "Adopter"),
name = "Actual Status") +
labs(
caption = "Index refers to observations ranked in increasing order of predicted probability (Logit)"
) +
theme_minimal()Application 2: Determinants of enrollment in college
We use the sample data set provided by the R package wooldrige. We will examine what determines enrollment in college using data from David Card (1995). The link to the paper can be found in the references section. The paper investigates the causal link between schooling and earnings using proximity to college as an exogenous determinant of schooling. The study uses data from the NLS Young Men Cohort. This data is already is already available in R and we don’t need to download it separately. You can look at the wooldridge package to explore other available data sets. In the next few steps we will see how we can load data from the wooldridge package. First, load the library.
library(wooldridge)If you type the following you will get a list of all the data sets in this package.
library(help = "wooldridge")Subsequently, you can type the following to look at each data set more carefully. We show it for the ‘card’ database.
?cardLoad the data (lazily)
data('card')Although Card (1993) investigated the impact of schooling on wages and used proximity as an instrumental variable, we are going to use the data to see how proximity influences enrollment (a binary choice variable).
Binary choice models
probit_card <- glm(enroll ~ nearc2 + nearc4 + fatheduc + motheduc + IQ,
data = card,
family = binomial(link = 'probit'))
logit_card <- glm(enroll ~ nearc2 + nearc4 + fatheduc + motheduc + IQ,
data = card,
family = binomial(link = 'logit'))
card_margins_logit <- margins(logit_card, type = "response")
card_margins_probit <- margins(probit_card, type = "response")
export_summs(probit_card, card_margins_probit, logit_card, card_margins_logit,
model.names = c("Probit", "AME(Probit)", "Logit", "AME(Logit)"),
to.file = "Word",
file.name = "card_results.docx") # might need "officer"/"flextable package| Probit | AME(Probit) | Logit | AME(Logit) | |
|---|---|---|---|---|
| (Intercept) | -2.19 *** | -4.00 *** | ||
| (0.32) | (0.62) | |||
| nearc2 | 0.04 | 0.01 | 0.08 | 0.01 |
| (0.09) | (0.02) | (0.16) | (0.02) | |
| nearc4 | 0.21 * | 0.04 * | 0.42 * | 0.04 * |
| (0.10) | (0.02) | (0.20) | (0.02) | |
| fatheduc | 0.01 | 0.00 | 0.02 | 0.00 |
| (0.02) | (0.00) | (0.03) | (0.00) | |
| motheduc | 0.01 | 0.00 | 0.02 | 0.00 |
| (0.02) | (0.00) | (0.04) | (0.00) | |
| IQ | 0.01 | 0.00 | 0.01 | 0.00 |
| (0.00) | (0.00) | (0.01) | (0.00) | |
| N | 1619 | 0 | 1619 | 0 |
| AIC | 1107.08 | 1107.08 | 1106.78 | 1106.78 |
| BIC | 1139.42 | 1139.42 | 1139.12 | 1139.12 |
| Pseudo R2 | 0.02 | 0.02 | ||
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||
Plot marginal effects
In addition to reporting the results in a tabular format, we can also plot the marginal effects. This often helps in presenting our findings in a conference or blog article, and even in the paper itself.
p1 <- plot(card_margins_logit)p2 <- plot(card_margins_probit)Summary2
Learn how to perform logit and probit analysis using the
glmfunction.Compute marginal effects using the
marginsfunction.Make publication-ready tables. Export it to a word file (can also be exported to excel, csv or tex format files).
Compute predicted probabilities.
Plot marginal effects (can be an alternative to the tables).
Additional Sources for Data
Data sets used in Jeffrey M. Wooldridge, “Introductory Econometrics: A Modern Approach, 7e.
There is an R package that contains all the datasets used in the book (requires R version >= 3.5.0).
It can be installed using
install.packages("wooldridge"). Additional information on thewooldridgepackage can be accessed here.Harvard Dataverse is a free data repository open to all researchers from any discipline. You can access data that has been uploaded by other researchers or even upload your own data and codes. This is the link to the website.
Another in built package in R that contains miscellaneous data sets is the
datasetspackage.Data Sharing for Demographic Research (DSDR).This is a Population Health Data Archive and contains thousands of data sets from across the world.
IPUMS. This provides census and survey data from around the world integrated across time and space.