Due Friday, April 15th.
This assignment involves implementing data analysis for a geographic quasi-experiment. In particular, we will work with some reanalysis of this study we read and discussed in class: Christensen, D., Hartman, A. C., & Samii, C. (2021). Legibility and External Investment: An Institutional Natural Experiment in Liberia. International Organization, 75(4), 1087-1108.
We have provided a set of files for you to work with. You don’t need to use the other replication materials from the paper, and would suggest you try doing this without consulting them.
Our starter code also makes use of a few packages, including
fixest
. You can use another regression package, like
lfe
or base lm
with sandwich
or
clubSandwich
.
pd = readRDS("panel.rds")
head(pd)
This is panel data for the years 2001-2014. The authors study the effects of property right systems on external land investment (measured by fraction of forest cleared in 36x36 blocks), specifically around the time of the Global Food Crisis in 2007.
The institutional boundary is 40 miles inland from the southern
coast, and in their main analysis they analyze counties within 40 km (25
miles) of the boundary. They also identify “Southern Counties” – Grand
Gedeh, Grand Kru, Maryland, Sinoe, and River Gee – where this
institutional boundary is less salient, and is excluded in certain
anaylsis.
- cell_number_36x36
: unique identifier for each block of
36x36 cells.
- prop_loss
: outcome variable, fraction of cells that have
forest cleared (each indv. cell’s loss is binary 0/1). Despite the
naming, each 36x36 block does not have 1296 cells. In fact, most blocks
have 648 and as few as 4 cells. - is_county
: indicator for
“County” areas. - is_county_post
: indicator that takes a 1
for County Area starting in 2008.
- avg_running_var
: distance from the institutional border.
Positive values indicates cell is above the border (i.e. “Hinterland”),
negative values indicate cell is below the border (i.e. “County”).
Here is the area covered by the data. Note that some areas have somewhat less dense data.
ggplot(
aes(x = avg_x, y = avg_y),
data = pd %>% filter(year == 2004)
) +
stat_bin_hex(bins = 100) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
Here we compute a series of points approximately specifying the border
and plot this border over the different counties in this data.
boundary = pd %>%
filter(year == 2004, is_county == 1) %>%
mutate(
x = round(avg_x, 2)
) %>%
filter(x > -11.15) %>%
group_by(x) %>%
summarise(
y = max(avg_y)
)
ggplot(
aes(x = avg_x, y = avg_y, color = county_mode),
data = pd %>% filter(year == 2004)
) +
geom_point(size = .0001) +
geom_line(
aes(x = x, y = y),
inherit.aes = FALSE,
data = boundary,
size = 1
) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none"
)
In this assignment, we are going to assess the results considered in the paper’s Table 1. These are results about effects on forest loss following the food crisis.
The basic model used here is this two-way fixed effects model:
formula_1 = prop_loss ~ is_county_post | cell_number_36x36 + year
model_iid = feols(
formula_1,
data = pd
)
pd['residuals'] <- c(residuals(model_iid))
summary(model_iid)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 2,079,672
## Fixed-effects: cell_number_36x36: 148,548, year: 14
## Standard-errors: Clustered (cell_number_36x36)
## Estimate Std. Error t value Pr(>|t|)
## is_county_post 0.015102 0.000304 49.6596 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040135 Adj. R2: 0.623956
## Within R2: 0.008772
Now consider different ways of doing statistical inference here.
The analysis above is valid only if we assume that observations are
independent. In the paper the authors report “cluster-robust” standard
errors, where the clusters are districts
(district_mode
).
A useful presentation of these kind of dependency-robust variance-covariance estimators is: \[ \hat{V}[\mathbf{\hat{\beta}}] = (\mathbf{X}'\mathbf{X})^{-1} \hat{\mathbf{B}} (\mathbf{X}'\mathbf{X})^{-1} \] where the central matrix is often written \[ \mathbf{\hat{B}} = \sum_{g=1}^G \mathbf{X}'_g \mathbf{\hat{u}}_g \mathbf{\hat{u}}'_g \mathbf{X}_g \] where \(\hat{u}_i = y_i - \mathbf{x}'_i \mathbf{\hat{\beta}}\) is the \(i\)th unit’s residual, subscripts \(g\) pick out those elements from cluster \(g\), and this is a sum over the \(G\) clusters.
We can write this a bit more generally by thinking about the vector of all residuals \(\mathbf{\hat{u}}\) and the matrix of all pairs of residuals \(\mathbf{\hat{u}} \mathbf{\hat{u}}'\). Then this central matrix can be written as \[ \mathbf{\hat{B}} = \mathbf{X}'(\mathbf{\hat{u}} \mathbf{\hat{u}}' \odot \mathbf{S}^G) \mathbf{X} \] where \(\odot\) denotes element-wise multiplication. \(\mathbf{S}^G\) is a ``selector’’ matrix that is 1 when the row and column elements are in the same cluster (i.e., \(S^G_{ij} = 1_{G(i) = G(j)}\)). That is, this matrix indicates which pairs of residuals get summed up and which do not. (This discussion follows Cameron, Gelbach & Miller (2011) and Eckles, Kizilcec & Bakshy (2016, SI \(\S\) 5).)
This representation can be useful for other sorts of variance-covariance estimators, such as those discussed below in Question 3.
Do a similar cluster-robust analysis and comment on whether this level of clustering makes this analysis defensible. Are there other levels of clustering that you think are preferable?
First, we plot the distribution of model errors over the individual cells for a visual interpretation. As the plot below illustrates, the errors appear heteroskedastic and correlated over cells, as we see patches of yellow, which indicates higher residuals in these areas. This correlation of model errors by location motives some type of cluster-robust analysis.
df_heat <- data.frame(
x= rep(round(pd["avg_x"],2)),
y= rep(round(pd["avg_y"],2)),
error = model_iid["residuals"]
)
ggplot(df_heat, aes(x = as.factor(avg_x), y = as.factor(avg_y), fill = residuals)) +
geom_tile() +
labs(x='avg_x', y= 'avg_y') +
scale_fill_viridis_c() +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
Following the authors’ approach, we begin by clustering on the district level and compute the standard errors as:
model_cluster_district = feols(
formula_1,
data = pd,
cluster = "district_mode"
)
## NOTE: 56 observations removed because of NA values (vcov: 56).
summary(model_cluster_district)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 2,079,616
## Fixed-effects: cell_number_36x36: 148,544, year: 14
## Standard-errors: Clustered (district_mode)
## Estimate Std. Error t value Pr(>|t|)
## is_county_post 0.015102 0.005274 2.86316 0.0064016 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040136 Adj. R2: 0.623956
## Within R2: 0.008771
Using this methodology, we obtain a greater a greater standard error (0.005) than in the original OLS estimation (0.0001). This standard error is about 50x larger than the one in the original OLS.
Notably, this type of cluster-robust analysis relies on some assumptions. For example, it assumes that model errors for grid cells in different districts are uncorrelated. Intuitively, though, the forest loss for adjacent districts, even if they are in different districts, might be highly correlated, which could violate the assumption necessary for this analysis and still lead to an underestimation of the standard error. As a test of this assumption, we again plot the distribution of model errors, but this time over the individual districts (instead of cells) for a visual interpretation. The following plot suggests that the errors are still correlated over districts – note the proximity of districts with higher errors to each other – which calls into question this level of clustering.
df_heat_district = pd %>% group_by(district_mode) %>%
summarise(
avg_x = mean(avg_x),
avg_y = mean(avg_y),
avg_error = mean(residuals),
)
ggplot(df_heat_district, aes(x = as.factor(avg_x), y = as.factor(avg_y), fill = avg_error)) +
geom_tile() +
labs(x='avg_x', y= 'avg_y') +
scale_fill_viridis_c() +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
Given the limitations of this approaches, as outlined in Cameron and Miller, we could additionally learn a model for the correlation of errors within-districts based on other covariates in the data such as avg_x, avg_y, year, and perhaps any_agriculture, any_forestry, and any_mining. as well. Then, instead of using ordinary least squares (OLS) regression to estimate forest loss, use feasible generalized least squares (FGLS) regression to estimate it.
The 36 by 36 grid cells here are approximately 1 km on a side. Say we were to have measured these cells even more finely (perhaps at 6 by 6 resolution, i.e. approximately). If our statistical inference was valid before, does it remain valid?
Show this, either (a) formally by manipulating the expressions for cluster-robust sandwich variance-covariance estimators or (b) informally using a numerical example.
Yes, if our statistical inference were valid when we used the 36x36 grid cell resolution and clustered on district, it does remain valid if we measure cells more finely at a level of, say 6x6 cell resolution, and cluster on the same district level. Imagine one extreme in which all of the 6x6 cells had the same amount of forest loss – then the model errors are even more correlated within cluster, so a new observation in the district does not yield an entirely “new” (i.e. independent) source of information – in fact, we see 5 other copies! We can simulate this case by creating 5 additional copies of each of the original cells as below. Then, we re-run our original analysis with clustering, and obtain the same estimates for beta and the standard error.
pd_rep_1 <- pd[rep(seq_len(nrow(pd)), each = 5), ]
model_cluster_rep = feols(
formula_1,
data = pd_rep_1,
cluster = "district_mode"
)
## NOTE: 280 observations removed because of NA values (vcov: 280).
summary(model_cluster_rep)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 10,398,080
## Fixed-effects: cell_number_36x36: 148,544, year: 14
## Standard-errors: Clustered (district_mode)
## Estimate Std. Error t value Pr(>|t|)
## is_county_post 0.015102 0.005274 2.86317 0.0064014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040136 Adj. R2: 0.645758
## Within R2: 0.008771
Now imagine the alternative in which some of the 6x6 cells had different amounts of forest loss. We can simulate this case by defining two of the 6x6 cells such that they have different values, but they average to the same value as the original 36x36 cell. Then, we re-run our original analysis with clustering, and again obtain the same estimates for beta and the standard error.
pd_rep_2 <- pd[rep(seq_len(nrow(pd)), each = 3), ]
pd_up <- pd
pd_up["prop_loss"] <- pd["prop_loss"] + 1
pd_down <- pd
pd_down["prop_loss"] <- pd["prop_loss"] + 1
pd_noise <- rbind(pd_up, pd_down)
pd_rep_2 <- rbind(pd_rep_2, pd_noise)
model_cluster_rep_2 = feols(
formula_1,
data = pd_rep_2,
cluster = "district_mode"
)
## NOTE: 280 observations removed because of NA values (vcov: 280).
summary(model_cluster_rep_2)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 10,398,080
## Fixed-effects: cell_number_36x36: 148,544, year: 14
## Standard-errors: Clustered (district_mode)
## Estimate Std. Error t value Pr(>|t|)
## is_county_post 0.015102 0.005274 2.86317 0.0064014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.491539 Adj. R2: -0.002042
## Within R2: 5.899e-5
The results – that the standard error remains the same regardless of the size of the individual cell – hold in general, because the additional terms in the numerator of the selector matrix cancel out with those in the denominator: \(clu[\hat{\beta}]=\frac{\big( \sum_i \sum_j x_ix_j E[u_iu_j]1[i,j \text{ in same cluster}]\big)}{\big( \sum_i x_i^2\big)^2}\).
The authors include an analysis using an estimate of the variance-covariance matrix that is robust to spatial autocorrelation, based on Conley (1999).
We can think about this as using a different selector matrix \(\mathbf{S}^{d}\) that includes products of residuals for observations that are within some distance of each other. More generally, this need not be a binary matrix, but can be a matrix with entries in \([0, 1]\) where the entries decrease in distance.
Do a similar analysis using a matrix \(\mathbf{S}^{d}\) based on spatial distance between cells. You may find that there are computational challenges in doing this. It is possible to do this analysis using only a sample of pairs of observations. See Eckles, Kizilcec & Bakshy (2016, SI \(\S\) 5) for details on downsampling for such estimators.
A design-based perspective we could take here is to think about the boundary between the “County” and “Hinterland” areas as being haphazard, perhaps even random. This could motivate using (Fisherian) randomization inference, whereby we would draw many such boundaries, recompute our key statistic(s) using each, and compare our observed statistic(s) to those from this simulated null distribution.
Implement this and compare with prior results. There may be some choices about what the right distribution over possible random boundaries is, so say a little about about why you choose the one you do.
You might consider using the package ri2
for some of
this, though this is nonstandard enough that some custom coding will be
required.
Following the approach outlined above, we proceed as follows:
x_min = min(pd["avg_x"])
x_max = max(pd["avg_x"])
y_min = min(pd["avg_y"])
y_max = max(pd["avg_y"])
n_draws = 100
b_vals <- runif(n_draws, y_min, y_max)
a_max <- (y_min - b_vals)/x_max
a_min <- (y_max - b_vals)/x_min
a_vals <- runif(n_draws, a_min, a_max)
We define cumulative forest loss in region \(i\) as the absolute difference in cumulative forest loss as of 2014 in the regions separated by the boundary, so \(t = |propLoss_{r_2} - propLoss_{r_1}|\), where \(r_1\) is the region below the boundary and \(r_2\) is the region above the boundary.
x_min = min(pd["avg_x"])
x_max = max(pd["avg_x"])
y_min = min(pd["avg_y"])
y_max = max(pd["avg_y"])
n_draws = 100
sim_vals = c()
for (x in 1:n_draws) {
b = runif(1, y_min, y_max)
a_max <- (y_min - b)/x_max
a_min <- (y_max - b)/x_min
a <- runif(1, a_min, a_max)
r1_loss_df = pd %>% filter(year == 2009) %>%
filter(avg_y > a*avg_x+b)
r2_loss_df = pd %>% filter(year == 2009) %>%
filter(avg_y <= a*avg_x+b)
if(nrow(r1_loss_df) == 0 | nrow(r2_loss_df) == 0 ) {
next
}
r1_loss = r1_loss_df$prop_loss[[1]]
r2_loss = r2_loss_df$prop_loss[[1]]
t = abs(r2_loss - r1_loss)
sim_vals <- c(sim_vals, t)
}
est = 0.015
ggplot(data.frame(sim_vals), aes(sim_vals)) + # Draw histogram
geom_histogram(bins = 10) +
geom_vline(xintercept = est, color="red", linetype="dashed")
To quantify this find, we calculate the p-value for the observe
statistic as \(\frac{\text{# sim above
observed value}}{\text{# total sim}}\), and obtain a p-value of
0.9899, which points to a failure to reject the null hypothesis and
calls into question the significance of the results of the paper.
p_val = sum(sim_vals > est) / length(sim_vals)
print(p_val)
## [1] 1