Due Friday, April 15th.

This assignment involves implementing data analysis for a geographic quasi-experiment. In particular, we will work with some reanalysis of this study we read and discussed in class: Christensen, D., Hartman, A. C., & Samii, C. (2021). Legibility and External Investment: An Institutional Natural Experiment in Liberia. International Organization, 75(4), 1087-1108.

We have provided a set of files for you to work with. You don’t need to use the other replication materials from the paper, and would suggest you try doing this without consulting them.

Packages

Our starter code also makes use of a few packages, including fixest. You can use another regression package, like lfe or base lm with sandwich or clubSandwich.

Data

pd = readRDS("panel.rds")
head(pd)

This is panel data for the years 2001-2014. The authors study the effects of property right systems on external land investment (measured by fraction of forest cleared in 36x36 blocks), specifically around the time of the Global Food Crisis in 2007.

The institutional boundary is 40 miles inland from the southern coast, and in their main analysis they analyze counties within 40 km (25 miles) of the boundary. They also identify “Southern Counties” – Grand Gedeh, Grand Kru, Maryland, Sinoe, and River Gee – where this institutional boundary is less salient, and is excluded in certain anaylsis.
- cell_number_36x36: unique identifier for each block of 36x36 cells.
- prop_loss: outcome variable, fraction of cells that have forest cleared (each indv. cell’s loss is binary 0/1). Despite the naming, each 36x36 block does not have 1296 cells. In fact, most blocks have 648 and as few as 4 cells. - is_county: indicator for “County” areas. - is_county_post: indicator that takes a 1 for County Area starting in 2008.
- avg_running_var: distance from the institutional border. Positive values indicates cell is above the border (i.e. “Hinterland”), negative values indicate cell is below the border (i.e. “County”).

Here is the area covered by the data. Note that some areas have somewhat less dense data.

ggplot(
  aes(x = avg_x, y = avg_y),
  data = pd %>% filter(year == 2004)
  ) +
  stat_bin_hex(bins = 100) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
    )

Here we compute a series of points approximately specifying the border and plot this border over the different counties in this data.

boundary = pd %>%
  filter(year == 2004, is_county == 1) %>%
  mutate(
    x = round(avg_x, 2)
  ) %>%
  filter(x > -11.15) %>%
  group_by(x) %>%
  summarise(
    y = max(avg_y)
  )


ggplot(
  aes(x = avg_x, y = avg_y, color = county_mode),
  data = pd %>% filter(year == 2004)
  ) +
  geom_point(size = .0001) +
  geom_line(
    aes(x = x, y = y),
    inherit.aes = FALSE,
    data = boundary,
    size = 1
    ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none"
    )

Analysis

In this assignment, we are going to assess the results considered in the paper’s Table 1. These are results about effects on forest loss following the food crisis.

The basic model used here is this two-way fixed effects model:

formula_1 = prop_loss ~ is_county_post | cell_number_36x36 + year 

model_iid = feols(
  formula_1,
  data = pd
)
pd['residuals'] <- c(residuals(model_iid))
summary(model_iid)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 2,079,672 
## Fixed-effects: cell_number_36x36: 148,548,  year: 14
## Standard-errors: Clustered (cell_number_36x36) 
##                Estimate Std. Error t value  Pr(>|t|)    
## is_county_post 0.015102   0.000304 49.6596 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040135     Adj. R2: 0.623956
##                  Within R2: 0.008772

Now consider different ways of doing statistical inference here.

Question 1

The analysis above is valid only if we assume that observations are independent. In the paper the authors report “cluster-robust” standard errors, where the clusters are districts (district_mode).


A useful presentation of these kind of dependency-robust variance-covariance estimators is: \[ \hat{V}[\mathbf{\hat{\beta}}] = (\mathbf{X}'\mathbf{X})^{-1} \hat{\mathbf{B}} (\mathbf{X}'\mathbf{X})^{-1} \] where the central matrix is often written \[ \mathbf{\hat{B}} = \sum_{g=1}^G \mathbf{X}'_g \mathbf{\hat{u}}_g \mathbf{\hat{u}}'_g \mathbf{X}_g \] where \(\hat{u}_i = y_i - \mathbf{x}'_i \mathbf{\hat{\beta}}\) is the \(i\)th unit’s residual, subscripts \(g\) pick out those elements from cluster \(g\), and this is a sum over the \(G\) clusters.

We can write this a bit more generally by thinking about the vector of all residuals \(\mathbf{\hat{u}}\) and the matrix of all pairs of residuals \(\mathbf{\hat{u}} \mathbf{\hat{u}}'\). Then this central matrix can be written as \[ \mathbf{\hat{B}} = \mathbf{X}'(\mathbf{\hat{u}} \mathbf{\hat{u}}' \odot \mathbf{S}^G) \mathbf{X} \] where \(\odot\) denotes element-wise multiplication. \(\mathbf{S}^G\) is a ``selector’’ matrix that is 1 when the row and column elements are in the same cluster (i.e., \(S^G_{ij} = 1_{G(i) = G(j)}\)). That is, this matrix indicates which pairs of residuals get summed up and which do not. (This discussion follows Cameron, Gelbach & Miller (2011) and Eckles, Kizilcec & Bakshy (2016, SI \(\S\) 5).)

This representation can be useful for other sorts of variance-covariance estimators, such as those discussed below in Question 3.


Do a similar cluster-robust analysis and comment on whether this level of clustering makes this analysis defensible. Are there other levels of clustering that you think are preferable?

Answer 1

First, we plot the distribution of model errors over the individual cells for a visual interpretation. As the plot below illustrates, the errors appear heteroskedastic and correlated over cells, as we see patches of yellow, which indicates higher residuals in these areas. This correlation of model errors by location motives some type of cluster-robust analysis.

df_heat <- data.frame(
  x= rep(round(pd["avg_x"],2)), 
  y= rep(round(pd["avg_y"],2)),
  error = model_iid["residuals"]
)

ggplot(df_heat, aes(x = as.factor(avg_x), y = as.factor(avg_y), fill = residuals)) +
  geom_tile() + 
  labs(x='avg_x', y= 'avg_y') +
  scale_fill_viridis_c() + 
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

Following the authors’ approach, we begin by clustering on the district level and compute the standard errors as:

model_cluster_district = feols(
  formula_1,
  data = pd,
  cluster = "district_mode"
)
## NOTE: 56 observations removed because of NA values (vcov: 56).
summary(model_cluster_district)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 2,079,616 
## Fixed-effects: cell_number_36x36: 148,544,  year: 14
## Standard-errors: Clustered (district_mode) 
##                Estimate Std. Error t value  Pr(>|t|)    
## is_county_post 0.015102   0.005274 2.86316 0.0064016 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040136     Adj. R2: 0.623956
##                  Within R2: 0.008771

Using this methodology, we obtain a greater a greater standard error (0.005) than in the original OLS estimation (0.0001). This standard error is about 50x larger than the one in the original OLS.

Notably, this type of cluster-robust analysis relies on some assumptions. For example, it assumes that model errors for grid cells in different districts are uncorrelated. Intuitively, though, the forest loss for adjacent districts, even if they are in different districts, might be highly correlated, which could violate the assumption necessary for this analysis and still lead to an underestimation of the standard error. As a test of this assumption, we again plot the distribution of model errors, but this time over the individual districts (instead of cells) for a visual interpretation. The following plot suggests that the errors are still correlated over districts – note the proximity of districts with higher errors to each other – which calls into question this level of clustering.

df_heat_district = pd %>% group_by(district_mode)  %>%
  summarise(
    avg_x = mean(avg_x),
    avg_y = mean(avg_y),
    avg_error = mean(residuals),
)

ggplot(df_heat_district, aes(x = as.factor(avg_x), y = as.factor(avg_y), fill = avg_error)) +
  geom_tile() + 
  labs(x='avg_x', y= 'avg_y') +
  scale_fill_viridis_c() + 
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

Given the limitations of this approaches, as outlined in Cameron and Miller, we could additionally learn a model for the correlation of errors within-districts based on other covariates in the data such as avg_x, avg_y, year, and perhaps any_agriculture, any_forestry, and any_mining. as well. Then, instead of using ordinary least squares (OLS) regression to estimate forest loss, use feasible generalized least squares (FGLS) regression to estimate it.

Question 2

The 36 by 36 grid cells here are approximately 1 km on a side. Say we were to have measured these cells even more finely (perhaps at 6 by 6 resolution, i.e. approximately). If our statistical inference was valid before, does it remain valid?

Show this, either (a) formally by manipulating the expressions for cluster-robust sandwich variance-covariance estimators or (b) informally using a numerical example.

Answer

Yes, if our statistical inference were valid when we used the 36x36 grid cell resolution and clustered on district, it does remain valid if we measure cells more finely at a level of, say 6x6 cell resolution, and cluster on the same district level. Imagine one extreme in which all of the 6x6 cells had the same amount of forest loss – then the model errors are even more correlated within cluster, so a new observation in the district does not yield an entirely “new” (i.e. independent) source of information – in fact, we see 5 other copies! We can simulate this case by creating 5 additional copies of each of the original cells as below. Then, we re-run our original analysis with clustering, and obtain the same estimates for beta and the standard error.

pd_rep_1 <- pd[rep(seq_len(nrow(pd)), each = 5), ]
model_cluster_rep = feols(
  formula_1,
  data = pd_rep_1,
  cluster = "district_mode"
)
## NOTE: 280 observations removed because of NA values (vcov: 280).
summary(model_cluster_rep)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 10,398,080 
## Fixed-effects: cell_number_36x36: 148,544,  year: 14
## Standard-errors: Clustered (district_mode) 
##                Estimate Std. Error t value  Pr(>|t|)    
## is_county_post 0.015102   0.005274 2.86317 0.0064014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.040136     Adj. R2: 0.645758
##                  Within R2: 0.008771

Now imagine the alternative in which some of the 6x6 cells had different amounts of forest loss. We can simulate this case by defining two of the 6x6 cells such that they have different values, but they average to the same value as the original 36x36 cell. Then, we re-run our original analysis with clustering, and again obtain the same estimates for beta and the standard error.

pd_rep_2 <- pd[rep(seq_len(nrow(pd)), each = 3), ]
pd_up <- pd
pd_up["prop_loss"] <- pd["prop_loss"] + 1
pd_down <- pd
pd_down["prop_loss"] <- pd["prop_loss"] + 1
pd_noise <- rbind(pd_up, pd_down)
pd_rep_2 <- rbind(pd_rep_2, pd_noise)

model_cluster_rep_2 = feols(
  formula_1,
  data = pd_rep_2,
  cluster = "district_mode"
)
## NOTE: 280 observations removed because of NA values (vcov: 280).
summary(model_cluster_rep_2)
## OLS estimation, Dep. Var.: prop_loss
## Observations: 10,398,080 
## Fixed-effects: cell_number_36x36: 148,544,  year: 14
## Standard-errors: Clustered (district_mode) 
##                Estimate Std. Error t value  Pr(>|t|)    
## is_county_post 0.015102   0.005274 2.86317 0.0064014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.491539     Adj. R2: -0.002042
##                  Within R2:  5.899e-5

The results – that the standard error remains the same regardless of the size of the individual cell – hold in general, because the additional terms in the numerator of the selector matrix cancel out with those in the denominator: \(clu[\hat{\beta}]=\frac{\big( \sum_i \sum_j x_ix_j E[u_iu_j]1[i,j \text{ in same cluster}]\big)}{\big( \sum_i x_i^2\big)^2}\).

Question 3 (optional, bonus)

The authors include an analysis using an estimate of the variance-covariance matrix that is robust to spatial autocorrelation, based on Conley (1999).

We can think about this as using a different selector matrix \(\mathbf{S}^{d}\) that includes products of residuals for observations that are within some distance of each other. More generally, this need not be a binary matrix, but can be a matrix with entries in \([0, 1]\) where the entries decrease in distance.

Do a similar analysis using a matrix \(\mathbf{S}^{d}\) based on spatial distance between cells. You may find that there are computational challenges in doing this. It is possible to do this analysis using only a sample of pairs of observations. See Eckles, Kizilcec & Bakshy (2016, SI \(\S\) 5) for details on downsampling for such estimators.

Question 4

A design-based perspective we could take here is to think about the boundary between the “County” and “Hinterland” areas as being haphazard, perhaps even random. This could motivate using (Fisherian) randomization inference, whereby we would draw many such boundaries, recompute our key statistic(s) using each, and compare our observed statistic(s) to those from this simulated null distribution.

Implement this and compare with prior results. There may be some choices about what the right distribution over possible random boundaries is, so say a little about about why you choose the one you do.

You might consider using the package ri2 for some of this, though this is nonstandard enough that some custom coding will be required.

Answer 4

Following the approach outlined above, we proceed as follows:

  1. Draw many such possible boundaries. Here, we assume that all boundaries are linear, so follow the form \(y \sim ax + b\). We simulate possible boundaries by drawing \(a \sim Uniform(y_{min}, y_{max})\) and \(b \sim Uniform(b_{min}, b_{max})\). We choose the range for the intercept term \(b\) by considering all feasible values of \(y\), and we choose the range for the slope term \(a\) conditional on \(b\) by considering all slopes that results at least some division of land. Concretely, \(a_{max} = \frac{y_{min} - b}{x_{max}}\) and \(a_{min} = \frac{y_{max} - b}{x_{min}}\) because all values for \(x\) are negative.
x_min = min(pd["avg_x"])
x_max = max(pd["avg_x"])
y_min = min(pd["avg_y"])
y_max = max(pd["avg_y"])
n_draws = 100

b_vals <- runif(n_draws, y_min, y_max)
a_max <- (y_min - b_vals)/x_max
a_min <- (y_max - b_vals)/x_min
a_vals <- runif(n_draws, a_min, a_max)
  1. Recompute the key statistic under each of these different boundaries. Here, our key statistic \(t\) is the difference in cumulative forest loss between the regions separated by the boundary. We define cumulative forest loss in region \(i\) as the difference in forest loss after the intervention (2009) versus before the intervention (2007): \(forestLoss_i = propLoss_{2009}^{r_i} - propLoss_{2007}^{r_i}\). As such, our key statistic is \(t = forestLoss_2 - forestLoss_1 = (propLoss_{post}^{r_2} - propLoss_{pre}^{r_2}) - (propLoss_{2009}^{r_1} - propLoss_{2007}^{r_1})\), where \(r_1\) is the region below the boundary and \(r_2\) is the region above the boundary.

We define cumulative forest loss in region \(i\) as the absolute difference in cumulative forest loss as of 2014 in the regions separated by the boundary, so \(t = |propLoss_{r_2} - propLoss_{r_1}|\), where \(r_1\) is the region below the boundary and \(r_2\) is the region above the boundary.

x_min = min(pd["avg_x"])
x_max = max(pd["avg_x"])
y_min = min(pd["avg_y"])
y_max = max(pd["avg_y"])
n_draws = 100

sim_vals = c()
for (x in 1:n_draws) {
  b = runif(1, y_min, y_max)
  a_max <- (y_min - b)/x_max
  a_min <- (y_max - b)/x_min
  a <- runif(1, a_min, a_max)
  
  r1_loss_df = pd %>% filter(year == 2009) %>%
    filter(avg_y > a*avg_x+b) 

  r2_loss_df = pd %>% filter(year == 2009) %>%
    filter(avg_y <= a*avg_x+b) 

  if(nrow(r1_loss_df) == 0 | nrow(r2_loss_df) == 0 ) {
    next
  }

  r1_loss = r1_loss_df$prop_loss[[1]]
  r2_loss = r2_loss_df$prop_loss[[1]]

  t = abs(r2_loss - r1_loss)
  sim_vals <- c(sim_vals, t) 
}
  1. Compare the observed statistic from our OLS model to the statistic generated under this null distribution. The green line indicates the estimate from the paper, and the histogram indicates the null distribution. If the institutional boundary were significant, we would expect the probability of obtaining the observed value under our null distribution to be small (i.e. a value of less than 0.05 for a 95% confidence level), which we do not see in the histogram below. Many of our simulations lead to greater (absolute) differences in forest loss between the two regions.
est = 0.015
ggplot(data.frame(sim_vals), aes(sim_vals)) +    # Draw histogram
  geom_histogram(bins = 10) +
  geom_vline(xintercept = est, color="red", linetype="dashed")

To quantify this find, we calculate the p-value for the observe statistic as \(\frac{\text{# sim above observed value}}{\text{# total sim}}\), and obtain a p-value of 0.9899, which points to a failure to reject the null hypothesis and calls into question the significance of the results of the paper.

p_val = sum(sim_vals > est) / length(sim_vals)
print(p_val)
## [1] 1

References

  • Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiway clustering. Journal of Business & Economic Statistics, 29(2).
  • Eckles, D., Kizilcec, R.F., and Bakshy, E. (2016). Estimating peer effects in networks with peer encouragment designs. Proceedings of the National Academy of Sciences 113(7316).