Homework week six

Outline

This week’s comprehension homework is based on Acemoglu, Johnson and Robinson (2001) The Colonial Origins of Comparative Development: An Empirical Investigation. This task has been posted late, and so will be due on Tuesday, alongside the computational task.

Don’t spend too much time on it. Give the paper a read, and go over the parts you don’t understand a couple of times. If you have any conceptual issues, then please post them in the homework channel on Slack. Each question really only needs a couple of lines written.

Part 1: Comprehension

What is the central point of the paper?
What variables do the authors use to proxy for current institutions?
Why can’t we just look at the simple relationship between these variables and today’s development? Isn’t that the causal relationship?
What is the instrumental variable the authors describe?
What is their exclusion restriction? Do you find it plausible?

Part 2: Using Acemoglu Johnson and Robinson’s data to update our model from last week.

This task asks you to implement a very simple Instrumental Variables model to help estimate the parameters of the Solow model. As a base, use the code snippet below, which constructs the data we used to generate the model in class.

In last week’s homework, we were estimating the model (for each country \(i\))

\[ \frac{Y_{i}}{POP_{i}A_{i}} = \left(\frac{s_{i}}{n_{i} + g + \delta}\right)^{\frac{\alpha}{1-\alpha}} \]

Where \(Y_{i}\) is country \(i\)’s GDP, \(POP_{i}\) is its population, \(A_{i}\) is the technology available to a country, \(s_{i}\) is their savings/investment rate, \(n_{i}\) is their population growth rate, \(g\) is the growth rate of population-augmenting technology (since we’ve divided GDP by population, not workforce), and \(\delta\) is the depreciation rate. We assume that \(g\) and \(\delta\) are the same in all countries.

Multiplying both sides by \(A_{i}\) and taking logs, we get

\[ \ln\left(\frac{Y_{i}}{P_{i}}\right) = \ln(A_{i}) + \frac{\alpha}{1-\alpha}\ln(s_{i}) - \frac{\alpha}{1-\alpha}\ln(n_{i} + g + \delta) \]

This looks very similar to a regression equation. If we make the assumption that \(\ln(A_{i}) = \hat{A} + \epsilon_{i}\) and substitute this in, then we have our regression equation. The identifying restriction (that is, the assumption that lets a linear regression correctly estimate \(A\) and \(\alpha\)) is that whatever is in \(\epsilon_{i}\) (you can think of this as being everything else) is not systematically correlated with \(s_{i}\) and \(n_{i}\).

In class, we estimated this model and found that we ended up with an \(\alpha\) estimate of 0.6 or so, which seemed too high. This would be the case if the sorts of things that push up savings/decrease population growth also push up GDP for reasons other than through their impacts on savings and population growth. These “other things” are called confounders. In particular, we are concerned that unobserved institutions are causing both an increase in GDP, a decrease in population growth, and an increase in savings, so that some of the relationship that we observe between the variables is not the true causal relationship.

Let’s propose that instead of saying that

\[ \ln(A_{i}) = \hat{A} + \epsilon_{i} \]

we say that

\[ \ln(A_{i}) = \bar{A} + \delta_{i}\mbox{aveexpr}_{i} + \eta_{i} \]

where aveexpr is a score of the average protection against expropriation risk from 1985–1995. We’ll use this variable as a proxy for institutional quality.

Task 1

Re-run the code posted below. Make sure that you’re comfortable with what is happening in each line.
Load the data I have posted on Slack. This is based on the data provided alongside the paper. To load the datafile, simply run load(ajr.RData). You don’t need to create a new variable.
Join this data onto your main datafile using the left_join() function. A note on this is below.
Run a linear regression as in the code below, but this time including our institution proxy.
Do the unrestricted estimates imply similar values of alpha?
Re-run the non-linear model, this time including aveexpr as in the model described above. What happens to our estimate of alpha?
What is the effect of increasing the savings rate in Australia by 1 per centage point? (Note, you should run the simulations from the model as in the code below, but with Australia’s values of \(s\), \(n\) and avexpr).

Note on left_join

Often we have two datasets that have a column in common, and we want to “join” the datasets. That is, we want to keep all the observations in the “left” dataset, and match these observations to the one on the right. To do this, we use left_join(), which lives in the dplyr library. For example, if dataset1 contains a column called country and a column called GDP, and dataset2 contains a country called country and a column called Tax rate then we can join the two using

dataset3 <- left_join(dataset1, dataset2)

where dataset3 will now contain three columns, one called country, one called GDP and one called Tax rate.

For a more detailed description, see https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html

Task 2 (optional but encouraged)

Following the paper, use the log of settler mortality (logem4) to instrument for our institutional proxy. Does this change our unrestricted estimates? There is an illustration of how to use the ivreg() function in library(AER) to estimate an unrestricted model. What happens to our estimates?

The code used to construct the data

# Load the libraries
library(ggplot2); library(dplyr)

# Read the data
pwt <- read.csv("pwt71.csv")

# Filter out the observations outside the period I'm interested in
pwt.ss <- pwt %>% filter(year<=2010 & year>=1985)

# Generate our data for the regression
pwt.2 <- pwt.ss  %>% group_by(isocode) %>% # For each country
  summarise(s = mean(ki), # What was the average investment?
            y = last(y), # The last GDP per person relative to the US?
            n = 100*log(last(POP)/first(POP))/(n() - 1)) %>% # Population growth rate?
  filter(!is.na(s)) %>% # Get rid of missing rows of s
  mutate(ln_y = log(y), # Create new columns- log of y
         ln_s = log(s), # Log of s
         ln_ngd = log(n + 1.6 + 5)) # Log of n + g + delta

# Modelling! --------------------------------------------------------------

# Run the linear model (unrestricted)
mod1 <- lm(ln_y ~ ln_s + ln_ngd, data = pwt.2)
# Take a look at the parameter estimates
summary(mod1)

# Run the restricted parameter model
mod2 <- nls(ln_y ~ A + (alpha/(1-alpha))*ln_s - (alpha/(1-alpha))*ln_ngd, 
            data = pwt.2,
            start = list(A = 11, alpha = 0.3))
# Take a look at the parameters
summary(mod2)


# Simulate a new country called Straya -----------------------------------------------

# Exogenous variables
s_current <- 26
n_current <- 1.6

# Parameters of the model
A <- coef(mod2)[1]
alpha <- coef(mod2)[2]
se <- 1.134 # From the summary command


# Simulate new country 1000 times (benchmark/baseline/BAU)

straya_1 <- rnorm(100000, # Generate 100k new observations
                  mean = A + (alpha/(1-alpha))*log(s_current) - (alpha/(1-alpha))*log(n_current + 1.6 + 5),
                  sd = se)

# Plot a histogram
hist(exp(straya_1), xlim = c(0, 200), breaks = 100)

# Simulate with new savings rate
s_new <- 27

straya_2 <- rnorm(100000,
                  mean = A + (alpha/(1-alpha))*log(s_new) - (alpha/(1-alpha))*log(n_current + 1.6 + 5),
                  sd = se)
hist(exp(straya_2), xlim = c(0, 200), breaks = 100)

# What is the difference in median simulations between the scenarios? 
median(straya_2) - median(straya_1)