This week’s homework is designed to help students get a better grasp of how to use R for basic data anlysis, model building and simulation.

Installation

If you have not already, please install both R and R Studio. R is available from http://cran.r-project.org/ and R Studio is available at https://www.rstudio.com/. You will need to install R before R Studio. If you have difficulties installing, ping me on the R Help Slack channel.

Part 1

Reading in and preparing some data

The data that we will use for the exercise comes from the Penn World Table http://www.rug.nl/research/ggdc/data/pwt/pwt-7.1. We will use it to estimate and simulate a basic Solow-Swan model.

Most of the useful features of R come in libraries; bits of code written by enthusiasts that are freely available. To install new libraries from within R Studio, go to Tools -> Install Packages, and then search for the library by name. You should install the following libraries: ggplot2, which we will use for plotting, and dplyr, a useful package for data summarisation.

First you will need to open a new R script. To do this, go to File -> New File -> R Script. In an R Script, you can write many lines of code—instructions—that stay put once you execute them. This is a useful way of arranging your analysis.

Now load the data by writing the following line. To execute the line, simply press ctrl-enter while your cursor is on the line. (If you want to execute multiple lines or just a part of a line, you can highlight that code and press ctrl-enter).

library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the Penn World Table data
pwt <- read.csv("pwt71.csv")

# You may want to take a look at the data; you can do this by clicking on the `pwt` object in the Environment, or just type
head(pwt)
##   isocode year      POP   XRAT Currency_Unit ppp tcgdp cgdp cgdp2 cda2 cc
## 1     AFG 1950 8150.368     NA                NA    NA   NA    NA   NA NA
## 2     AFG 1951 8284.473     NA                NA    NA   NA    NA   NA NA
## 3     AFG 1952 8425.333     NA                NA    NA   NA    NA   NA NA
## 4     AFG 1953 8573.217     NA                NA    NA   NA    NA   NA NA
## 5     AFG 1954 8728.408     NA                NA    NA   NA    NA   NA NA
## 6     AFG 1955 8891.209 0.0168       Afghani  NA    NA   NA    NA   NA NA
##   cg ci  p p2 pc pg pi openc cgnp  y y2 rgdpl rgdpl2 rgdpch kc kg ki openk
## 1 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
## 2 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
## 3 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
## 4 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
## 5 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
## 6 NA NA NA NA NA NA NA    NA   NA NA NA    NA     NA     NA NA NA NA    NA
##   rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
## 1      NA      NA        NA       NA       NA       NA     NA
## 2      NA      NA        NA       NA       NA       NA     NA
## 3      NA      NA        NA       NA       NA       NA     NA
## 4      NA      NA        NA       NA       NA       NA     NA
## 5      NA      NA        NA       NA       NA       NA     NA
## 6      NA      NA        NA       NA       NA       NA     NA

As you can see, there are many missing observations in the data; don’t worry too much about this yet.

Summarising the data for each country

What we want here:

  • An observation for each country of their average investment rate 1985-2010
  • An observation for each country of their population growth rate 1985-2010
  • An observation for each country of their income per capita relative to the united states in 2010.

The dataset contains multiple observations for each country. We want to reduce this to just the data that we need for the main exercise (estimating a basic Solow model). First, let’s “filter” the observations so that we only have the data from 1985-2010. We can do this using the following code:

pwt2 <- pwt %>% filter(year>=1985)

What happens with this syntax? First, we’re creating a new object called pwt2. This object takes pwt, our dataset, and then passes it into the filter function with the “pipe” %>%. The filter command restricts the observations to only those that have a year value equal to or greater than 1985.

We can “chain” together many functions using this syntax. In particular, we want to use two functions, group_by() and summarise(). A statement like mydata %>% group_by(some_variable) takes some data (here mydata), and “groups” it by a variable called some_variable. You can think of this as being like splitting the dataset up into chunks, each with the same value for some_variable. We can then perform analysis on each of those chunks, and it mashes them back together.

A line like

mydata2 <- mydata %>% group_by(mygroupingvariable) %>%
              summarise(Average_value = mean(Variable_2),
                        Ratio = last(Variable_3)/first(Variable_3))

takes mydata, splits it into groups, each with the same value for mygroupingvariable, then for each of those groups, calculates two summary statistics for each groups. The first is called Average_value, which is the mean of Variable_2, and the second is called Ratio, which is the ratio of the last observation for Variable_3 (the last observation in each group, that is), to the first.

Sometimes you want to generate a new variable (column of data) that contains a particular value from your dataset. Say, I want a variable that is equal to the value of Variable_8 when both the date is 2014-02-08 and when Variable_6 is equal to 4. To do this, I use the mutate command:

mydata3 <- mydata2 %>% ungroup %>% # we use ungroup to undo groupings
              mutate(Specialval = Variable_8[Date=="2014-02-08" & Variable_6==4])

Exercise

  • Generate variables

Using the hints above, generate a dataset that contains, for each country, the average investment rate (as a percentage of GDP), the annual population growth rate (expressed in raw numbers, like \(0.02\)) for the period 1985-2010, and the income per capita (in purchasing power terms) relative to the United States in 2010. (Note I’m looking for a single number—the average annual growth rates over the period. You may need to look up how to calculate geometric growth rates from two levels series.)

  • Plot the data in scatter plots

You may use either base graphics or ggplot for this.

Part 2: Modelling

Note the change below. You should first make sure that your y, s, n, g and delta are all on the same scale before taking logs. For instance, y has the US at 100, so s should be in whole percentage points, n g and delta also.

Estimate two basic Solow models of the following form:

\[ \ln(y_i) = \beta_0 + \beta_1\ln(s_i) + \beta_2 \ln(n_{i} + g + \delta) + \epsilon_i \]

This can be done using the linear model function

mymod <- lm(y ~ x1 + x2 + etc, data = mydata)

After estimating the model, check the parameter estimates using

summary(mymod)

Do these parameters look sensible?

Also estimate the models

\[ \ln(y_i) = \frac{\alpha}{1-\alpha}\ln(s_i) - \frac{\alpha}{1-\alpha} \ln(n_{i} + g + \delta) + \epsilon_i \]

and

This is changed!

\[ \ln(y_i) = \ln(A_{0}) + \frac{\alpha}{1-\alpha}\ln(s_i) - \frac{\alpha}{1-\alpha} \ln(n_{i} + g + \delta) + \epsilon_i \]

using the function nls(). In this function, we give it a formula as above, but this time we can write one that is non-linear in the parameters. So if we wanted to estimate a restricted parameter \(\beta\), we could write

mymod2 <- nls(y ~ beta*x1 + (1/beta)*x2, data = mydata)

and check the estimate of \(\beta\) using

summary(mymod2)

You should get results that look like this:

## 
## Call:
## lm(formula = lnY ~ lnS + lnngd, data = pwt2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57707 -0.74167  0.02233  0.64023  3.04949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.1422     1.6120   7.532 3.77e-12 ***
## lnS           0.9823     0.1909   5.147 7.87e-07 ***
## lnngd        -5.9296     0.6659  -8.905 1.30e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9918 on 156 degrees of freedom
## Multiple R-squared:  0.4475, Adjusted R-squared:  0.4404 
## F-statistic: 63.18 on 2 and 156 DF,  p-value: < 2.2e-16
## 
## Formula: lnY ~ (alpha/(1 - alpha)) * lnS - (alpha/(1 - alpha)) * lnngd
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## alpha 0.719205   0.007163   100.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 158 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 4.408e-09
## 
## Formula: lnY ~ LnA0 + (alpha/(1 - alpha)) * lnS - (alpha/(1 - alpha)) * 
##     lnngd
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## alpha  0.60637    0.03035  19.978  < 2e-16 ***
## LnA0   1.21510    0.21129   5.751 4.51e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.127 on 157 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 5.851e-09

(Moment comparison) Generate some new countries and see if they look like real countries

  • Using the parameter estimates from the non-linear model along with your country data, use the mutate() function to generate a new column that contains the predicted GDP per capita relative to the US. (Note that the predicted GDP from the model will be the log proportion—negative number for most countries; you will need to take the antilog of this value using exp()).
  • Plot the actual for each country against the predicted. Does the model fit the data well?

Often, we want to check how well a model is performing by using the model to generate some new random numbers, and compare these to the actuals. This is in the spirit of what Bayesians call posterior predictive checking, and what macroeconomists call moment checking. The model will be more “valid” if the new countries that it generates have a similar distribution of GDP per capita to what we observe in reality.

  • First generate a new column that contains the errors from your model—the difference between actual GDP per capita relative to the US and predicted.
  • Next generate a histogram of the errors using `hist(mydata\(myerrorcolumn)\).
  • Generate a plot of fitted vs errors using `plot(mydata\(fitted, mydata\)myerrorcolumn)$. Do the errors depend on the fitted value?
  • What is the standard deviation of the errors? You can find this using \(sd(mydata\)myerrorcolumn)$.
  • Let’s simulate the model using your estimate parameters \(\alpha,\sigma\) (where sigma is the standard deviation you just estimated)! Create a column in your dataset containing some new “countries”. Each of these countries will be based on an actual country’s investment and population growth rates (ie one new country per line, which corresponds to another country). You can generate a new country using
mydata <- mydata %>%
  mutate(NewVersionOfCountry = exp(rnorm(n = n(), mean = predicted_gdppp, sd = sdestimate)))

Generate two histograms, one of the actual GDP per capital relative to the united states, and the second with our “new countries”. Would you say the model is a good fit?