This week’s homework is designed to help students get a better grasp of how to use R for basic data anlysis, model building and simulation.
If you have not already, please install both R and R Studio. R is available from http://cran.r-project.org/ and R Studio is available at https://www.rstudio.com/. You will need to install R before R Studio. If you have difficulties installing, ping me on the R Help
Slack channel.
The data that we will use for the exercise comes from the Penn World Table http://www.rug.nl/research/ggdc/data/pwt/pwt-7.1. We will use it to estimate and simulate a basic Solow-Swan model.
Most of the useful features of R come in libraries; bits of code written by enthusiasts that are freely available. To install new libraries from within R Studio, go to Tools -> Install Packages
, and then search for the library by name. You should install the following libraries: ggplot2
, which we will use for plotting, and dplyr
, a useful package for data summarisation.
First you will need to open a new R script. To do this, go to File -> New File -> R Script
. In an R Script, you can write many lines of code—instructions—that stay put once you execute them. This is a useful way of arranging your analysis.
Now load the data by writing the following line. To execute the line, simply press ctrl-enter
while your cursor is on the line. (If you want to execute multiple lines or just a part of a line, you can highlight that code and press ctrl-enter
).
library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the Penn World Table data
pwt <- read.csv("pwt71.csv")
# You may want to take a look at the data; you can do this by clicking on the `pwt` object in the Environment, or just type
head(pwt)
## isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 cda2 cc
## 1 AFG 1950 8150.368 NA NA NA NA NA NA NA
## 2 AFG 1951 8284.473 NA NA NA NA NA NA NA
## 3 AFG 1952 8425.333 NA NA NA NA NA NA NA
## 4 AFG 1953 8573.217 NA NA NA NA NA NA NA
## 5 AFG 1954 8728.408 NA NA NA NA NA NA NA
## 6 AFG 1955 8891.209 0.0168 Afghani NA NA NA NA NA NA
## cg ci p p2 pc pg pi openc cgnp y y2 rgdpl rgdpl2 rgdpch kc kg ki openk
## 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
As you can see, there are many missing observations in the data; don’t worry too much about this yet.
What we want here:
The dataset contains multiple observations for each country. We want to reduce this to just the data that we need for the main exercise (estimating a basic Solow model). First, let’s “filter” the observations so that we only have the data from 1985-2010. We can do this using the following code:
pwt2 <- pwt %>% filter(year>=1985)
What happens with this syntax? First, we’re creating a new object called pwt2
. This object takes pwt
, our dataset, and then passes it into the filter
function with the “pipe” %>%
. The filter
command restricts the observations to only those that have a year value equal to or greater than 1985.
We can “chain” together many functions using this syntax. In particular, we want to use two functions, group_by()
and summarise()
. A statement like mydata %>% group_by(some_variable)
takes some data (here mydata
), and “groups” it by a variable called some_variable
. You can think of this as being like splitting the dataset up into chunks, each with the same value for some_variable
. We can then perform analysis on each of those chunks, and it mashes them back together.
A line like
mydata2 <- mydata %>% group_by(mygroupingvariable) %>%
summarise(Average_value = mean(Variable_2),
Ratio = last(Variable_3)/first(Variable_3))
takes mydata
, splits it into groups, each with the same value for mygroupingvariable
, then for each of those groups, calculates two summary statistics for each groups. The first is called Average_value
, which is the mean of Variable_2
, and the second is called Ratio
, which is the ratio of the last observation for Variable_3
(the last observation in each group, that is), to the first.
Sometimes you want to generate a new variable (column of data) that contains a particular value from your dataset. Say, I want a variable that is equal to the value of Variable_8
when both the date is 2014-02-08
and when Variable_6
is equal to 4. To do this, I use the mutate
command:
mydata3 <- mydata2 %>% ungroup %>% # we use ungroup to undo groupings
mutate(Specialval = Variable_8[Date=="2014-02-08" & Variable_6==4])
Using the hints above, generate a dataset that contains, for each country, the average investment rate (as a percentage of GDP), the annual population growth rate (expressed in raw numbers, like \(0.02\)) for the period 1985-2010, and the income per capita (in purchasing power terms) relative to the United States in 2010. (Note I’m looking for a single number—the average annual growth rates over the period. You may need to look up how to calculate geometric growth rates from two levels series.)
You may use either base graphics or ggplot
for this.
Note the change below. You should first make sure that your y, s, n, g and delta are all on the same scale before taking logs. For instance, y
has the US at 100, so s
should be in whole percentage points, n g and delta also.
Estimate two basic Solow models of the following form:
\[ \ln(y_i) = \beta_0 + \beta_1\ln(s_i) + \beta_2 \ln(n_{i} + g + \delta) + \epsilon_i \]
This can be done using the linear model function
mymod <- lm(y ~ x1 + x2 + etc, data = mydata)
After estimating the model, check the parameter estimates using
summary(mymod)
Do these parameters look sensible?
Also estimate the models
\[ \ln(y_i) = \frac{\alpha}{1-\alpha}\ln(s_i) - \frac{\alpha}{1-\alpha} \ln(n_{i} + g + \delta) + \epsilon_i \]
and
\[ \ln(y_i) = \ln(A_{0}) + \frac{\alpha}{1-\alpha}\ln(s_i) - \frac{\alpha}{1-\alpha} \ln(n_{i} + g + \delta) + \epsilon_i \]
using the function nls()
. In this function, we give it a formula as above, but this time we can write one that is non-linear in the parameters. So if we wanted to estimate a restricted parameter \(\beta\), we could write
mymod2 <- nls(y ~ beta*x1 + (1/beta)*x2, data = mydata)
and check the estimate of \(\beta\) using
summary(mymod2)
You should get results that look like this:
##
## Call:
## lm(formula = lnY ~ lnS + lnngd, data = pwt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57707 -0.74167 0.02233 0.64023 3.04949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.1422 1.6120 7.532 3.77e-12 ***
## lnS 0.9823 0.1909 5.147 7.87e-07 ***
## lnngd -5.9296 0.6659 -8.905 1.30e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9918 on 156 degrees of freedom
## Multiple R-squared: 0.4475, Adjusted R-squared: 0.4404
## F-statistic: 63.18 on 2 and 156 DF, p-value: < 2.2e-16
##
## Formula: lnY ~ (alpha/(1 - alpha)) * lnS - (alpha/(1 - alpha)) * lnngd
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## alpha 0.719205 0.007163 100.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 158 degrees of freedom
##
## Number of iterations to convergence: 4
## Achieved convergence tolerance: 4.408e-09
##
## Formula: lnY ~ LnA0 + (alpha/(1 - alpha)) * lnS - (alpha/(1 - alpha)) *
## lnngd
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## alpha 0.60637 0.03035 19.978 < 2e-16 ***
## LnA0 1.21510 0.21129 5.751 4.51e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.127 on 157 degrees of freedom
##
## Number of iterations to convergence: 5
## Achieved convergence tolerance: 5.851e-09
mutate()
function to generate a new column that contains the predicted GDP per capita relative to the US. (Note that the predicted GDP from the model will be the log proportion—negative number for most countries; you will need to take the antilog of this value using exp()
).Often, we want to check how well a model is performing by using the model to generate some new random numbers, and compare these to the actuals. This is in the spirit of what Bayesians call posterior predictive checking, and what macroeconomists call moment checking. The model will be more “valid” if the new countries that it generates have a similar distribution of GDP per capita to what we observe in reality.
mydata <- mydata %>%
mutate(NewVersionOfCountry = exp(rnorm(n = n(), mean = predicted_gdppp, sd = sdestimate)))
Generate two histograms, one of the actual GDP per capital relative to the united states, and the second with our “new countries”. Would you say the model is a good fit?