This document is a quick introduction to data wrangling, graphics, and typesetting in R. You can download the code by clicking the “Code” button to the upper right.
First thing is first; let’s install/load all of the packages that we will be using, and clear out the environment.
We will be using data from Peterson (2017): Export Diversity and Human Rights. You can download the replication archive by clicking here or download the data directly by running the following chunk.
Data Wrangling
Selecting Cases and Subsetting
As with most things in R there are multiple ways of accomplishing the same goal. Being able to switch between approaches can help you accomplish more complicated tasks. I generally stick to base R for subsetting. To get indices which satisfy logical statements you can use the which function
which(d$gdppc > 50000)
## [1] 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 2116 2117 2118
## [16] 2119 2120 2121 2122 2123 2124 2125 2126 3981 3982 3983 3984 3985 3986 4017
## [31] 4018 4019 4020 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046
## [46] 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816
## [61] 4817 4818 4820 4821 4823 4824 4825 4826 4827 4828 4830
which(d$gdppc < 2000 & d$polity2 == 10)
## [1] 4277 4278 4279 4280
which(d$gdpgrowth < -.5 | d$gdppc < 100)
## [1] 3040 3369 3717 3805 3967
We can combine this with indexing to subset down the data. We can also call columns in a variety of ways. Remember that you can create objects carying this information to modularize your code, which might be helpful in particular situations to keep everything clear.
inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")
sub1 <- d[inds,cols]
sub1
## # A tibble: 71 x 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ... with 61 more rows
Alternatively, one could use the subset function from base R to get the same result.
sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
## # A tibble: 71 x 4
## ccode year physint lnpop
## <dbl> <dbl> <dbl> <dbl>
## 1 212 1999 8 13.0
## 2 212 2000 8 13.0
## 3 212 2001 8 13.0
## 4 212 2002 8 13.0
## 5 212 2003 8 13.0
## 6 212 2004 8 13.0
## 7 212 2005 8 13.0
## 8 212 2006 8 13.1
## 9 212 2007 8 13.1
## 10 212 2008 8 13.1
## # ... with 61 more rows
identical(sub1,sub2)
## [1] TRUE
Another great option is the dplyr package, which is part of the tidyverse alongside the packages haven and ggplot2. One of the best things about the tidyverse family of packages is that they are very well documented, including a variety of cheatsheets and books. One thing that makes the wrangling tools particularly powerful is that they leverage a pipe (%>%) from the magrittr package which says,in pseudo-code, that x %>% f(y) is the same as f(x,y). This can create nice work flow. For example, to get the same subset yet again:
d %>% filter(gdppc > 50000) %>% select(all_of(cols)) -> sub3
all(sub1 == sub3, na.rm=T)
## [1] TRUE
Neat. What we did was took our dataframe, filtered the rows that we wanted, and then selected the columns of interest.
If we want to sort data, there is a base R approach for vectors.
head(sort(d$gdpgrowth))
## [1] -0.6532034 -0.6011391 -0.5537450 -0.5498320 -0.5225903 -0.4939937
For dataframes you have to use order, which produces index numbers that can be used as before
d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 645 1991 -0.653
## 2 572 2003 -0.601
## 3 517 1994 -0.554
## 4 690 1991 -0.550
## 5 660 1989 -0.523
## 6 92 2007 -0.494
## 7 475 1986 -0.480
## 8 450 1990 -0.475
## 9 373 1993 -0.440
## 10 411 1990 -0.423
## # ... with 5,178 more rows
We can also switch the ordering around by setting decreasing = T
d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ... with 5,178 more rows
Or we could use the handy %>%. In this case we have to use the placeholder . for the input, which might be handy to know that you can do for more complicated functions.
order(d$gdpgrowth,decreasing = T) %>% d[.,c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 690 1992 1.86
## 2 572 2005 1.48
## 3 450 1997 1.39
## 4 411 1997 1.38
## 5 92 2009 1.13
## 6 411 2002 0.874
## 7 345 1996 0.854
## 8 411 1999 0.788
## 9 552 2010 0.752
## 10 411 1992 0.731
## # ... with 5,178 more rows
There is also a handy arrange function in dplyr for accomplishing the same sort of task but allowing to sort based on multiple columns.
d %>% select(c("ccode","year","gdpgrowth")) %>% arrange(ccode,gdpgrowth)
## # A tibble: 5,188 x 3
## ccode year gdpgrowth
## <dbl> <dbl> <dbl>
## 1 2 2009 -0.0359
## 2 2 1982 -0.0233
## 3 2 2008 -0.00629
## 4 2 1991 -0.00469
## 5 2 2001 0.00829
## 6 2 2002 0.0129
## 7 2 1990 0.0150
## 8 2 2007 0.0195
## 9 2 2003 0.0238
## 10 2 2010 0.0256
## # ... with 5,178 more rows
Converting between Long and Wide
Sometimes data comes in “Long” format, like the data we have been working with, where each row represents an observational unit (here country - year). In other cases data will come in “Wide” format like this or this
We might want to convert data back and forth depending on what we want to do. Let’s do this for the GDP per capita variable; see NYU Data Services for a handy guide to reshaping.
long <- d[,c("ccode","year","gdppc")]
long
## # A tibble: 5,188 x 3
## ccode year gdppc
## <dbl> <dbl> <dbl>
## 1 2 1981 25977.
## 2 2 1982 25131.
## 3 2 1983 26026.
## 4 2 1984 27751.
## 5 2 1985 28455.
## 6 2 1986 29014.
## 7 2 1987 29639.
## 8 2 1988 30506.
## 9 2 1989 31274.
## 10 2 1990 31432.
## # ... with 5,178 more rows
Long to Wide with tidyr (updated versions of gather and spread)
long %>% pivot_wider(names_from = year, values_from = gdppc) -> wide
wide
## # A tibble: 189 x 31
## ccode `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988` `1989` `1990`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 25977. 25131. 26026. 27751. 28455. 29014. 29639. 30506. 31274. 31432.
## 2 20 22718. 22267. 23001. 24238. 25387. 25987. 27535. 28487. 29005. 28589.
## 3 31 12648. 14004. 14880. 17393. 18428. 17593. 18885. 17582. 18414. 16909.
## 4 40 NA NA NA NA NA NA NA NA NA NA
## 5 41 NA NA NA NA NA NA NA NA NA NA
## 6 42 3600. 3638. 3719. 4064. 4100. 4273. 4673. 4826. 4364. 3923.
## 7 51 3904. 3811. 3862. 3744. 3329. 3539. 3875. 3958. 3962. 4267.
## 8 52 24382. 25326. 22556. 20483. 19106. 17596. 14946. 14170. 12743. 12770.
## 9 53 10365. 10358. 10463. 10725. 9966. 11390. 13180. 13719. 15294. 15270.
## 10 54 4407. 4683. 4862. 5109. 5008. 5477. 5973. 6233. 6168. 6562.
## # ... with 179 more rows, and 20 more variables: `1991` <dbl>, `1992` <dbl>,
## # `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
## # `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## # `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## # `2008` <dbl>, `2009` <dbl>, `2010` <dbl>
Back to long
wide %>% pivot_longer(cols=-ccode,names_to = "year", values_to = "gdppc") -> long_again
long_again
## # A tibble: 5,670 x 3
## ccode year gdppc
## <dbl> <chr> <dbl>
## 1 2 1981 25977.
## 2 2 1982 25131.
## 3 2 1983 26026.
## 4 2 1984 27751.
## 5 2 1985 28455.
## 6 2 1986 29014.
## 7 2 1987 29639.
## 8 2 1988 30506.
## 9 2 1989 31274.
## 10 2 1990 31432.
## # ... with 5,660 more rows
Merging Datasets
You may have noticed that the ccode variable isn’t particularly descriptive for which country it means. We can use this to illustrate merging datasets together. At the start we loaded in the countrycode package which contains additional information.
codes <- countrycode::codelist_panel
Let’s see what they have.
colnames(codes)
## [1] "country.name.en" "year" "ar5"
## [4] "continent" "country.name.de" "country.name.de.regex"
## [7] "country.name.en.regex" "cow.name" "cowc"
## [10] "cown" "ecb" "ecb.name"
## [13] "eu28" "eurocontrol_pru" "eurocontrol_statfor"
## [16] "eurostat" "eurostat.name" "fao"
## [19] "fao.name" "fips" "fips.name"
## [22] "gaul" "genc.name" "genc2c"
## [25] "genc3c" "genc3n" "gwc"
## [28] "gwn" "icao.name" "icao.region"
## [31] "imf" "ioc" "ioc.name"
## [34] "iso.name.en" "iso.name.fr" "iso2c"
## [37] "iso3c" "iso3n" "p4.name"
## [40] "p4c" "p4n" "region"
## [43] "un" "un.name.ar" "un.name.en"
## [46] "un.name.es" "un.name.fr" "un.name.ru"
## [49] "un.name.zh" "unpd" "unpd.name"
## [52] "vdem" "vdem.name" "wb"
## [55] "wb_api.name" "wb_api2c" "wb_api3c"
## [58] "wb.name" "wvs" "wvs.name"
The country codes we are currently using are cown. Let’s grab iso3c and region to add to the dataset. We also know that the dataset we are working with only has years from 1981 to 2010, so let’s practice our subsetting skillz
codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]
One thing to pay attention to is losing or gaining observations durring a merge. For a great overview, check out this handy NYU Data Services guide.
nrow(d)
## [1] 5188
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
## [1] 5125
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
## [1] 5188
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
## [1] 6959
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
## [1] 7022
And, of course, we can do the same merges using dplyr with inner_join, left_join, right_join, and full_join respectively. Going forward we will keep out2 as the working dataset.
Aggregates and Summaries
There are a number of great things you can do with the apply family of functions, including easily going in parallel with the pbapply package. If you are interested in more details on this you should check out this tutorial and this taskview. We will focus on using dplyr to calculate summaries of interest.
One reason for this is that it is super easy to calculate summaries grouping on another variable. For example, if we wanted to think about regional variation in gdppc we could
out2 %>% group_by(region) %>%
summarize(mean=mean(gdppc,na.rm=T),sd=sd(gdppc,na.rm=T),sum=sum(gdppc,na.rm=T))
## # A tibble: 23 x 4
## region mean sd sum
## <chr> <dbl> <dbl> <dbl>
## 1 Australia and New Zealand 24990. 5898. 1499423.
## 2 Caribbean 9328. 5894. 3059582.
## 3 Central America 5276. 3314. 1107909.
## 4 Central Asia 4762. 2813. 452347.
## 5 Eastern Africa 1803. 2240. 735621.
## 6 Eastern Asia 11792. 10740. 1415087.
## 7 Eastern Europe 9855. 4712. 2305957.
## 8 Melanesia 4459. 558. 133778.
## 9 Micronesia NaN NA 0
## 10 Middle Africa 3184. 4070. 859647.
## # ... with 13 more rows
We can also use the mutate function to add this information to our dataframe. In base R this would take mergeing the output of aggregate, so it can certainly be done, but dplyr makes it somewhat more straightforward and scaleable.
out2 %>% group_by(region) %>%
mutate(mean_gdppc=mean(gdppc,na.rm=T),sd_gdppc=sd(gdppc,na.rm=T)) -> out2
out2
## # A tibble: 5,188 x 23
## # Groups: region [23]
## ccode year twoway inhhi comper polity2 physint conflictonlocat~ lnpop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1981 0.986 0.990 0.997 10 8 0 19.3
## 2 2 1982 0.988 0.989 0.998 10 8 0 19.3
## 3 2 1983 0.982 0.989 0.994 10 8 1 19.3
## 4 2 1984 0.987 0.987 1 10 8 0 19.3
## 5 2 1985 0.985 0.987 0.998 10 7 0 19.3
## 6 2 1986 0.980 0.987 0.994 10 7 0 19.3
## 7 2 1987 0.982 0.987 0.995 10 8 0 19.3
## 8 2 1988 0.977 0.988 0.989 10 7 0 19.3
## 9 2 1989 0.980 0.988 0.992 10 7 1 19.3
## 10 2 1990 0.983 0.988 0.995 10 8 0 19.4
## # ... with 5,178 more rows, and 14 more variables: lngdppc <dbl>, gdppc <dbl>,
## # expdep <dbl>, gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, region <chr>, mean_gdppc <dbl>, sd_gdppc <dbl>
A base R version of the above might be
a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1
## Group.1 x
## 1 Australia and New Zealand 24990.386
## 2 Caribbean 9327.994
## 3 Central America 5275.757
## 4 Central Asia 4761.549
## 5 Eastern Africa 1802.991
## 6 Eastern Asia 11792.392
## 7 Eastern Europe 9854.516
## 8 Melanesia 4459.267
## 9 Micronesia NaN
## 10 Middle Africa 3183.878
## 11 Northern Africa 3388.796
## 12 Northern America 32390.804
## 13 Northern Europe 24848.061
## 14 Polynesia NaN
## 15 South-Eastern Asia 12756.196
## 16 South America 6143.031
## 17 Southern Africa 5279.329
## 18 Southern Asia 3029.750
## 19 Southern Europe 14384.153
## 20 Western Africa 1175.762
## 21 Western Asia 15632.999
## 22 Western Europe 30016.166
colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")
t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
## # A tibble: 5,063 x 25
## region ccode year twoway inhhi comper polity2 physint conflictonlocat~ lnpop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austr~ 900 1982 0.915 0.955 0.958 10 8 0 16.5
## 2 Austr~ 900 1983 0.920 0.951 0.968 10 7 0 16.5
## 3 Austr~ 900 1984 0.917 0.952 0.963 10 8 0 16.6
## 4 Austr~ 900 1986 0.912 0.945 0.965 10 7 0 16.6
## 5 Austr~ 900 1987 0.921 0.954 0.966 10 8 0 16.6
## 6 Austr~ 900 1988 0.924 0.958 0.965 10 6 0 16.6
## 7 Austr~ 900 1985 0.903 0.945 0.955 10 8 0 16.6
## 8 Austr~ 900 1990 0.928 0.961 0.966 10 7 0 16.7
## 9 Austr~ 900 1991 0.914 0.958 0.955 10 7 0 16.7
## 10 Austr~ 900 1992 0.932 0.958 0.972 10 7 0 16.7
## # ... with 5,053 more rows, and 15 more variables: lngdppc <dbl>, gdppc <dbl>,
## # expdep <dbl>, gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## # sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## # country.name.en <chr>, mean_gdppc.x <dbl>, sd_gdppc.x <dbl>,
## # mean_gdppc.y <dbl>, sd_gdppc.y <dbl>
but the dplyr approach really is quite nice.
Graphics
We will focus on using ggplot2 for graphics in R, although base R has nice capabilities on its own. ggplot is all about the `grammar of graphics’ which follows a layered approach to describe and construct graphics in a structured manner. To begin, we will always initialize a plot:
p1 <- ggplot(out2[which(out2$region == "Northern America"),], aes(x=log(gdppc)))
To get different plots, we will add layers. For example, if we wanted a dot plot
p1 + geom_dotplot(binwidth=0.03)
or a histogram
p1 + geom_histogram(binwidth=0.03)
or a density plot
p1 + geom_density()
we can just add a different layer to the same underlying plot.
The order of the layers does not matter, and there are a bunch more customizations that we can add.
p1 + geom_histogram(color="black",fill="darkblue",binwidth = 0.03) +
xlab("Natural Log of Per Capita GDP") +
ylab("Frequency") +
ggtitle('North American GDPPC') +
theme_bw() -> g1
g1

You can also add multiple geometries to the same underderlying plot.
p2 <- ggplot(out2[which(out2$region == "Eastern Europe"),],aes(x=year,y=log(gdppc),color=iso3c))
p2 + geom_point(na.rm=T) +
geom_line(na.rm=T) +
labs(color="Country") +
scale_color_brewer(palette="Spectral") -> g2
g2

You can even add some smoothers if you want.
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)

p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3

Two last notes on plots – faceting and adding plots together into a larger image.
Faceting can be a nice way to break up a continuous variable by category.
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(region ~ .)

p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) +
facet_grid(. ~ region)

Once we do all that, we might want to add multiple plots together into a larger multi-panel graphic. The gridExtra package is great for this.
grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)

You should take a look at the ggplot2 book linked to above, as well as other stuff like ggpubr, some examples can be found here.
Typesetting
In R it is relatively easy to create tables that are presentable in Word, LaTex, or HTML. A large number of packages – apsrtable, xtable, texreg, memisc, outreg and others – cater to LaTex users. If you’re using word to prepare your documents, stargazer is an option that can work nicely. One thing we might want to do from the above is greate a table of descriptive statistics.
vars <- c("polity2","physint","conflictonlocation","lnpop")
path <- "C:/Users/Schwarz/Dropbox/AQR"
stargazer(as.data.frame(out2[,vars]),type="html",
title="Descriptive Statistics",
covariate.labels = c("Democracy","Human Rights","Any Conflict?",
"Population (Logged)"),
out=paste0(path,"/summary1.doc"))
As you might have gathered from the code above, what we are going to do is use the ability of Word to render html by saving the output with a .doc extension.
Otherwise, the table looks like this:
Descriptive Statistics
|
|
|
Statistic
|
N
|
Mean
|
St. Dev.
|
Min
|
Pctl(25)
|
Pctl(75)
|
Max
|
|
|
|
Democracy
|
4,380
|
1.746
|
7.203
|
-10.000
|
-6.000
|
9.000
|
10.000
|
|
Human Rights
|
4,487
|
4.838
|
2.322
|
0.000
|
3.000
|
7.000
|
8.000
|
|
Any Conflict?
|
5,188
|
0.171
|
0.377
|
0
|
0
|
0
|
1
|
|
Population (Logged)
|
4,457
|
15.674
|
1.925
|
10.615
|
14.671
|
16.902
|
21.000
|
|
|
If you’re interested in what other summary statistics are available natively from stargazer, check out the link in omit.summary.stat in their documentation
?stargazer
Something a bit more interesting would be to take a look at Regression tables. First, let’s run some models. To replicate the Peterson (2017) results, we need to define a function for leading a variable.
tscslead <- function(x, cs,ts){
obs <- 1:length(x)
lagobs <- match(paste(cs, ts+1, sep="::"), paste(cs, ts, sep="::"))
lagx <- x[lagobs]
lagx
}
The next thing that we will do is create the dependent variable with a one period lead
out2$physint_lead <- tscslead(out2$physint, out2$ccode, out2$year)
And then we will replicate models 1-3
m1 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
m2 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
m3 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
+ lngdppc + gdpgrowth + lnpop + year + physint, data = out2)
stargazer(m1,m2,m3,type="html",out = paste0(path,"/regressions.doc"),
covariate.labels = c("Export Diversification","Liberalization","
Export Dependence","Democracy","Conflict","
Log Income","Growth","Log Population",
"Linear Time Trend","Lagged DV"),
title="Replication of Peterson (2017) Models 1-3",
dep.var.caption = "All States",
dep.var.labels = "Human Rights")
The output looks like this:
Replication of Peterson (2017) Models 1-3
|
|
|
|
All States
|
|
|
|
|
|
Human Rights
|
|
|
(1)
|
(2)
|
(3)
|
|
|
|
Export Diversification
|
0.276**
|
0.242**
|
0.226**
|
|
|
(0.114)
|
(0.113)
|
(0.114)
|
|
|
|
|
|
|
Liberalization
|
-10.638
|
|
-57.858***
|
|
|
(8.896)
|
|
(13.994)
|
|
|
|
|
|
|
Export Dependence
|
|
0.202**
|
0.707***
|
|
|
|
(0.101)
|
(0.162)
|
|
|
|
|
|
|
Democracy
|
0.026***
|
0.026***
|
0.027***
|
|
|
(0.004)
|
(0.004)
|
(0.004)
|
|
|
|
|
|
|
Conflict
|
-0.583***
|
-0.555***
|
-0.611***
|
|
|
(0.062)
|
(0.061)
|
(0.062)
|
|
|
|
|
|
|
Log Income
|
0.126***
|
0.118***
|
0.131***
|
|
|
(0.023)
|
(0.022)
|
(0.023)
|
|
|
|
|
|
|
Growth
|
-0.154
|
-0.207
|
-0.140
|
|
|
(0.193)
|
(0.191)
|
(0.192)
|
|
|
|
|
|
|
Log Population
|
-0.190***
|
-0.174***
|
-0.226***
|
|
|
(0.020)
|
(0.018)
|
(0.022)
|
|
|
|
|
|
|
Linear Time Trend
|
-0.014***
|
-0.014***
|
-0.016***
|
|
|
(0.002)
|
(0.002)
|
(0.002)
|
|
|
|
|
|
|
Lagged DV
|
0.675***
|
0.679***
|
0.666***
|
|
|
(0.013)
|
(0.012)
|
(0.013)
|
|
|
|
|
|
|
Constant
|
31.817***
|
31.360***
|
35.327***
|
|
|
(4.914)
|
(4.876)
|
(4.967)
|
|
|
|
|
|
|
|
|
Observations
|
3,479
|
3,562
|
3,479
|
|
R2
|
0.751
|
0.754
|
0.753
|
|
Adjusted R2
|
0.751
|
0.753
|
0.752
|
|
Residual Std. Error
|
1.118 (df = 3469)
|
1.121 (df = 3552)
|
1.115 (df = 3468)
|
|
F Statistic
|
1,164.520*** (df = 9; 3469)
|
1,208.221*** (df = 9; 3552)
|
1,055.422*** (df = 10; 3468)
|
|
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
Stargazer has a number of useful “style” options which format close to a number of major journals, which may be worth looking at or playing around with.
---
title: 'AQR Proseminar: Wrangling, Graphics, and Typesetting in R'
author: "Christopher Schwarz"
date: "3/23/2020"
pages:
  extra: true
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    code_download: true
---

This document is a quick introduction to data wrangling, graphics, and typesetting in R.  You can download the code by clicking the "Code" button to the upper right.

First thing is first; let's install/load all of the packages that we will be using, and clear out the environment.

```{r message=FALSE, warning=FALSE}
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
              "stargazer")
for(i in packages){
  if(!require(i,character.only = T, quietly = T)){
    install.packages(i)
    library(i, character.only = T, quietly = T)
  }
}

rm(list=ls())
```


We will be using data from Peterson (2017): Export Diversity and Human Rights.  You can download the replication archive by clicking [here](https://sites.google.com/site/timothympetersonusc/JCR%20-%20export%20diversity%20-%20replication%20data.zip) or download the data directly by running the following chunk.

```{r}
d <- read_dta("https://bit.ly/3a9nUSL")
d
```


# Data Wrangling

## Selecting Cases and Subsetting

As with most things in R there are multiple ways of accomplishing the same goal.  Being able to switch between approaches can help you accomplish more complicated tasks.  I generally stick to base R for subsetting.  To get indices which satisfy logical statements you can use the `which` function 
```{r}
which(d$gdppc > 50000)
```

```{r}
which(d$gdppc < 2000 & d$polity2 == 10)
```

```{r}
which(d$gdpgrowth < -.5 | d$gdppc < 100)
```

We can combine this with indexing to subset down the data.  We can also call columns in a variety of ways.  Remember that you can create objects carying this information to modularize your code, which might be helpful in particular situations to keep everything clear.

```{r}
inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")

sub1 <- d[inds,cols]
sub1
```

Alternatively, one could use the `subset` function from base R to get the same result.

```{r}
sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
```
```{r}
identical(sub1,sub2)
```

Another great option is the `dplyr` package, which is part of the [tidyverse](https://www.tidyverse.org/) alongside the packages `haven` and `ggplot2`.  One of the best things about the tidyverse family of packages is that they are very well documented, including [a](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) [variety](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) [of](http://www.flutterbys.com.au/stats/downloads/slides/figure/factors.pdf) [cheatsheets](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_purrr.pdf) [and](https://ggplot2-book.org/) [books](https://r4ds.had.co.nz/).  One thing that makes the wrangling tools particularly powerful is that they leverage a pipe (`%>%`) from the `magrittr` package which says,in pseudo-code, that `x %>% f(y)` is the same as `f(x,y)`.  This can create nice work flow.  For example, to get the same subset yet again:
```{r}
d %>% filter(gdppc > 50000) %>% select(all_of(cols)) -> sub3

all(sub1 == sub3, na.rm=T)
```

Neat.  What we did was took our dataframe, `filter`ed the rows that we wanted, and then `select`ed the columns of interest.

If we want to `sort` data, there is a base R approach for vectors.

```{r}
head(sort(d$gdpgrowth))
```

For dataframes you have to use `order`, which produces index numbers that can be used as before
```{r}
d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
```
We can also switch the ordering around by setting `decreasing = T`
```{r}
d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
```
Or we could use the handy `%>%`.  In this case we have to use the placeholder `.` for the input, which might be handy to know that you can do for more complicated functions.
```{r}
order(d$gdpgrowth,decreasing = T) %>% d[.,c("ccode","year","gdpgrowth")]
```

There is also a handy `arrange` function in `dplyr` for accomplishing the same sort of task but allowing to sort based on multiple columns.

```{r}
d %>% select(c("ccode","year","gdpgrowth")) %>% arrange(ccode,gdpgrowth)
```

## Converting between Long and Wide

Sometimes data comes in "Long" format, like the data we have been working with, where each row represents an observational unit (here country - year).  In other cases data will come in "Wide" format like [this](https://data.imf.org/regular.aspx?key=60991467) or [this](https://en.wikipedia.org/wiki/United_States_Electoral_College#Chronological_table)

We might want to convert data back and forth depending on what we want to do.  Let's do this for the GDP per capita variable; see NYU Data Services for a handy [guide to reshaping](https://guides.nyu.edu/quant/reshape).

```{r}
long <- d[,c("ccode","year","gdppc")]
long
```

Long to Wide with tidyr (updated versions of `gather` and `spread`)
```{r}
long %>% pivot_wider(names_from = year, values_from = gdppc) -> wide
wide
```
Back to long
```{r}
wide %>% pivot_longer(cols=-ccode,names_to = "year", values_to = "gdppc") -> long_again
long_again
```

## Merging Datasets

You may have noticed that the ccode variable isn't particularly descriptive for which country it means.  We can use this to illustrate merging datasets together.  At the start we loaded in the `countrycode` package which contains additional information.

```{r}
codes <- countrycode::codelist_panel
```

Let's see what they have.
```{r}
colnames(codes)
```

The country codes we are currently using are `cown`.  Let's grab `iso3c` and `region` to add to the dataset.  We also know that the dataset we are working with only has years from 1981 to 2010, so let's practice our subsetting skillz
```{r}
codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]
```

One thing to pay attention to is losing or gaining observations durring a merge.  For a great overview, check out this handy NYU Data Services [guide](https://guides.nyu.edu/quant/merge).
```{r}
nrow(d)
```

```{r}
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
```
```{r}
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
```
```{r}
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
```
```{r}
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
```

And, of course, we can do the same merges using dplyr with `inner_join`, `left_join`, `right_join`, and `full_join` respectively.  Going forward we will keep `out2` as the working dataset.

## Aggregates and Summaries

There are a number of great things you can do with the `apply` family of functions, including easily going in parallel with the `pbapply` package.  If you are interested in more details on this you should check out [this tutorial](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family) and [this taskview](https://www.r-pkg.org/ctv/HighPerformanceComputing).  We will focus on using `dplyr` to calculate summaries of interest.

One reason for this is that it is super easy to calculate summaries grouping on another variable.  For example, if we wanted to think about regional variation in `gdppc` we could

```{r}
out2 %>% group_by(region) %>% 
  summarize(mean=mean(gdppc,na.rm=T),sd=sd(gdppc,na.rm=T),sum=sum(gdppc,na.rm=T))
```

We can also use the `mutate` function to add this information to our dataframe.  In base R this would take `merge`ing the output of `aggregate`, so it can certainly be done, but `dplyr` makes it somewhat more straightforward and scaleable.
```{r}
out2 %>% group_by(region) %>% 
  mutate(mean_gdppc=mean(gdppc,na.rm=T),sd_gdppc=sd(gdppc,na.rm=T)) -> out2

out2
```

A base R version of the above might be
```{r}
a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1

colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")

t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
```
but the `dplyr` approach really is quite nice.

# Graphics

We will focus on using `ggplot2` for graphics in R, although base R has nice capabilities on its own.  `ggplot` is all about the `grammar of graphics' which follows a layered approach to describe and construct graphics in a structured manner.  To begin, we will always initialize a plot:

```{r}
p1 <- ggplot(out2[which(out2$region == "Northern America"),], aes(x=log(gdppc)))
```

To get different plots, we will add layers.  For example, if we wanted a dot plot
```{r}
p1 + geom_dotplot(binwidth=0.03)
```
or a histogram
```{r}
p1 + geom_histogram(binwidth=0.03)
```
or a density plot
```{r}
p1 + geom_density()
```
we can just add a different layer to the same underlying plot.

The order of the layers does not matter, and there are a bunch more customizations that we can add.

```{r}
p1 + geom_histogram(color="black",fill="darkblue",binwidth = 0.03) +
     xlab("Natural Log of Per Capita GDP") +
     ylab("Frequency") +
     ggtitle('North American GDPPC') +
     theme_bw()  -> g1
g1
```

You can also add multiple geometries to the same underderlying plot.

```{r}
p2 <- ggplot(out2[which(out2$region == "Eastern Europe"),],aes(x=year,y=log(gdppc),color=iso3c))
p2  + geom_point(na.rm=T) + 
      geom_line(na.rm=T) +
      labs(color="Country") +
      scale_color_brewer(palette="Spectral")  -> g2
g2
```

You can even add some smoothers if you want.

```{r}
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)

```

```{r warnings=FALSE}
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3
```

Two last notes on plots -- faceting and adding plots together into a larger image.


Faceting can be a nice way to break up a continuous variable by category.
```{r}
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(region ~ .)
```
```{r}
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(. ~ region)
```

Once we do all that, we might want to add multiple plots together into a larger multi-panel graphic.  The `gridExtra` package is great for this.

```{r}
grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)
```

You should take a look at the `ggplot2` book linked to above, as well as other stuff like `ggpubr`, some examples can be found [here](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/).

# Typesetting

In R it is relatively easy to create tables that are presentable in Word, LaTex, or HTML.  A large number of packages -- `apsrtable`, `xtable`, `texreg`, `memisc`, `outreg` and others -- cater to LaTex users.  If you're using word to prepare your documents, `stargazer` is an option that can work nicely.  One thing we might want to do from the above is greate a table of descriptive statistics.

```{r results=FALSE}
vars <- c("polity2","physint","conflictonlocation","lnpop")
path <- "C:/Users/Schwarz/Dropbox/AQR"

stargazer(as.data.frame(out2[,vars]),type="html",
          title="Descriptive Statistics",
          covariate.labels = c("Democracy","Human Rights","Any Conflict?",
                               "Population (Logged)"),
          out=paste0(path,"/summary1.doc"))

```

As you might have gathered from the code above, what we are going to do is use the ability of Word to render html by saving the output with a .doc extension.  

Otherwise, the table looks like this:

<center>
<table style="text-align:center"><caption><strong>Descriptive Statistics</strong></caption>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Democracy</td><td>4,380</td><td>1.746</td><td>7.203</td><td>-10.000</td><td>-6.000</td><td>9.000</td><td>10.000</td></tr>
<tr><td style="text-align:left">Human Rights</td><td>4,487</td><td>4.838</td><td>2.322</td><td>0.000</td><td>3.000</td><td>7.000</td><td>8.000</td></tr>
<tr><td style="text-align:left">Any Conflict?</td><td>5,188</td><td>0.171</td><td>0.377</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td style="text-align:left">Population (Logged)</td><td>4,457</td><td>15.674</td><td>1.925</td><td>10.615</td><td>14.671</td><td>16.902</td><td>21.000</td></tr>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>
</center>

If you're interested in what other summary statistics are available natively from `stargazer`, check out the link in `omit.summary.stat` in their documentation

```{r eval=FALSE}
?stargazer
```

Something a bit more interesting would be to take a look at Regression tables.  First, let's run some models.  To replicate the Peterson (2017) results, we need to define a function for leading a variable.

```{r}
tscslead <- function(x, cs,ts){ 
  obs <- 1:length(x) 
  lagobs <- match(paste(cs, ts+1, sep="::"), paste(cs, ts, sep="::")) 
  lagx <- x[lagobs]
  lagx
}
```

The next thing that we will do is create the dependent variable with a one period lead
```{r}
out2$physint_lead <- tscslead(out2$physint, out2$ccode, out2$year)
```

And then we will replicate models 1-3
```{r results=FALSE}
m1 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m2 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m3 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

stargazer(m1,m2,m3,type="html",out = paste0(path,"/regressions.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Replication of Peterson (2017) Models 1-3",
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights")
```

The output looks like this:

<center>
<table style="text-align:center"><caption><strong>Replication of Peterson (2017) Models 1-3</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">All States</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.276<sup>**</sup></td><td>0.242<sup>**</sup></td><td>0.226<sup>**</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.114)</td><td>(0.113)</td><td>(0.114)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>-10.638</td><td></td><td>-57.858<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(8.896)</td><td></td><td>(13.994)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>0.202<sup>**</sup></td><td>0.707<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.101)</td><td>(0.162)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.026<sup>***</sup></td><td>0.026<sup>***</sup></td><td>0.027<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.004)</td><td>(0.004)</td><td>(0.004)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.583<sup>***</sup></td><td>-0.555<sup>***</sup></td><td>-0.611<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.062)</td><td>(0.061)</td><td>(0.062)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>0.126<sup>***</sup></td><td>0.118<sup>***</sup></td><td>0.131<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.023)</td><td>(0.022)</td><td>(0.023)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>-0.154</td><td>-0.207</td><td>-0.140</td></tr>
<tr><td style="text-align:left"></td><td>(0.193)</td><td>(0.191)</td><td>(0.192)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.190<sup>***</sup></td><td>-0.174<sup>***</sup></td><td>-0.226<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.020)</td><td>(0.018)</td><td>(0.022)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.014<sup>***</sup></td><td>-0.014<sup>***</sup></td><td>-0.016<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.002)</td><td>(0.002)</td><td>(0.002)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td>0.675<sup>***</sup></td><td>0.679<sup>***</sup></td><td>0.666<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.013)</td><td>(0.012)</td><td>(0.013)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>31.817<sup>***</sup></td><td>31.360<sup>***</sup></td><td>35.327<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(4.914)</td><td>(4.876)</td><td>(4.967)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,479</td><td>3,562</td><td>3,479</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.751</td><td>0.754</td><td>0.753</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.751</td><td>0.753</td><td>0.752</td></tr>
<tr><td style="text-align:left">Residual Std. Error</td><td>1.118 (df = 3469)</td><td>1.121 (df = 3552)</td><td>1.115 (df = 3468)</td></tr>
<tr><td style="text-align:left">F Statistic</td><td>1,164.520<sup>***</sup> (df = 9; 3469)</td><td>1,208.221<sup>***</sup> (df = 9; 3552)</td><td>1,055.422<sup>***</sup> (df = 10; 3468)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
</center>

Stargazer has a number of useful "style" options which format close to a number of major journals, which may be worth looking at or playing around with.

# Exercises

The best way to learn R is to use R.  Let's play around with some of the above concepts and coding approaches to replicate a number of other aspects of the paper and 'probe robustness' to deviations in specification.

## Exercise 1 {.tabset}
### Description
Figure 1 of Peterson (2017) shows two histograms of Export Diversity (`twoway`), one for all states and the other excluding high states (those with a GDP Per Capita larger than 11101.2).  Reproduce this image as closely as possible; this will take digging into the documentation or Googling to accomplish particular customizations!  The final output should look something like this:

<center>
![Exercise 1 Desired Output](C:/Users/Schwarz/Dropbox/AQR/Exercise1.jpg)
</center>

### Answer
```{r results=FALSE}
sub <- out2[which(out2$gdppc < 11101.2),]

out2$all_states <- "All States"
sub$exc <- "Excluding high-income states"

out2 %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~all_states)-> f1
sub %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~exc) -> f2

jpeg(paste0(path,"/Exercise1.jpg"))

grid.arrange(f1,f2,ncol=2,nrow=1, 
             bottom = textGrob("Figure 1. Histogram of export diversity for all states (left) and excluding high-income states (right).", gp=gpar(fontsize=10)))
grid.rect(width = 1, height = 0.96, vjust=0.485 , gp = gpar(lwd = 2, col = "black", fill = NA))

dev.off()
```

## Exercise 2 {.tabset}

### Description
Models 4-6 replicate use the same specifications as above but excluding high income states.  Run these regressions and format the results to get the following:

<center>
<table style="text-align:center"><caption><strong>Replication of Peterson (2017) Models 4-6</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">Excluding high income states</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(4)</td><td>(5)</td><td>(6)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.472<sup>***</sup></td><td>0.471<sup>***</sup></td><td>0.450<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.141)</td><td>(0.139)</td><td>(0.141)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>-19.891<sup>*</sup></td><td></td><td>-44.903<sup>**</sup></td></tr>
<tr><td style="text-align:left"></td><td>(11.636)</td><td></td><td>(20.707)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>-0.077</td><td>0.333</td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.126)</td><td>(0.228)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.024<sup>***</sup></td><td>0.023<sup>***</sup></td><td>0.025<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.004)</td><td>(0.004)</td><td>(0.004)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.598<sup>***</sup></td><td>-0.578<sup>***</sup></td><td>-0.597<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.072)</td><td>(0.071)</td><td>(0.072)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>-0.030</td><td>-0.038</td><td>-0.017</td></tr>
<tr><td style="text-align:left"></td><td>(0.034)</td><td>(0.034)</td><td>(0.035)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>-0.066</td><td>-0.141</td><td>-0.060</td></tr>
<tr><td style="text-align:left"></td><td>(0.232)</td><td>(0.229)</td><td>(0.232)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.290<sup>***</sup></td><td>-0.273<sup>***</sup></td><td>-0.307<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.027)</td><td>(0.024)</td><td>(0.029)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.015<sup>***</sup></td><td>-0.014<sup>***</sup></td><td>-0.015<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.003)</td><td>(0.003)</td><td>(0.003)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td>0.622<sup>***</sup></td><td>0.628<sup>***</sup></td><td>0.621<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.016)</td><td>(0.016)</td><td>(0.016)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>36.093<sup>***</sup></td><td>34.344<sup>***</sup></td><td>37.349<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(6.419)</td><td>(6.352)</td><td>(6.475)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>2,488</td><td>2,550</td><td>2,488</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.682</td><td>0.682</td><td>0.683</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.681</td><td>0.681</td><td>0.681</td></tr>
<tr><td style="text-align:left">Residual Std. Error</td><td>1.190 (df = 2478)</td><td>1.196 (df = 2540)</td><td>1.190 (df = 2477)</td></tr>
<tr><td style="text-align:left">F Statistic</td><td>591.627<sup>***</sup> (df = 9; 2478)</td><td>605.829<sup>***</sup> (df = 9; 2540)</td><td>532.921<sup>***</sup> (df = 10; 2477)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
</center>

### Answer
```{r results=FALSE}
m5 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m6 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m7 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

stargazer(m5,m6,m7,type="html",out = paste0(path,"/regressions2.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          column.labels = c("(4)","(5)","(6)"),
          model.numbers = F,
          title="Replication of Peterson (2017) Models 4-6",
          dep.var.caption = "Excluding high income states",
          dep.var.labels = "Human Rights")
```

## Exercise 3 {.tabset}

### Description
One might wonder whether the results are robust to the inclusion of region or state fixed effects.  These can be entered into the regression equation by including, for example, `as.factor(ccode)` to the regression equation.  Fixed effects are generally omitted from regression tables; figure out how to exclude regressors from `stargazer` output and add an additional note to the table indicating that fixed effects were included.  You should also remove the R-squared, Residual Standard Error, and F Statistic from being reported.  The output should look something like this:

<center>
<table style="text-align:center"><caption><strong>Robustness of Peterson (2017) Models 1-3</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">All States</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(1a)</td><td>(2a)</td><td>(3a)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.190</td><td>0.220</td><td>0.248</td></tr>
<tr><td style="text-align:left"></td><td>(0.259)</td><td>(0.257)</td><td>(0.260)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>21.905</td><td></td><td>-9.210</td></tr>
<tr><td style="text-align:left"></td><td>(17.509)</td><td></td><td>(20.824)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>0.587<sup>***</sup></td><td>0.633<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.191)</td><td>(0.230)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.037<sup>***</sup></td><td>0.039<sup>***</sup></td><td>0.039<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.006)</td><td>(0.006)</td><td>(0.006)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.520<sup>***</sup></td><td>-0.495<sup>***</sup></td><td>-0.508<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.077)</td><td>(0.075)</td><td>(0.077)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>0.029</td><td>0.086</td><td>0.077</td></tr>
<tr><td style="text-align:left"></td><td>(0.091)</td><td>(0.090)</td><td>(0.092)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>0.016</td><td>-0.039</td><td>0.032</td></tr>
<tr><td style="text-align:left"></td><td>(0.188)</td><td>(0.186)</td><td>(0.188)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.324</td><td>-0.244</td><td>-0.267</td></tr>
<tr><td style="text-align:left"></td><td>(0.237)</td><td>(0.234)</td><td>(0.238)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.017<sup>***</sup></td><td>-0.022<sup>***</sup></td><td>-0.022<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.005)</td><td>(0.006)</td><td>(0.006)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>44.060<sup>***</sup></td><td>50.103<sup>***</sup></td><td>50.438<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(7.467)</td><td>(7.665)</td><td>(7.811)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,479</td><td>3,562</td><td>3,479</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.775</td><td>0.779</td><td>0.776</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
<tr><td style="text-align:left"></td><td colspan="3" style="text-align:right">All models include country fixed effects.</td></tr>
</table>
</center>

### Answer
```{r results=FALSE}
m7 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m8 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m9 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

stargazer(m7,m8,m9,type="html",out = paste0(path,"/regressions3.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Robustness of Peterson (2017) Models 1-3",
          omit=11:length(m9$coefficients),
          notes = "All models include country fixed effects.",
          notes.append=TRUE,
          column.labels = c("(1a)","(2a)","(3a)"),
          model.numbers = F,
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights",
          omit.stat = c("f","rsq","ser"))
```

# Final Words

That's all I have for you today.  Hope it was helpful!  Feel free to reach out to NYU Data Services if you encounter any issues with programming in R or would like to be directed to any additional resources.  Over the past two sessions we have only touched the surface of what R is able to do thanks to the vibrant community of scholars and practitioners who contribute to the project.  If you are looking to get further into R for statistical analysis, I highly recommend checking out what is available on the various [taskviews](https://cran.r-project.org/web/views/) or even these neat [swirl courses](https://swirlstats.com/) where you can learn R in R.




