This document is a quick introduction to data wrangling, graphics, and typesetting in R. You can download the code by clicking the “Code” button to the upper right.

First thing is first; let’s install/load all of the packages that we will be using, and clear out the environment.

packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
              "stargazer")
for(i in packages){
  if(!require(i,character.only = T, quietly = T)){
    install.packages(i)
    library(i, character.only = T, quietly = T)
  }
}

rm(list=ls())

We will be using data from Peterson (2017): Export Diversity and Human Rights. You can download the replication archive by clicking here or download the data directly by running the following chunk.

d <- read_dta("https://bit.ly/3a9nUSL")
d
## # A tibble: 5,188 x 18
##    ccode  year twoway inhhi comper polity2 physint conflictonlocat~ lnpop
##    <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>            <dbl> <dbl>
##  1     2  1981  0.986 0.990  0.997      10       8                0  19.3
##  2     2  1982  0.988 0.989  0.998      10       8                0  19.3
##  3     2  1983  0.982 0.989  0.994      10       8                1  19.3
##  4     2  1984  0.987 0.987  1          10       8                0  19.3
##  5     2  1985  0.985 0.987  0.998      10       7                0  19.3
##  6     2  1986  0.980 0.987  0.994      10       7                0  19.3
##  7     2  1987  0.982 0.987  0.995      10       8                0  19.3
##  8     2  1988  0.977 0.988  0.989      10       7                0  19.3
##  9     2  1989  0.980 0.988  0.992      10       7                1  19.3
## 10     2  1990  0.983 0.988  0.995      10       8                0  19.4
## # ... with 5,178 more rows, and 9 more variables: lngdppc <dbl>, gdppc <dbl>,
## #   expdep <dbl>, gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## #   sdarab_manual <dbl>, sdpop_manual <dbl>

Data Wrangling

Selecting Cases and Subsetting

As with most things in R there are multiple ways of accomplishing the same goal. Being able to switch between approaches can help you accomplish more complicated tasks. I generally stick to base R for subsetting. To get indices which satisfy logical statements you can use the which function

which(d$gdppc > 50000)
##  [1] 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 2116 2117 2118
## [16] 2119 2120 2121 2122 2123 2124 2125 2126 3981 3982 3983 3984 3985 3986 4017
## [31] 4018 4019 4020 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046
## [46] 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816
## [61] 4817 4818 4820 4821 4823 4824 4825 4826 4827 4828 4830
which(d$gdppc < 2000 & d$polity2 == 10)
## [1] 4277 4278 4279 4280
which(d$gdpgrowth < -.5 | d$gdppc < 100)
## [1] 3040 3369 3717 3805 3967

We can combine this with indexing to subset down the data. We can also call columns in a variety of ways. Remember that you can create objects carying this information to modularize your code, which might be helpful in particular situations to keep everything clear.

inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")

sub1 <- d[inds,cols]
sub1
## # A tibble: 71 x 4
##    ccode  year physint lnpop
##    <dbl> <dbl>   <dbl> <dbl>
##  1   212  1999       8  13.0
##  2   212  2000       8  13.0
##  3   212  2001       8  13.0
##  4   212  2002       8  13.0
##  5   212  2003       8  13.0
##  6   212  2004       8  13.0
##  7   212  2005       8  13.0
##  8   212  2006       8  13.1
##  9   212  2007       8  13.1
## 10   212  2008       8  13.1
## # ... with 61 more rows

Alternatively, one could use the subset function from base R to get the same result.

sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
## # A tibble: 71 x 4
##    ccode  year physint lnpop
##    <dbl> <dbl>   <dbl> <dbl>
##  1   212  1999       8  13.0
##  2   212  2000       8  13.0
##  3   212  2001       8  13.0
##  4   212  2002       8  13.0
##  5   212  2003       8  13.0
##  6   212  2004       8  13.0
##  7   212  2005       8  13.0
##  8   212  2006       8  13.1
##  9   212  2007       8  13.1
## 10   212  2008       8  13.1
## # ... with 61 more rows
identical(sub1,sub2)
## [1] TRUE

Another great option is the dplyr package, which is part of the tidyverse alongside the packages haven and ggplot2. One of the best things about the tidyverse family of packages is that they are very well documented, including a variety of cheatsheets and books. One thing that makes the wrangling tools particularly powerful is that they leverage a pipe (%>%) from the magrittr package which says,in pseudo-code, that x %>% f(y) is the same as f(x,y). This can create nice work flow. For example, to get the same subset yet again:

d %>% filter(gdppc > 50000) %>% select(all_of(cols)) -> sub3

all(sub1 == sub3, na.rm=T)
## [1] TRUE

Neat. What we did was took our dataframe, filtered the rows that we wanted, and then selected the columns of interest.

If we want to sort data, there is a base R approach for vectors.

head(sort(d$gdpgrowth))
## [1] -0.6532034 -0.6011391 -0.5537450 -0.5498320 -0.5225903 -0.4939937

For dataframes you have to use order, which produces index numbers that can be used as before

d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
##    ccode  year gdpgrowth
##    <dbl> <dbl>     <dbl>
##  1   645  1991    -0.653
##  2   572  2003    -0.601
##  3   517  1994    -0.554
##  4   690  1991    -0.550
##  5   660  1989    -0.523
##  6    92  2007    -0.494
##  7   475  1986    -0.480
##  8   450  1990    -0.475
##  9   373  1993    -0.440
## 10   411  1990    -0.423
## # ... with 5,178 more rows

We can also switch the ordering around by setting decreasing = T

d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
##    ccode  year gdpgrowth
##    <dbl> <dbl>     <dbl>
##  1   690  1992     1.86 
##  2   572  2005     1.48 
##  3   450  1997     1.39 
##  4   411  1997     1.38 
##  5    92  2009     1.13 
##  6   411  2002     0.874
##  7   345  1996     0.854
##  8   411  1999     0.788
##  9   552  2010     0.752
## 10   411  1992     0.731
## # ... with 5,178 more rows

Or we could use the handy %>%. In this case we have to use the placeholder . for the input, which might be handy to know that you can do for more complicated functions.

order(d$gdpgrowth,decreasing = T) %>% d[.,c("ccode","year","gdpgrowth")]
## # A tibble: 5,188 x 3
##    ccode  year gdpgrowth
##    <dbl> <dbl>     <dbl>
##  1   690  1992     1.86 
##  2   572  2005     1.48 
##  3   450  1997     1.39 
##  4   411  1997     1.38 
##  5    92  2009     1.13 
##  6   411  2002     0.874
##  7   345  1996     0.854
##  8   411  1999     0.788
##  9   552  2010     0.752
## 10   411  1992     0.731
## # ... with 5,178 more rows

There is also a handy arrange function in dplyr for accomplishing the same sort of task but allowing to sort based on multiple columns.

d %>% select(c("ccode","year","gdpgrowth")) %>% arrange(ccode,gdpgrowth)
## # A tibble: 5,188 x 3
##    ccode  year gdpgrowth
##    <dbl> <dbl>     <dbl>
##  1     2  2009  -0.0359 
##  2     2  1982  -0.0233 
##  3     2  2008  -0.00629
##  4     2  1991  -0.00469
##  5     2  2001   0.00829
##  6     2  2002   0.0129 
##  7     2  1990   0.0150 
##  8     2  2007   0.0195 
##  9     2  2003   0.0238 
## 10     2  2010   0.0256 
## # ... with 5,178 more rows

Converting between Long and Wide

Sometimes data comes in “Long” format, like the data we have been working with, where each row represents an observational unit (here country - year). In other cases data will come in “Wide” format like this or this

We might want to convert data back and forth depending on what we want to do. Let’s do this for the GDP per capita variable; see NYU Data Services for a handy guide to reshaping.

long <- d[,c("ccode","year","gdppc")]
long
## # A tibble: 5,188 x 3
##    ccode  year  gdppc
##    <dbl> <dbl>  <dbl>
##  1     2  1981 25977.
##  2     2  1982 25131.
##  3     2  1983 26026.
##  4     2  1984 27751.
##  5     2  1985 28455.
##  6     2  1986 29014.
##  7     2  1987 29639.
##  8     2  1988 30506.
##  9     2  1989 31274.
## 10     2  1990 31432.
## # ... with 5,178 more rows

Long to Wide with tidyr (updated versions of gather and spread)

long %>% pivot_wider(names_from = year, values_from = gdppc) -> wide
wide
## # A tibble: 189 x 31
##    ccode `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988` `1989` `1990`
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     2 25977. 25131. 26026. 27751. 28455. 29014. 29639. 30506. 31274. 31432.
##  2    20 22718. 22267. 23001. 24238. 25387. 25987. 27535. 28487. 29005. 28589.
##  3    31 12648. 14004. 14880. 17393. 18428. 17593. 18885. 17582. 18414. 16909.
##  4    40    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
##  5    41    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
##  6    42  3600.  3638.  3719.  4064.  4100.  4273.  4673.  4826.  4364.  3923.
##  7    51  3904.  3811.  3862.  3744.  3329.  3539.  3875.  3958.  3962.  4267.
##  8    52 24382. 25326. 22556. 20483. 19106. 17596. 14946. 14170. 12743. 12770.
##  9    53 10365. 10358. 10463. 10725.  9966. 11390. 13180. 13719. 15294. 15270.
## 10    54  4407.  4683.  4862.  5109.  5008.  5477.  5973.  6233.  6168.  6562.
## # ... with 179 more rows, and 20 more variables: `1991` <dbl>, `1992` <dbl>,
## #   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
## #   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## #   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>

Back to long

wide %>% pivot_longer(cols=-ccode,names_to = "year", values_to = "gdppc") -> long_again
long_again
## # A tibble: 5,670 x 3
##    ccode year   gdppc
##    <dbl> <chr>  <dbl>
##  1     2 1981  25977.
##  2     2 1982  25131.
##  3     2 1983  26026.
##  4     2 1984  27751.
##  5     2 1985  28455.
##  6     2 1986  29014.
##  7     2 1987  29639.
##  8     2 1988  30506.
##  9     2 1989  31274.
## 10     2 1990  31432.
## # ... with 5,660 more rows

Merging Datasets

You may have noticed that the ccode variable isn’t particularly descriptive for which country it means. We can use this to illustrate merging datasets together. At the start we loaded in the countrycode package which contains additional information.

codes <- countrycode::codelist_panel

Let’s see what they have.

colnames(codes)
##  [1] "country.name.en"       "year"                  "ar5"                  
##  [4] "continent"             "country.name.de"       "country.name.de.regex"
##  [7] "country.name.en.regex" "cow.name"              "cowc"                 
## [10] "cown"                  "ecb"                   "ecb.name"             
## [13] "eu28"                  "eurocontrol_pru"       "eurocontrol_statfor"  
## [16] "eurostat"              "eurostat.name"         "fao"                  
## [19] "fao.name"              "fips"                  "fips.name"            
## [22] "gaul"                  "genc.name"             "genc2c"               
## [25] "genc3c"                "genc3n"                "gwc"                  
## [28] "gwn"                   "icao.name"             "icao.region"          
## [31] "imf"                   "ioc"                   "ioc.name"             
## [34] "iso.name.en"           "iso.name.fr"           "iso2c"                
## [37] "iso3c"                 "iso3n"                 "p4.name"              
## [40] "p4c"                   "p4n"                   "region"               
## [43] "un"                    "un.name.ar"            "un.name.en"           
## [46] "un.name.es"            "un.name.fr"            "un.name.ru"           
## [49] "un.name.zh"            "unpd"                  "unpd.name"            
## [52] "vdem"                  "vdem.name"             "wb"                   
## [55] "wb_api.name"           "wb_api2c"              "wb_api3c"             
## [58] "wb.name"               "wvs"                   "wvs.name"

The country codes we are currently using are cown. Let’s grab iso3c and region to add to the dataset. We also know that the dataset we are working with only has years from 1981 to 2010, so let’s practice our subsetting skillz

codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]

One thing to pay attention to is losing or gaining observations durring a merge. For a great overview, check out this handy NYU Data Services guide.

nrow(d)
## [1] 5188
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
## [1] 5125
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
## [1] 5188
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
## [1] 6959
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
## [1] 7022

And, of course, we can do the same merges using dplyr with inner_join, left_join, right_join, and full_join respectively. Going forward we will keep out2 as the working dataset.

Aggregates and Summaries

There are a number of great things you can do with the apply family of functions, including easily going in parallel with the pbapply package. If you are interested in more details on this you should check out this tutorial and this taskview. We will focus on using dplyr to calculate summaries of interest.

One reason for this is that it is super easy to calculate summaries grouping on another variable. For example, if we wanted to think about regional variation in gdppc we could

out2 %>% group_by(region) %>% 
  summarize(mean=mean(gdppc,na.rm=T),sd=sd(gdppc,na.rm=T),sum=sum(gdppc,na.rm=T))
## # A tibble: 23 x 4
##    region                      mean     sd      sum
##    <chr>                      <dbl>  <dbl>    <dbl>
##  1 Australia and New Zealand 24990.  5898. 1499423.
##  2 Caribbean                  9328.  5894. 3059582.
##  3 Central America            5276.  3314. 1107909.
##  4 Central Asia               4762.  2813.  452347.
##  5 Eastern Africa             1803.  2240.  735621.
##  6 Eastern Asia              11792. 10740. 1415087.
##  7 Eastern Europe             9855.  4712. 2305957.
##  8 Melanesia                  4459.   558.  133778.
##  9 Micronesia                  NaN     NA        0 
## 10 Middle Africa              3184.  4070.  859647.
## # ... with 13 more rows

We can also use the mutate function to add this information to our dataframe. In base R this would take mergeing the output of aggregate, so it can certainly be done, but dplyr makes it somewhat more straightforward and scaleable.

out2 %>% group_by(region) %>% 
  mutate(mean_gdppc=mean(gdppc,na.rm=T),sd_gdppc=sd(gdppc,na.rm=T)) -> out2

out2
## # A tibble: 5,188 x 23
## # Groups:   region [23]
##    ccode  year twoway inhhi comper polity2 physint conflictonlocat~ lnpop
##    <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>            <dbl> <dbl>
##  1     2  1981  0.986 0.990  0.997      10       8                0  19.3
##  2     2  1982  0.988 0.989  0.998      10       8                0  19.3
##  3     2  1983  0.982 0.989  0.994      10       8                1  19.3
##  4     2  1984  0.987 0.987  1          10       8                0  19.3
##  5     2  1985  0.985 0.987  0.998      10       7                0  19.3
##  6     2  1986  0.980 0.987  0.994      10       7                0  19.3
##  7     2  1987  0.982 0.987  0.995      10       8                0  19.3
##  8     2  1988  0.977 0.988  0.989      10       7                0  19.3
##  9     2  1989  0.980 0.988  0.992      10       7                1  19.3
## 10     2  1990  0.983 0.988  0.995      10       8                0  19.4
## # ... with 5,178 more rows, and 14 more variables: lngdppc <dbl>, gdppc <dbl>,
## #   expdep <dbl>, gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## #   sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## #   country.name.en <chr>, region <chr>, mean_gdppc <dbl>, sd_gdppc <dbl>

A base R version of the above might be

a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1
##                      Group.1         x
## 1  Australia and New Zealand 24990.386
## 2                  Caribbean  9327.994
## 3            Central America  5275.757
## 4               Central Asia  4761.549
## 5             Eastern Africa  1802.991
## 6               Eastern Asia 11792.392
## 7             Eastern Europe  9854.516
## 8                  Melanesia  4459.267
## 9                 Micronesia       NaN
## 10             Middle Africa  3183.878
## 11           Northern Africa  3388.796
## 12          Northern America 32390.804
## 13           Northern Europe 24848.061
## 14                 Polynesia       NaN
## 15        South-Eastern Asia 12756.196
## 16             South America  6143.031
## 17           Southern Africa  5279.329
## 18             Southern Asia  3029.750
## 19           Southern Europe 14384.153
## 20            Western Africa  1175.762
## 21              Western Asia 15632.999
## 22            Western Europe 30016.166
colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")

t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
## # A tibble: 5,063 x 25
##    region ccode  year twoway inhhi comper polity2 physint conflictonlocat~ lnpop
##    <chr>  <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>            <dbl> <dbl>
##  1 Austr~   900  1982  0.915 0.955  0.958      10       8                0  16.5
##  2 Austr~   900  1983  0.920 0.951  0.968      10       7                0  16.5
##  3 Austr~   900  1984  0.917 0.952  0.963      10       8                0  16.6
##  4 Austr~   900  1986  0.912 0.945  0.965      10       7                0  16.6
##  5 Austr~   900  1987  0.921 0.954  0.966      10       8                0  16.6
##  6 Austr~   900  1988  0.924 0.958  0.965      10       6                0  16.6
##  7 Austr~   900  1985  0.903 0.945  0.955      10       8                0  16.6
##  8 Austr~   900  1990  0.928 0.961  0.966      10       7                0  16.7
##  9 Austr~   900  1991  0.914 0.958  0.955      10       7                0  16.7
## 10 Austr~   900  1992  0.932 0.958  0.972      10       7                0  16.7
## # ... with 5,053 more rows, and 15 more variables: lngdppc <dbl>, gdppc <dbl>,
## #   expdep <dbl>, gdpgrowth <dbl>, lib_HK <dbl>, meanarab <dbl>, meanpop <dbl>,
## #   sdarab_manual <dbl>, sdpop_manual <dbl>, iso3c <chr>,
## #   country.name.en <chr>, mean_gdppc.x <dbl>, sd_gdppc.x <dbl>,
## #   mean_gdppc.y <dbl>, sd_gdppc.y <dbl>

but the dplyr approach really is quite nice.

Graphics

We will focus on using ggplot2 for graphics in R, although base R has nice capabilities on its own. ggplot is all about the `grammar of graphics’ which follows a layered approach to describe and construct graphics in a structured manner. To begin, we will always initialize a plot:

p1 <- ggplot(out2[which(out2$region == "Northern America"),], aes(x=log(gdppc)))

To get different plots, we will add layers. For example, if we wanted a dot plot

p1 + geom_dotplot(binwidth=0.03)

or a histogram

p1 + geom_histogram(binwidth=0.03)

or a density plot

p1 + geom_density()

we can just add a different layer to the same underlying plot.

The order of the layers does not matter, and there are a bunch more customizations that we can add.

p1 + geom_histogram(color="black",fill="darkblue",binwidth = 0.03) +
     xlab("Natural Log of Per Capita GDP") +
     ylab("Frequency") +
     ggtitle('North American GDPPC') +
     theme_bw()  -> g1
g1

You can also add multiple geometries to the same underderlying plot.

p2 <- ggplot(out2[which(out2$region == "Eastern Europe"),],aes(x=year,y=log(gdppc),color=iso3c))
p2  + geom_point(na.rm=T) + 
      geom_line(na.rm=T) +
      labs(color="Country") +
      scale_color_brewer(palette="Spectral")  -> g2
g2

You can even add some smoothers if you want.

p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)

p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3

Two last notes on plots – faceting and adding plots together into a larger image.

Faceting can be a nice way to break up a continuous variable by category.

p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(region ~ .)

p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(. ~ region)

Once we do all that, we might want to add multiple plots together into a larger multi-panel graphic. The gridExtra package is great for this.

grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)

You should take a look at the ggplot2 book linked to above, as well as other stuff like ggpubr, some examples can be found here.

Typesetting

In R it is relatively easy to create tables that are presentable in Word, LaTex, or HTML. A large number of packages – apsrtable, xtable, texreg, memisc, outreg and others – cater to LaTex users. If you’re using word to prepare your documents, stargazer is an option that can work nicely. One thing we might want to do from the above is greate a table of descriptive statistics.

vars <- c("polity2","physint","conflictonlocation","lnpop")
path <- "C:/Users/Schwarz/Dropbox/AQR"

stargazer(as.data.frame(out2[,vars]),type="html",
          title="Descriptive Statistics",
          covariate.labels = c("Democracy","Human Rights","Any Conflict?",
                               "Population (Logged)"),
          out=paste0(path,"/summary1.doc"))

As you might have gathered from the code above, what we are going to do is use the ability of Word to render html by saving the output with a .doc extension.

Otherwise, the table looks like this:

Descriptive Statistics
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Democracy 4,380 1.746 7.203 -10.000 -6.000 9.000 10.000
Human Rights 4,487 4.838 2.322 0.000 3.000 7.000 8.000
Any Conflict? 5,188 0.171 0.377 0 0 0 1
Population (Logged) 4,457 15.674 1.925 10.615 14.671 16.902 21.000

If you’re interested in what other summary statistics are available natively from stargazer, check out the link in omit.summary.stat in their documentation

?stargazer

Something a bit more interesting would be to take a look at Regression tables. First, let’s run some models. To replicate the Peterson (2017) results, we need to define a function for leading a variable.

tscslead <- function(x, cs,ts){ 
  obs <- 1:length(x) 
  lagobs <- match(paste(cs, ts+1, sep="::"), paste(cs, ts, sep="::")) 
  lagx <- x[lagobs]
  lagx
}

The next thing that we will do is create the dependent variable with a one period lead

out2$physint_lead <- tscslead(out2$physint, out2$ccode, out2$year)

And then we will replicate models 1-3

m1 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m2 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m3 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

stargazer(m1,m2,m3,type="html",out = paste0(path,"/regressions.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Replication of Peterson (2017) Models 1-3",
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights")

The output looks like this:

Replication of Peterson (2017) Models 1-3
All States
Human Rights
(1) (2) (3)
Export Diversification 0.276** 0.242** 0.226**
(0.114) (0.113) (0.114)
Liberalization -10.638 -57.858***
(8.896) (13.994)
Export Dependence 0.202** 0.707***
(0.101) (0.162)
Democracy 0.026*** 0.026*** 0.027***
(0.004) (0.004) (0.004)
Conflict -0.583*** -0.555*** -0.611***
(0.062) (0.061) (0.062)
Log Income 0.126*** 0.118*** 0.131***
(0.023) (0.022) (0.023)
Growth -0.154 -0.207 -0.140
(0.193) (0.191) (0.192)
Log Population -0.190*** -0.174*** -0.226***
(0.020) (0.018) (0.022)
Linear Time Trend -0.014*** -0.014*** -0.016***
(0.002) (0.002) (0.002)
Lagged DV 0.675*** 0.679*** 0.666***
(0.013) (0.012) (0.013)
Constant 31.817*** 31.360*** 35.327***
(4.914) (4.876) (4.967)
Observations 3,479 3,562 3,479
R2 0.751 0.754 0.753
Adjusted R2 0.751 0.753 0.752
Residual Std. Error 1.118 (df = 3469) 1.121 (df = 3552) 1.115 (df = 3468)
F Statistic 1,164.520*** (df = 9; 3469) 1,208.221*** (df = 9; 3552) 1,055.422*** (df = 10; 3468)
Note: p<0.1; p<0.05; p<0.01

Stargazer has a number of useful “style” options which format close to a number of major journals, which may be worth looking at or playing around with.

Exercises

The best way to learn R is to use R. Let’s play around with some of the above concepts and coding approaches to replicate a number of other aspects of the paper and ‘probe robustness’ to deviations in specification.

Exercise 1

Description

Figure 1 of Peterson (2017) shows two histograms of Export Diversity (twoway), one for all states and the other excluding high states (those with a GDP Per Capita larger than 11101.2). Reproduce this image as closely as possible; this will take digging into the documentation or Googling to accomplish particular customizations! The final output should look something like this:

Exercise 1 Desired Output

Answer

sub <- out2[which(out2$gdppc < 11101.2),]

out2$all_states <- "All States"
sub$exc <- "Excluding high-income states"

out2 %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~all_states)-> f1
sub %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~exc) -> f2

jpeg(paste0(path,"/Exercise1.jpg"))

grid.arrange(f1,f2,ncol=2,nrow=1, 
             bottom = textGrob("Figure 1. Histogram of export diversity for all states (left) and excluding high-income states (right).", gp=gpar(fontsize=10)))
grid.rect(width = 1, height = 0.96, vjust=0.485 , gp = gpar(lwd = 2, col = "black", fill = NA))

dev.off()

Exercise 2

Description

Models 4-6 replicate use the same specifications as above but excluding high income states. Run these regressions and format the results to get the following:

Replication of Peterson (2017) Models 4-6
Excluding high income states
Human Rights
(4) (5) (6)
Export Diversification 0.472*** 0.471*** 0.450***
(0.141) (0.139) (0.141)
Liberalization -19.891* -44.903**
(11.636) (20.707)
Export Dependence -0.077 0.333
(0.126) (0.228)
Democracy 0.024*** 0.023*** 0.025***
(0.004) (0.004) (0.004)
Conflict -0.598*** -0.578*** -0.597***
(0.072) (0.071) (0.072)
Log Income -0.030 -0.038 -0.017
(0.034) (0.034) (0.035)
Growth -0.066 -0.141 -0.060
(0.232) (0.229) (0.232)
Log Population -0.290*** -0.273*** -0.307***
(0.027) (0.024) (0.029)
Linear Time Trend -0.015*** -0.014*** -0.015***
(0.003) (0.003) (0.003)
Lagged DV 0.622*** 0.628*** 0.621***
(0.016) (0.016) (0.016)
Constant 36.093*** 34.344*** 37.349***
(6.419) (6.352) (6.475)
Observations 2,488 2,550 2,488
R2 0.682 0.682 0.683
Adjusted R2 0.681 0.681 0.681
Residual Std. Error 1.190 (df = 2478) 1.196 (df = 2540) 1.190 (df = 2477)
F Statistic 591.627*** (df = 9; 2478) 605.829*** (df = 9; 2540) 532.921*** (df = 10; 2477)
Note: p<0.1; p<0.05; p<0.01

Answer

m5 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m6 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m7 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

stargazer(m5,m6,m7,type="html",out = paste0(path,"/regressions2.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          column.labels = c("(4)","(5)","(6)"),
          model.numbers = F,
          title="Replication of Peterson (2017) Models 4-6",
          dep.var.caption = "Excluding high income states",
          dep.var.labels = "Human Rights")

Exercise 3

Description

One might wonder whether the results are robust to the inclusion of region or state fixed effects. These can be entered into the regression equation by including, for example, as.factor(ccode) to the regression equation. Fixed effects are generally omitted from regression tables; figure out how to exclude regressors from stargazer output and add an additional note to the table indicating that fixed effects were included. You should also remove the R-squared, Residual Standard Error, and F Statistic from being reported. The output should look something like this:

Robustness of Peterson (2017) Models 1-3
All States
Human Rights
(1a) (2a) (3a)
Export Diversification 0.190 0.220 0.248
(0.259) (0.257) (0.260)
Liberalization 21.905 -9.210
(17.509) (20.824)
Export Dependence 0.587*** 0.633***
(0.191) (0.230)
Democracy 0.037*** 0.039*** 0.039***
(0.006) (0.006) (0.006)
Conflict -0.520*** -0.495*** -0.508***
(0.077) (0.075) (0.077)
Log Income 0.029 0.086 0.077
(0.091) (0.090) (0.092)
Growth 0.016 -0.039 0.032
(0.188) (0.186) (0.188)
Log Population -0.324 -0.244 -0.267
(0.237) (0.234) (0.238)
Linear Time Trend -0.017*** -0.022*** -0.022***
(0.005) (0.006) (0.006)
Lagged DV
Constant 44.060*** 50.103*** 50.438***
(7.467) (7.665) (7.811)
Observations 3,479 3,562 3,479
Adjusted R2 0.775 0.779 0.776
Note: p<0.1; p<0.05; p<0.01
All models include country fixed effects.

Answer

m7 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m8 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m9 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

stargazer(m7,m8,m9,type="html",out = paste0(path,"/regressions3.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Robustness of Peterson (2017) Models 1-3",
          omit=11:length(m9$coefficients),
          notes = "All models include country fixed effects.",
          notes.append=TRUE,
          column.labels = c("(1a)","(2a)","(3a)"),
          model.numbers = F,
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights",
          omit.stat = c("f","rsq","ser"))

Final Words

That’s all I have for you today. Hope it was helpful! Feel free to reach out to NYU Data Services if you encounter any issues with programming in R or would like to be directed to any additional resources. Over the past two sessions we have only touched the surface of what R is able to do thanks to the vibrant community of scholars and practitioners who contribute to the project. If you are looking to get further into R for statistical analysis, I highly recommend checking out what is available on the various taskviews or even these neat swirl courses where you can learn R in R.

---
title: 'AQR Proseminar: Wrangling, Graphics, and Typesetting in R'
author: "Christopher Schwarz"
date: "3/23/2020"
pages:
  extra: true
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    code_download: true
---

This document is a quick introduction to data wrangling, graphics, and typesetting in R.  You can download the code by clicking the "Code" button to the upper right.

First thing is first; let's install/load all of the packages that we will be using, and clear out the environment.

```{r message=FALSE, warning=FALSE}
packages <- c("haven","dplyr","ggplot2","countrycode","tidyr","gridExtra","grid",
              "stargazer")
for(i in packages){
  if(!require(i,character.only = T, quietly = T)){
    install.packages(i)
    library(i, character.only = T, quietly = T)
  }
}

rm(list=ls())
```


We will be using data from Peterson (2017): Export Diversity and Human Rights.  You can download the replication archive by clicking [here](https://sites.google.com/site/timothympetersonusc/JCR%20-%20export%20diversity%20-%20replication%20data.zip) or download the data directly by running the following chunk.

```{r}
d <- read_dta("https://bit.ly/3a9nUSL")
d
```


# Data Wrangling

## Selecting Cases and Subsetting

As with most things in R there are multiple ways of accomplishing the same goal.  Being able to switch between approaches can help you accomplish more complicated tasks.  I generally stick to base R for subsetting.  To get indices which satisfy logical statements you can use the `which` function 
```{r}
which(d$gdppc > 50000)
```

```{r}
which(d$gdppc < 2000 & d$polity2 == 10)
```

```{r}
which(d$gdpgrowth < -.5 | d$gdppc < 100)
```

We can combine this with indexing to subset down the data.  We can also call columns in a variety of ways.  Remember that you can create objects carying this information to modularize your code, which might be helpful in particular situations to keep everything clear.

```{r}
inds <- which(d$gdppc > 50000)
cols <- c("ccode","year","physint","lnpop")

sub1 <- d[inds,cols]
sub1
```

Alternatively, one could use the `subset` function from base R to get the same result.

```{r}
sub2 <- subset(d,d$gdppc > 50000,cols)
sub2
```
```{r}
identical(sub1,sub2)
```

Another great option is the `dplyr` package, which is part of the [tidyverse](https://www.tidyverse.org/) alongside the packages `haven` and `ggplot2`.  One of the best things about the tidyverse family of packages is that they are very well documented, including [a](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) [variety](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) [of](http://www.flutterbys.com.au/stats/downloads/slides/figure/factors.pdf) [cheatsheets](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_purrr.pdf) [and](https://ggplot2-book.org/) [books](https://r4ds.had.co.nz/).  One thing that makes the wrangling tools particularly powerful is that they leverage a pipe (`%>%`) from the `magrittr` package which says,in pseudo-code, that `x %>% f(y)` is the same as `f(x,y)`.  This can create nice work flow.  For example, to get the same subset yet again:
```{r}
d %>% filter(gdppc > 50000) %>% select(all_of(cols)) -> sub3

all(sub1 == sub3, na.rm=T)
```

Neat.  What we did was took our dataframe, `filter`ed the rows that we wanted, and then `select`ed the columns of interest.

If we want to `sort` data, there is a base R approach for vectors.

```{r}
head(sort(d$gdpgrowth))
```

For dataframes you have to use `order`, which produces index numbers that can be used as before
```{r}
d[order(d$gdpgrowth),c("ccode","year","gdpgrowth")]
```
We can also switch the ordering around by setting `decreasing = T`
```{r}
d[order(d$gdpgrowth,decreasing = T),c("ccode","year","gdpgrowth")]
```
Or we could use the handy `%>%`.  In this case we have to use the placeholder `.` for the input, which might be handy to know that you can do for more complicated functions.
```{r}
order(d$gdpgrowth,decreasing = T) %>% d[.,c("ccode","year","gdpgrowth")]
```

There is also a handy `arrange` function in `dplyr` for accomplishing the same sort of task but allowing to sort based on multiple columns.

```{r}
d %>% select(c("ccode","year","gdpgrowth")) %>% arrange(ccode,gdpgrowth)
```

## Converting between Long and Wide

Sometimes data comes in "Long" format, like the data we have been working with, where each row represents an observational unit (here country - year).  In other cases data will come in "Wide" format like [this](https://data.imf.org/regular.aspx?key=60991467) or [this](https://en.wikipedia.org/wiki/United_States_Electoral_College#Chronological_table)

We might want to convert data back and forth depending on what we want to do.  Let's do this for the GDP per capita variable; see NYU Data Services for a handy [guide to reshaping](https://guides.nyu.edu/quant/reshape).

```{r}
long <- d[,c("ccode","year","gdppc")]
long
```

Long to Wide with tidyr (updated versions of `gather` and `spread`)
```{r}
long %>% pivot_wider(names_from = year, values_from = gdppc) -> wide
wide
```
Back to long
```{r}
wide %>% pivot_longer(cols=-ccode,names_to = "year", values_to = "gdppc") -> long_again
long_again
```

## Merging Datasets

You may have noticed that the ccode variable isn't particularly descriptive for which country it means.  We can use this to illustrate merging datasets together.  At the start we loaded in the `countrycode` package which contains additional information.

```{r}
codes <- countrycode::codelist_panel
```

Let's see what they have.
```{r}
colnames(codes)
```

The country codes we are currently using are `cown`.  Let's grab `iso3c` and `region` to add to the dataset.  We also know that the dataset we are working with only has years from 1981 to 2010, so let's practice our subsetting skillz
```{r}
codes <- codes[codes$year %in% 1981:2010,c("cown","year","iso3c","country.name.en","region")]
```

One thing to pay attention to is losing or gaining observations durring a merge.  For a great overview, check out this handy NYU Data Services [guide](https://guides.nyu.edu/quant/merge).
```{r}
nrow(d)
```

```{r}
out1 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"))
nrow(out1)
```
```{r}
out2 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.x=T)
nrow(out2)
```
```{r}
out3 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all.y=T)
nrow(out3)
```
```{r}
out4 <- merge(d,codes,by.x=c("ccode","year"),by.y=c("cown","year"),all=T)
nrow(out4)
```

And, of course, we can do the same merges using dplyr with `inner_join`, `left_join`, `right_join`, and `full_join` respectively.  Going forward we will keep `out2` as the working dataset.

## Aggregates and Summaries

There are a number of great things you can do with the `apply` family of functions, including easily going in parallel with the `pbapply` package.  If you are interested in more details on this you should check out [this tutorial](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family) and [this taskview](https://www.r-pkg.org/ctv/HighPerformanceComputing).  We will focus on using `dplyr` to calculate summaries of interest.

One reason for this is that it is super easy to calculate summaries grouping on another variable.  For example, if we wanted to think about regional variation in `gdppc` we could

```{r}
out2 %>% group_by(region) %>% 
  summarize(mean=mean(gdppc,na.rm=T),sd=sd(gdppc,na.rm=T),sum=sum(gdppc,na.rm=T))
```

We can also use the `mutate` function to add this information to our dataframe.  In base R this would take `merge`ing the output of `aggregate`, so it can certainly be done, but `dplyr` makes it somewhat more straightforward and scaleable.
```{r}
out2 %>% group_by(region) %>% 
  mutate(mean_gdppc=mean(gdppc,na.rm=T),sd_gdppc=sd(gdppc,na.rm=T)) -> out2

out2
```

A base R version of the above might be
```{r}
a1 <- aggregate(out2$gdppc,by=list(out2$region),mean,na.rm=T)
a1

colnames(a1) <- c("region","mean_gdppc")
a2 <- aggregate(out2$gdppc,by=list(out2$region),sd,na.rm=T)
colnames(a2) <- c("region","sd_gdppc")

t1 <- merge(out2,a1,by="region")
t2 <- merge(t1,a2,by="region")
tbl_df(t2)
```
but the `dplyr` approach really is quite nice.

# Graphics

We will focus on using `ggplot2` for graphics in R, although base R has nice capabilities on its own.  `ggplot` is all about the `grammar of graphics' which follows a layered approach to describe and construct graphics in a structured manner.  To begin, we will always initialize a plot:

```{r}
p1 <- ggplot(out2[which(out2$region == "Northern America"),], aes(x=log(gdppc)))
```

To get different plots, we will add layers.  For example, if we wanted a dot plot
```{r}
p1 + geom_dotplot(binwidth=0.03)
```
or a histogram
```{r}
p1 + geom_histogram(binwidth=0.03)
```
or a density plot
```{r}
p1 + geom_density()
```
we can just add a different layer to the same underlying plot.

The order of the layers does not matter, and there are a bunch more customizations that we can add.

```{r}
p1 + geom_histogram(color="black",fill="darkblue",binwidth = 0.03) +
     xlab("Natural Log of Per Capita GDP") +
     ylab("Frequency") +
     ggtitle('North American GDPPC') +
     theme_bw()  -> g1
g1
```

You can also add multiple geometries to the same underderlying plot.

```{r}
p2 <- ggplot(out2[which(out2$region == "Eastern Europe"),],aes(x=year,y=log(gdppc),color=iso3c))
p2  + geom_point(na.rm=T) + 
      geom_line(na.rm=T) +
      labs(color="Country") +
      scale_color_brewer(palette="Spectral")  -> g2
g2
```

You can even add some smoothers if you want.

```{r}
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "lm", se = TRUE,na.rm=T, formula=y~x)

```

```{r warnings=FALSE}
p3 <- ggplot(out3[which(out3$iso3c=="RUS"),],aes(x=year,y=gdppc))
p3 + geom_point(na.rm=T) +
     geom_smooth(color ="gray", method = "loess", se = TRUE,formula=y~x, na.rm=T) -> g3
g3
```

Two last notes on plots -- faceting and adding plots together into a larger image.


Faceting can be a nice way to break up a continuous variable by category.
```{r}
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(region ~ .)
```
```{r}
p4 <- ggplot(na.omit(out2[which(out2$region %in% c("Western Europe","Western Africa","Eastern Europe")),]),aes(x=log(gdppc)))
p4 + geom_histogram(binwidth = 0.1) + 
     facet_grid(. ~ region)
```

Once we do all that, we might want to add multiple plots together into a larger multi-panel graphic.  The `gridExtra` package is great for this.

```{r}
grid.arrange(g1,g2,g3,textGrob("Spiffy!"),ncol=2,nrow=2)
```

You should take a look at the `ggplot2` book linked to above, as well as other stuff like `ggpubr`, some examples can be found [here](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/).

# Typesetting

In R it is relatively easy to create tables that are presentable in Word, LaTex, or HTML.  A large number of packages -- `apsrtable`, `xtable`, `texreg`, `memisc`, `outreg` and others -- cater to LaTex users.  If you're using word to prepare your documents, `stargazer` is an option that can work nicely.  One thing we might want to do from the above is greate a table of descriptive statistics.

```{r results=FALSE}
vars <- c("polity2","physint","conflictonlocation","lnpop")
path <- "C:/Users/Schwarz/Dropbox/AQR"

stargazer(as.data.frame(out2[,vars]),type="html",
          title="Descriptive Statistics",
          covariate.labels = c("Democracy","Human Rights","Any Conflict?",
                               "Population (Logged)"),
          out=paste0(path,"/summary1.doc"))

```

As you might have gathered from the code above, what we are going to do is use the ability of Word to render html by saving the output with a .doc extension.  

Otherwise, the table looks like this:

<center>
<table style="text-align:center"><caption><strong>Descriptive Statistics</strong></caption>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Democracy</td><td>4,380</td><td>1.746</td><td>7.203</td><td>-10.000</td><td>-6.000</td><td>9.000</td><td>10.000</td></tr>
<tr><td style="text-align:left">Human Rights</td><td>4,487</td><td>4.838</td><td>2.322</td><td>0.000</td><td>3.000</td><td>7.000</td><td>8.000</td></tr>
<tr><td style="text-align:left">Any Conflict?</td><td>5,188</td><td>0.171</td><td>0.377</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td style="text-align:left">Population (Logged)</td><td>4,457</td><td>15.674</td><td>1.925</td><td>10.615</td><td>14.671</td><td>16.902</td><td>21.000</td></tr>
<tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>
</center>

If you're interested in what other summary statistics are available natively from `stargazer`, check out the link in `omit.summary.stat` in their documentation

```{r eval=FALSE}
?stargazer
```

Something a bit more interesting would be to take a look at Regression tables.  First, let's run some models.  To replicate the Peterson (2017) results, we need to define a function for leading a variable.

```{r}
tscslead <- function(x, cs,ts){ 
  obs <- 1:length(x) 
  lagobs <- match(paste(cs, ts+1, sep="::"), paste(cs, ts, sep="::")) 
  lagx <- x[lagobs]
  lagx
}
```

The next thing that we will do is create the dependent variable with a one period lead
```{r}
out2$physint_lead <- tscslead(out2$physint, out2$ccode, out2$year)
```

And then we will replicate models 1-3
```{r results=FALSE}
m1 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m2 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

m3 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
              + lngdppc + gdpgrowth + lnpop + year + physint, data = out2)

stargazer(m1,m2,m3,type="html",out = paste0(path,"/regressions.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Replication of Peterson (2017) Models 1-3",
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights")
```

The output looks like this:

<center>
<table style="text-align:center"><caption><strong>Replication of Peterson (2017) Models 1-3</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">All States</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.276<sup>**</sup></td><td>0.242<sup>**</sup></td><td>0.226<sup>**</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.114)</td><td>(0.113)</td><td>(0.114)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>-10.638</td><td></td><td>-57.858<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(8.896)</td><td></td><td>(13.994)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>0.202<sup>**</sup></td><td>0.707<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.101)</td><td>(0.162)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.026<sup>***</sup></td><td>0.026<sup>***</sup></td><td>0.027<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.004)</td><td>(0.004)</td><td>(0.004)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.583<sup>***</sup></td><td>-0.555<sup>***</sup></td><td>-0.611<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.062)</td><td>(0.061)</td><td>(0.062)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>0.126<sup>***</sup></td><td>0.118<sup>***</sup></td><td>0.131<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.023)</td><td>(0.022)</td><td>(0.023)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>-0.154</td><td>-0.207</td><td>-0.140</td></tr>
<tr><td style="text-align:left"></td><td>(0.193)</td><td>(0.191)</td><td>(0.192)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.190<sup>***</sup></td><td>-0.174<sup>***</sup></td><td>-0.226<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.020)</td><td>(0.018)</td><td>(0.022)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.014<sup>***</sup></td><td>-0.014<sup>***</sup></td><td>-0.016<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.002)</td><td>(0.002)</td><td>(0.002)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td>0.675<sup>***</sup></td><td>0.679<sup>***</sup></td><td>0.666<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.013)</td><td>(0.012)</td><td>(0.013)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>31.817<sup>***</sup></td><td>31.360<sup>***</sup></td><td>35.327<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(4.914)</td><td>(4.876)</td><td>(4.967)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,479</td><td>3,562</td><td>3,479</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.751</td><td>0.754</td><td>0.753</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.751</td><td>0.753</td><td>0.752</td></tr>
<tr><td style="text-align:left">Residual Std. Error</td><td>1.118 (df = 3469)</td><td>1.121 (df = 3552)</td><td>1.115 (df = 3468)</td></tr>
<tr><td style="text-align:left">F Statistic</td><td>1,164.520<sup>***</sup> (df = 9; 3469)</td><td>1,208.221<sup>***</sup> (df = 9; 3552)</td><td>1,055.422<sup>***</sup> (df = 10; 3468)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
</center>

Stargazer has a number of useful "style" options which format close to a number of major journals, which may be worth looking at or playing around with.

# Exercises

The best way to learn R is to use R.  Let's play around with some of the above concepts and coding approaches to replicate a number of other aspects of the paper and 'probe robustness' to deviations in specification.

## Exercise 1 {.tabset}
### Description
Figure 1 of Peterson (2017) shows two histograms of Export Diversity (`twoway`), one for all states and the other excluding high states (those with a GDP Per Capita larger than 11101.2).  Reproduce this image as closely as possible; this will take digging into the documentation or Googling to accomplish particular customizations!  The final output should look something like this:

<center>
![Exercise 1 Desired Output](C:/Users/Schwarz/Dropbox/AQR/Exercise1.jpg)
</center>

### Answer
```{r results=FALSE}
sub <- out2[which(out2$gdppc < 11101.2),]

out2$all_states <- "All States"
sub$exc <- "Excluding high-income states"

out2 %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~all_states)-> f1
sub %>% ggplot(aes(x=twoway)) + 
  geom_histogram(binwidth=0.03,col="black") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5),
        strip.background = element_rect(fill="grey"),
        strip.text = element_text(size=13)) +
  xlab("Export diversity") +
  ylab("Frequency") +
  facet_grid(.~exc) -> f2

jpeg(paste0(path,"/Exercise1.jpg"))

grid.arrange(f1,f2,ncol=2,nrow=1, 
             bottom = textGrob("Figure 1. Histogram of export diversity for all states (left) and excluding high-income states (right).", gp=gpar(fontsize=10)))
grid.rect(width = 1, height = 0.96, vjust=0.485 , gp = gpar(lwd = 2, col = "black", fill = NA))

dev.off()
```

## Exercise 2 {.tabset}

### Description
Models 4-6 replicate use the same specifications as above but excluding high income states.  Run these regressions and format the results to get the following:

<center>
<table style="text-align:center"><caption><strong>Replication of Peterson (2017) Models 4-6</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">Excluding high income states</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(4)</td><td>(5)</td><td>(6)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.472<sup>***</sup></td><td>0.471<sup>***</sup></td><td>0.450<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.141)</td><td>(0.139)</td><td>(0.141)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>-19.891<sup>*</sup></td><td></td><td>-44.903<sup>**</sup></td></tr>
<tr><td style="text-align:left"></td><td>(11.636)</td><td></td><td>(20.707)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>-0.077</td><td>0.333</td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.126)</td><td>(0.228)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.024<sup>***</sup></td><td>0.023<sup>***</sup></td><td>0.025<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.004)</td><td>(0.004)</td><td>(0.004)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.598<sup>***</sup></td><td>-0.578<sup>***</sup></td><td>-0.597<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.072)</td><td>(0.071)</td><td>(0.072)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>-0.030</td><td>-0.038</td><td>-0.017</td></tr>
<tr><td style="text-align:left"></td><td>(0.034)</td><td>(0.034)</td><td>(0.035)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>-0.066</td><td>-0.141</td><td>-0.060</td></tr>
<tr><td style="text-align:left"></td><td>(0.232)</td><td>(0.229)</td><td>(0.232)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.290<sup>***</sup></td><td>-0.273<sup>***</sup></td><td>-0.307<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.027)</td><td>(0.024)</td><td>(0.029)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.015<sup>***</sup></td><td>-0.014<sup>***</sup></td><td>-0.015<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.003)</td><td>(0.003)</td><td>(0.003)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td>0.622<sup>***</sup></td><td>0.628<sup>***</sup></td><td>0.621<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.016)</td><td>(0.016)</td><td>(0.016)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>36.093<sup>***</sup></td><td>34.344<sup>***</sup></td><td>37.349<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(6.419)</td><td>(6.352)</td><td>(6.475)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>2,488</td><td>2,550</td><td>2,488</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.682</td><td>0.682</td><td>0.683</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.681</td><td>0.681</td><td>0.681</td></tr>
<tr><td style="text-align:left">Residual Std. Error</td><td>1.190 (df = 2478)</td><td>1.196 (df = 2540)</td><td>1.190 (df = 2477)</td></tr>
<tr><td style="text-align:left">F Statistic</td><td>591.627<sup>***</sup> (df = 9; 2478)</td><td>605.829<sup>***</sup> (df = 9; 2540)</td><td>532.921<sup>***</sup> (df = 10; 2477)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
</center>

### Answer
```{r results=FALSE}
m5 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m6 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

m7 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint, data = sub)

stargazer(m5,m6,m7,type="html",out = paste0(path,"/regressions2.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          column.labels = c("(4)","(5)","(6)"),
          model.numbers = F,
          title="Replication of Peterson (2017) Models 4-6",
          dep.var.caption = "Excluding high income states",
          dep.var.labels = "Human Rights")
```

## Exercise 3 {.tabset}

### Description
One might wonder whether the results are robust to the inclusion of region or state fixed effects.  These can be entered into the regression equation by including, for example, `as.factor(ccode)` to the regression equation.  Fixed effects are generally omitted from regression tables; figure out how to exclude regressors from `stargazer` output and add an additional note to the table indicating that fixed effects were included.  You should also remove the R-squared, Residual Standard Error, and F Statistic from being reported.  The output should look something like this:

<center>
<table style="text-align:center"><caption><strong>Robustness of Peterson (2017) Models 1-3</strong></caption>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3">All States</td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">Human Rights</td></tr>
<tr><td style="text-align:left"></td><td>(1a)</td><td>(2a)</td><td>(3a)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Export Diversification</td><td>0.190</td><td>0.220</td><td>0.248</td></tr>
<tr><td style="text-align:left"></td><td>(0.259)</td><td>(0.257)</td><td>(0.260)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Liberalization</td><td>21.905</td><td></td><td>-9.210</td></tr>
<tr><td style="text-align:left"></td><td>(17.509)</td><td></td><td>(20.824)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Export Dependence</td><td></td><td>0.587<sup>***</sup></td><td>0.633<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td></td><td>(0.191)</td><td>(0.230)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Democracy</td><td>0.037<sup>***</sup></td><td>0.039<sup>***</sup></td><td>0.039<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.006)</td><td>(0.006)</td><td>(0.006)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Conflict</td><td>-0.520<sup>***</sup></td><td>-0.495<sup>***</sup></td><td>-0.508<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.077)</td><td>(0.075)</td><td>(0.077)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Income</td><td>0.029</td><td>0.086</td><td>0.077</td></tr>
<tr><td style="text-align:left"></td><td>(0.091)</td><td>(0.090)</td><td>(0.092)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Growth</td><td>0.016</td><td>-0.039</td><td>0.032</td></tr>
<tr><td style="text-align:left"></td><td>(0.188)</td><td>(0.186)</td><td>(0.188)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Log Population</td><td>-0.324</td><td>-0.244</td><td>-0.267</td></tr>
<tr><td style="text-align:left"></td><td>(0.237)</td><td>(0.234)</td><td>(0.238)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Linear Time Trend</td><td>-0.017<sup>***</sup></td><td>-0.022<sup>***</sup></td><td>-0.022<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.005)</td><td>(0.006)</td><td>(0.006)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Lagged DV</td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>44.060<sup>***</sup></td><td>50.103<sup>***</sup></td><td>50.438<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(7.467)</td><td>(7.665)</td><td>(7.811)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,479</td><td>3,562</td><td>3,479</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.775</td><td>0.779</td><td>0.776</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
<tr><td style="text-align:left"></td><td colspan="3" style="text-align:right">All models include country fixed effects.</td></tr>
</table>
</center>

### Answer
```{r results=FALSE}
m7 <- lm(physint_lead ~ twoway +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m8 <- lm(physint_lead ~ twoway + expdep + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

m9 <- lm(physint_lead ~ twoway + expdep +lib_HK + polity2 + conflictonlocation
         + lngdppc + gdpgrowth + lnpop + year + physint + as.factor(ccode), data = out2)

stargazer(m7,m8,m9,type="html",out = paste0(path,"/regressions3.doc"),
          covariate.labels = c("Export Diversification","Liberalization","
                               Export Dependence","Democracy","Conflict","
                               Log Income","Growth","Log Population",
                               "Linear Time Trend","Lagged DV"),
          title="Robustness of Peterson (2017) Models 1-3",
          omit=11:length(m9$coefficients),
          notes = "All models include country fixed effects.",
          notes.append=TRUE,
          column.labels = c("(1a)","(2a)","(3a)"),
          model.numbers = F,
          dep.var.caption = "All States",
          dep.var.labels = "Human Rights",
          omit.stat = c("f","rsq","ser"))
```

# Final Words

That's all I have for you today.  Hope it was helpful!  Feel free to reach out to NYU Data Services if you encounter any issues with programming in R or would like to be directed to any additional resources.  Over the past two sessions we have only touched the surface of what R is able to do thanks to the vibrant community of scholars and practitioners who contribute to the project.  If you are looking to get further into R for statistical analysis, I highly recommend checking out what is available on the various [taskviews](https://cran.r-project.org/web/views/) or even these neat [swirl courses](https://swirlstats.com/) where you can learn R in R.




