Good tables and figures are an essential component of every paper. In general, I prefer figures to tables, especially for presenting regression coefficients. I find that most economists use tables too often for my liking, both in papers and presentations. In this document, we’ll focus mostly on how to make plots in R with the ggplot2 package and how to integrate them into a document. We’ll use other data-wrangling tools from the tidyverse where necessary. We’ll do a little bit on tables at the end.

Other resources, some of which I’ve drawn upon for this workshop:

Why ggplot2?

ggplot2 is based on the grammar of graphics, a phrase coined by Lee Wilkinson. The “grammer” means there is an underlying logic to building a plot. This makes it easier to understand the code used to build a plot, and a lot less effort to memorize arbitrary commands.

ggplot2 also comes with nice defaults. This means it takes a lot less work get your graphic to a stage where you’re satisfied. It’s also just a lot more fun to be able to make nice graphs without too much effort. If you want some tips on taking a ggplot2 figure to the next level, check out this tutorial

Basics

Getting started

We’ll start with the gapminder dataset, which you can get by installing the gapminder package. Make sure you have the tidyverse packages too.

install.packages("tidyverse")
install.packages("gapminder")

Take a quick peek at the dataset

library(tidyverse)
library(gapminder)
gapminder
## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

Creating a plot

Try creating your first plot with this code.

ggplot(gapminder)

Ha! Sorry. Try this instead.

ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap)) + 
  geom_point() 

ggplot() creates a coordinate system that you can add layers to. To create a graph, you need to specify your

  • data
  • mapping - this defines how variables in your dataset are mapped to visual properties (in this case, the x and y axes)

Points

  • Increase the size
ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap)) + 
  geom_point(size = 5) + 
  ggtitle("For dramatic effect")

  • Different colours by continent
ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap, colour = continent)) + 
  geom_point() 

Try this:

ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap)) + 
  geom_point(color = "red") 

and this:

ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap, color = "blue")) + 
  geom_point() 

Point aesthetics

Different geoms understand different aesthetics. Type ?geom_point in your console and scroll down to the Aesthetics section. The help document says that geom_point understands the shape aesthetic, so we can differentiate continents by shapes and colours. Shapes are helpful to accommodate black-and-white versions of your paper.

ggplot(data = filter(gapminder, year == 1997), 
       mapping = aes(x = lifeExp, y = gdpPercap, shape = continent, color = continent)) + 
  geom_point() 

Lines

Let’s look at average worldwide life expectancy over time. We’ll calculate mean life expectancy each year and then 95% confidence intervals.

lifexp = gapminder %>% 
  group_by(year) %>% 
  summarize(lifexp_mean = mean(lifeExp), lifexp_se = sd(lifeExp)/sqrt(n()))

lifexp = mutate(lifexp, 
                     lifexp_upperci = lifexp_mean + 1.96*lifexp_se, 
                     lifexp_lowerci = lifexp_mean - 1.96*lifexp_se)

lifexp
## # A tibble: 12 x 5
##     year lifexp_mean lifexp_se lifexp_upperci lifexp_lowerci
##    <int>       <dbl>     <dbl>          <dbl>          <dbl>
##  1  1952    49.05762 1.0259794       51.06854       47.04670
##  2  1957    51.50740 1.0264267       53.51920       49.49560
##  3  1962    53.60925 1.0151782       55.59900       51.61950
##  4  1967    55.67829 0.9834247       57.60580       53.75078
##  5  1972    57.64739 0.9551523       59.51948       55.77529
##  6  1977    59.57016 0.9421682       61.41681       57.72351
##  7  1982    61.53320 0.9038502       63.30474       59.76165
##  8  1987    63.21261 0.8858638       64.94891       61.47632
##  9  1992    64.16034 0.9421808       66.00701       62.31366
## 10  1997    65.01468 0.9700466       66.91597       63.11338
## 11  2002    65.69492 1.0304998       67.71470       63.67514
## 12  2007    67.00742 1.0131454       68.99319       65.02166
ggplot(lifexp, aes(year, lifexp_mean)) + 
  geom_line() 

You’ll often want to show the confidence intervals e.g. if you have regression coefficients from an event study.

ggplot(lifexp, aes(year, lifexp_mean)) + 
  geom_line() + 
  geom_ribbon(aes(ymin = lifexp_lowerci, ymax = lifexp_upperci), alpha = 1/10, fill = "blue")

You can also do errorbars.

ggplot(lifexp, aes(year, lifexp_mean)) + 
  geom_line() + 
  geom_errorbar(aes(ymax = lifexp_upperci, ymin = lifexp_lowerci), linetype = 2)

You can also plot the confidence intervals as lines. I am working on a geom that does this (and I’m really close!), but for now you have this workaround.

ggplot(lifexp, aes(year, lifexp_mean)) + 
  geom_line() + 
  geom_line(aes(year, lifexp_upperci), linetype = 2, color = "red") + 
  geom_line(aes(year, lifexp_lowerci), linetype = 2, color = "red")

Bars

Let’s use the nycflights13 package. Which New York airport had the most number of flights originating from it?

# install.packages(nycflights13)
library(nycflights13)

ggplot(flights) + geom_bar(aes(origin))

Notice that the counting has been done for you. If you want to display values that are already in the data, use geom_col.

origins = count(flights, origin)

origin_plot = ggplot(origins, aes(origin, n)) + geom_col()
origin_plot

# Same thing
# ggplot(origins, aes(origin, n)) + 
#   geom_bar(stat = "identity")

You might want to flip the bars.

origins %>% 
  mutate(origin = reorder(origin, n)) %>% 
  ggplot(aes(origin, n)) + geom_col() + coord_flip()

“3D” plots

ggplot does not have 3D plots, but has geoms for visualizing 3D surfaces in 2D. I often find 2D surfaces to be more interpretable than 3D surfaces anyway. I like heat maps, which you can draw with geom_raster.

delays_heatmap = flights %>% 
  group_by(month, day) %>% 
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE), n = n()) %>%  
  ggplot(aes(day, as.factor(month))) + geom_raster(aes(fill = dep_delay)) + 
  ggtitle("Average departure delay from JFK for each month-day of 2013") 

delays_heatmap

Themes

There are many options for deviating from the default grey theme. For papers, I like to use light backgrounds so that color printing is not an issue.

origin_plot + theme_minimal()

If you want to change the theme for all of your images, use theme_set() at the beginning of your code. Copy and paste the code below at the end of the code chunk named libs (where we loaded the packages) and knit this .rmd file.

theme_set(theme_classic())

There are themes that come with ggplot2 and there are also lots other options out there like the ggthemes package.

Tables

There are a number of helpful tools for making nice tables, but as you’ve seen with the gapminder dataset, if you’re just writing an informal report you can just print the object you want to show and it’ll come out pretty neatly. Just to recap:

gapminder
## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

Another option is kable from the knitr package.

knitr::kable(head(gapminder)) # without head() will print out entire dataframe
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

stargazer is probably your best option for making journal-quality tables. One of its main selling points is that can take regression output directly and convert that into your standard regression table. However, it has some idiosycrasies that can occasionally be inconvenient. Notice below how I have to specify results = 'asis' in the chunk options, and type = "html" in the function. The latter means that if I want to knit this document as a pdf instead then I have to change the stargazer options.

```{r, results = 'asis'}
toymodel = lm(lifeExp ~ gdpPercap, data = gapminder)

stargazer::stargazer(toymodel, type = "html", title = "Table with stargazer")
```
toymodel = lm(lifeExp ~ gdpPercap, data = gapminder)

stargazer::stargazer(toymodel, type = "html", title = "Table with stargazer")
Table with stargazer
Dependent variable:
lifeExp
gdpPercap 0.001***
(0.00003)
Constant 53.956***
(0.315)
Observations 1,704
R2 0.341
Adjusted R2 0.340
Residual Std. Error 10.491 (df = 1702)
F Statistic 879.577*** (df = 1; 1702)
Note: p<0.1; p<0.05; p<0.01

stargazer has a ton of options that you can tweak. The cheatsheet from Jake Russ (which you can see below) is super helpful.

This is also a good opportunity to show you the broom package (by David Robinson) for working with regression output. This is one of my favorite packages in R. The tidy function converts your regression output into a data frame, making it extremely easy to work with. In particular, it becomes straightforward to plot your coefficients.

toymodel_df = broom::tidy(toymodel)

ggplot(filter(toymodel_df, term == "gdpPercap"), 
       aes(term, estimate, ymin = estimate - 2*std.error, ymax = estimate + 2*std.error)) + 
  geom_point() + geom_errorbar() + 
  geom_hline(yintercept = 0, linetype = 2) + 
  ggtitle("gdpPercap estimate is statistically significant")

Of course, this isn’t a very helpful graph since there’s only one coefficient, but it’s easy to think of situations where it’d be super helpful e.g. event study with different estimates for each point in time.