Good tables and figures are an essential component of every paper. In general, I prefer figures to tables, especially for presenting regression coefficients. I find that most economists use tables too often for my liking, both in papers and presentations. In this document, we’ll focus mostly on how to make plots in R with the ggplot2 package and how to integrate them into a document. We’ll use other data-wrangling tools from the tidyverse where necessary. We’ll do a little bit on tables at the end.
Other resources, some of which I’ve drawn upon for this workshop:
ggplot2 is based on the grammar of graphics, a phrase coined by Lee Wilkinson. The “grammer” means there is an underlying logic to building a plot. This makes it easier to understand the code used to build a plot, and a lot less effort to memorize arbitrary commands.
ggplot2 also comes with nice defaults. This means it takes a lot less work get your graphic to a stage where you’re satisfied. It’s also just a lot more fun to be able to make nice graphs without too much effort. If you want some tips on taking a ggplot2 figure to the next level, check out this tutorial
ggplot() - main function where you specify the dataset and variables to plotgeoms - geometric objects
geom_point(), geom_bar(), geom_line()aes - aesthetics
qplot(). Stands for “quick plot”, but tricky to add complexityWe’ll start with the gapminder dataset, which you can get by installing the gapminder package. Make sure you have the tidyverse packages too.
install.packages("tidyverse")
install.packages("gapminder")
Take a quick peek at the dataset
library(tidyverse)
library(gapminder)
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
Try creating your first plot with this code.
ggplot(gapminder)
Ha! Sorry. Try this instead.
ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap)) +
geom_point()
ggplot() creates a coordinate system that you can add layers to. To create a graph, you need to specify your
datamapping - this defines how variables in your dataset are mapped to visual properties (in this case, the x and y axes)ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap)) +
geom_point(size = 5) +
ggtitle("For dramatic effect")
ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap, colour = continent)) +
geom_point()
Try this:
ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap)) +
geom_point(color = "red")
and this:
ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap, color = "blue")) +
geom_point()
Different geoms understand different aesthetics. Type ?geom_point in your console and scroll down to the Aesthetics section. The help document says that geom_point understands the shape aesthetic, so we can differentiate continents by shapes and colours. Shapes are helpful to accommodate black-and-white versions of your paper.
ggplot(data = filter(gapminder, year == 1997),
mapping = aes(x = lifeExp, y = gdpPercap, shape = continent, color = continent)) +
geom_point()
Let’s look at average worldwide life expectancy over time. We’ll calculate mean life expectancy each year and then 95% confidence intervals.
lifexp = gapminder %>%
group_by(year) %>%
summarize(lifexp_mean = mean(lifeExp), lifexp_se = sd(lifeExp)/sqrt(n()))
lifexp = mutate(lifexp,
lifexp_upperci = lifexp_mean + 1.96*lifexp_se,
lifexp_lowerci = lifexp_mean - 1.96*lifexp_se)
lifexp
## # A tibble: 12 x 5
## year lifexp_mean lifexp_se lifexp_upperci lifexp_lowerci
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1952 49.05762 1.0259794 51.06854 47.04670
## 2 1957 51.50740 1.0264267 53.51920 49.49560
## 3 1962 53.60925 1.0151782 55.59900 51.61950
## 4 1967 55.67829 0.9834247 57.60580 53.75078
## 5 1972 57.64739 0.9551523 59.51948 55.77529
## 6 1977 59.57016 0.9421682 61.41681 57.72351
## 7 1982 61.53320 0.9038502 63.30474 59.76165
## 8 1987 63.21261 0.8858638 64.94891 61.47632
## 9 1992 64.16034 0.9421808 66.00701 62.31366
## 10 1997 65.01468 0.9700466 66.91597 63.11338
## 11 2002 65.69492 1.0304998 67.71470 63.67514
## 12 2007 67.00742 1.0131454 68.99319 65.02166
ggplot(lifexp, aes(year, lifexp_mean)) +
geom_line()
You’ll often want to show the confidence intervals e.g. if you have regression coefficients from an event study.
ggplot(lifexp, aes(year, lifexp_mean)) +
geom_line() +
geom_ribbon(aes(ymin = lifexp_lowerci, ymax = lifexp_upperci), alpha = 1/10, fill = "blue")
You can also do errorbars.
ggplot(lifexp, aes(year, lifexp_mean)) +
geom_line() +
geom_errorbar(aes(ymax = lifexp_upperci, ymin = lifexp_lowerci), linetype = 2)
You can also plot the confidence intervals as lines. I am working on a geom that does this (and I’m really close!), but for now you have this workaround.
ggplot(lifexp, aes(year, lifexp_mean)) +
geom_line() +
geom_line(aes(year, lifexp_upperci), linetype = 2, color = "red") +
geom_line(aes(year, lifexp_lowerci), linetype = 2, color = "red")
Let’s use the nycflights13 package. Which New York airport had the most number of flights originating from it?
# install.packages(nycflights13)
library(nycflights13)
ggplot(flights) + geom_bar(aes(origin))
Notice that the counting has been done for you. If you want to display values that are already in the data, use geom_col.
origins = count(flights, origin)
origin_plot = ggplot(origins, aes(origin, n)) + geom_col()
origin_plot
# Same thing
# ggplot(origins, aes(origin, n)) +
# geom_bar(stat = "identity")
You might want to flip the bars.
origins %>%
mutate(origin = reorder(origin, n)) %>%
ggplot(aes(origin, n)) + geom_col() + coord_flip()
ggplot does not have 3D plots, but has geoms for visualizing 3D surfaces in 2D. I often find 2D surfaces to be more interpretable than 3D surfaces anyway. I like heat maps, which you can draw with geom_raster.
delays_heatmap = flights %>%
group_by(month, day) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE), n = n()) %>%
ggplot(aes(day, as.factor(month))) + geom_raster(aes(fill = dep_delay)) +
ggtitle("Average departure delay from JFK for each month-day of 2013")
delays_heatmap
There are many options for deviating from the default grey theme. For papers, I like to use light backgrounds so that color printing is not an issue.
origin_plot + theme_minimal()
If you want to change the theme for all of your images, use theme_set() at the beginning of your code. Copy and paste the code below at the end of the code chunk named libs (where we loaded the packages) and knit this .rmd file.
theme_set(theme_classic())
There are themes that come with ggplot2 and there are also lots other options out there like the ggthemes package.
There are a number of helpful tools for making nice tables, but as you’ve seen with the gapminder dataset, if you’re just writing an informal report you can just print the object you want to show and it’ll come out pretty neatly. Just to recap:
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
Another option is kable from the knitr package.
knitr::kable(head(gapminder)) # without head() will print out entire dataframe
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
stargazer is probably your best option for making journal-quality tables. One of its main selling points is that can take regression output directly and convert that into your standard regression table. However, it has some idiosycrasies that can occasionally be inconvenient. Notice below how I have to specify results = 'asis' in the chunk options, and type = "html" in the function. The latter means that if I want to knit this document as a pdf instead then I have to change the stargazer options.
```{r, results = 'asis'}
toymodel = lm(lifeExp ~ gdpPercap, data = gapminder)
stargazer::stargazer(toymodel, type = "html", title = "Table with stargazer")
```
toymodel = lm(lifeExp ~ gdpPercap, data = gapminder)
stargazer::stargazer(toymodel, type = "html", title = "Table with stargazer")
| Dependent variable: | |
| lifeExp | |
| gdpPercap | 0.001*** |
| (0.00003) | |
| Constant | 53.956*** |
| (0.315) | |
| Observations | 1,704 |
| R2 | 0.341 |
| Adjusted R2 | 0.340 |
| Residual Std. Error | 10.491 (df = 1702) |
| F Statistic | 879.577*** (df = 1; 1702) |
| Note: | p<0.1; p<0.05; p<0.01 |
stargazer has a ton of options that you can tweak. The cheatsheet from Jake Russ (which you can see below) is super helpful.
This is also a good opportunity to show you the broom package (by David Robinson) for working with regression output. This is one of my favorite packages in R. The tidy function converts your regression output into a data frame, making it extremely easy to work with. In particular, it becomes straightforward to plot your coefficients.
toymodel_df = broom::tidy(toymodel)
ggplot(filter(toymodel_df, term == "gdpPercap"),
aes(term, estimate, ymin = estimate - 2*std.error, ymax = estimate + 2*std.error)) +
geom_point() + geom_errorbar() +
geom_hline(yintercept = 0, linetype = 2) +
ggtitle("gdpPercap estimate is statistically significant")
Of course, this isn’t a very helpful graph since there’s only one coefficient, but it’s easy to think of situations where it’d be super helpful e.g. event study with different estimates for each point in time.