Teaching Econometrics with R

1/26/2020

Why R?

Can be used for all subfields of metrics, plus machine learning/data science
Easy to jump from R to Stata, Python, Excel, Matlab, etc.
Easier and better-suited-to-econometrics than Python
Free, but well-supported. RStudio very handy
Best-in-biz visualization with ggplot2

Resources

Google: RStudio Cheat Sheets
My website nickchk.com/econometrics.html
My videos nickchk.com/videos.html#Rstats
LOST - a Rosetta Stone of sorts lost-stats.github.io
ESPECIALLY: The swirl package <- extremely good. Get that muscle memory!

install.packages('swirl')
library(swirl)
swirl()

(note that syntax for getting packages!)

How does R work?

Nearly everything you do in R fits into one of three categories:

Create (or overwrite) an object
Manipulate objects using functions (which look like function(input, input, option = input))
Look at objects

Everything is an object - observations, variables, data, functions, regressions, regression tables, etc. etc.

Basics

Let’s look around RStudio
And get familiar with the help() system
And the basic syntax of functions

Data Object Types

Let’s create some objects with <- (note = works too… or -> if you’re fiesty), and look at them by putting them on a line by themselves

There are a bunch of data object types. But the main ones to think about…

numeric.var <- 1
character.var <- 'one'
factor.var <- factor(1, labels = 'one')
logical.var <- TRUE

numeric.var

## [1] 1

as.character(numeric.var)

## [1] "1"

Building Vectors

We naturally combine multiple observations of the same type into vectors. c() concatenates things together

vector <- c(1, 10, 2, 3)
1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

factor(c(1, 1, 2, 2, 1), labels = c('one', 'two'))

## [1] one one two two one
## Levels: one two

c(vector, c(6, 7))

## [1]  1 10  2  3  6  7

Looking at Vectors

You can access pieces of a vector using other vectors… numeric vectors to pick indices, or logical vectors to include/exclude!

vec <- 11:20
vec[c(1, 5, 9)]

## [1] 11 15 19

vec[rep(c(TRUE,FALSE),5)]

## [1] 11 13 15 17 19

Lists and data.frames

A list is a very flexible R object that is basically a bunch of objects (such as vectors) stuck together in a bigger object. You can pull out the sub-objects with [[]] or $. Most of the time, strange complex objects output by functions (like regression objects) are just lists

my.list <- list(a = 1:10, b = 11:20, c = 'hello')
my.list[['a']]

##  [1]  1  2  3  4  5  6  7  8  9 10

my.list$c

## [1] "hello"

Lists and data.frames

Most of the time you’ll be working with data frames (unless you’re working with ts), which are just lists that are made up of vectors of the same length

my.df <- data.frame(a = 1:10, b = 11:20, c = rep('hello',10))
my.df$d <- sample(c(TRUE, FALSE), 10, replace = TRUE)
head(my.df)

##   a  b     c     d
## 1 1 11 hello  TRUE
## 2 2 12 hello FALSE
## 3 3 13 hello FALSE
## 4 4 14 hello  TRUE
## 5 5 15 hello FALSE
## 6 6 16 hello  TRUE

Functions

You will commonly want to pull variables out to run them through functions

Keep in mind that many R functions need you to explicitly handle NAs

mean(my.df$d)

## [1] 0.5

my.df$d[1] <- NA
mean(my.df$d)

## [1] NA

mean(my.df$d, na.rm = TRUE)

## [1] 0.4444444

Getting Data

LOTS of built-in data sets for writing class examples
Do data( and see what pops up. Many in packages too - see the Ecdat package, and this list
read.csv() to get CSV files, or in the haven package, read_dta(), read_csv(), etc. See also the foreign package. Note all of these can take a URL instead of a file on the system
Lots of packages designed to get fresh data into R, for example World Bank data has a few APIs. See my list on nickchk.com/econometrics.html#Rdata.
Plug: my vtable (vt()) package for exploring the data

Manipulating data.frames (and tibbles) with dplyr

I’m going to recommend the use of the tidyverse package, which is a sort of alternate basis for using R

Tends to think about data manipulation in the way that economists think about it - R made little sense to me until I used dplyr
The pipe %>% puts the left-hand thing as the first argument of the right-hand thing
Handles missing values better than base R
Consistent syntax
“tibble”s are just data.frames with a few extra bells and whistles
Comes with ggplot2 and functions for working with strings
Annoys some CS types

Manipulating data.frames (and tibbles) with dplyr

dplyr is a package that comes with the tidyverse for manipulating data
It is based on chaining together simple “verbs”:
mutate() to create new variables
arrange() to sort data
filter() to pick observations and select() to pick variables
group_by() to perform subsequent calculations within-group (and ungroup() when done)
summarize() to collapse the data to the group level
If you want to get fancy and automate, these all come with “scoped” versions to do multiple variables at once - _at, _if, _all
MANY other, lesser, commands (including join functions (i.e. merge) and in tidyr the pivot functions (reshape) - see the Data Wrangling swirl() or cheat sheet)

Example

library(tidyverse)
library(atus)
data("atusact")

# Make sure to overwrite old data so it updates!
atusact <- atusact %>%
  # Get two-digit and four-digit activity codes
  mutate(two.digit.activity = floor(tiercode/10000),
         four.digit.activity = floor(tiercode/100)) %>%
  # Keep just personal care activities
  filter(two.digit.activity == 1)

Example

# Get mean and SD of time spent in each of those activities
atusact_summary <- atusact %>%
  # Group into the different four-digit activities
  group_by(four.digit.activity) %>%
  # And summarize
  summarize(mean.dur = mean(dur, na.rm = TRUE),
            sd.dur = sd(dur, na.rm = TRUE)) %>%
  # Arrange by most often
  arrange(-mean.dur)
atusact_summary

## # A tibble: 6 x 3
##   four.digit.activity mean.dur sd.dur
##                 <dbl>    <dbl>  <dbl>
## 1                 101    507.   161. 
## 2                 103     80.2  173. 
## 3                 104     64.5  106. 
## 4                 102     51.7   33.8
## 5                 199     38.8   36.1
## 6                 105     29.5   15.9

Data Output

Summary stats are easy! If getting statistics within groups, use dplyr’s group_by() %>% summarize(). If getting a typical summary-stats table, use stargazer() (note: doesn’t like tibble()s - may have to do as.data.frame() first)

library(stargazer)
data(atusresp)
atusresp %>%
  select(hourly_wage, work_hrs_week, hh_size) %>%
  as.data.frame() %>%
  stargazer(type = 'html', digits = 1)


Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

hourly_wage	56,243	15.7	9.8	0.0	9.2	19.0	100.0
work_hrs_week	107,020	40.1	13.5	0.0	36.0	45.0	160.0
hh_size	181,335	2.8	1.5	1	2	4	16

The Pipe

The pipe %>% is from the magrittr package and also is in tidyverse
Whatever precedes it becomes the first argument of the function on the right
It makes it way easier to structure multiple steps of code - avoid nested functions
“make a change, ship it along” <- intuitive

Pipe Tricks

Use pull() to take a variable out of a data set in a pipe-consistent way ($ doesn’t play nice). Compare median(filter(atusact, four.digit.activity == 104)$dur) to:

median.of.104 <- atusact %>%
  filter(four.digit.activity == 104) %>%
  pull(dur) %>%
  median()

Some functions like lm() take data in a non-first argument. Use . to send data:

rmse <- atusact %>%
  lm(data = ., dur ~ factor(four.digit.activity)) %>%
  resid() %>%
  sd()

Regressions

Regression functions in R create regression objects

Inputs: a formula object, a data set, options
Outputs: a regression object, generally intended to be stored and then run through some other function
To-do: look at the outcome! summary(regression.object) often a good option. Even better is stargazer(regression.object)
To-do: make use of the regression! Perhaps run it through predict(), or get relevant statistics from it with $
Some of the relevant statistics are in the summary() not the regression object itself

Regressions

my.reg <- lm(hourly_wage ~ hh_size + work_hrs_week, data = atusresp)
my.reg2 <- lm(hourly_wage ~ hh_size + work_hrs_week + ptft, data = atusresp)
reg.predictions <- predict(my.reg)
summary(my.reg)$r.squared

## [1] 0.03778727

Regressions

Significance stars are a common headache - note standards don’t match econ in default regression output

summary(my.reg)

## 
## Call:
## lm(formula = hourly_wage ~ hh_size + work_hrs_week, data = atusresp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.127  -5.856  -2.692   2.877  89.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.150826   0.158218   70.48   <2e-16 ***
## hh_size       -0.289364   0.027089  -10.68   <2e-16 ***
## work_hrs_week  0.146556   0.003346   43.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.562 on 53623 degrees of freedom
##   (127709 observations deleted due to missingness)
## Multiple R-squared:  0.03779,    Adjusted R-squared:  0.03775 
## F-statistic:  1053 on 2 and 53623 DF,  p-value: < 2.2e-16

Regression Tables

stargazer is a good option, and I have used it in the past. Print table to the screen with type='text', or to a turn-in-able file with type = 'html', out = 'filename.html'. All of its defaults are what economists would expect

stargazer(my.reg, my.reg2, type = 'html', keep = c('hh_size', 'ptft'))


	Dependent variable:

	hourly_wage
	(1)	(2)

hh_size	-0.289^***	-0.260^***
	(0.027)	(0.027)

ptftPT		-2.825^***
		(0.146)


Observations	53,626	53,626
R²	0.038	0.044
Adjusted R²	0.038	0.044
Residual Std. Error	9.562 (df = 53623)	9.529 (df = 53622)
F Statistic	1,052.920^*** (df = 2; 53623)	832.299^*** (df = 3; 53622)

Note:	p<0.1; p<0.05; p<0.01

Regression Tables

Other options: huxtable a little nicer, but defaults aren’t economics defaults (notably stars). jtools is built on top of huxtable and is very cool:

handles weird regression types
summ() allows you to do robust SEs and VIFs easily, plus standardized coefs. - export_summs() prints tables to file (although no VIFs there).
effect_plot() for easy ggplot2 regression-scatterplot graphing (including regressions with nonlinear models / effects, controls, categorical predictors)
plot_coefs() for easy dot-and-CI plots of regression coefs.
Note summ() won’t do sig stars. export_summs() will, although you must set them to econ levels by hand.

Regression Tables

export_summs(my.reg, my.reg2, robust = TRUE, stars = c(`***` = 0.01, `**` =
  0.05, `*` = 0.1), coefs = 'hh_size')

	Model 1	Model 2
hh_size	-0.29 ***	-0.26 ***
	(0.03)	(0.03)
N	53626	53626
R2	0.04	0.04
Standard errors are heteroskedasticity robust. * p < 0.01; p < 0.05; * p < 0.1.

Regression Graphs in jtools

data(mtcars); carsreg <- lm(mpg~hp+I(hp^2)+cyl, data = mtcars)
effect_plot(carsreg, pred = hp, plot.points = TRUE, interval = TRUE)

plot_coefs(my.reg, my.reg2)

Regression Formulas

outcome ~ independent.variable + independent.variable
Factor variables will automatically be turned to dummies. Can do y ~ x + factor(z) to be sure
Interactions with *, y ~ x*z, or just-interaction-not-individual with :, y ~ x + x:z
Full interaction sets: y ~ (x1 + x2 + x3)^2 gives all two-way interactions, ^3 adds on the three-way
Functions can go straight in: ln(y) ~ x
Do calculations on variables first with I(): I(y == 1) ~ I(x^2)
Lots of variables? y~. regresses on everything in the data but y. In the tidyverse, combine this with “tidyselect” helpers like select(starts_with('gdp_'))

Regression Commands

lm() is standard OLS
Time series: see the packages dynlm, forecast, tseries
Pretty much all micro (robust SEs, clustering, IV, fixed effects) can be done with the estimatr (lm_robust(), iv_robust()) or lfe (felm()) packages. Former has easier syntax but is less powerful and doesn’t work with stargazer() - use jtools (export_summs()) or huxtable (huxreg()) instead.
Joint F-tests with linearHypothesis() in car
For anything: google “R my-thing” or “rstats my-thing”

Graphing

ggplot2 (ggplot()) is very much a top-tier graphing tool
Each graph consists of: data, an aes()thetic (~axes), and a geometry (what you draw)
Not actually too difficult to use; syntax for advanced styling can be difficult, but shouldn’t be too necessary for UGs. Stuff like legends and text styling can be made easier with the ggeasy package.

Graphing

data(mtcars)
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()

Graphing

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am))) + geom_point()

Graphing

mtcars <- mtcars %>% mutate(Transmission = factor(am, labels = c("Automatic", "Manual")))
ggplot(mtcars, aes(x = hp, y = mpg, shape = Transmission, color = Transmission)) + geom_point() + 
  theme_minimal() +
  geom_smooth(method = 'lm') + 
  labs(x = "Horsepower", y = "Miles per Gallon", title = "MPG vs. HP")

Graphing

data("economics")
plot <- ggplot(economics, aes(x = date, y = unemploy/pop)) + geom_line() + theme_bw() +
   labs(x = "Month", y = "Unemployment")
plot

In Sum

There’s only so much I can cover. Explore!
The goal here is to get you comfortable with the basic structure and reduce the number of unknown unknowns
DO try a swirl() class - it will help if you’re starting out
Check out the materials on my website: nickchk.com/econometrics.html - plenty of things to try it out with

Extra! Rstudio.cloud

Rstudio.cloud is a cloud-based version of Rstudio
Allows you to use R without installing anything
Also has classroom tools
Pro: No installation, can set things like packages and data up for students, no filesystem worries
Con: No great “due date” or grading system, a little slow to start up
Let’s go take a look

RMarkdown

RMarkdown is an extension of the Markdown formatting system, which is, like, super simple. []() for links, - or 1. for bulleted lists, etc.
Also, easily include code in your document - either as code to show up like this or code to run so the output is in your doc
This is a “reproducible document”
Output in LaTeX (without learning LaTeX), Word, etc.
Relevant here is output to SLIDES. Make slides that easily include your R code and output
Plus, automatic online publication to rpubs.
Let’s look at the code for this presentation