1/26/2020

Why R?

  • Can be used for all subfields of metrics, plus machine learning/data science
  • Easy to jump from R to Stata, Python, Excel, Matlab, etc.
  • Easier and better-suited-to-econometrics than Python
  • Free, but well-supported. RStudio very handy
  • Best-in-biz visualization with ggplot2

Resources

How does R work?

Nearly everything you do in R fits into one of three categories:

  1. Create (or overwrite) an object
  2. Manipulate objects using functions (which look like function(input, input, option = input))
  3. Look at objects

Everything is an object - observations, variables, data, functions, regressions, regression tables, etc. etc.

Basics

  • Let’s look around RStudio
  • And get familiar with the help() system
  • And the basic syntax of functions

Data Object Types

Let’s create some objects with <- (note = works too… or -> if you’re fiesty), and look at them by putting them on a line by themselves

There are a bunch of data object types. But the main ones to think about…

numeric.var <- 1
character.var <- 'one'
factor.var <- factor(1, labels = 'one')
logical.var <- TRUE

numeric.var
## [1] 1
as.character(numeric.var)
## [1] "1"

Building Vectors

We naturally combine multiple observations of the same type into vectors. c() concatenates things together

vector <- c(1, 10, 2, 3)
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
factor(c(1, 1, 2, 2, 1), labels = c('one', 'two'))
## [1] one one two two one
## Levels: one two
c(vector, c(6, 7))
## [1]  1 10  2  3  6  7

Looking at Vectors

You can access pieces of a vector using other vectors… numeric vectors to pick indices, or logical vectors to include/exclude!

vec <- 11:20
vec[c(1, 5, 9)]
## [1] 11 15 19
vec[rep(c(TRUE,FALSE),5)]
## [1] 11 13 15 17 19

Lists and data.frames

A list is a very flexible R object that is basically a bunch of objects (such as vectors) stuck together in a bigger object. You can pull out the sub-objects with [[]] or $. Most of the time, strange complex objects output by functions (like regression objects) are just lists

my.list <- list(a = 1:10, b = 11:20, c = 'hello')
my.list[['a']]
##  [1]  1  2  3  4  5  6  7  8  9 10
my.list$c
## [1] "hello"

Lists and data.frames

Most of the time you’ll be working with data frames (unless you’re working with ts), which are just lists that are made up of vectors of the same length

my.df <- data.frame(a = 1:10, b = 11:20, c = rep('hello',10))
my.df$d <- sample(c(TRUE, FALSE), 10, replace = TRUE)
head(my.df)
##   a  b     c     d
## 1 1 11 hello  TRUE
## 2 2 12 hello FALSE
## 3 3 13 hello FALSE
## 4 4 14 hello  TRUE
## 5 5 15 hello FALSE
## 6 6 16 hello  TRUE

Functions

You will commonly want to pull variables out to run them through functions

Keep in mind that many R functions need you to explicitly handle NAs

mean(my.df$d)
## [1] 0.5
my.df$d[1] <- NA
mean(my.df$d)
## [1] NA
mean(my.df$d, na.rm = TRUE)
## [1] 0.4444444

Getting Data

  • LOTS of built-in data sets for writing class examples
  • Do data( and see what pops up. Many in packages too - see the Ecdat package, and this list
  • read.csv() to get CSV files, or in the haven package, read_dta(), read_csv(), etc. See also the foreign package. Note all of these can take a URL instead of a file on the system
  • Lots of packages designed to get fresh data into R, for example World Bank data has a few APIs. See my list on nickchk.com/econometrics.html#Rdata.
  • Plug: my vtable (vt()) package for exploring the data

Manipulating data.frames (and tibbles) with dplyr

I’m going to recommend the use of the tidyverse package, which is a sort of alternate basis for using R

  • Tends to think about data manipulation in the way that economists think about it - R made little sense to me until I used dplyr
  • The pipe %>% puts the left-hand thing as the first argument of the right-hand thing
  • Handles missing values better than base R
  • Consistent syntax
  • tibble”s are just data.frames with a few extra bells and whistles
  • Comes with ggplot2 and functions for working with strings
  • Annoys some CS types

Manipulating data.frames (and tibbles) with dplyr

  • dplyr is a package that comes with the tidyverse for manipulating data
  • It is based on chaining together simple “verbs”:
  • mutate() to create new variables
  • arrange() to sort data
  • filter() to pick observations and select() to pick variables
  • group_by() to perform subsequent calculations within-group (and ungroup() when done)
  • summarize() to collapse the data to the group level
  • If you want to get fancy and automate, these all come with “scoped” versions to do multiple variables at once - _at, _if, _all
  • MANY other, lesser, commands (including join functions (i.e. merge) and in tidyr the pivot functions (reshape) - see the Data Wrangling swirl() or cheat sheet)

Example

library(tidyverse)
library(atus)
data("atusact")

# Make sure to overwrite old data so it updates!
atusact <- atusact %>%
  # Get two-digit and four-digit activity codes
  mutate(two.digit.activity = floor(tiercode/10000),
         four.digit.activity = floor(tiercode/100)) %>%
  # Keep just personal care activities
  filter(two.digit.activity == 1)

Example

# Get mean and SD of time spent in each of those activities
atusact_summary <- atusact %>%
  # Group into the different four-digit activities
  group_by(four.digit.activity) %>%
  # And summarize
  summarize(mean.dur = mean(dur, na.rm = TRUE),
            sd.dur = sd(dur, na.rm = TRUE)) %>%
  # Arrange by most often
  arrange(-mean.dur)
atusact_summary  
## # A tibble: 6 x 3
##   four.digit.activity mean.dur sd.dur
##                 <dbl>    <dbl>  <dbl>
## 1                 101    507.   161. 
## 2                 103     80.2  173. 
## 3                 104     64.5  106. 
## 4                 102     51.7   33.8
## 5                 199     38.8   36.1
## 6                 105     29.5   15.9

Data Output

Summary stats are easy! If getting statistics within groups, use dplyr’s group_by() %>% summarize(). If getting a typical summary-stats table, use stargazer() (note: doesn’t like tibble()s - may have to do as.data.frame() first)

library(stargazer)
data(atusresp)
atusresp %>%
  select(hourly_wage, work_hrs_week, hh_size) %>%
  as.data.frame() %>%
  stargazer(type = 'html', digits = 1)
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
hourly_wage 56,243 15.7 9.8 0.0 9.2 19.0 100.0
work_hrs_week 107,020 40.1 13.5 0.0 36.0 45.0 160.0
hh_size 181,335 2.8 1.5 1 2 4 16

The Pipe

  • The pipe %>% is from the magrittr package and also is in tidyverse
  • Whatever precedes it becomes the first argument of the function on the right
  • It makes it way easier to structure multiple steps of code - avoid nested functions
  • “make a change, ship it along” <- intuitive

Pipe Tricks

  • Use pull() to take a variable out of a data set in a pipe-consistent way ($ doesn’t play nice). Compare median(filter(atusact, four.digit.activity == 104)$dur) to:
median.of.104 <- atusact %>%
  filter(four.digit.activity == 104) %>%
  pull(dur) %>%
  median()
  • Some functions like lm() take data in a non-first argument. Use . to send data:
rmse <- atusact %>%
  lm(data = ., dur ~ factor(four.digit.activity)) %>%
  resid() %>%
  sd()

Regressions

Regression functions in R create regression objects

  • Inputs: a formula object, a data set, options
  • Outputs: a regression object, generally intended to be stored and then run through some other function
  • To-do: look at the outcome! summary(regression.object) often a good option. Even better is stargazer(regression.object)
  • To-do: make use of the regression! Perhaps run it through predict(), or get relevant statistics from it with $
  • Some of the relevant statistics are in the summary() not the regression object itself

Regressions

my.reg <- lm(hourly_wage ~ hh_size + work_hrs_week, data = atusresp)
my.reg2 <- lm(hourly_wage ~ hh_size + work_hrs_week + ptft, data = atusresp)
reg.predictions <- predict(my.reg)
summary(my.reg)$r.squared
## [1] 0.03778727

Regressions

Significance stars are a common headache - note standards don’t match econ in default regression output

summary(my.reg)
## 
## Call:
## lm(formula = hourly_wage ~ hh_size + work_hrs_week, data = atusresp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.127  -5.856  -2.692   2.877  89.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.150826   0.158218   70.48   <2e-16 ***
## hh_size       -0.289364   0.027089  -10.68   <2e-16 ***
## work_hrs_week  0.146556   0.003346   43.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.562 on 53623 degrees of freedom
##   (127709 observations deleted due to missingness)
## Multiple R-squared:  0.03779,    Adjusted R-squared:  0.03775 
## F-statistic:  1053 on 2 and 53623 DF,  p-value: < 2.2e-16

Regression Tables

stargazer is a good option, and I have used it in the past. Print table to the screen with type='text', or to a turn-in-able file with type = 'html', out = 'filename.html'. All of its defaults are what economists would expect

stargazer(my.reg, my.reg2, type = 'html', keep = c('hh_size', 'ptft'))
Dependent variable:
hourly_wage
(1) (2)
hh_size -0.289*** -0.260***
(0.027) (0.027)
ptftPT -2.825***
(0.146)
Observations 53,626 53,626
R2 0.038 0.044
Adjusted R2 0.038 0.044
Residual Std. Error 9.562 (df = 53623) 9.529 (df = 53622)
F Statistic 1,052.920*** (df = 2; 53623) 832.299*** (df = 3; 53622)
Note: p<0.1; p<0.05; p<0.01

Regression Tables

Other options: huxtable a little nicer, but defaults aren’t economics defaults (notably stars). jtools is built on top of huxtable and is very cool:

  • handles weird regression types
  • summ() allows you to do robust SEs and VIFs easily, plus standardized coefs. - export_summs() prints tables to file (although no VIFs there).
  • effect_plot() for easy ggplot2 regression-scatterplot graphing (including regressions with nonlinear models / effects, controls, categorical predictors)
  • plot_coefs() for easy dot-and-CI plots of regression coefs.
  • Note summ() won’t do sig stars. export_summs() will, although you must set them to econ levels by hand.

Regression Tables

export_summs(my.reg, my.reg2, robust = TRUE, stars = c(`***` = 0.01, `**` =
  0.05, `*` = 0.1), coefs = 'hh_size')
Model 1 Model 2
hh_size -0.29 *** -0.26 ***
(0.03)    (0.03)   
N 53626        53626       
R2 0.04     0.04    
Standard errors are heteroskedasticity robust. *** p < 0.01; ** p < 0.05; * p < 0.1.

Regression Graphs in jtools

data(mtcars); carsreg <- lm(mpg~hp+I(hp^2)+cyl, data = mtcars)
effect_plot(carsreg, pred = hp, plot.points = TRUE, interval = TRUE)

plot_coefs(my.reg, my.reg2)

Regression Formulas

  • outcome ~ independent.variable + independent.variable
  • Factor variables will automatically be turned to dummies. Can do y ~ x + factor(z) to be sure
  • Interactions with *, y ~ x*z, or just-interaction-not-individual with :, y ~ x + x:z
  • Full interaction sets: y ~ (x1 + x2 + x3)^2 gives all two-way interactions, ^3 adds on the three-way
  • Functions can go straight in: ln(y) ~ x
  • Do calculations on variables first with I(): I(y == 1) ~ I(x^2)
  • Lots of variables? y~. regresses on everything in the data but y. In the tidyverse, combine this with “tidyselect” helpers like select(starts_with('gdp_'))

Regression Commands

  • lm() is standard OLS
  • Time series: see the packages dynlm, forecast, tseries
  • Pretty much all micro (robust SEs, clustering, IV, fixed effects) can be done with the estimatr (lm_robust(), iv_robust()) or lfe (felm()) packages. Former has easier syntax but is less powerful and doesn’t work with stargazer() - use jtools (export_summs()) or huxtable (huxreg()) instead.
  • Joint F-tests with linearHypothesis() in car
  • For anything: google “R my-thing” or “rstats my-thing”

Graphing

  • ggplot2 (ggplot()) is very much a top-tier graphing tool
  • Each graph consists of: data, an aes()thetic (~axes), and a geometry (what you draw)
  • Not actually too difficult to use; syntax for advanced styling can be difficult, but shouldn’t be too necessary for UGs. Stuff like legends and text styling can be made easier with the ggeasy package.

Graphing

data(mtcars)
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()

Graphing

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am))) + geom_point()

Graphing

mtcars <- mtcars %>% mutate(Transmission = factor(am, labels = c("Automatic", "Manual")))
ggplot(mtcars, aes(x = hp, y = mpg, shape = Transmission, color = Transmission)) + geom_point() + 
  theme_minimal() +
  geom_smooth(method = 'lm') + 
  labs(x = "Horsepower", y = "Miles per Gallon", title = "MPG vs. HP")

Graphing

data("economics")
plot <- ggplot(economics, aes(x = date, y = unemploy/pop)) + geom_line() + theme_bw() +
   labs(x = "Month", y = "Unemployment")
plot

In Sum

  • There’s only so much I can cover. Explore!
  • The goal here is to get you comfortable with the basic structure and reduce the number of unknown unknowns
  • DO try a swirl() class - it will help if you’re starting out
  • Check out the materials on my website: nickchk.com/econometrics.html - plenty of things to try it out with

Extra! Rstudio.cloud

  • Rstudio.cloud is a cloud-based version of Rstudio
  • Allows you to use R without installing anything
  • Also has classroom tools
  • Pro: No installation, can set things like packages and data up for students, no filesystem worries
  • Con: No great “due date” or grading system, a little slow to start up
  • Let’s go take a look

RMarkdown

  • RMarkdown is an extension of the Markdown formatting system, which is, like, super simple. []() for links, - or 1. for bulleted lists, etc.
  • Also, easily include code in your document - either as code to show up like this or code to run so the output is in your doc
  • This is a “reproducible document”
  • Output in LaTeX (without learning LaTeX), Word, etc.
  • Relevant here is output to SLIDES. Make slides that easily include your R code and output
  • Plus, automatic online publication to rpubs.
  • Let’s look at the code for this presentation