Data Analysis with R

1. SETUP and BASICS

Set up R, RStudio, and learn how to run code in this tutorial.

Install R
Install the RStudio IDE (Integrated Development Environment for R).

Advantages of R

Open source
A wide range of packages for:
- Data processing and cleaning
- Data visualization
- Interactive web-apps
- Typesetting, writing articles and slides
- The newest machine learning routines
Accomplishes the things you might be used to do doing in Stata (data processing, fitting standard models) and those you might be used to doing in Matlab (numerical programming).

Simple calculations:

# This is a comment 2 # print a number
2+3 # perform a simple calculation

## [1] 5

log(2) # natural log

## [1] 0.6931472

Numeric and string objects

x = 2 # store an object x # print this object
(x = 3) # store and print an object

## [1] 3

x = "Hello" # store a string object x

Vectors

Height = c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector

Height[2] # Print the second component

## [1] 177

Height[2:5] # Print the second, the 3rd, the 4th and 5th component

## [1] 177 177 177 178

(obs = 1:10) # Define a vector as a sequence (1 to 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

Weight = c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)

BMI = Weight/((Height/100)^2) # Performs a simple calculation using vectors

BMI

##  [1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
##  [9] 16.09645 25.95156

We can also describe the vector with length(), mean() and var().

length(Height)

## [1] 10

mean(Height) # Compute the sample mean var(Height)

## [1] 173.3

Matrices

M = cbind(obs,Height,Weight,BMI) # Create a matrix 

typeof(M) # Give the type of the matrix

## [1] "double"

class(M) # Give the class of an object

## [1] "matrix" "array"

is.matrix(M) # Check if M is a matrix

## [1] TRUE

dim(M) # Dimensions of a matrix

## [1] 10  4

Simple plots

For simple plots, use plot.
For more advanced and attractive data visualizations, use ggplot.

plot(Height,Weight,ylab="Weight",xlab="Height")

Data frames and Tibbles

Tibbles are modernized versions of dataframes. Technically, they are just lists of vectors with names. Tibbles are souped up versions of R dataframes that are designed to work seamlessly with dplyr.

library(tibble) # Load the tidyverse tibble package 

mydat = as_tibble(M) # Creates a dataframe 

names(mydat) # Give the names of each variable

## [1] "obs"    "Height" "Weight" "BMI"

summary(mydat) # Descriptive Statistics

##       obs            Height          Weight           BMI       
##  Min.   : 1.00   Min.   :165.0   Min.   :51.00   Min.   :16.10  
##  1st Qu.: 3.25   1st Qu.:170.2   1st Qu.:61.00   1st Qu.:21.25  
##  Median : 5.50   Median :174.5   Median :70.00   Median :22.70  
##  Mean   : 5.50   Mean   :173.3   Mean   :68.50   Mean   :22.89  
##  3rd Qu.: 7.75   3rd Qu.:177.0   3rd Qu.:74.25   3rd Qu.:25.29  
##  Max.   :10.00   Max.   :178.0   Max.   :88.00   Max.   :31.18

Data frames and tibbles are R’s structures for storing tabular data.
Any dataset in R is usually in one of these structures.
Help syntax: You can open a help page for any object that comes with R or with an R package. To open the help page, type a ? before the object’s name and then run the command. This technique works for functions, packages, and more.

Reading and writing data

There are many routines for reading and writing files.
Tidyverse versions are in the readr package.

library(readr) #load the tidyverse readr package 

write_csv(mydat, "my_data.csv") 

mydat2=read_csv("my_data.csv")

mydat2

## # A tibble: 10 × 4
##      obs Height Weight   BMI
##    <dbl>  <dbl>  <dbl> <dbl>
##  1     1    168     88  31.2
##  2     2    177     72  23.0
##  3     3    177     85  27.1
##  4     4    177     52  16.6
##  5     5    178     71  22.4
##  6     6    172     69  23.3
##  7     7    165     61  22.4
##  8     8    171     61  20.9
##  9     9    178     51  16.1
## 10    10    170     75  26.0

2. WORKING WITH DATA

Core objects, tibbles, and the dplyr verbs you will use 90% of the time.

Example: Gapminder

Load both tidyverse, which contains dplyr and ggplot2, and gapminder

library(tidyverse)
library(gapminder)

Use the command ?gapminder to view the R help file for this dataset and read the documentation you find there

What information does this dataset contain?
How may rows and columns does it have?
What are the names of each of the columns, and what information does each contain?
What is the source of the dataset?

To display the gapminder dataset:

gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

R shows us a helpful summary of the information contained in gapminder.

Note: This is because gapminder is not a dataframe; it’s a tibble, often abbreviated tbl.

2a. Dplyr: Filter Rows with `filter`

The dplyr package provides a number of powerful but easy-to-use tools for data manipulation in R.

Rather than displaying the whole dataset, only show the 142 rows for which the column year has a value of 2007:

gapminder %>% filter(year == 2007) #Subset rows based on logical conditions.

## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

The %>% symbol is called a pipe.
Pipes make the code very easy to understand.
The tibble gapminder is being piped into the function filter().
The argument year == 2007 tells filter() that it should find all the rows such that the logical condition year == 2007 is TRUE.
All that this command does is display a subset of gapminder.

If we wanted to store the result of running this command, we’d need to assign it to a variable:

gapminder2007 <- gapminder %>% filter(year == 2007)
gapminder2007

## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

We can also use filter to subset on two or more variables. For example, below we display data for the US in 2007:

gapminder %>% filter(year == 2007, country == 'United States')

## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

2b. Dplyr: Sort data with `arrange`

Order rows ascending/descending; combine with desc().

Suppose we wanted to sort gapminder by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:

gapminder %>% arrange(gdpPercap)

## # A tibble: 1,704 × 6
##    country          continent  year lifeExp      pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.
##  2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.
##  3 Lesotho          Africa     1952    42.1   748747      299.
##  4 Guinea-Bissau    Africa     1952    32.5   580653      300.
##  5 Congo, Dem. Rep. Africa     1997    42.6 47798986      312.
##  6 Eritrea          Africa     1952    35.9  1438760      329.
##  7 Myanmar          Asia       1952    36.3 20092996      331 
##  8 Lesotho          Africa     1957    45.0   813338      336.
##  9 Burundi          Africa     1952    39.0  2445618      339.
## 10 Eritrea          Africa     1957    38.0  1542611      344.
## # ℹ 1,694 more rows

The logic is very similar to filter:

We pipe the tibble gapminder into the function arrange().
The argument gdpPercap tells arrange() that we want to sort by GDP per capita.
Note that by default arrange() sorts in ascending order.

If we want to sort in descending order, we use the function desc() as follows:

gapminder %>% arrange(desc(gdpPercap))

## # A tibble: 1,704 × 6
##    country   continent  year lifeExp     pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Kuwait    Asia       1957    58.0  212846   113523.
##  2 Kuwait    Asia       1972    67.7  841934   109348.
##  3 Kuwait    Asia       1952    55.6  160000   108382.
##  4 Kuwait    Asia       1962    60.5  358266    95458.
##  5 Kuwait    Asia       1967    64.6  575003    80895.
##  6 Kuwait    Asia       1977    69.3 1140357    59265.
##  7 Norway    Europe     2007    80.2 4627926    49357.
##  8 Kuwait    Asia       2007    77.6 2505559    47307.
##  9 Singapore Asia       2007    80.0 4553009    47143.
## 10 Norway    Europe     2002    79.0 4535591    44684.
## # ℹ 1,694 more rows

2c. Chaining commands with pipes

The command gapminder %>% filter(year == 2007) “pipes” the tibble gapminder into the function filter(). What does this mean?

Take a look at the R help file for the dplyr function filter. We see that filter() takes something called .data as its first argument. Moving on to the “Arguments” section of the help file, we see that .data is “A tbl” i.e. a tibble.

To better understand what this means, try using filter without the pipe:

filter(gapminder, year == 2007, country == 'United States')

## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

Notice that this gives us exactly the same result as

gapminder %>% filter(year == 2007, country == 'United States')

## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

In other words The pipe is gives an alternative way of supplying the first argument to a function.

Now try this with a more familiar R function: mean.

The first argument of mean is a vector x. Use the pipe to compute the mean of some data:

x <- c(1, 5, 2, 7, 2)
x %>% mean

## [1] 3.4

The pipe supplies a function with its first argument.

If we want to specify additional arguments, we need to do so within the function call itself. For example, here’s how we could use the pipe to compute the mean after dropping missing observations:

y <- c(1, 5, NA, 7, 2)
y %>% mean(na.rm = TRUE)

## [1] 3.75

One important note about the pipe: it’s not a base R command. Instead it’s a command provided by the package Magrittr. This package is installed automatically along with dplyr (if we load the tidyverse package, which includes dplyr, the pipe is automatically available).

Pipes are most useful when we need to chain a series of commands together.
For example, suppose we wanted to display the 1952 data from gapminder sorted by gdpPercap in descending order.

Using the pipe,

gapminder %>% 
  filter(year == 1952) %>% 
  arrange(desc(gdpPercap))

## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Kuwait         Asia       1952    55.6    160000   108382.
##  2 Switzerland    Europe     1952    69.6   4815000    14734.
##  3 United States  Americas   1952    68.4 157553000    13990.
##  4 Canada         Americas   1952    68.8  14785584    11367.
##  5 New Zealand    Oceania    1952    69.4   1994794    10557.
##  6 Norway         Europe     1952    72.7   3327728    10095.
##  7 Australia      Oceania    1952    69.1   8691212    10040.
##  8 United Kingdom Europe     1952    69.2  50430000     9980.
##  9 Bahrain        Asia       1952    50.9    120447     9867.
## 10 Denmark        Europe     1952    70.8   4334000     9692.
## # ℹ 132 more rows

The first step in the chain gapminder %>% filter(year == 1952) returns a tibble: the subset of gapminder for which year is 1952.
The next step %>% arrange(gdpPercap) pipes this new tibble into the function arrange(), giving us the desired result.
In stark contrast, let’s look at the code we would have to use if we wanted to accomplish the same task without using the pipe:

arrange(filter(gapminder, year == 1952), desc(gdpPercap))

## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Kuwait         Asia       1952    55.6    160000   108382.
##  2 Switzerland    Europe     1952    69.6   4815000    14734.
##  3 United States  Americas   1952    68.4 157553000    13990.
##  4 Canada         Americas   1952    68.8  14785584    11367.
##  5 New Zealand    Oceania    1952    69.4   1994794    10557.
##  6 Norway         Europe     1952    72.7   3327728    10095.
##  7 Australia      Oceania    1952    69.1   8691212    10040.
##  8 United Kingdom Europe     1952    69.2  50430000     9980.
##  9 Bahrain        Asia       1952    50.9    120447     9867.
## 10 Denmark        Europe     1952    70.8   4334000     9692.
## # ℹ 132 more rows

This is hard to read and not good coding practice: the commands arrange and filter have to appear in the code in the opposite of the order in which they are actually being carried out. This is because parentheses are evaluated from inside to outside.

Pipes let us write our code in a way that accords with the actual order of the steps we want to carry out.

2d. Change/create variables with `mutate`

Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:

gapminder %>% 
  mutate(pop = pop / 1000000)

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43      779.
##  2 Afghanistan Asia       1957    30.3  9.24      821.
##  3 Afghanistan Asia       1962    32.0 10.3       853.
##  4 Afghanistan Asia       1967    34.0 11.5       836.
##  5 Afghanistan Asia       1972    36.1 13.1       740.
##  6 Afghanistan Asia       1977    38.4 14.9       786.
##  7 Afghanistan Asia       1982    39.9 12.9       978.
##  8 Afghanistan Asia       1987    40.8 13.9       852.
##  9 Afghanistan Asia       1992    41.7 16.3       649.
## 10 Afghanistan Asia       1997    41.8 22.2       635.
## # ℹ 1,694 more rows

Note the syntax here: within mutate() we have an assignment statement, namely pop = pop / 1000000. This tells R to calculate pop / 1000000 and assign the result to pop, in place of the original variable.

We can also use mutate() to create a new variable. The gapminder dataset doesn’t contain overall GDP, only GDP per capita. To calculate GDP, we need to multiply gdpPercap by pop.

gapminder %>% mutate(gdp = pop * gdpPercap)

## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ℹ 1,694 more rows

3. DATA VISUALIZATION

3a. Scatterplot with `ggplot2`

we’ll use a package for statistical visualization called ggplot2.

ggplot2 is included in the tidyverse package, which you’ve already installed and loaded.

For more details on ggplot2 see the chapter entitled “Data Visualisation” in R for Data Science.

Start off by constructing a subset of the gapminder dataset that contains information from the year 2007 that we’ll use for our plots below.

gapminder_2007 <- gapminder %>% filter(year == 2007)

The first example will be a simple scatterplot using gapminder_2007. Each point will correspond to a single country in 2007. Its x-coordinate will be GDP per capita and its y-coordinate will be life expectancy.

ggplot(gapminder_2007) + geom_point(mapping = aes(x = gdpPercap, y = lifeExp))

We see that GDP per capita is a very strong predictor of life expectancy, although the relationship is non-linear.

Plotting on a log scale

In order to transform the x-axis, add a + scale_x_log10() to the end of our command from above:

ggplot(data = gapminder_2007) + 
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) + 
  scale_x_log10()

Notice how we split the code across multiple lines and ended each of the intermediate lines with the +.

This makes things much easier to read.

3b. Color and size aesthetics

To make a graph using ggplot we use the following template:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

replacing <DATA>, <GEOM_FUNCTION>, and <MAPPINGS> to specify what we want to plot and how it should appear.

The first part is: we replace <DATA> with the dataset we want to plot, for example gapminder_2007 in the example from above. The second part is: we replace <GEOM_FUNCTION> with the name of a function that specifies the kind of plot we want to make.

So far we’ve only seen one example: geom_point() which tells ggplot that we want to make a scatterplot.

Now looking at mapping = aes(<MAPPINGS>).

The abbreviation aes is short for aesthetic and the code mapping = aes(<MAPPINGS>) defines what is called an aesthetic mapping.

This just tells R how we want our plot to look.

The information we need to put in place of <MAPPINGS> depends on what kind of plot we’re making.

Thus far we’ve only examined geom_point() which produces a scatterplot. For this kind of plot, the minimum information we need to provide is the location of each point.For example, in our example above we wrote aes(x = gdpPercap, y = lifeExp) to tell R that gdpPercap gives the x-axis location of each point, and lifeExp gives the y-axis location.

When making a scatterplot with geom_point, we can also specify the size and color of each point using aes.

For example, let’s use the color of each point to indicate continent

ggplot(data = gapminder_2007) +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
  scale_x_log10()

Notice how ggplot automatically generates a helpful legend.

This plot makes it easy to see at a glance that the European countries in 2007 ten to have high GDP per capita and high life expectancy, while the African countries have the opposite.

In order to use the size of each point to encode information, e.g. population:

ggplot(data = gapminder_2007) +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
  scale_x_log10()

4. APPLIED EXAMPLE I

4a. DID Visualize

Question: In 1980, Kentucky raised its cap on weekly earnings that were covered by worker’s compensation. We want to know if this new policy caused workers to spend more time unemployed.

The policy was designed so that the cap increase did not affect low-earnings workers, but did affect high-earnings workers, so we use low-earnings workers as our control group and high-earnings workers as our treatment group.

Load the data and explore it.

The data is included in the wooldridge R package and is called injury

In order to use the dataset, you need to load it with library(wooldridge)

duration: Duration of unemployment benefits, measured in weeks

log_duration: Log duration

after_1980: Indicator variable marking if the observation happened before (0) or after (1) the policy change in 1980.

highearn: Indicator variable marking if the observation is a low (0) or high (1) earner.

library(tidyverse)  # ggplot(), %>%, mutate(), and friends
library(broom)  # Convert models to data frames
library(scales)  # Format numbers with functions like comma(), percent(), and dollar()
library(modelsummary)  # Create side-by-side regression tables
library(wooldridge)

injury <- injury %>%
  rename(duration = durat, log_duration = ldurat, after_1980 = afchnge)

ggplot(data = injury, aes(x = duration)) +
  # binwidth = 8 makes each column represent 2 months (8 weeks)
  # boundary = 0 make it so the 0-8 bar starts at 0 and isn't -4 to 4
  geom_histogram(binwidth = 8, color = "white", boundary = 0) +
  facet_wrap(vars(highearn))

The distribution is really skewed, with most people in both groups getting between 0-8 weeks of benefits.

If we use the log of duration, we can get a less skewed distribution that works better with regression models:

ggplot(data = injury, mapping = aes(x = log_duration)) +
  geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
  # Uncomment this line if you want to exponentiate the logged values on the
  # x-axis. Instead of showing 1, 2, 3, etc., it'll show e^1, e^2, e^3, etc. and
  # make the labels more human readable
  # scale_x_continuous(labels = trans_format("exp", format = round)) +
  facet_wrap(vars(highearn))

We should also check the distribution of unemployment before and after the policy change. Copy/paste one of the histogram chunks and change the faceting:

ggplot(data = injury, mapping = aes(x = log_duration)) +
  geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
  facet_wrap(vars(after_1980))

The distributions look sort of normal, but we can’t really easily see anything different between the before/after and treatment/control groups. We can plot the averages though. There are a few different ways we can do this.

You can use a stat_summary() layer to have ggplot calculate summary statistics like averages. Here we just calculate the mean:

ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
  geom_point(size = 0.5, alpha = 0.2) +
  stat_summary(geom = "point", fun = "mean", size = 5, color = "red") +
  facet_wrap(vars(after_1980))

We can also calculate the mean and 95% confidence interval:

ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
  stat_summary(geom = "pointrange", size = 1, color = "red",
               fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  facet_wrap(vars(after_1980))

It looks like high earners after 1980 had longer unemployment on average.

We can also use group_by() and summarize() to figure out group means before sending the data to ggplot (gives more control over the data that we’re plotting):

  plot_data <- injury %>%
  # Make these categories instead of 0/1 numbers so they look nicer in the plot
  mutate(highearn = factor(highearn, labels = c("Low earner", "High earner")),
         after_1980 = factor(after_1980, labels = c("Before 1980", "After 1980"))) %>%
  group_by(highearn, after_1980) %>%
  summarize(mean_duration = mean(log_duration),
            se_duration = sd(log_duration) / sqrt(n()),
            upper = mean_duration + (1.96 * se_duration),
            lower = mean_duration + (-1.96 * se_duration))

ggplot(plot_data, aes(x = highearn, y = mean_duration)) +
  geom_pointrange(aes(ymin = lower, ymax = upper),
                  color = "darkgreen", size = 1) +
  facet_wrap(vars(after_1980))

#Or, plotted in the more standard diff-in-diff format:
ggplot(plot_data, aes(x = after_1980, y = mean_duration, color = highearn)) +
  geom_pointrange(aes(ymin = lower, ymax = upper), size = 1) +
  # The group = highearn here makes it so the lines go across categories
  geom_line(aes(group = highearn))

4b. DID by hand

Use a combination of group_by() and summarize() :

 diffs <- injury %>%
  group_by(after_1980, highearn) %>%
  summarize(mean_duration = mean(log_duration),
            # Calculate average with regular duration too
            mean_duration_for_humans = mean(duration))
diffs

## # A tibble: 4 × 4
## # Groups:   after_1980 [2]
##   after_1980 highearn mean_duration mean_duration_for_humans
##        <int>    <int>         <dbl>                    <dbl>
## 1          0        0          1.20                     7.48
## 2          0        1          1.41                    11.8 
## 3          1        0          1.22                     8.61
## 4          1        1          1.63                    13.9

We can pull each of these numbers out of the table with filter() and pull():

before_treatment <- diffs %>%
  filter(after_1980 == 0, highearn == 1) %>%
  pull(mean_duration)

before_control <- diffs %>%
  filter(after_1980 == 0, highearn == 0) %>%
  pull(mean_duration)

after_treatment <- diffs %>%
  filter(after_1980 == 1, highearn == 1) %>%
  pull(mean_duration)

after_control <- diffs %>%
  filter(after_1980 == 1, highearn == 0) %>%
  pull(mean_duration)

Now we are ready to calculate the difference-in-difference estimate:

diff_treatment_before_after <- after_treatment - before_treatment
cat("diff_treatment_before_after =", diff_treatment_before_after, "\n")

## diff_treatment_before_after = 0.2119848

diff_control_before_after <- after_control - before_control
cat("diff_control_before_after =", diff_control_before_after, "\n")

## diff_control_before_after = 0.02363507

diff_diff <- diff_treatment_before_after - diff_control_before_after
cat("diff_diff =", diff_diff, "\n")

## diff_diff = 0.1883498

The diff-in-diff estimate is 0.19, which means that the program causes an increase in unemployment duration of 0.19 logged weeks. ie Receiving the treatment (i.e. being a high earner after the change in policy) causes a 19% increase in the length of unemployment.

ggplot(diffs, aes(x = as.factor(after_1980),
                  y = mean_duration,
                  color = as.factor(highearn))) +
  geom_point() +
  geom_line(aes(group = as.factor(highearn))) +
  # If you use these lines you'll get some extra annotation lines and
  # labels. The annotate() function lets you put stuff on a ggplot that's not
  # part of a dataset. Normally with geom_line, geom_point, etc., you have to
  # plot data that is in columns. With annotate() you can specify your own x and
  # y values.
  annotate(geom = "segment", x = "0", xend = "1",
           y = before_treatment, yend = after_treatment - diff_diff,
           linetype = "dashed", color = "grey50") +
  annotate(geom = "segment", x = "1", xend = "1",
           y = after_treatment, yend = after_treatment - diff_diff,
           linetype = "dotted", color = "blue") +
  annotate(geom = "label", x = "1", y = after_treatment - (diff_diff / 2),
           label = "Program effect", size = 3)

4c. DID regression

\[\begin{aligned} \log(\text{duration}) = &\alpha + \beta \ \text{highearn} + \gamma \ \text{after_1980} + \\ & \delta \ (\text{highearn} \times \text{after_1980}) + \epsilon \end{aligned}\]

The coefficient is the effect we care about in the end—that’s the diff-in-diff estimator.

model_small <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980,
                  data = injury)
tidy(model_small)

## # A tibble: 4 × 5
##   term                estimate std.error statistic     p.value
##   <chr>                  <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)           1.20      0.0271    44.2   0          
## 2 highearn              0.215     0.0434     4.96  0.000000711
## 3 after_1980            0.0236    0.0397     0.595 0.552      
## 4 highearn:after_1980   0.188     0.0628     3.00  0.00271

The coefficient for highearn:after_1980 is the same as what we found by hand, as it should be! It is statistically significant, so we can be fairly confident that it is not 0.

4d. DID regression with controls

Now estimate an expanded version of the basic regression model with the following additional variables:

male married age hosp (1 = hospitalized) indust (1 = manuf, 2 = construc, 3 = other) injtype (1-8; categories for different types of injury) lprewage (log of wage prior to filing a claim)

Important: indust and injtype are in the dataset as numbers (1-3 and 1-8), but they’re actually categories. We have to tell R to treat them as categories (or factors), otherwise it’ll assume that you can have an injury type of 3.46 or something impossible.

#Convert industry and injury type to categories/factors
injury_fixed <- injury %>%
  mutate(indust = as.factor(indust),
         injtype = as.factor(injtype))


model_big <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980 +
                  male + married + age + hosp + indust + injtype + lprewage,
                data = injury_fixed)
tidy(model_big)

## # A tibble: 18 × 5
##    term                estimate std.error statistic   p.value
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)         -2.42      0.309      -7.82  6.15e- 15
##  2 highearn            -0.327     0.0670     -4.89  1.05e-  6
##  3 after_1980           0.0585    0.0368      1.59  1.12e-  1
##  4 male                -0.140     0.0387     -3.61  3.07e-  4
##  5 married              0.0255    0.0335      0.762 4.46e-  1
##  6 age                  0.00779   0.00122     6.41  1.58e- 10
##  7 hosp                 1.10      0.0335     32.8   2.82e-219
##  8 indust2              0.258     0.0472      5.47  4.58e-  8
##  9 indust3              0.154     0.0337      4.57  4.91e-  6
## 10 injtype2             0.883     0.133       6.62  3.83e- 11
## 11 injtype3             0.662     0.0794      8.34  9.06e- 17
## 12 injtype4             0.603     0.0856      7.05  1.98e- 12
## 13 injtype5             0.631     0.0796      7.93  2.64e- 15
## 14 injtype6             0.610     0.0803      7.60  3.43e- 14
## 15 injtype7             1.13      0.161       6.99  2.96e- 12
## 16 injtype8             0.570     0.108       5.28  1.35e-  7
## 17 lprewage             0.462     0.0569      8.12  5.70e- 16
## 18 highearn:after_1980  0.162     0.0586      2.76  5.86e-  3

Extract just the diff-in-diff estimate

diff_diff_controls <- tidy(model_big) %>%
  filter(term == "highearn:after_1980") %>%
  pull(estimate)

Compare results: We can put the model coefficients side-by-side to compare the value for highearn:after_1980 as we change the model.

modelsummary(list("Simple" = model_small, "Full" = model_big))

	Simple	Full
(Intercept)	1.199	-2.416
	(0.027)	(0.309)
highearn	0.215	-0.327
	(0.043)	(0.067)
after_1980	0.024	0.058
	(0.040)	(0.037)
highearn × after_1980	0.188	0.162
	(0.063)	(0.059)
male		-0.140
		(0.039)
married		0.026
		(0.033)
age		0.008
		(0.001)
hosp		1.098
		(0.033)
indust2		0.258
		(0.047)
indust3		0.154
		(0.034)
injtype2		0.883
		(0.133)
injtype3		0.662
		(0.079)
injtype4		0.603
		(0.086)
injtype5		0.631
		(0.080)
injtype6		0.610
		(0.080)
injtype7		1.127
		(0.161)
injtype8		0.570
		(0.108)
lprewage		0.462
		(0.057)
Num.Obs.	7150	6822
R2	0.016	0.182
R2 Adj.	0.015	0.180
AIC	24031.1	21628.4
BIC	24065.5	21758.1
Log.Lik.	-12010.556	-10795.203
F	38.342	89.112
RMSE	1.30	1.18

After controlling for a host of demographic controls, the diff-in-diff estimate is smaller (0.17), indicating that the policy caused a 17% increase in the duration of weeks unemployed following a workplace injury. It is smaller because the other independent variables now explain some of the variation in log_duration.

5. APPLIED EXAMPLE II

Policy / treatment: small cash incentives (any) offered to adults in rural Malawi to encourage returning to pick up HIV test results.

Outcome: whether the participant got their HIV result (got).

We’ll estimate the causal effect of offering any incentive on picking up results. We’ll use simple linear probability models (OLS) so the code is easy to follow.

With randomized treatment, the cleanest estimate is a bare-bones regression of the outcome on the treatment (an Intent-to-Treat, ITT). Adding good pre-treatment covariates can improve precision. Adding bad covariates (post-treatment mediators or colliders) can bias the causal estimate.

Setup:

Data: thornton_hiv comes from an experiment in Malawi looking at whether cash incentives could encourage people to learn the results of their HIV tests.

# Install once if needed:
# install.packages(c("causaldata","dplyr","ggplot2","broom","knitr","dagitty"))
# If you want to knit to PDF but don't have LaTeX:
# install.packages("tinytex"); tinytex::install_tinytex()

library(causaldata)  # thornton_hiv dataset
library(dplyr)
library(ggplot2)
library(broom)
library(knitr)
library(dagitty)

theme_classic(theme_dark(base_size = 13))

## Error in base_size/2: non-numeric argument to binary operator

Load data:

data("thornton_hiv")
df <- thornton_hiv %>%
  transmute(
    got,              # 1 = picked up HIV result
    any,              # 1 = received any incentive (treatment indicator)
    tinc,             # amount of incentive (in Kwacha) -- varies among treated
    distvct,          # distance to testing center (km)
    age,              # age in years
    hiv2004           # actual HIV status (1 = positive), measured by the study
  )

glimpse(df)

## Rows: 4,820
## Columns: 6
## $ got     <dbl> 1, NA, 1, NA, 1, 1, NA, NA, NA, NA, 0, 1, NA, NA, 0, NA, 0, NA…
## $ any     <dbl> 1, NA, 1, NA, 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, 1, NA, 1, NA…
## $ tinc    <dbl> 2.08032, NA, 1.89120, NA, 0.09456, 0.94560, NA, NA, NA, NA, 0.…
## $ distvct <dbl> 2.718921, 2.835039, 2.485713, 2.835039, 1.837131, 2.217743, 2.…
## $ age     <dbl> 22, 44, 19, 30, 53, 50, 18, NA, 24, 45, 21, 62, 20, 19, 47, NA…
## $ hiv2004 <dbl> 0, NA, 0, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, 1, NA, 0, NA…

summary(select(df, got, any, tinc, distvct, age, hiv2004))

##       got              any              tinc           distvct     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0946   1st Qu.:1.030  
##  Median :1.0000   Median :1.0000   Median :0.9456   Median :1.675  
##  Mean   :0.6966   Mean   :0.7659   Mean   :0.9880   Mean   :2.003  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.8912   3rd Qu.:2.743  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.8368   Max.   :5.192  
##  NA's   :1926     NA's   :1919     NA's   :1919                    
##       age           hiv2004       
##  Min.   :11.00   Min.   :-1.0000  
##  1st Qu.:23.00   1st Qu.: 0.0000  
##  Median :32.00   Median : 0.0000  
##  Mean   :33.65   Mean   : 0.0591  
##  3rd Qu.:43.00   3rd Qu.: 0.0000  
##  Max.   :84.00   Max.   : 1.0000  
##  NA's   :441     NA's   :1926

*Note: transmute() is like mutate(), but with one key difference. mutate() adds new variables (or modifies existing ones) while keeping all the original variables, unless you drop them explicitly. transmute() only keeps the variables you explicitly list/create inside it.

Variables:

any is randomly assigned (treatment/control).
tinc is the amount of incentive among those offered any incentive. It lies on the pathway from “offer any incentive” → “strength of incentive” → “pickup result.”
distvct, age, and hiv2004 are pre-treatment characteristics. They may predict the outcome but are not caused by the incentive offer.

Visual Intuition

df %>%
  group_by(any) %>%
  summarise(mean_got = mean(got), n = n()) %>%
  kable(digits = 3, caption = "Mean pickup rate by treatment arm")

Mean pickup rate by treatment arm
any	mean_got	n
0	NA	679
1	NA	2222
NA	NA	1919

ggplot(df, aes(x = factor(any, labels=c("No incentive","Any incentive")), y = got)) +
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = .15) +
  labs(x = NULL, y = "Share who picked up results", title = "Simple difference in means")

5a. No controls

No Controls (the clean ITT):

Because the offer is randomized, we can estimate the following:

\[ got_i = \alpha + \beta \text{any}_i + \varepsilon_i \]

Here, $\beta$ is the ITT: average change in pickup rate from offering any incentive.

m1 <- lm(got ~ any, data = df)
tidy(m1) %>% kable(digits = 3, caption = "Model 1: ITT without controls")

Model 1: ITT without controls
term	estimate	std.error	statistic	p.value
(Intercept)	0.339	0.017	19.973	0
any	0.451	0.019	23.469	0

Interpretation: $\hat{\beta}$ is the difference in average pickup rates between any incentive'' vs.no incentive.’’ With randomization, this is an unbiased estimate of the causal effect without needing controls.

5b. Good controls

Good controls (pre-treatment covariates):

We can add variables measured before treatment that predict the outome (to tighten standard errors):

\[ got_i = \alpha + \beta \text{any}_i + \gamma_1 \text{distvct}_i + \gamma_2 \text{age}_i + \gamma_3 \text{hiv2004}_i + \varepsilon_i \]

m2 <- lm(got ~ any + distvct + age + hiv2004, data = df)
tidy(m2) %>% kable(digits = 3, caption = "Model 2: ITT with pre-treatment controls (good controls)")

Model 2: ITT with pre-treatment controls (good controls)
term	estimate	std.error	statistic	p.value
(Intercept)	0.342	0.028	12.365	0.000
any	0.449	0.019	23.430	0.000
distvct	-0.029	0.006	-4.690	0.000
age	0.002	0.001	3.075	0.002
hiv2004	-0.043	0.031	-1.380	0.168

# Compare fit/precision roughly
glance(m1) %>%
  select(r.squared, adj.r.squared, sigma) %>% mutate(model = "m1") %>%
  bind_rows(glance(m2) %>% select(r.squared, adj.r.squared, sigma) %>% mutate(model="m2")) %>%
  kable(digits = 3, caption = "Fit/precision comparison (lower sigma = tighter residuals)")

Fit/precision comparison (lower sigma = tighter residuals)
r.squared	adj.r.squared	sigma	model
0.163	0.163	0.423	m1
0.171	0.170	0.421	m2

If treatment is randomized, $\hat{\beta}$ in Model 2 should be very close to Model 1.
Controls like distance and age can explain extra variation in got, often making the estimate more precise (Smaller SEs).

5c. Bad controls

Bad controls: conditioning on a mediator:

A common mistake is to “control for everything,” including variables that sit after treatment on the causal path. Here, tinc (the amount of incentive) is part of the treatment dose once you’ve decided to offer any incentive. If we control for tinc, we’re partially “blocking” the treatment effect we want to measure:

any $\Rightarrow$ tinc (dose) $\Rightarrow$ got

Regressing got on any and tinc estimates the ``effect of offering any incentive holding the amount fixed,’’ which is not the ITT and is not a policy-relevant contrast here.

m3 <- lm(got ~ any + tinc + distvct + age + hiv2004, data = df)
tidy(m3) %>% kable(digits = 3, caption = "Model 3: Over-controlling by adding a mediator (tinc) — don't do this for ITT")

Model 3: Over-controlling by adding a mediator (tinc) — don’t do this for ITT
term	estimate	std.error	statistic	p.value
(Intercept)	0.337	0.027	12.316	0.000
any	0.340	0.024	14.448	0.000
tinc	0.084	0.011	7.750	0.000
distvct	-0.029	0.006	-4.613	0.000
age	0.002	0.001	3.278	0.001
hiv2004	-0.042	0.031	-1.354	0.176

What went wrong?

tinc is post-treatment (or at least part of the treatment bundle). Including it soaks up some of the path from any $\Rightarrow$ got
Your $\hat{\beta}$ on any will usually move toward zero, not because the program is weaker, but because you told the empirical model to compare ``any incentive vs. no incentive at the same amount,’’ which is nonsensical for policy.

Rule of thumb:

Good controls: measured before treatment and help make treated vs. control more comparable (confounders/prognostics).
Bad controls: variables affected by treatment (mediators) or colliders (common effects of treatment and outcome). These can bias your causal effect.

5d. DAG

g <- dagitty("
dag {
  any [exposure]
  got [outcome]
  tinc
  distvct
  age
  hiv2004

  any -> tinc
  tinc -> got
  any -> got

  distvct -> got
  age -> got
  hiv2004 -> got
}
")
plot(g)

Controlling for distvct, age, hiv2004 is fine (pre-treatment).
Controlling for tinc is NOT fine if your estimand is the ITT of offering any incentive.
Avoid controlling for colliders (a variable caused by both treatment and some unobserved factor that also affects the outcome). Conditioning on a collider opens backdoor bias. When in doubt, sketch a DAG and ask: Is this variable caused by the treatment? If yes, don’t control for it for the ITT.)

More controls can:

Help precision when they are pre-treatment predictors of the outcome.
Hurt when they are post-treatment (mediators) or colliders, or when you “oversaturate” a small sample with many noisy controls (variance inflation).

Start from your causal question and estimand (here: ITT of offering any incentive). Include only covariates that are needed to satisfy identification (pre-treatment confounders) or to improve precision, and exclude mediators/colliders.

bind_rows(
  tidy(m1) %>% mutate(model="(1) ITT"),
  tidy(m2) %>% mutate(model="(2) ITT + good controls"),
  tidy(m3) %>% mutate(model="(3) ITT + BAD control (tinc)")
) %>%
  filter(term=="any") %>%
  select(model, estimate, std.error, statistic, p.value) %>%
  kable(digits=3, caption = "Estimate on `any` across specifications")

Estimate on `any` across specifications
model	estimate	std.error	statistic
(1) ITT	0.451	0.019	23.469
(2) ITT + good controls	0.449	0.019	23.430
(3) ITT + BAD control (tinc)	0.340	0.024	14.448

Expect: (3) typically shows a smaller any coefficient than (1)–(2), not because the program is weaker, but because we controlled away part of the effect.

Randomized treatment: the cleanest regression is outcome ~ treatment.
Good controls: pre-treatment covariates (for precision).
Bad controls: mediators/colliders (can bias estimates).
Kitchen-sink OLS is not a virtue in causal work—start from a DAG and your estimand.

Dataset: thornton_hiv from the causaldata package (CRAN manual lists variables and study context).

Thornton, R. (2008). The Demand for, and Impact of, Learning HIV Status. AER 98(5): 1829–1863.

The causaldata CRAN manual documents thornton_hiv and its variables (see the dataset listing and description). :contentReferenceoaicite:0
For background on the Malawi incentives experiment (Thornton 2008), see the original paper. :contentReferenceoaicite:1

6. APPLIED EXAMPLE III

Research Question: How does education affect female wages?

We want to estimate how education affects female wages. The problem: education may be endogenous (people with higher ability or motivation might get more education and earn higher wages). If so, simple OLS can be biased.

Idea: Use instrumental variables (IV) — variables related to education (the “push”) but not directly related to the wage except through education.

We’ll keep the econometrics light and focus on intuition.

The problem: education may be endogenous (people with higher ability or motivation might get more education and earn higher wages). If so, simple OLS can be biased.

Idea: Use instrumental variables (IV) — variables related to education but not directly related to the wage except through education.

We will use the classic Mroz dataset from the AER package (female labor supply/wages).

Key variables (names are from the dataset):

wage: hourly wage; lwage: log wage

educ: years of schooling (our main regressor of interest)

exper: years of labor market experience; expersq: experience squared

Candidate instruments for educ:

motheduc, fatheduc: mother’s and father’s education (often used as IVs for schooling)

huseduc: husband’s education (we will try it but also discuss why it may be invalid)

Strategy

Think causally: Draw the story. What pushes education? What else pushes wages directly?
Pick instruments with a credible story for relevance and exclusion.
Add sensible controls that are not caused by education (experience, demographics).
Run OLS and IV; compare estimates and first-stage strength.
Use diagnostics (Hausman, Sargan) as clues, not final proof.
Be transparent about limitations and alternative interpretations.

6a. Data: Mroz dataset

# Install once if needed:
# install.packages(c("AER","dplyr","ggplot2","broom"))
# install.packages("wooldridge")  # run once if not installed
library(AER)
library(wooldridge)
library(dplyr)     # <- provides %>% and across()
library(ggplot2)
library(broom)

# Load Mroz dataset from wooldridge package
data("mroz", package = "wooldridge")

df <- mroz %>%
  mutate(
    expersq = exper^2
  )

core_vars <- c("lwage","wage","educ","exper","expersq","motheduc","fatheduc","huseduc")
df_use <- df %>% filter(complete.cases(across(all_of(core_vars))))

dim(df_use)

## [1] 428  22

head(df_use %>% select(all_of(core_vars)))

##        lwage   wage educ exper expersq motheduc fatheduc huseduc
## 1 1.21015370 3.3540   12    14     196       12        7      12
## 2 0.32851210 1.3889   12     5      25        7        7       9
## 3 1.51413774 4.5455   12    15     225       12        7      12
## 4 0.09212332 1.0965   12     6      36        7        7      10
## 5 1.52427220 4.5918   14     7      49       12       14      12
## 6 1.55648005 4.7421   12    33    1089       14        7      11

A quick look at the data on wages and education:

ggplot(df_use, aes(x = educ, y = lwage)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Log wage vs. education (raw relationship)",
       x = "Years of education",
       y = "Log wage")

This positive slope mixes causal effect with selection (ability, family background, etc.).

What should we include as controls?

Include: Things that are not caused by education.

We will include experience and experience squared (earnings rise with experience but at a decreasing rate).

Be careful including variables affected by education (e.g., occupation, current hours, current job tenure). These can “soak up” part of education’s effect (so-called bad controls).

Intuition about instruments:

Relevance: The instrument must move education (first stage).

Exclusion: The instrument must not affect wages directly, only through education.

Parents’ education plausibly encourages more schooling (relevance).

The tricky part: Do parents’ education affect adult wages other than via the daughter’s education (e.g., through networks or family background)? We can’t prove this with data; we argue it and test implications.

6b. Baseline OLS (no instruments)

m_ols <- lm(lwage ~ educ + exper + expersq, data = df_use)
tidy(m_ols)

## # A tibble: 4 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -0.522     0.199        -2.63 8.90e- 3
## 2 educ         0.107     0.0141        7.60 1.94e-13
## 3 exper        0.0416    0.0132        3.15 1.72e- 3
## 4 expersq     -0.000811  0.000393     -2.06 3.97e- 2

The coefficient on educ is the average percentage change in wages associated with an extra year of schooling, holding exper fixed.

6c. IV Strategy 1: Parents’ Education as Instruments

Instruments: motheduc and fatheduc for educ.

Relevance: Parents’ education likely correlates with the child’s education (first stage).
Exclusion (debatable): Parents’ education should not affect the daughter’s wage except through the daughter’s education. This is an assumption; violations are possible (e.g., networks, values, non-pecuniary support). We can check implications (weak instruments tests, overidentification tests), but ultimately it’s a judgment call.

First stage intuition

ggplot(df_use, aes(x = (motheduc + fatheduc)/2, y = educ)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "First-stage intuition: parental education vs. daughter's education",
       x = "Average of mother & father education",
       y = "Daughter's education")

IV estimation

m_iv_parents <- ivreg(
  lwage ~ educ + exper + expersq |
    motheduc + fatheduc + exper + expersq,
  data = df_use
)

summary(m_iv_parents, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc + 
##     exper + expersq, data = df_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0986 -0.3196  0.0551  0.3689  2.3493 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.0481003  0.4003281   0.120  0.90442   
## educ         0.0613966  0.0314367   1.953  0.05147 . 
## exper        0.0441704  0.0134325   3.288  0.00109 **
## expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   2 423    55.400  <2e-16 ***
## Wu-Hausman         1 423     2.793  0.0954 .  
## Sargan             1  NA     0.378  0.5386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357,  Adjusted R-squared: 0.1296 
## Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05

In the IV formula y ~ x1 + x2 | z1 + z2, the right side before the bar lists regressors in the wage equation; the right side after the bar lists instruments and exogenous controls.
Here, educ is treated as endogenous; exper and expersq are included on both sides as exogenous controls (not instrumented).

Reading the diagnostics

Looking at the Diagnostic tests section in the above output:

1. First-stage strength (look for a decent F-statistic on the instruments in the first-stage diagnostics; a common rule of thumb is F > 10).

If first-stage F is small → instruments are weak → IV estimates can be noisy or misleading.

As a rule of thumb, if the F-statistic is greater than 10, instruments are considered “not weak.”

So 55.4 (which is the robust version of the F-statistic in the above output) indicates that that the instrument is strong (ie parental education strongly predicts women’s education, so the relevance condition is satisfied).

2. Wu-Hausman is a test for the endogeneity of educ. Tests whether OLS and IV estimates differ significantly.

look at the p-value (0.095). If <0.05, conclude endogeneity matters (OLS is biased). Here, borderline → some endogeneity but not overwhelming.
If Wu–Hausman rejects → OLS and IV differ → suggests endogeneity matters.

p-value = 0.095 → not significant at 5%, but borderline at 10%.

Interpretation: There’s some evidence (weak) that education is endogenous. OLS may be biased, but it’s not conclusive.

Note that the null hypothesis for the Wu-Hausman test is education is exogenous (i.e. OLS is fine) and the alternative is education is endogenous (i.e. need IV)

3. Sargan test (overidentification test): We have 2 instruments (mother’s and father’s education) for 1 endogenous regressor (educ). That leaves 1 overidentifying restriction you can test.

Null hypothesis: all instruments are valid (they affect wages only through education).

p-value = 0.54 → you fail to reject → no evidence that the instruments are invalid.

Interpretation: Parents’ education passes this statistical test. But remember: Sargan can’t “prove” validity, it only checks consistency across instruments.

Sargan test (overidentification): with 2 instruments for 1 endogenous variable, we can test whether instruments as a set look consistent with the exclusion restriction. look at the p-value (0.54). If <0.05, reject validity. Here, not rejected → instruments look valid. If Sargan rejects → at least one instrument may violate the exclusion restriction.

Overall conclusion: Instruments are strong (F = 55). There’s some suggestion of endogeneity (p ≈ 0.10). No sign the instruments are invalid (Sargan p = 0.54). So IV is reasonable, but interpretation still relies on believing the exclusion restriction (that parents’ education only matters for wages through the woman’s education).

6d. IV Strategy 2: Husband’s Ed as an Instrument

Try using husband’s education huseduc as an instrument for the woman’s education.

m_iv_hus <- ivreg(
  lwage ~ educ + exper + expersq |
    huseduc + exper + expersq,
  data = df_use
)

summary(m_iv_hus, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | huseduc + exper + 
##     expersq, data = df_use)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07677 -0.32148  0.03525  0.37605  2.36256 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2980987  0.3099189  -0.962 0.336668    
## educ         0.0893851  0.0238705   3.745 0.000206 ***
## exper        0.0425893  0.0132451   3.215 0.001402 ** 
## expersq     -0.0008457  0.0003957  -2.137 0.033155 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 424   230.900  <2e-16 ***
## Wu-Hausman         1 423     0.892   0.346    
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6677 on 424 degrees of freedom
## Multiple R-Squared: 0.1536,  Adjusted R-squared: 0.1476 
## Wald test: 11.69 on 3 and 424 DF,  p-value: 2.26e-07

…..a “strong-looking” instrument can still be invalid if the exclusion restriction is not credible.

OLS estimate: ~0.107 (about 10.7% higher wages per extra year of schooling).
IV with parents’ education: ~0.061 (much lower, imprecise).
IV with husband’s education: ~0.089 (between the two, closer to OLS, but still smaller).
So compared to OLS, the husband-IV estimate is lower than OLS but higher than the parents-IV estimate.

6e. Compare OLS vs. IV Estimates

tab_ols <- tidy(m_ols) %>% mutate(model = "OLS")
tab_iv1 <- tidy(m_iv_parents) %>% mutate(model = "IV: Parents' educ")
tab_iv2 <- tidy(m_iv_hus) %>% mutate(model = "IV: Husband's educ")

combined <- bind_rows(tab_ols, tab_iv1, tab_iv2) %>%
  filter(term == "educ") %>%
  select(model, term, estimate, std.error, statistic, p.value)



# Print
# results_table

# (Optional) Pretty table when knitting
 knitr::kable(combined, digits = 3, align = "lrrrrrl",
              caption = "Education Coefficient and IV Diagnostics (Side-by-Side)")

Education Coefficient and IV Diagnostics (Side-by-Side)
model	term	estimate	std.error	statistic	p.value
OLS	educ	0.107	0.014	7.598	0.000
IV: Parents’ educ	educ	0.061	0.031	1.953	0.051
IV: Husband’s educ	educ	0.089	0.024	3.745	0.000

Why “strong” instruments aren’t automatically “good” instruments.

Let’s unpack why the husband’s education (huseduc) instrument gives a higher return to schooling than OLS.

Why might the husband’s education IV push the coefficient upward relative to the parents-IV case?

Exclusion restriction problem: Husband’s education may be correlated with the wife’s wages for reasons other than her schooling:

Assortative mating: highly educated women tend to marry highly educated men. But beyond schooling, husband’s education also correlates with:

- Household income (which can relax constraints, allow wife to choose higher-paying but riskier jobs).

- Social networks, labor market connections, location choices.

- Even “quality” of match (ambition, preferences).

These channels directly affect the wife’s wages independent of her education. If so, the IV estimate picks up not just the causal effect of the woman’s own schooling, but also these extra channels that bias the estimate upward.

Assortative mating: high-earning women may marry high-earning men; husband’s education may proxy for social networks or local labor market conditions that directly affect the woman’s wage.

If so, huseduc affects wages not only through the woman’s schooling → invalid instrument.

ie a “strong-looking” instrument can still be invalid if the exclusion restriction is not credible.

A bit more intuition in “IV logic” terms

Remember IV estimates are LATEs (local average treatment effects): they reflect the causal effect of education for the “compliers” — people whose schooling is shifted by the instrument.
Parents’ education instruments → compliers are daughters whose education levels are influenced by their parents’ background. This group may be more marginal in the labor market, so the estimated returns to education could be smaller.
Husband’s education instrument → compliers are women whose education choices are “aligned” with the kind of husband they eventually marry. This is a very selected group — likely higher-ability or higher-income households — where the return to education looks higher.
So part of the difference is a different complier population, and part is the invalidity of the instrument (husband’s education directly affecting wages).

In summary

OLS mixes up the effect of education with ability and background.
IV with parents’ education tries to strip that out; it gives a lower return, but it might be closer to the “true” causal effect. Parents’ education is a common (imperfect) candidate: it often passes relevance; exclusion is arguable and must be discussed.
The “husband-as-IV” example illustrates that even if the instrument is relevant, it might fail exclusion (economically implausible) — treat with caution.

IV with husband’s education looks statistically strong (high F-statistic), but it’s more problematic than parental education: husband’s schooling affects wife’s wages through many other channels. That contamination pushes the coefficient up, making it look larger than the parents-IV estimate.

A strong first stage (big F-statistic) isn’t enough. If the exclusion restriction is not believable, the IV estimate can be biased — sometimes more biased than OLS.

6f. What to Include vs. Not Include

Include reasonable controls that are not caused by education and help isolate the effect:

Experience and experience squared (learning-by-doing patterns)
Age (sometimes), region, cohort if they are not themselves outcomes of education decisions in your setup.

Avoid “bad controls” — variables affected by education:

Current occupation, current firm/industry, job tenure if education helped you get that job.
These can “soak up” the very pathway through which education raises wages, biasing the education coefficient downward.

Multiple instruments:

If you have more instruments than endogenous regressors, you can test “overidentifying restrictions,” but remember tests are imperfect; it’s still about economic reasoning.

Consider: What if we (probably inappropriately) add a potential “bad control” like occupation (if available)? This is just to demonstrate how estimates can shift; it does not validate the model.

m_ols <- lm(lwage ~ educ + exper + expersq, data = df_use)
summary(m_ols)$coef["educ",]

##     Estimate   Std. Error      t value     Pr(>|t|) 
## 1.074896e-01 1.414648e-02 7.598332e+00 1.939931e-13

m_iv_f <- ivreg(lwage ~ educ + exper + expersq |
                  fatheduc + exper + expersq, data = df_use)
summary(m_iv_f, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | fatheduc + exper + 
##     expersq, data = df_use)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.09170 -0.32776  0.05006  0.37365  2.35346 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.0611169  0.4364461  -0.140  0.88870   
## educ         0.0702263  0.0344427   2.039  0.04208 * 
## exper        0.0436716  0.0134001   3.259  0.00121 **
## expersq     -0.0008822  0.0004009  -2.200  0.02832 * 
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 424    87.741  <2e-16 ***
## Wu-Hausman         1 423     1.437   0.231    
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6719 on 424 degrees of freedom
## Multiple R-Squared: 0.143,   Adjusted R-squared: 0.137 
## Wald test: 8.314 on 3 and 424 DF,  p-value: 2.201e-05

m_iv_parents <- ivreg(lwage ~ educ + exper + expersq |
                        motheduc + fatheduc + exper + expersq, data = df_use)
summary(m_iv_parents, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc + 
##     exper + expersq, data = df_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0986 -0.3196  0.0551  0.3689  2.3493 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.0481003  0.4003281   0.120  0.90442   
## educ         0.0613966  0.0314367   1.953  0.05147 . 
## exper        0.0441704  0.0134325   3.288  0.00109 **
## expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   2 423    55.400  <2e-16 ***
## Wu-Hausman         1 423     2.793  0.0954 .  
## Sargan             1  NA     0.378  0.5386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357,  Adjusted R-squared: 0.1296 
## Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05

m_iv_fh <- ivreg(lwage ~ educ + exper + expersq |
                   fatheduc + huseduc + exper + expersq, data = df_use)
summary(m_iv_fh, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | fatheduc + huseduc + 
##     exper + expersq, data = df_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0805 -0.3195  0.0319  0.3732  2.3603 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2385882  0.2908234  -0.820 0.412456    
## educ         0.0845739  0.0222416   3.803 0.000164 ***
## exper        0.0428611  0.0132513   3.234 0.001314 ** 
## expersq     -0.0008548  0.0003958  -2.160 0.031353 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   2 423   145.189  <2e-16 ***
## Wu-Hausman         1 423     1.805    0.18    
## Sargan             1  NA     0.306    0.58    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6685 on 424 degrees of freedom
## Multiple R-Squared: 0.1516,  Adjusted R-squared: 0.1456 
## Wald test: 11.82 on 3 and 424 DF,  p-value: 1.898e-07

m_iv_all <- ivreg(lwage ~ educ + exper + expersq |
                    motheduc + fatheduc + huseduc + exper + expersq, data = df_use)
summary(m_iv_all, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc + 
##     huseduc + exper + expersq, data = df_use)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.08378 -0.32135  0.03538  0.36934  2.35829 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1868572  0.2853959  -0.655 0.512997    
## educ         0.0803918  0.0217740   3.692 0.000251 ***
## exper        0.0430973  0.0132649   3.249 0.001250 ** 
## expersq     -0.0008628  0.0003962  -2.178 0.029976 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   3 422   104.294  <2e-16 ***
## Wu-Hausman         1 423     2.732  0.0991 .  
## Sargan             2  NA     1.115  0.5726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6693 on 424 degrees of freedom
## Multiple R-Squared: 0.1495,  Adjusted R-squared: 0.1435 
## Wald test: 11.52 on 3 and 424 DF,  p-value: 2.817e-07

# Helper to extract the educ coefficient and IV diagnostics
extract_row <- function(model, label, note) {
  is_iv <- inherits(model, "ivreg")
  
  if (is_iv) {
    sm <- summary(model, diagnostics = TRUE)
    coefs <- sm$coefficients
    diags <- as.data.frame(sm$diagnostics)
    # Pull diagnostics safely (may be NA for just-identified models)
    weak_F   <- if ("Weak instruments" %in% rownames(diags)) diags["Weak instruments", "statistic"] else NA_real_
    wu_p     <- if ("Wu-Hausman"      %in% rownames(diags)) diags["Wu-Hausman",      "p-value"]   else NA_real_
    sargan_p <- if ("Sargan"          %in% rownames(diags)) diags["Sargan",          "p-value"]   else NA_real_
  } else {
    sm <- summary(model)
    coefs <- sm$coefficients
    weak_F <- NA_real_
    wu_p <- NA_real_
    sargan_p <- NA_real_
  }
  
  est <- coefs["educ", "Estimate"]
  se  <- coefs["educ", "Std. Error"]
  
  data.frame(
    Model = label,
    Estimate = est,
    `Std. Err.` = se,
    `Weak F` = weak_F,
    `Wu–Hausman p` = wu_p,
    `Sargan p` = sargan_p,
    Notes = note,
    check.names = FALSE
  )
}

# Build the table exactly like shown
results_table <- rbind(
  extract_row(m_ols, "OLS",
              "Raw association; possibly biased"),
  extract_row(m_iv_f, "IV: father only",
              "Just identified; weaker return than OLS"),
  extract_row(m_iv_parents, "IV: mother + father",
              "Strong instruments, consistent; lower than OLS"),
  extract_row(m_iv_fh, "IV: father + husband",
              "Strong instruments; estimate pulled upward by husband’s education"),
  extract_row(m_iv_all, "IV: mother + father + husband",
              "Very strong; estimate between parents-only and with husband")
)

# Round to match the style in the example
results_table$Estimate      <- round(results_table$Estimate, 3)
results_table$`Std. Err.`   <- round(results_table$`Std. Err.`, 3)
results_table$`Weak F`      <- ifelse(is.na(results_table$`Weak F`), NA, round(results_table$`Weak F`, 1))
results_table$`Wu–Hausman p`<- ifelse(is.na(results_table$`Wu–Hausman p`), NA, signif(results_table$`Wu–Hausman p`, 3))
results_table$`Sargan p`    <- ifelse(is.na(results_table$`Sargan p`), NA, signif(results_table$`Sargan p`, 3))

# Print
results_table

##                           Model Estimate Std. Err. Weak F Wu–Hausman p Sargan p
## 1                           OLS    0.107     0.014     NA           NA       NA
## 2               IV: father only    0.070     0.034   87.7       0.2310       NA
## 3           IV: mother + father    0.061     0.031   55.4       0.0954    0.539
## 4          IV: father + husband    0.085     0.022  145.2       0.1800    0.580
## 5 IV: mother + father + husband    0.080     0.022  104.3       0.0991    0.573
##                                                               Notes
## 1                                  Raw association; possibly biased
## 2                           Just identified; weaker return than OLS
## 3                    Strong instruments, consistent; lower than OLS
## 4 Strong instruments; estimate pulled upward by husband’s education
## 5       Very strong; estimate between parents-only and with husband

# (Optional) Pretty table when knitting
 knitr::kable(results_table, digits = 3, align = "lrrrrrl",
              caption = "Education Coefficient and IV Diagnostics (Side-by-Side)")

Education Coefficient and IV Diagnostics (Side-by-Side)
Model	Estimate	Std. Err.	Weak F	Wu–Hausman p	Sargan p	Notes
OLS	0.107	0.014	NA	NA	NA	Raw association; possibly biased
IV: father only	0.070	0.034	87.7	0.231	NA	Just identified; weaker return than OLS
IV: mother + father	0.061	0.031	55.4	0.095	0.539	Strong instruments, consistent; lower than OLS
IV: father + husband	0.085	0.022	145.2	0.180	0.580	Strong instruments; estimate pulled upward by husband’s education
IV: mother + father + husband	0.080	0.022	104.3	0.099	0.573	Very strong; estimate between parents-only and with husband

6g. Additional notes and tips

1. Identification with instruments

If you have 1 endogenous regressor (say, education) and exactly 1 instrument (say, mother’s education), then the model is just identified. You can estimate the causal effect, but you cannot test the instrument’s validity — because there’s no redundancy.
If you have 1 endogenous regressor (education) and 2 instruments (say, mother’s and father’s education), the model is overidentified. Both instruments are supposed to point to the same true effect. If they don’t, that tells you something is off (at least one instrument may be invalid).

2. Overidentifying restrictions test (Sargan/Hansen test)

If all instruments are valid, then it shouldn’t matter which one you use — the IV estimates should line up.
With >1 instrument, you can statistically check whether the extra instruments are “telling the same story.”
This is the Sargan test (or Hansen’s J-test, in GMM settings).
Null hypothesis: All instruments are valid (they only affect wages through education).
Alternative: At least one instrument is invalid (affects wages directly).
A low p-value (< 0.05) → reject validity → red flag about instruments.
A high p-value → no evidence against validity (but not proof they are valid — tests can miss subtle violations).

3. In this example:

Parents’ education IV (motheduc + fatheduc):

Overidentified: 2 instruments for 1 endogenous regressor.

Sargan p-value = 0.54.

Interpretation: The two instruments give consistent answers; no evidence they are invalid.

Husband’s education IV (huseduc only):

Just identified: 1 instrument for 1 endogenous regressor.

No overidentifying restriction possible → test is NA in the output.

Interpretation: You can’t check exclusion validity with the data; you have to argue it based on economic reasoning.

4. Key caution

Even if the test doesn’t reject, it doesn’t prove the instruments are valid — it just means they’re not contradicting each other in the sample. For example, if both instruments are invalid in the same way, the test won’t catch it. That’s why you always combine the test with economic reasoning

6h. Fake data

# install.packages(c("AER","broom","dplyr","knitr","kableExtra")) # run once if needed
library(AER)
library(broom)
library(dplyr)
library(knitr)
library(kableExtra)

set.seed(123)

N   <- 5000
beta <- 0.5
gamma <- 1.0     # makes z2 invalid (direct effect on y)
pi1 <- 1.0
pi2 <- 1.0

z1 <- rnorm(N)
z2 <- rnorm(N)
v  <- rnorm(N)
e  <- rnorm(N)

x <- pi1*z1 + pi2*z2 + v
y <- beta*x + gamma*z2 + e

dat <- data.frame(y, x, z1, z2)
head(dat)

##            y         x          z1         z2
## 1 -1.1899849 1.3160757 -0.56047565 -0.4941739
## 2  0.9135182 0.7306040 -0.23017749  1.1275935
## 3 -1.3386337 1.3387202  1.55870831 -1.1469495
## 4  2.9453846 0.9833752  0.07050839  1.4810186
## 5  2.1706215 1.2705690  0.12928774  0.9161912
## 6  3.3116676 3.1821819  1.71506499  0.3351310

m_ols   <- lm(y ~ x, data = dat)

m_iv_z1 <- ivreg(y ~ x | z1, data = dat)        # just-identified (valid)
m_iv_z2 <- ivreg(y ~ x | z2, data = dat)        # just-identified (invalid)
m_iv_both <- ivreg(y ~ x | z1 + z2, data = dat) # overidentified (will run Sargan)

extract_row <- function(model, label, note) {
  is_iv <- inherits(model, "ivreg")
  if (is_iv) {
    sm <- summary(model, diagnostics = TRUE)
    co <- sm$coefficients
    di <- as.data.frame(sm$diagnostics)
    weakF   <- if ("Weak instruments" %in% rownames(di)) di["Weak instruments","statistic"] else NA_real_
    wu_p    <- if ("Wu-Hausman"      %in% rownames(di)) di["Wu-Hausman","p-value"]          else NA_real_
    sargan_p<- if ("Sargan"          %in% rownames(di)) di["Sargan","p-value"]              else NA_real_
  } else {
    sm <- summary(model)
    co <- sm$coefficients
    weakF <- wu_p <- sargan_p <- NA_real_
  }
  data.frame(
    Model = label,
    Estimate = co["x","Estimate"],
    `Std. Err.` = co["x","Std. Error"],
    `Weak F` = weakF,
    `Wu–Hausman p` = wu_p,
    `Sargan p` = sargan_p,
    Notes = note,
    check.names = FALSE
  )
}

tab <- bind_rows(
  extract_row(m_ols,     "OLS",                    "Raw association; biased (x endogenous)"),
  extract_row(m_iv_z1,   "IV: z1 only (valid)",    "Strong & just-identified; credible"),
  extract_row(m_iv_z2,   "IV: z2 only (invalid)",  "Strong but invalid (exclusion fails)"),
  extract_row(m_iv_both, "IV: z1 + z2 (overid.)",  "Overidentified; Sargan should reject")
) %>%
  mutate(
    Estimate = round(Estimate, 3),
    `Std. Err.` = round(`Std. Err.`, 3),
    `Weak F` = ifelse(is.na(`Weak F`), "", round(`Weak F`, 1)),
    `Wu–Hausman p` = ifelse(is.na(`Wu–Hausman p`), "", signif(`Wu–Hausman p`, 3)),
    `Sargan p` = ifelse(is.na(`Sargan p`), "", signif(`Sargan p`, 3))
  )

kable(tab, caption = "Side-by-side results", align = "lrrrrrl", booktabs = TRUE) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover")) %>%
  column_spec(1, bold = TRUE) %>%
  add_header_above(c(" " = 1, "Effect of x on y" = 2, "Diagnostics" = 3, " " = 1))

Side-by-side results
	Effect of x on y		Diagnostics
Model	Estimate	Std. Err.	Weak F	Wu–Hausman p	Sargan p	Notes
OLS	0.828	0.011				Raw association; biased (x endogenous)
IV: z1 only (valid)	0.488	0.020	2449.9	2.03e-116		Strong & just-identified; credible
IV: z2 only (invalid)	1.510	0.025	2496.9	0		Strong but invalid (exclusion fails)
IV: z1 + z2 (overid.)	1.002	0.013	4983.4	5.71e-125	6.50999999984885e-315	Overidentified; Sargan should reject

7 APPLIED EXAMPLE IV

What is the effect on test scores of reducing class size in early grades?

Project STAR (Student Teacher Achievement RAtio): Tenessee class size reduction experiment

Comparison between three different class arrangements for kindergarten through third grade:

regular class size (22-25 students per class and a single teacher with no aides)
small class size (13-17 students per class and no aide)
regular-sized class plus a teacher’s aide

\[ Y_i = \beta_0 + \beta_1 \text{SmallClass}_i + \beta_2 \text{RegAide}_i + u_i \]

$Y_i$: test score

$\text{SmallClass}_i$: =1 if student i is in a small class,=0 otherwise

$\text{RegAide}_i$: =1 if student i is in a regular class with aide,=0 otherwise

$\beta_1$: the effect of a small class, relative to a regular class

$\beta_2$: the effect of a regular class with an aide, relative to a regular class

## extra for display: 
 if(!requireNamespace("lattice")) {
  if(interactive() || is.na(Sys.getenv("_R_CHECK_PACKAGE_NAME_", NA))) {
    stop("not all packages required for the example are installed")
  } else q() }

library(AER)
data("STAR")

## Stock and Watson, p. 488
fmk <- lm(I(readk + mathk) ~ stark, data = STAR)
fm1 <- lm(I(read1 + math1) ~ star1, data = STAR)
fm2 <- lm(I(read2 + math2) ~ star2, data = STAR)
fm3 <- lm(I(read3 + math3) ~ star3, data = STAR)

library(stargazer)

stargazer(fmk, fm1, fm2, fm3,
          type = "text",   # change to "latex" or "html" if needed
          title = "Effect of Small Classes on Test Scores",
          column.labels = c("Kindergarten", "Grade 1", "Grade 2", "Grade 3"),
          dep.var.labels = "Reading + Math Score",
          covariate.labels = c("Small class", "Regular+aide class"),
          omit.stat = c("f", "ser"))   # cleaner output

## 
## Effect of Small Classes on Test Scores
## ==========================================================================================
##                                              Dependent variable:                          
##                    -----------------------------------------------------------------------
##                    Reading + Math Score I(read1 + math1) I(read2 + math2) I(read3 + math3)
##                        Kindergarten         Grade 1          Grade 2          Grade 3     
##                            (1)                (2)              (3)              (4)       
## ------------------------------------------------------------------------------------------
## Small class             13.899***                                                         
##                          (2.409)                                                          
##                                                                                           
## Regular+aide class        0.314                                                           
##                          (2.310)                                                          
##                                                                                           
## star1small                                 29.781***                                      
##                                             (2.807)                                       
##                                                                                           
## star1regular+aide                          11.959***                                      
##                                             (2.686)                                       
##                                                                                           
## star2small                                                  19.394***                     
##                                                              (2.710)                      
##                                                                                           
## star2regular+aide                                             3.479                       
##                                                              (2.566)                      
##                                                                                           
## star3small                                                                   15.587***    
##                                                                               (2.395)     
##                                                                                           
## star3regular+aide                                                              -0.291     
##                                                                               (2.302)     
##                                                                                           
## Constant                918.043***        1,039.393***     1,157.807***     1,228.506***  
##                          (1.641)            (1.836)          (1.849)          (1.715)     
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations              5,786              6,379            6,049            5,967      
## R2                        0.007              0.017            0.009            0.010      
## Adjusted R2               0.007              0.017            0.009            0.010      
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01

coeftest(fm3, vcov = sandwich)

## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)       1228.50636    1.67959 731.4322 < 2.2e-16 ***
## star3small          15.58660    2.39544   6.5068 8.303e-11 ***
## star3regular+aide   -0.29094    2.27214  -0.1280    0.8981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(I(read3 + math3) ~ star3, data = STAR)

These estimates suggest that reducing class size has an effect on test performance, but adding an aide to a regular-sized class has a much smaller effect, possibly zero.

Should we include additional regressors?

Augmenting the regressions with additional regressors can provide more efficient estimates of the causal effects.
Morever, if the treatment receivced is not random because of failures to follow the treatment protocol, then the estimates of the experimental effects based on regressions with additional regressors could differ from the difference estimates reported above.

For these reasons, we now also look at the results with additional regressors that measure teacher, school and students characteristics

## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)


addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)


addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)


stargazer(addreg_A, addreg_B, addreg_C,
          type = "text",   # change to "latex" or "html" if needed
          title = "Effect of Small Classes on Test Scores",
          column.labels = c("A", "B", "C"),
          dep.var.labels = "Reading + Math Score",
          #covariate.labels = c("Small class", "Regular+aide class"),
          omit.stat = c("f", "ser"))   # cleaner output

## 
## Effect of Small Classes on Test Scores
## ===================================================
##                          Dependent variable:       
##                   ---------------------------------
##                         Reading + Math Score       
##                       A           B          C     
##                      (1)         (2)        (3)    
## ---------------------------------------------------
## starksmall        14.006***   15.933***  15.918*** 
##                    (2.395)     (2.167)    (2.092)  
##                                                    
## starkregular+aide   -0.601      1.215      1.808   
##                    (2.306)     (2.093)    (2.019)  
##                                                    
## experiencek        1.469***   0.743***    0.665*** 
##                    (0.167)     (0.168)    (0.162)  
##                                                    
## genderfemale                             12.106*** 
##                                           (1.667)  
##                                                    
## lunchkfree                               -34.636***
##                                           (2.020)  
##                                                    
## ethnicityafam                            -25.426***
##                                           (3.584)  
##                                                    
## ethnicityasian                             0.184   
##                                           (16.972) 
##                                                    
## ethnicityhispanic                          3.225   
##                                           (31.648) 
##                                                    
## ethnicityamindian                         -34.973  
##                                           (44.676) 
##                                                    
## ethnicityother                            -24.933  
##                                           (23.935) 
##                                                    
## schoolidk2                   -81.716***  -57.496***
##                               (12.077)    (11.702) 
##                                                    
## schoolidk3                     -7.175      -7.679  
##                               (10.446)    (10.075) 
##                                                    
## schoolidk4                   -44.735***  -51.680***
##                               (11.731)    (11.309) 
##                                                    
## schoolidk5                   -48.425***  -45.966***
##                               (11.664)    (11.246) 
##                                                    
## schoolidk6                   -38.441***  -33.491***
##                               (11.949)    (11.569) 
##                                                    
## schoolidk7                    22.672**   26.266*** 
##                               (10.036)    (9.669)  
##                                                    
## schoolidk8                   -34.505***  -35.777***
##                               (10.332)    (9.973)  
##                                                    
## schoolidk9                      4.032      2.870   
##                                (9.979)    (9.635)  
##                                                    
## schoolidk10                   38.558***  32.084*** 
##                               (12.137)    (11.695) 
##                                                    
## schoolidk11                   57.981***  59.025*** 
##                               (11.833)    (11.397) 
##                                                    
## schoolidk12                     1.828      11.171  
##                               (11.852)    (11.463) 
##                                                    
## schoolidk13                   36.435***  37.072*** 
##                               (11.733)    (11.312) 
##                                                    
## schoolidk14                   -32.964**    8.688   
##                               (13.788)    (13.628) 
##                                                    
## schoolidk15                  -51.949***    -9.525  
##                               (11.893)    (11.861) 
##                                                    
## schoolidk16                  -76.829***  -31.285***
##                               (10.241)    (10.360) 
##                                                    
## schoolidk17                   -18.900*    -13.323  
##                               (11.142)    (10.798) 
##                                                    
## schoolidk18                  -51.424***  -24.005** 
##                               (10.853)    (10.681) 
##                                                    
## schoolidk19                  -29.031***    15.643  
##                               (10.413)    (10.497) 
##                                                    
## schoolidk20                   -24.413**    2.657   
##                               (10.896)    (10.913) 
##                                                    
## schoolidk21                    16.309    32.475*** 
##                               (10.998)    (10.765) 
##                                                    
## schoolidk22                    -16.033   27.355*** 
##                                (9.851)    (9.975)  
##                                                    
## schoolidk23                   31.210***  52.800*** 
##                               (11.059)    (10.956) 
##                                                    
## schoolidk24                  -39.174***   -14.848  
##                               (11.474)    (11.440) 
##                                                    
## schoolidk25                  -33.916***   -11.968  
##                               (11.887)    (11.513) 
##                                                    
## schoolidk26                  -65.901***  -26.378** 
##                               (12.279)    (12.213) 
##                                                    
## schoolidk27                   20.344**   60.777*** 
##                                (9.786)    (9.920)  
##                                                    
## schoolidk28                  -43.699***    -1.908  
##                                (9.847)    (9.976)  
##                                                    
## schoolidk29                   -19.793*    23.665** 
##                               (12.007)    (12.027) 
##                                                    
## schoolidk30                   73.697***  114.765***
##                               (11.734)    (11.705) 
##                                                    
## schoolidk31                     5.703    51.576*** 
##                               (12.221)    (12.178) 
##                                                    
## schoolidk32                  -76.597***  -31.771***
##                               (10.258)    (10.357) 
##                                                    
## schoolidk33                  -74.895***  -29.645***
##                               (10.448)    (10.532) 
##                                                    
## schoolidk34                  -43.062***  -47.958***
##                               (11.216)    (10.814) 
##                                                    
## schoolidk35                  -46.238***  -41.571***
##                               (11.616)    (11.195) 
##                                                    
## schoolidk36                    -13.686   -21.777** 
##                               (11.490)    (11.080) 
##                                                    
## schoolidk37                    -12.351     -7.870  
##                               (10.548)    (10.170) 
##                                                    
## schoolidk38                    -10.610     -2.724  
##                               (12.669)    (12.242) 
##                                                    
## schoolidk39                   -23.309**   -13.302  
##                               (11.648)    (11.270) 
##                                                    
## schoolidk40                    17.470    30.033*** 
##                               (11.348)    (11.005) 
##                                                    
## schoolidk41                   51.231***  39.230*** 
##                               (11.369)    (10.978) 
##                                                    
## schoolidk42                    -2.847      -9.913  
##                               (12.137)    (11.749) 
##                                                    
## schoolidk43                    -16.442   -25.884** 
##                               (11.292)    (10.892) 
##                                                    
## schoolidk44                     6.826    43.162*** 
##                               (11.029)    (10.898) 
##                                                    
## schoolidk45                  -105.064*** -58.205***
##                               (11.611)    (11.605) 
##                                                    
## schoolidk46                    -17.011    -20.996* 
##                               (12.018)    (11.577) 
##                                                    
## schoolidk47                    -18.214   -22.803** 
##                               (11.771)    (11.459) 
##                                                    
## schoolidk48                    -4.983      0.816   
##                               (11.440)    (11.041) 
##                                                    
## schoolidk49                     9.272      12.195  
##                               (11.769)    (11.339) 
##                                                    
## schoolidk50                    -3.353      1.766   
##                               (11.102)    (10.751) 
##                                                    
## schoolidk51                    11.627      7.057   
##                                (9.766)    (9.423)  
##                                                    
## schoolidk52                   30.792**     17.722  
##                               (12.378)    (11.941) 
##                                                    
## schoolidk53                  -66.974***  -50.533***
##                               (12.014)    (11.621) 
##                                                    
## schoolidk54                    -0.463      -6.483  
##                               (12.007)    (11.568) 
##                                                    
## schoolidk55                  -35.635***  -37.789***
##                               (10.970)    (10.573) 
##                                                    
## schoolidk56                  -90.780***  -82.116***
##                               (10.408)    (10.050) 
##                                                    
## schoolidk57                  -43.141***  -45.715***
##                               (11.592)    (11.166) 
##                                                    
## schoolidk58                    19.694*     8.719   
##                               (10.450)    (10.081) 
##                                                    
## schoolidk59                    10.294      8.540   
##                               (11.837)    (11.460) 
##                                                    
## schoolidk60                  -40.156***  -32.861***
##                               (11.341)    (10.936) 
##                                                    
## schoolidk61                  -28.758***  -36.711***
##                               (11.150)    (10.750) 
##                                                    
## schoolidk62                  -36.164***  -33.598***
##                               (12.136)    (11.718) 
##                                                    
## schoolidk63                    18.475*    19.295** 
##                               (10.105)    (9.736)  
##                                                    
## schoolidk64                   -24.511**  -22.818** 
##                               (10.604)    (10.260) 
##                                                    
## schoolidk65                    17.476      16.041  
##                               (12.948)    (12.478) 
##                                                    
## schoolidk66                  -36.682***  -39.517***
##                               (10.559)    (10.212) 
##                                                    
## schoolidk67                    -13.550    -10.331  
##                               (13.157)    (12.717) 
##                                                    
## schoolidk68                   26.870***  38.880*** 
##                               (10.410)    (10.045) 
##                                                    
## schoolidk69                   24.625**    18.973*  
##                               (11.885)    (11.460) 
##                                                    
## schoolidk70                   -20.916**   -15.221  
##                               (10.509)    (10.155) 
##                                                    
## schoolidk71                  -39.109***  -39.601***
##                               (10.899)    (10.513) 
##                                                    
## schoolidk72                    11.188     19.282*  
##                               (10.189)    (9.851)  
##                                                    
## schoolidk73                    -0.435      0.436   
##                               (11.427)    (11.023) 
##                                                    
## schoolidk74                     4.574      6.433   
##                               (10.997)    (10.597) 
##                                                    
## schoolidk75                     7.822      -1.311  
##                               (10.762)    (10.380) 
##                                                    
## schoolidk76                    -15.980    -19.275* 
##                               (10.367)    (10.040) 
##                                                    
## schoolidk78                  -48.027***  -51.562***
##                               (11.675)    (11.270) 
##                                                    
## schoolidk79                    -15.241    -15.346  
##                               (11.520)    (11.112) 
##                                                    
## schoolidk80                    -4.414      -6.440  
##                               (11.944)    (11.511) 
##                                                    
## Constant          904.721*** 925.675***  934.031***
##                    (2.228)     (8.203)    (7.964)  
##                                                    
## ---------------------------------------------------
## Observations        5,766       5,766      5,748   
## R2                  0.020       0.234      0.291   
## Adjusted R2         0.020       0.223      0.280   
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

The above output looks quite messy because of the inclusion of all the school fixed effects. We can make it look nicer by including some additional omit prompts in stargazer:

## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)


addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)


addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)

stargazer(addreg_A, addreg_B, addreg_C,
          type = "text",
          title = "Effect of Small Classes on Test Scores",
          column.labels = c("A", "B", "C"),
          dep.var.labels = "Reading + Math Score",
          omit = "schoolidk",   # <--- hides school dummies
          omit.labels = "School FE",  # <--- replaces them with one line
          omit.stat = c("f", "ser"))

## 
## Effect of Small Classes on Test Scores
## ==================================================
##                         Dependent variable:       
##                   --------------------------------
##                         Reading + Math Score      
##                       A          B          C     
##                      (1)        (2)        (3)    
## --------------------------------------------------
## starksmall        14.006***  15.933***  15.918*** 
##                    (2.395)    (2.167)    (2.092)  
##                                                   
## starkregular+aide   -0.601     1.215      1.808   
##                    (2.306)    (2.093)    (2.019)  
##                                                   
## experiencek        1.469***   0.743***   0.665*** 
##                    (0.167)    (0.168)    (0.162)  
##                                                   
## genderfemale                            12.106*** 
##                                          (1.667)  
##                                                   
## lunchkfree                              -34.636***
##                                          (2.020)  
##                                                   
## ethnicityafam                           -25.426***
##                                          (3.584)  
##                                                   
## ethnicityasian                            0.184   
##                                          (16.972) 
##                                                   
## ethnicityhispanic                         3.225   
##                                          (31.648) 
##                                                   
## ethnicityamindian                        -34.973  
##                                          (44.676) 
##                                                   
## ethnicityother                           -24.933  
##                                          (23.935) 
##                                                   
## Constant          904.721*** 925.675*** 934.031***
##                    (2.228)    (8.203)    (7.964)  
##                                                   
## --------------------------------------------------
## School FE             No         No         No    
## --------------------------------------------------
## Observations        5,766      5,766      5,748   
## R2                  0.020      0.234      0.291   
## Adjusted R2         0.020      0.223      0.280   
## ==================================================
## Note:                  *p<0.1; **p<0.05; ***p<0.01

OR alternatively we can use the modelsummary package to make things look a little nicer:

library(modelsummary)

## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)


addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)


addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)



models <- list("A" = addreg_A, "B" = addreg_A, "C" = addreg_A)

modelsummary(models,
             stars = TRUE,
             coef_omit = "schoolidk",      # regex, hides all school dummies
             coef_rename = c(starkSmall = "Small class",
                             starkRegularaide = "Regular+aide class",
                             experiencek = "Teacher experience"),
             gof_omit = "IC|Log|F|Adj")    # cleaner output

	A	B	C
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	904.721***	904.721***	904.721***
	(2.228)	(2.228)	(2.228)
starksmall	14.006***	14.006***	14.006***
	(2.395)	(2.395)	(2.395)
starkregular+aide	-0.601	-0.601	-0.601
	(2.306)	(2.306)	(2.306)
Teacher experience	1.469***	1.469***	1.469***
	(0.167)	(0.167)	(0.167)
Num.Obs.	5766	5766	5766
R2	0.020	0.020	0.020
RMSE	73.06	73.06	73.06

Adding additional regressors does not change the estimated causal effects of the different treatments $\Rightarrow$ This suggests that the random assignment to the smaller lasses does not depend on unobserved variables.
Additional regressors increase the $\bar{R}^2$ and the standard error of the estimated class size effect decreases from 2.45 to 2.16.

8. APPLIED EXAMPLE V

Setup:

# install missing packages automatically
pkgs <- c("tidyverse", "WDI", "broom", "kableExtra", "fixest", "sandwich", "lmtest", "countrycode")
new <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(new)) install.packages(new, repos = "https://cloud.r-project.org")

library(tidyverse)
library(WDI)
library(broom)
library(kableExtra)
library(fixest)
library(sandwich)
library(lmtest)
library(countrycode)

theme_set(theme_minimal(base_size = 13))
set.seed(123)

Policy Setting: Many Sub-Saharan African countries eliminated primary school fees in the 1990s–2000s (Free Primary Education, FPE).

Idea When fees are removed, children who are at primary-school age at that time are more likely to enroll/stay in school.

Causal question: DiD (Policy to Outcome): Did FPE increase girls’ primary completion?

Outcome: Primary completion rate, female (%).

Treatment: FPE is in effect (post-reform years).

Design: Country and year fixed effects to compare changes within a country before vs after FPE, relative to countries not yet (or never) treated.

Load Data: Indicators and FPE Reform Years

We’ll pull a panel (1990–2018) from the World Bank API:

Primary completion rate, female (%): SE.PRM.CMPT.FE.ZS
Girls’ secondary enrollment (% gross): SE.SEC.ENRR.FE
Adolescent fertility rate (births per 1,000 women ages 15–19): SP.ADO.TFRT
GDP per capita (constant 2015 US$): NY.GDP.PCAP.KD
Urban population (%): SP.URB.TOTL.IN.ZS

We’ll also hand-code FPE adoption years for a few countries (commonly cited dates):

# Indicators
ind <- c(
  prm_comp_f = "SE.PRM.CMPT.FE.ZS",   # Outcome for DiD
  sec_enrol_f = "SE.SEC.ENRR.FE",     # Endogenous X for IV
  teen_fert = "SP.ADO.TFRT",          # Outcome for IV
  gdppc = "NY.GDP.PCAP.KD",
  urban = "SP.URB.TOTL.IN.ZS"
)

yrs <- 1990:2018

raw <- WDI(
  country = "all",
  indicator = ind,
  start = min(yrs), end = max(yrs),
  extra = TRUE, cache = NULL
)

# Keep only Sub-Saharan Africa "countries" (drop aggregates)
df <- raw %>%
  filter(region == "Sub-Saharan Africa", !is.na(iso3c)) %>%
  select(iso3c, country, year, prm_comp_f, sec_enrol_f, teen_fert, gdppc, urban)

# Hand-coded FPE reform years (illustrative teaching sample) --------------
fpe_years <- tribble(
  ~country,       ~iso3c, ~fpe_year,
  "Kenya",        "KEN",  2003,
  "Uganda",       "UGA",  1997,
  "Tanzania",     "TZA",  2001,
  "Malawi",       "MWI",  1994,
  "Zambia",       "ZMB",  2002,
  "Mozambique",   "MOZ",  2004,
  "Rwanda",       "RWA",  2003,
  "Ghana",        "GHA",  2005
)

# Keep panel for the countries above, to keep the sample simple and transparent
panel <- df %>%
  semi_join(fpe_years, by = c("iso3c")) %>%
  arrange(iso3c, year)

# Merge in reform year
panel <- panel %>%
  left_join(fpe_years, by = c("iso3c", "country"))

panel %>% 
  group_by(country) %>% 
  slice_head(n = 3) %>%
  ungroup() %>%
  kbl(digits = 1, caption = "Sample rows from our panel (first 3 years per country)") %>% 
  kable_paper()

Sample rows from our panel (first 3 years per country)
iso3c	country	year	prm_comp_f	sec_enrol_f	teen_fert	gdppc	urban	fpe_year
GHA	Ghana	1990	NA	27.4	111.0	855.9	36.4	2005
GHA	Ghana	1991	49.9	NA	110.0	878.6	37.2	2005
GHA	Ghana	1992	51.4	NA	106.8	890.3	37.9	2005
KEN	Kenya	1990	NA	NA	133.5	1329.3	16.7	2003
KEN	Kenya	1991	NA	NA	130.2	1305.5	17.0	2003
KEN	Kenya	1992	NA	NA	126.4	1255.9	17.3	2003
MWI	Malawi	1990	25.3	12.3	155.1	353.8	11.6	1994
MWI	Malawi	1991	27.5	12.7	156.4	373.0	11.9	1994
MWI	Malawi	1992	30.1	NA	155.5	335.7	12.2	1994
MOZ	Mozambique	1990	23.6	5.4	164.9	221.0	25.0	2004
MOZ	Mozambique	1991	24.0	5.5	164.1	227.5	25.5	2004
MOZ	Mozambique	1992	22.5	5.5	159.0	206.2	26.0	2004
RWA	Rwanda	1990	43.4	14.8	56.9	368.2	5.4	2003
RWA	Rwanda	1991	43.3	14.7	54.3	351.7	5.5	2003
RWA	Rwanda	1992	42.9	13.5	52.6	364.8	6.3	2003
TZA	Tanzania	1990	NA	NA	137.9	541.5	18.9	2001
TZA	Tanzania	1991	61.4	4.5	135.7	539.0	19.2	2001
TZA	Tanzania	1992	52.8	4.5	131.0	528.8	19.5	2001
UGA	Uganda	1990	NA	7.9	180.4	369.8	11.1	1997
UGA	Uganda	1991	NA	NA	179.8	377.8	11.5	1997
UGA	Uganda	1992	NA	7.4	184.0	377.6	11.8	1997
ZMB	Zambia	1990	NA	NA	149.2	878.6	39.4	2002
ZMB	Zambia	1991	NA	NA	150.8	856.8	39.0	2002
ZMB	Zambia	1992	NA	NA	150.2	821.9	38.5	2002

Note: The FPE years are a minimal teaching list. For research, you’d verify dates carefully and consider staggered or partial implementations.

Basic Cleaning and Treatment Variables

For DiD, we need an indicator for when FPE is active in a country-year.
For IV, we need a cohort-exposure measure. We’ll approximate exposure for the 15–19 age group in year t as the share of their primary-school years (ages 6–13) that occurred after the FPE year. This yields a 0–1 exposure:
If FPE happens long before they start primary: exposure ≈ 1
If FPE happens after they’ve already finished primary: exposure ≈ 0
In between: partial exposure (between 0 and 1)
This is a simple, intuitive exposure index. (A research paper would build this from microdata or detailed age-by-year records.)

# DiD treatment: post-FPE indicator
panel <- panel %>%
  mutate(post_fpe = if_else(!is.na(fpe_year) & year >= fpe_year, 1, 0))

# IV instrument: cohort exposure to FPE for ages 15-19 in year t
# Approximate the "middle" birth year for 15-19 as t-17 (midpoint).
# A child's primary years are ages 6..13 -> calendar years (b+6)..(b+13).
# Exposure share = fraction of those 8 years that occur at or after fpe_year.

expo_fun <- function(year, fpe_year){
  if(is.na(fpe_year)) return(0)  # treat as never exposed
  b <- year - 17                 # midpoint birth year for 15-19
  start <- b + 6
  end   <- b + 13
  # overlap of [start, end] with [fpe_year, +inf)
  exposed_years <- pmax(0, (end - pmax(start, fpe_year) + 1))
  share <- exposed_years / 8
  share <- pmin(pmax(share, 0), 1)
  return(share)
}

panel <- panel %>%
  rowwise() %>%
  mutate(exposure_15_19 = expo_fun(year, fpe_year)) %>%
  ungroup()

# Inspect a few rows to see exposure growing over time after reform
panel %>%
  filter(country %in% c("Kenya","Uganda")) %>%
  select(country, year, fpe_year, post_fpe, exposure_15_19) %>%
  arrange(country, year) %>%
  slice(1:10) %>%
  kbl(digits = 2, caption = "Illustration: post-FPE and exposure index over time") %>% 
  kable_paper()

Illustration: post-FPE and exposure index over time
country	year	fpe_year
Kenya	1990	2003
Kenya	1991	2003
Kenya	1992	2003
Kenya	1993	2003
Kenya	1994	2003
Kenya	1995	2003
Kenya	1996	2003
Kenya	1997	2003
Kenya	1998	2003
Kenya	1999	2003

9. APPLIED EXAMPLE VI

Can cigarette consumption be reduced by taxing cigarettes more heavily? For example, what would the after-tax sales price of cigarettes need to be to to achieve a 20 percent reduction in cigarette consumption?
Need to estimate the price elasticity for the demand of cigarettes needs to be estimated from data on price and sales.
An OLS regression of log quantity on log price cannot be used to estimate the effect of interest since there is simultaneous causality between demand and supply.

Load the data set and get an overview:

library(AER)
data("CigarettesSW")
summary(CigarettesSW)

##      state      year         cpi          population           packs       
##  AL     : 2   1985:48   Min.   :1.076   Min.   :  478447   Min.   : 49.27  
##  AR     : 2   1995:48   1st Qu.:1.076   1st Qu.: 1622606   1st Qu.: 92.45  
##  AZ     : 2             Median :1.300   Median : 3697472   Median :110.16  
##  CA     : 2             Mean   :1.300   Mean   : 5168866   Mean   :109.18  
##  CO     : 2             3rd Qu.:1.524   3rd Qu.: 5901500   3rd Qu.:123.52  
##  CT     : 2             Max.   :1.524   Max.   :31493524   Max.   :197.99  
##  (Other):84                                                                
##      income               tax            price             taxs       
##  Min.   :  6887097   Min.   :18.00   Min.   : 84.97   Min.   : 21.27  
##  1st Qu.: 25520384   1st Qu.:31.00   1st Qu.:102.71   1st Qu.: 34.77  
##  Median : 61661644   Median :37.00   Median :137.72   Median : 41.05  
##  Mean   : 99878736   Mean   :42.68   Mean   :143.45   Mean   : 48.33  
##  3rd Qu.:127313964   3rd Qu.:50.88   3rd Qu.:176.15   3rd Qu.: 59.48  
##  Max.   :771470144   Max.   :99.00   Max.   :240.85   Max.   :112.63  
##

price: price is the average real price per pack of cigarettes including all taxes
pack: number of packs per capita
income: state personal income (total, nominal)
tax: average state, federal and average local excise taxes for fiscal year
taxs: average excise taxes for fiscal year, including sales tax

Compute real price per pack of cigarettes, the sales tax and the correlation between sales tax and price.

# real prices:
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)

#  sales tax:
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)

# correlation between sales tax and price:
cor(CigarettesSW$salestax, CigarettesSW$price)

## [1] 0.6141228

9a. IV First Stage

Generate a subset for the year 1995:

c1995 <- subset(CigarettesSW, year == "1995")

Using the 1995 subset, perform the first stage regression:

\[ log( P_i^{\text{cigarettes}}) = \pi_0 + \pi_1 \text{SalesTax}_i + \nu_i \]

cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)

coeftest(cig_s1, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 4.6165463  0.0289177 159.6444 < 2.2e-16 ***
## salestax    0.0307289  0.0048354   6.3549 8.489e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the regression’s $R^2$ we see that 47 percent of the observed variation in $log( P^{\text{cigarettes}})$ can be explained by the instrument $\text{SalesTax}$:

# inspect the R^2 of the first stage regression
summary(cig_s1)$r.squared

## [1] 0.4709961

9b. IV Second Stage

Store the fitted values obtained by the first stage regression in a new variable:

# store the predicted values
lcigp_pred <- cig_s1$fitted.values

Next, we run the second stage regression which gives us the TSLS estimates we seek.

# run the stage 2 regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  9.71988    1.70304  5.7074 7.932e-07 ***
## lcigp_pred  -1.08359    0.35563 -3.0469  0.003822 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9c. ivreg()

The function ivreg() from the package AER carries out TSLS procedure automatically. It is used similarly as lm(). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is:

log(packs) ~ log(rprice) | salestax

# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)

coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  9.71988    1.52832  6.3598 8.346e-08 ***
## log(rprice) -1.08359    0.31892 -3.3977  0.001411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the following:

The individual regressions for each stage of TSLS using lm() leads to the same coefficient estimates as when using ivreg(). However, the standard errors reported for the second-stage regression (e.g.,coeftest() or summary()) are invalid because they do not account for the use of predictions from the first-stage regression as regressors in the second-stage regression. In contrast, ivreg() performs the necessary adjustment automatically. The manual step-by-step estimation is only included here for the purpose of demonstrating the mechanics of the procedure.
Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC().

9d. TSLS

The TSLS estimate for $\beta_1$ shows that an increase in cigarette prices by 1 percent reduces cigarette consumption by 1.08 percentage points, which is fairly elastic.

However, even though we used IV estimation, there still might be bias in the coefficient estimates due to omitted variables.

For example: In this model, the TSLS estimator is inconsistent for the true $\beta_1$ if the instrument (the real sales tax per pack) correlates with the error term.

This is likely to be the case since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model.

a multiple variables regression is needed in order to try to account for this

# add rincome to the dataset
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)

c1995 <- subset(CigarettesSW, year == "1995")
  
# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + 
                    salestax, data = c1995)

coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.43066    1.25939  7.4883 1.935e-09 ***
## log(rprice)  -1.14338    0.37230 -3.0711  0.003611 ** 
## log(rincome)  0.21452    0.31175  0.6881  0.494917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, add cigtax to the data set:

CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 
                    log(rincome) + salestax + cigtax, data = c1995)

coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.89496    0.95922 10.3157 1.947e-13 ***
## log(rprice)  -1.27742    0.24961 -5.1177 6.211e-06 ***
## log(rincome)  0.28040    0.25389  1.1044    0.2753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10. WORKFLOW

10a. console/script/output

Four panes in R Studio:

console
script editor: save you work! eg. a do file
output
environment

Two important shortcuts when using the script editor:

Cmd/Ctrl + Enter Run part of the code

Cmd/Ctrl + Shift + S : Run the entire code in the script

Script Editor

Open the script editor up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, one of which is the script editor.

When done writing the script, save with extension .R

Example:

Gapminder.R

10b. markdown

Combine script and output in one file with markdown

Example:

Gapminder.Rmd

Gapminder.pdf

References

An Introduction to R: Complete introduction to base R.

R for Data Science: Introduction to data analysis using R. Good for comparing to Stata.

An Economist’s Guide to Visualizing Data: Guidelines for good visualizations (not R-specific).

Introduction to Econometrics with R (online textbook)

Nick Huntington-Klein’s Econometrics Tips and Resources: A very useful page full of helpful slides, lectures, links to other resources all in R.

Program Evaluation for Public Service (Andrew Young): Using R, combines research design, causal inference, and econometric tools to measure the effects of social programs.

Programming interactive R-apps using Shiny: Useful if you want to make your methods easy to use for people not familiar with R, or want to include interactive visualizations in web-pages.

R markdown: Integrate code and output into typeset documents and slides.

RStudio Cheat Sheets: Cheatsheets for numerous packages.

Data Analysis with R

1. SETUP and BASICS

2. WORKING WITH DATA

2a. Dplyr: Filter Rows with filter

2b. Dplyr: Sort data with arrange

2c. Chaining commands with pipes

2d. Change/create variables with mutate

3. DATA VISUALIZATION

3a. Scatterplot with ggplot2

3b. Color and size aesthetics

4. APPLIED EXAMPLE I

4a. DID Visualize

4b. DID by hand

4c. DID regression

4d. DID regression with controls

5. APPLIED EXAMPLE II

5a. No controls

5b. Good controls

5c. Bad controls

5d. DAG

6. APPLIED EXAMPLE III

6a. Data: Mroz dataset

6b. Baseline OLS (no instruments)

6c. IV Strategy 1: Parents’ Education as Instruments

6d. IV Strategy 2: Husband’s Ed as an Instrument

6e. Compare OLS vs. IV Estimates

6f. What to Include vs. Not Include

6g. Additional notes and tips

6h. Fake data

7 APPLIED EXAMPLE IV

8. APPLIED EXAMPLE V

9. APPLIED EXAMPLE VI

9a. IV First Stage

9b. IV Second Stage

9c. ivreg()

9d. TSLS

10. WORKFLOW

10a. console/script/output

10b. markdown

References

2a. Dplyr: Filter Rows with `filter`

2b. Dplyr: Sort data with `arrange`

2d. Change/create variables with `mutate`

3a. Scatterplot with `ggplot2`