Set up R, RStudio, and learn how to run code in this tutorial.
Install R
Install the RStudio IDE (Integrated Development Environment for R).
Advantages of R
Open source
A wide range of packages for:
Data processing and cleaning
Data visualization
Interactive web-apps
Typesetting, writing articles and slides
The newest machine learning routines
Accomplishes the things you might be used to do doing in Stata (data processing, fitting standard models) and those you might be used to doing in Matlab (numerical programming).
Simple calculations:
# This is a comment 2 # print a number
2+3 # perform a simple calculation
## [1] 5
log(2) # natural log
## [1] 0.6931472
Numeric and string objects
x = 2 # store an object x # print this object
(x = 3) # store and print an object
## [1] 3
x = "Hello" # store a string object x
Vectors
Height = c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector
Height[2] # Print the second component
## [1] 177
Height[2:5] # Print the second, the 3rd, the 4th and 5th component
## [1] 177 177 177 178
(obs = 1:10) # Define a vector as a sequence (1 to 10)
## [1] 1 2 3 4 5 6 7 8 9 10
Weight = c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)
BMI = Weight/((Height/100)^2) # Performs a simple calculation using vectors
BMI
## [1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
## [9] 16.09645 25.95156
We can also describe the vector with length(), mean() and var().
length(Height)
## [1] 10
mean(Height) # Compute the sample mean var(Height)
## [1] 173.3
Matrices
M = cbind(obs,Height,Weight,BMI) # Create a matrix
typeof(M) # Give the type of the matrix
## [1] "double"
class(M) # Give the class of an object
## [1] "matrix" "array"
is.matrix(M) # Check if M is a matrix
## [1] TRUE
dim(M) # Dimensions of a matrix
## [1] 10 4
Simple plots
For simple plots, use plot.
For more advanced and attractive data visualizations, use ggplot.
plot(Height,Weight,ylab="Weight",xlab="Height")
Data frames and Tibbles
Tibbles are modernized versions of dataframes. Technically, they are
just lists of vectors with names. Tibbles are souped up versions of R
dataframes that are designed to work seamlessly with
dplyr.
library(tibble) # Load the tidyverse tibble package
mydat = as_tibble(M) # Creates a dataframe
names(mydat) # Give the names of each variable
## [1] "obs" "Height" "Weight" "BMI"
summary(mydat) # Descriptive Statistics
## obs Height Weight BMI
## Min. : 1.00 Min. :165.0 Min. :51.00 Min. :16.10
## 1st Qu.: 3.25 1st Qu.:170.2 1st Qu.:61.00 1st Qu.:21.25
## Median : 5.50 Median :174.5 Median :70.00 Median :22.70
## Mean : 5.50 Mean :173.3 Mean :68.50 Mean :22.89
## 3rd Qu.: 7.75 3rd Qu.:177.0 3rd Qu.:74.25 3rd Qu.:25.29
## Max. :10.00 Max. :178.0 Max. :88.00 Max. :31.18
Data frames and tibbles are R’s structures for storing tabular data.
Any dataset in R is usually in one of these structures.
Help syntax: You can open a help page for any object that comes
with R or with an R package. To open the help page, type a
? before the object’s name and then run the command. This
technique works for functions, packages, and more.
Reading and writing data
There are many routines for reading and writing files.
Tidyverse versions are in the readr package.
library(readr) #load the tidyverse readr package
write_csv(mydat, "my_data.csv")
mydat2=read_csv("my_data.csv")
mydat2
## # A tibble: 10 × 4
## obs Height Weight BMI
## <dbl> <dbl> <dbl> <dbl>
## 1 1 168 88 31.2
## 2 2 177 72 23.0
## 3 3 177 85 27.1
## 4 4 177 52 16.6
## 5 5 178 71 22.4
## 6 6 172 69 23.3
## 7 7 165 61 22.4
## 8 8 171 61 20.9
## 9 9 178 51 16.1
## 10 10 170 75 26.0
Core objects, tibbles, and the dplyr verbs you will use 90% of the time.
Example: Gapminder
Load both tidyverse, which contains dplyr
and ggplot2, and gapminder
library(tidyverse)
library(gapminder)
Use the command ?gapminder to view the R help file for
this dataset and read the documentation you find there
To display the gapminder dataset:
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
R shows us a helpful summary of the information contained in
gapminder.
gapminder is not a
dataframe; it’s a tibble, often abbreviated tbl.filterdplyr package provides a number of powerful but
easy-to-use tools for data manipulation in R.
Rather than displaying the whole dataset, only show the 142 rows for
which the column year has a value of 2007:
gapminder %>% filter(year == 2007) #Subset rows based on logical conditions.
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ℹ 132 more rows
The %>% symbol is called a
pipe.
Pipes make the code very easy to understand.
The tibble gapminder is being piped into the
function filter().
The argument year == 2007 tells
filter() that it should find all the rows such that the
logical condition year == 2007 is
TRUE.
All that this command does is display a subset of
gapminder.
If we wanted to store the result of running this command, we’d need to assign it to a variable:
gapminder2007 <- gapminder %>% filter(year == 2007)
gapminder2007
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ℹ 132 more rows
We can also use filter to subset on two or more
variables. For example, below we display data for the US in 2007:
gapminder %>% filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
arrangedesc().
Suppose we wanted to sort gapminder by
gdpPercap. To do this we can use the arrange
command along with the pipe %>% as follows:
gapminder %>% arrange(gdpPercap)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Congo, Dem. Rep. Africa 2002 45.0 55379852 241.
## 2 Congo, Dem. Rep. Africa 2007 46.5 64606759 278.
## 3 Lesotho Africa 1952 42.1 748747 299.
## 4 Guinea-Bissau Africa 1952 32.5 580653 300.
## 5 Congo, Dem. Rep. Africa 1997 42.6 47798986 312.
## 6 Eritrea Africa 1952 35.9 1438760 329.
## 7 Myanmar Asia 1952 36.3 20092996 331
## 8 Lesotho Africa 1957 45.0 813338 336.
## 9 Burundi Africa 1952 39.0 2445618 339.
## 10 Eritrea Africa 1957 38.0 1542611 344.
## # ℹ 1,694 more rows
The logic is very similar to filter:
We pipe the tibble gapminder into the function
arrange().
The argument gdpPercap tells arrange()
that we want to sort by GDP per capita.
Note that by default arrange() sorts in
ascending order.
If we want to sort in descending order, we use the function
desc() as follows:
gapminder %>% arrange(desc(gdpPercap))
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523.
## 2 Kuwait Asia 1972 67.7 841934 109348.
## 3 Kuwait Asia 1952 55.6 160000 108382.
## 4 Kuwait Asia 1962 60.5 358266 95458.
## 5 Kuwait Asia 1967 64.6 575003 80895.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
## 7 Norway Europe 2007 80.2 4627926 49357.
## 8 Kuwait Asia 2007 77.6 2505559 47307.
## 9 Singapore Asia 2007 80.0 4553009 47143.
## 10 Norway Europe 2002 79.0 4535591 44684.
## # ℹ 1,694 more rows
The command gapminder %>% filter(year == 2007)
“pipes” the tibble gapminder into the function
filter(). What does this mean?
dplyr function
filter. We see that filter() takes something
called .data as its first argument. Moving on to the
“Arguments” section of the help file, we see that .data is
“A tbl” i.e. a tibble.
To better understand what this means, try using filter
without the pipe:
filter(gapminder, year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
Notice that this gives us exactly the same result as
gapminder %>% filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
In other words The pipe is gives an alternative way of
supplying the first argument to a function.
Now try this with a more familiar R function: mean.
The first argument of mean is a vector x.
Use the pipe to compute the mean of some data:
x <- c(1, 5, 2, 7, 2)
x %>% mean
## [1] 3.4
The pipe supplies a function with its first argument.
If we want to specify additional arguments, we need to do so within the function call itself. For example, here’s how we could use the pipe to compute the mean after dropping missing observations:
y <- c(1, 5, NA, 7, 2)
y %>% mean(na.rm = TRUE)
## [1] 3.75
One important note about the pipe: it’s not a base R
command. Instead it’s a command provided by the package
Magrittr. This package is installed automatically along
with dplyr (if we load the tidyverse package,
which includes dplyr, the pipe is automatically
available).
Pipes are most useful when we need to chain a series of commands together.
For example, suppose we wanted to display the 1952 data from
gapminder sorted by gdpPercap in descending
order.
Using the pipe,
gapminder %>%
filter(year == 1952) %>%
arrange(desc(gdpPercap))
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1952 55.6 160000 108382.
## 2 Switzerland Europe 1952 69.6 4815000 14734.
## 3 United States Americas 1952 68.4 157553000 13990.
## 4 Canada Americas 1952 68.8 14785584 11367.
## 5 New Zealand Oceania 1952 69.4 1994794 10557.
## 6 Norway Europe 1952 72.7 3327728 10095.
## 7 Australia Oceania 1952 69.1 8691212 10040.
## 8 United Kingdom Europe 1952 69.2 50430000 9980.
## 9 Bahrain Asia 1952 50.9 120447 9867.
## 10 Denmark Europe 1952 70.8 4334000 9692.
## # ℹ 132 more rows
The first step in the chain
gapminder %>% filter(year == 1952) returns a tibble: the
subset of gapminder for which year is
1952.
The next step %>% arrange(gdpPercap) pipes this
new tibble into the function arrange(), giving us
the desired result.
In stark contrast, let’s look at the code we would have to use if we wanted to accomplish the same task without using the pipe:
arrange(filter(gapminder, year == 1952), desc(gdpPercap))
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1952 55.6 160000 108382.
## 2 Switzerland Europe 1952 69.6 4815000 14734.
## 3 United States Americas 1952 68.4 157553000 13990.
## 4 Canada Americas 1952 68.8 14785584 11367.
## 5 New Zealand Oceania 1952 69.4 1994794 10557.
## 6 Norway Europe 1952 72.7 3327728 10095.
## 7 Australia Oceania 1952 69.1 8691212 10040.
## 8 United Kingdom Europe 1952 69.2 50430000 9980.
## 9 Bahrain Asia 1952 50.9 120447 9867.
## 10 Denmark Europe 1952 70.8 4334000 9692.
## # ℹ 132 more rows
This is hard to read and not good coding practice: the commands
arrange and filter have to appear in the code
in the opposite of the order in which they are actually being
carried out. This is because parentheses are evaluated from inside
to outside.
Pipes let us write our code in a way that accords with the actual order of the steps we want to carry out.
mutateSuppose that, instead of raw population, we wanted to display
population in millions. This requires us to pop by
1000000, which we can do using the function
mutate() from dplyr as follows:
gapminder %>%
mutate(pop = pop / 1000000)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43 779.
## 2 Afghanistan Asia 1957 30.3 9.24 821.
## 3 Afghanistan Asia 1962 32.0 10.3 853.
## 4 Afghanistan Asia 1967 34.0 11.5 836.
## 5 Afghanistan Asia 1972 36.1 13.1 740.
## 6 Afghanistan Asia 1977 38.4 14.9 786.
## 7 Afghanistan Asia 1982 39.9 12.9 978.
## 8 Afghanistan Asia 1987 40.8 13.9 852.
## 9 Afghanistan Asia 1992 41.7 16.3 649.
## 10 Afghanistan Asia 1997 41.8 22.2 635.
## # ℹ 1,694 more rows
Note the syntax here: within mutate() we have an
assignment statement, namely pop = pop / 1000000. This
tells R to calculate pop / 1000000 and assign the result to
pop, in place of the original variable.
We can also use mutate() to create a new variable. The
gapminder dataset doesn’t contain overall GDP, only GDP per
capita. To calculate GDP, we need to multiply gdpPercap by
pop.
gapminder %>% mutate(gdp = pop * gdpPercap)
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ℹ 1,694 more rows
ggplot2we’ll use a package for statistical visualization called
ggplot2.
ggplot2 is included in the tidyverse
package, which you’ve already installed and loaded.
For more details on ggplot2 see the chapter entitled
“Data Visualisation” in R for Data Science.
Start off by constructing a subset of the gapminder
dataset that contains information from the year 2007 that we’ll use for
our plots below.
gapminder_2007 <- gapminder %>% filter(year == 2007)
The first example will be a simple scatterplot using
gapminder_2007. Each point will correspond to a single
country in 2007. Its x-coordinate will be GDP per capita and its
y-coordinate will be life expectancy.
ggplot(gapminder_2007) + geom_point(mapping = aes(x = gdpPercap, y = lifeExp))
We see that GDP per capita is a very strong predictor of life expectancy, although the relationship is non-linear.
Plotting on a log scale
In order to transform the x-axis, add a
+ scale_x_log10() to the end of our command from above:
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()
Notice how we split the code across multiple lines and ended each of
the intermediate lines with the +.
This makes things much easier to read.
To make a graph using ggplot we use the following
template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
replacing <DATA>,
<GEOM_FUNCTION>, and <MAPPINGS> to
specify what we want to plot and how it should appear.
The first part is: we replace <DATA> with the
dataset we want to plot, for example gapminder_2007 in the
example from above. The second part is: we replace
<GEOM_FUNCTION> with the name of a function that
specifies the kind of plot we want to make.
So far we’ve only seen one example: geom_point() which
tells ggplot that we want to make a scatterplot.
Now looking at mapping = aes(<MAPPINGS>).
The abbreviation aes is short for aesthetic and
the code mapping = aes(<MAPPINGS>) defines what is
called an aesthetic mapping.
This just tells R how we want our plot to look.
The information we need to put in place of
<MAPPINGS> depends on what kind of plot we’re
making.
Thus far we’ve only examined geom_point() which produces
a scatterplot. For this kind of plot, the minimum information we need to
provide is the location of each point.For example, in our example above
we wrote aes(x = gdpPercap, y = lifeExp) to tell R that
gdpPercap gives the x-axis location of each point, and
lifeExp gives the y-axis location.
When making a scatterplot with geom_point, we can also
specify the size and color of each point using aes.
For example, let’s use the color of each point to indicate
continent
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
scale_x_log10()
Notice how ggplot automatically generates a helpful
legend.
This plot makes it easy to see at a glance that the European countries in 2007 ten to have high GDP per capita and high life expectancy, while the African countries have the opposite.
In order to use the size of each point to encode information, e.g. population:
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10()
Question: In 1980, Kentucky raised its cap on weekly earnings that were covered by worker’s compensation. We want to know if this new policy caused workers to spend more time unemployed.
The policy was designed so that the cap increase did not affect low-earnings workers, but did affect high-earnings workers, so we use low-earnings workers as our control group and high-earnings workers as our treatment group.
Load the data and explore it.
The data is included in the wooldridge R package and is called
injury
In order to use the dataset, you need to load it with
library(wooldridge)
duration: Duration of unemployment benefits, measured in
weeks
log_duration: Log duration
after_1980: Indicator variable marking if the
observation happened before (0) or after (1) the policy change in
1980.
highearn: Indicator variable marking if the observation
is a low (0) or high (1) earner.
library(tidyverse) # ggplot(), %>%, mutate(), and friends
library(broom) # Convert models to data frames
library(scales) # Format numbers with functions like comma(), percent(), and dollar()
library(modelsummary) # Create side-by-side regression tables
library(wooldridge)
injury <- injury %>%
rename(duration = durat, log_duration = ldurat, after_1980 = afchnge)
ggplot(data = injury, aes(x = duration)) +
# binwidth = 8 makes each column represent 2 months (8 weeks)
# boundary = 0 make it so the 0-8 bar starts at 0 and isn't -4 to 4
geom_histogram(binwidth = 8, color = "white", boundary = 0) +
facet_wrap(vars(highearn))
If we use the log of duration, we can get a less skewed distribution that works better with regression models:
ggplot(data = injury, mapping = aes(x = log_duration)) +
geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
# Uncomment this line if you want to exponentiate the logged values on the
# x-axis. Instead of showing 1, 2, 3, etc., it'll show e^1, e^2, e^3, etc. and
# make the labels more human readable
# scale_x_continuous(labels = trans_format("exp", format = round)) +
facet_wrap(vars(highearn))
We should also check the distribution of unemployment before and after the policy change. Copy/paste one of the histogram chunks and change the faceting:
ggplot(data = injury, mapping = aes(x = log_duration)) +
geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
facet_wrap(vars(after_1980))
You can use a stat_summary() layer to have ggplot calculate summary statistics like averages. Here we just calculate the mean:
ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
geom_point(size = 0.5, alpha = 0.2) +
stat_summary(geom = "point", fun = "mean", size = 5, color = "red") +
facet_wrap(vars(after_1980))
We can also calculate the mean and 95% confidence interval:
ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
stat_summary(geom = "pointrange", size = 1, color = "red",
fun.data = "mean_se", fun.args = list(mult = 1.96)) +
facet_wrap(vars(after_1980))
We can also use group_by() and summarize() to figure out group means before sending the data to ggplot (gives more control over the data that we’re plotting):
plot_data <- injury %>%
# Make these categories instead of 0/1 numbers so they look nicer in the plot
mutate(highearn = factor(highearn, labels = c("Low earner", "High earner")),
after_1980 = factor(after_1980, labels = c("Before 1980", "After 1980"))) %>%
group_by(highearn, after_1980) %>%
summarize(mean_duration = mean(log_duration),
se_duration = sd(log_duration) / sqrt(n()),
upper = mean_duration + (1.96 * se_duration),
lower = mean_duration + (-1.96 * se_duration))
ggplot(plot_data, aes(x = highearn, y = mean_duration)) +
geom_pointrange(aes(ymin = lower, ymax = upper),
color = "darkgreen", size = 1) +
facet_wrap(vars(after_1980))
#Or, plotted in the more standard diff-in-diff format:
ggplot(plot_data, aes(x = after_1980, y = mean_duration, color = highearn)) +
geom_pointrange(aes(ymin = lower, ymax = upper), size = 1) +
# The group = highearn here makes it so the lines go across categories
geom_line(aes(group = highearn))
Use a combination of group_by() and
summarize() :
diffs <- injury %>%
group_by(after_1980, highearn) %>%
summarize(mean_duration = mean(log_duration),
# Calculate average with regular duration too
mean_duration_for_humans = mean(duration))
diffs
## # A tibble: 4 × 4
## # Groups: after_1980 [2]
## after_1980 highearn mean_duration mean_duration_for_humans
## <int> <int> <dbl> <dbl>
## 1 0 0 1.20 7.48
## 2 0 1 1.41 11.8
## 3 1 0 1.22 8.61
## 4 1 1 1.63 13.9
We can pull each of these numbers out of the table with
filter() and pull():
before_treatment <- diffs %>%
filter(after_1980 == 0, highearn == 1) %>%
pull(mean_duration)
before_control <- diffs %>%
filter(after_1980 == 0, highearn == 0) %>%
pull(mean_duration)
after_treatment <- diffs %>%
filter(after_1980 == 1, highearn == 1) %>%
pull(mean_duration)
after_control <- diffs %>%
filter(after_1980 == 1, highearn == 0) %>%
pull(mean_duration)
Now we are ready to calculate the difference-in-difference estimate:
diff_treatment_before_after <- after_treatment - before_treatment
cat("diff_treatment_before_after =", diff_treatment_before_after, "\n")
## diff_treatment_before_after = 0.2119848
diff_control_before_after <- after_control - before_control
cat("diff_control_before_after =", diff_control_before_after, "\n")
## diff_control_before_after = 0.02363507
diff_diff <- diff_treatment_before_after - diff_control_before_after
cat("diff_diff =", diff_diff, "\n")
## diff_diff = 0.1883498
The diff-in-diff estimate is 0.19, which means that the program causes
an increase in unemployment duration of 0.19 logged weeks. ie Receiving
the treatment (i.e. being a high earner after the change in policy)
causes a 19% increase in the length of unemployment.
ggplot(diffs, aes(x = as.factor(after_1980),
y = mean_duration,
color = as.factor(highearn))) +
geom_point() +
geom_line(aes(group = as.factor(highearn))) +
# If you use these lines you'll get some extra annotation lines and
# labels. The annotate() function lets you put stuff on a ggplot that's not
# part of a dataset. Normally with geom_line, geom_point, etc., you have to
# plot data that is in columns. With annotate() you can specify your own x and
# y values.
annotate(geom = "segment", x = "0", xend = "1",
y = before_treatment, yend = after_treatment - diff_diff,
linetype = "dashed", color = "grey50") +
annotate(geom = "segment", x = "1", xend = "1",
y = after_treatment, yend = after_treatment - diff_diff,
linetype = "dotted", color = "blue") +
annotate(geom = "label", x = "1", y = after_treatment - (diff_diff / 2),
label = "Program effect", size = 3)
The coefficient is the effect we care about in the end—that’s the diff-in-diff estimator.
model_small <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980,
data = injury)
tidy(model_small)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.20 0.0271 44.2 0
## 2 highearn 0.215 0.0434 4.96 0.000000711
## 3 after_1980 0.0236 0.0397 0.595 0.552
## 4 highearn:after_1980 0.188 0.0628 3.00 0.00271
The coefficient for highearn:after_1980 is the same as what we found by hand, as it should be! It is statistically significant, so we can be fairly confident that it is not 0.
Now estimate an expanded version of the basic regression model with the following additional variables:
male married age
hosp (1 = hospitalized) indust (1 = manuf, 2 =
construc, 3 = other) injtype (1-8; categories for different
types of injury) lprewage (log of wage prior to filing a
claim)
Important: indust and injtype are in the
dataset as numbers (1-3 and 1-8), but they’re actually categories. We
have to tell R to treat them as categories (or factors), otherwise it’ll
assume that you can have an injury type of 3.46 or something
impossible.
#Convert industry and injury type to categories/factors
injury_fixed <- injury %>%
mutate(indust = as.factor(indust),
injtype = as.factor(injtype))
model_big <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980 +
male + married + age + hosp + indust + injtype + lprewage,
data = injury_fixed)
tidy(model_big)
## # A tibble: 18 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.42 0.309 -7.82 6.15e- 15
## 2 highearn -0.327 0.0670 -4.89 1.05e- 6
## 3 after_1980 0.0585 0.0368 1.59 1.12e- 1
## 4 male -0.140 0.0387 -3.61 3.07e- 4
## 5 married 0.0255 0.0335 0.762 4.46e- 1
## 6 age 0.00779 0.00122 6.41 1.58e- 10
## 7 hosp 1.10 0.0335 32.8 2.82e-219
## 8 indust2 0.258 0.0472 5.47 4.58e- 8
## 9 indust3 0.154 0.0337 4.57 4.91e- 6
## 10 injtype2 0.883 0.133 6.62 3.83e- 11
## 11 injtype3 0.662 0.0794 8.34 9.06e- 17
## 12 injtype4 0.603 0.0856 7.05 1.98e- 12
## 13 injtype5 0.631 0.0796 7.93 2.64e- 15
## 14 injtype6 0.610 0.0803 7.60 3.43e- 14
## 15 injtype7 1.13 0.161 6.99 2.96e- 12
## 16 injtype8 0.570 0.108 5.28 1.35e- 7
## 17 lprewage 0.462 0.0569 8.12 5.70e- 16
## 18 highearn:after_1980 0.162 0.0586 2.76 5.86e- 3
Extract just the diff-in-diff estimate
diff_diff_controls <- tidy(model_big) %>%
filter(term == "highearn:after_1980") %>%
pull(estimate)
Compare results: We can put the model coefficients side-by-side to compare the value for highearn:after_1980 as we change the model.
modelsummary(list("Simple" = model_small, "Full" = model_big))
| Simple | Full | |
|---|---|---|
| (Intercept) | 1.199 | -2.416 |
| (0.027) | (0.309) | |
| highearn | 0.215 | -0.327 |
| (0.043) | (0.067) | |
| after_1980 | 0.024 | 0.058 |
| (0.040) | (0.037) | |
| highearn × after_1980 | 0.188 | 0.162 |
| (0.063) | (0.059) | |
| male | -0.140 | |
| (0.039) | ||
| married | 0.026 | |
| (0.033) | ||
| age | 0.008 | |
| (0.001) | ||
| hosp | 1.098 | |
| (0.033) | ||
| indust2 | 0.258 | |
| (0.047) | ||
| indust3 | 0.154 | |
| (0.034) | ||
| injtype2 | 0.883 | |
| (0.133) | ||
| injtype3 | 0.662 | |
| (0.079) | ||
| injtype4 | 0.603 | |
| (0.086) | ||
| injtype5 | 0.631 | |
| (0.080) | ||
| injtype6 | 0.610 | |
| (0.080) | ||
| injtype7 | 1.127 | |
| (0.161) | ||
| injtype8 | 0.570 | |
| (0.108) | ||
| lprewage | 0.462 | |
| (0.057) | ||
| Num.Obs. | 7150 | 6822 |
| R2 | 0.016 | 0.182 |
| R2 Adj. | 0.015 | 0.180 |
| AIC | 24031.1 | 21628.4 |
| BIC | 24065.5 | 21758.1 |
| Log.Lik. | -12010.556 | -10795.203 |
| F | 38.342 | 89.112 |
| RMSE | 1.30 | 1.18 |
After controlling for a host of demographic controls, the diff-in-diff estimate is smaller (0.17), indicating that the policy caused a 17% increase in the duration of weeks unemployed following a workplace injury. It is smaller because the other independent variables now explain some of the variation in log_duration.
Policy / treatment: small cash incentives (any) offered to adults in rural Malawi to encourage returning to pick up HIV test results.
Outcome: whether the participant got their HIV result (got).
We’ll estimate the causal effect of offering any incentive on picking up results. We’ll use simple linear probability models (OLS) so the code is easy to follow.
With randomized treatment, the cleanest estimate is a bare-bones regression of the outcome on the treatment (an Intent-to-Treat, ITT). Adding good pre-treatment covariates can improve precision. Adding bad covariates (post-treatment mediators or colliders) can bias the causal estimate.
Setup:
Data: thornton_hiv comes from an experiment in Malawi
looking at whether cash incentives could encourage people to learn the
results of their HIV tests.
# Install once if needed:
# install.packages(c("causaldata","dplyr","ggplot2","broom","knitr","dagitty"))
# If you want to knit to PDF but don't have LaTeX:
# install.packages("tinytex"); tinytex::install_tinytex()
library(causaldata) # thornton_hiv dataset
library(dplyr)
library(ggplot2)
library(broom)
library(knitr)
library(dagitty)
theme_classic(theme_dark(base_size = 13))
## Error in base_size/2: non-numeric argument to binary operator
Load data:
data("thornton_hiv")
df <- thornton_hiv %>%
transmute(
got, # 1 = picked up HIV result
any, # 1 = received any incentive (treatment indicator)
tinc, # amount of incentive (in Kwacha) -- varies among treated
distvct, # distance to testing center (km)
age, # age in years
hiv2004 # actual HIV status (1 = positive), measured by the study
)
glimpse(df)
## Rows: 4,820
## Columns: 6
## $ got <dbl> 1, NA, 1, NA, 1, 1, NA, NA, NA, NA, 0, 1, NA, NA, 0, NA, 0, NA…
## $ any <dbl> 1, NA, 1, NA, 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, 1, NA, 1, NA…
## $ tinc <dbl> 2.08032, NA, 1.89120, NA, 0.09456, 0.94560, NA, NA, NA, NA, 0.…
## $ distvct <dbl> 2.718921, 2.835039, 2.485713, 2.835039, 1.837131, 2.217743, 2.…
## $ age <dbl> 22, 44, 19, 30, 53, 50, 18, NA, 24, 45, 21, 62, 20, 19, 47, NA…
## $ hiv2004 <dbl> 0, NA, 0, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, 1, NA, 0, NA…
summary(select(df, got, any, tinc, distvct, age, hiv2004))
## got any tinc distvct
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0946 1st Qu.:1.030
## Median :1.0000 Median :1.0000 Median :0.9456 Median :1.675
## Mean :0.6966 Mean :0.7659 Mean :0.9880 Mean :2.003
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.8912 3rd Qu.:2.743
## Max. :1.0000 Max. :1.0000 Max. :2.8368 Max. :5.192
## NA's :1926 NA's :1919 NA's :1919
## age hiv2004
## Min. :11.00 Min. :-1.0000
## 1st Qu.:23.00 1st Qu.: 0.0000
## Median :32.00 Median : 0.0000
## Mean :33.65 Mean : 0.0591
## 3rd Qu.:43.00 3rd Qu.: 0.0000
## Max. :84.00 Max. : 1.0000
## NA's :441 NA's :1926
*Note: transmute() is like mutate(), but
with one key difference. mutate() adds new variables (or
modifies existing ones) while keeping all the original variables, unless
you drop them explicitly. transmute() only keeps the
variables you explicitly list/create inside it.
Variables:
any is randomly assigned
(treatment/control).
tinc is the amount of incentive among those offered
any incentive. It lies on the pathway from “offer any
incentive” → “strength of incentive” → “pickup result.”
distvct, age, and hiv2004
are pre-treatment characteristics. They may predict the
outcome but are not caused by the incentive
offer.
Visual Intuition
df %>%
group_by(any) %>%
summarise(mean_got = mean(got), n = n()) %>%
kable(digits = 3, caption = "Mean pickup rate by treatment arm")
| any | mean_got | n |
|---|---|---|
| 0 | NA | 679 |
| 1 | NA | 2222 |
| NA | NA | 1919 |
ggplot(df, aes(x = factor(any, labels=c("No incentive","Any incentive")), y = got)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_se, geom = "errorbar", width = .15) +
labs(x = NULL, y = "Share who picked up results", title = "Simple difference in means")
No Controls (the clean ITT):
Because the offer is randomized, we can estimate the following:
\[ got_i = \alpha + \beta \text{any}_i + \varepsilon_i \]
Here, \(\beta\) is the ITT: average change in pickup rate from offering any incentive.
m1 <- lm(got ~ any, data = df)
tidy(m1) %>% kable(digits = 3, caption = "Model 1: ITT without controls")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.339 | 0.017 | 19.973 | 0 |
| any | 0.451 | 0.019 | 23.469 | 0 |
Interpretation: \(\hat{\beta}\) is the difference in average
pickup rates between any incentive'' vs.no incentive.’’
With randomization, this is an unbiased estimate of the causal effect
without needing controls.
Good controls (pre-treatment covariates):
We can add variables measured before treatment that predict the outome (to tighten standard errors):
\[ got_i = \alpha + \beta \text{any}_i + \gamma_1 \text{distvct}_i + \gamma_2 \text{age}_i + \gamma_3 \text{hiv2004}_i + \varepsilon_i \]
m2 <- lm(got ~ any + distvct + age + hiv2004, data = df)
tidy(m2) %>% kable(digits = 3, caption = "Model 2: ITT with pre-treatment controls (good controls)")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.342 | 0.028 | 12.365 | 0.000 |
| any | 0.449 | 0.019 | 23.430 | 0.000 |
| distvct | -0.029 | 0.006 | -4.690 | 0.000 |
| age | 0.002 | 0.001 | 3.075 | 0.002 |
| hiv2004 | -0.043 | 0.031 | -1.380 | 0.168 |
# Compare fit/precision roughly
glance(m1) %>%
select(r.squared, adj.r.squared, sigma) %>% mutate(model = "m1") %>%
bind_rows(glance(m2) %>% select(r.squared, adj.r.squared, sigma) %>% mutate(model="m2")) %>%
kable(digits = 3, caption = "Fit/precision comparison (lower sigma = tighter residuals)")
| r.squared | adj.r.squared | sigma | model |
|---|---|---|---|
| 0.163 | 0.163 | 0.423 | m1 |
| 0.171 | 0.170 | 0.421 | m2 |
If treatment is randomized, \(\hat{\beta}\) in Model 2 should be very close to Model 1.
Controls like distance and age can explain extra variation in
got, often making the estimate more precise (Smaller
SEs).
Bad controls: conditioning on a mediator:
A common mistake is to “control for everything,” including variables that sit after treatment on the causal path. Here, tinc (the amount of incentive) is part of the treatment dose once you’ve decided to offer any incentive. If we control for tinc, we’re partially “blocking” the treatment effect we want to measure:
any \(\Rightarrow\)
tinc (dose) \(\Rightarrow\) got
Regressing got on any and tinc
estimates the ``effect of offering any incentive holding the amount
fixed,’’ which is not the ITT and is not a policy-relevant contrast
here.
m3 <- lm(got ~ any + tinc + distvct + age + hiv2004, data = df)
tidy(m3) %>% kable(digits = 3, caption = "Model 3: Over-controlling by adding a mediator (tinc) — don't do this for ITT")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.337 | 0.027 | 12.316 | 0.000 |
| any | 0.340 | 0.024 | 14.448 | 0.000 |
| tinc | 0.084 | 0.011 | 7.750 | 0.000 |
| distvct | -0.029 | 0.006 | -4.613 | 0.000 |
| age | 0.002 | 0.001 | 3.278 | 0.001 |
| hiv2004 | -0.042 | 0.031 | -1.354 | 0.176 |
What went wrong?
tinc is post-treatment (or at least part of the
treatment bundle). Including it soaks up some of the path from
any \(\Rightarrow\)
got
Your \(\hat{\beta}\) on
any will usually move toward zero, not because the program
is weaker, but because you told the empirical model to compare ``any
incentive vs. no incentive at the same amount,’’ which is nonsensical
for policy.
Rule of thumb:
Good controls: measured before treatment and help make treated vs. control more comparable (confounders/prognostics).
Bad controls: variables affected by treatment (mediators) or colliders (common effects of treatment and outcome). These can bias your causal effect.
g <- dagitty("
dag {
any [exposure]
got [outcome]
tinc
distvct
age
hiv2004
any -> tinc
tinc -> got
any -> got
distvct -> got
age -> got
hiv2004 -> got
}
")
plot(g)
Controlling for distvct, age,
hiv2004 is fine (pre-treatment).
Controlling for tinc is NOT fine if your estimand is
the ITT of offering any incentive.
Avoid controlling for colliders (a variable caused by both treatment and some unobserved factor that also affects the outcome). Conditioning on a collider opens backdoor bias. When in doubt, sketch a DAG and ask: Is this variable caused by the treatment? If yes, don’t control for it for the ITT.)
More controls can:
Help precision when they are pre-treatment predictors of the outcome.
Hurt when they are post-treatment (mediators) or colliders, or when you “oversaturate” a small sample with many noisy controls (variance inflation).
Start from your causal question and estimand (here: ITT of offering any incentive). Include only covariates that are needed to satisfy identification (pre-treatment confounders) or to improve precision, and exclude mediators/colliders.
bind_rows(
tidy(m1) %>% mutate(model="(1) ITT"),
tidy(m2) %>% mutate(model="(2) ITT + good controls"),
tidy(m3) %>% mutate(model="(3) ITT + BAD control (tinc)")
) %>%
filter(term=="any") %>%
select(model, estimate, std.error, statistic, p.value) %>%
kable(digits=3, caption = "Estimate on `any` across specifications")
| model | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (1) ITT | 0.451 | 0.019 | 23.469 | 0 |
| (2) ITT + good controls | 0.449 | 0.019 | 23.430 | 0 |
| (3) ITT + BAD control (tinc) | 0.340 | 0.024 | 14.448 | 0 |
Expect: (3) typically shows a smaller any coefficient
than (1)–(2), not because the program is weaker, but because we
controlled away part of the effect.
Randomized treatment: the cleanest regression is outcome ~ treatment.
Good controls: pre-treatment covariates (for precision).
Bad controls: mediators/colliders (can bias estimates).
Kitchen-sink OLS is not a virtue in causal work—start from a DAG and your estimand.
Dataset: thornton_hiv from the causaldata package (CRAN manual lists variables and study context).
Thornton, R. (2008). The Demand for, and Impact of, Learning HIV Status. AER 98(5): 1829–1863.
The causaldata CRAN manual documents
thornton_hiv and its variables (see the dataset listing and
description). :contentReferenceoaicite:0
For background on the Malawi incentives experiment (Thornton 2008), see the original paper. :contentReferenceoaicite:1
Research Question: How does education affect female wages?
We want to estimate how education affects female wages. The problem: education may be endogenous (people with higher ability or motivation might get more education and earn higher wages). If so, simple OLS can be biased.
Idea: Use instrumental variables (IV) — variables related to education (the “push”) but not directly related to the wage except through education.
We’ll keep the econometrics light and focus on intuition.
The problem: education may be endogenous (people with higher ability or motivation might get more education and earn higher wages). If so, simple OLS can be biased.
Idea: Use instrumental variables (IV) — variables related to education but not directly related to the wage except through education.
We will use the classic Mroz dataset from the AER package (female labor supply/wages).
Key variables (names are from the dataset):
wage: hourly wage; lwage: log wage
educ: years of schooling (our main regressor of
interest)
exper: years of labor market experience; expersq:
experience squared
Candidate instruments for educ:
motheduc, fatheduc: mother’s and father’s
education (often used as IVs for schooling)
huseduc: husband’s education (we will try it but also
discuss why it may be invalid)
Strategy
Think causally: Draw the story. What pushes education? What else pushes wages directly?
Pick instruments with a credible story for relevance and exclusion.
Add sensible controls that are not caused by education (experience, demographics).
Run OLS and IV; compare estimates and first-stage strength.
Use diagnostics (Hausman, Sargan) as clues, not final proof.
Be transparent about limitations and alternative interpretations.
# Install once if needed:
# install.packages(c("AER","dplyr","ggplot2","broom"))
# install.packages("wooldridge") # run once if not installed
library(AER)
library(wooldridge)
library(dplyr) # <- provides %>% and across()
library(ggplot2)
library(broom)
# Load Mroz dataset from wooldridge package
data("mroz", package = "wooldridge")
df <- mroz %>%
mutate(
expersq = exper^2
)
core_vars <- c("lwage","wage","educ","exper","expersq","motheduc","fatheduc","huseduc")
df_use <- df %>% filter(complete.cases(across(all_of(core_vars))))
dim(df_use)
## [1] 428 22
head(df_use %>% select(all_of(core_vars)))
## lwage wage educ exper expersq motheduc fatheduc huseduc
## 1 1.21015370 3.3540 12 14 196 12 7 12
## 2 0.32851210 1.3889 12 5 25 7 7 9
## 3 1.51413774 4.5455 12 15 225 12 7 12
## 4 0.09212332 1.0965 12 6 36 7 7 10
## 5 1.52427220 4.5918 14 7 49 12 14 12
## 6 1.55648005 4.7421 12 33 1089 14 7 11
A quick look at the data on wages and education:
ggplot(df_use, aes(x = educ, y = lwage)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Log wage vs. education (raw relationship)",
x = "Years of education",
y = "Log wage")
This positive slope mixes causal effect with selection (ability, family
background, etc.).
What should we include as controls?
We will include experience and experience squared (earnings rise with experience but at a decreasing rate).
Intuition about instruments:
Relevance: The instrument must move education (first stage).
Exclusion: The instrument must not affect wages directly, only through education.
Parents’ education plausibly encourages more schooling (relevance).
The tricky part: Do parents’ education affect adult wages other than via the daughter’s education (e.g., through networks or family background)? We can’t prove this with data; we argue it and test implications.m_ols <- lm(lwage ~ educ + exper + expersq, data = df_use)
tidy(m_ols)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.522 0.199 -2.63 8.90e- 3
## 2 educ 0.107 0.0141 7.60 1.94e-13
## 3 exper 0.0416 0.0132 3.15 1.72e- 3
## 4 expersq -0.000811 0.000393 -2.06 3.97e- 2
The coefficient on educ is the average percentage change
in wages associated with an extra year of schooling, holding
exper fixed.
Instruments: motheduc and fatheduc for
educ.
Relevance: Parents’ education likely correlates with the child’s education (first stage).
Exclusion (debatable): Parents’ education should not affect the daughter’s wage except through the daughter’s education. This is an assumption; violations are possible (e.g., networks, values, non-pecuniary support). We can check implications (weak instruments tests, overidentification tests), but ultimately it’s a judgment call.
First stage intuition
ggplot(df_use, aes(x = (motheduc + fatheduc)/2, y = educ)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "First-stage intuition: parental education vs. daughter's education",
x = "Average of mother & father education",
y = "Daughter's education")
IV estimation
m_iv_parents <- ivreg(
lwage ~ educ + exper + expersq |
motheduc + fatheduc + exper + expersq,
data = df_use
)
summary(m_iv_parents, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc +
## exper + expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0986 -0.3196 0.0551 0.3689 2.3493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0481003 0.4003281 0.120 0.90442
## educ 0.0613966 0.0314367 1.953 0.05147 .
## exper 0.0441704 0.0134325 3.288 0.00109 **
## expersq -0.0008990 0.0004017 -2.238 0.02574 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 423 55.400 <2e-16 ***
## Wu-Hausman 1 423 2.793 0.0954 .
## Sargan 1 NA 0.378 0.5386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357, Adjusted R-squared: 0.1296
## Wald test: 8.141 on 3 and 424 DF, p-value: 2.787e-05
In the IV formula y ~ x1 + x2 | z1 + z2, the right
side before the bar lists regressors in the wage equation; the right
side after the bar lists instruments and exogenous controls.
Here, educ is treated as endogenous;
exper and expersq are included on both sides
as exogenous controls (not instrumented).
Reading the diagnostics
Looking at the Diagnostic tests section in the above
output:
1. First-stage strength (look for a decent F-statistic on the instruments in the first-stage diagnostics; a common rule of thumb is F > 10).
If first-stage F is small → instruments are weak → IV estimates can be noisy or misleading.
As a rule of thumb, if the F-statistic is greater than 10, instruments are considered “not weak.”
So 55.4 (which is the robust version of the F-statistic in the above output) indicates that that the instrument is strong (ie parental education strongly predicts women’s education, so the relevance condition is satisfied).
2. Wu-Hausman is a test for the endogeneity of
educ. Tests whether OLS and IV estimates differ
significantly.
look at the p-value (0.095). If <0.05, conclude endogeneity matters (OLS is biased). Here, borderline → some endogeneity but not overwhelming.
If Wu–Hausman rejects → OLS and IV differ → suggests endogeneity matters.
p-value = 0.095 → not significant at 5%, but borderline at 10%.
Interpretation: There’s some evidence (weak) that education is endogenous. OLS may be biased, but it’s not conclusive.
Note that the null hypothesis for the Wu-Hausman test is
education is exogenous (i.e. OLS is fine) and the
alternative is
education is endogenous (i.e. need IV)
3. Sargan test (overidentification test): We have 2 instruments (mother’s and father’s education) for 1 endogenous regressor (educ). That leaves 1 overidentifying restriction you can test.
Null hypothesis: all instruments are valid (they affect wages only through education).
p-value = 0.54 → you fail to reject → no evidence that the instruments are invalid.
Interpretation: Parents’ education passes this statistical test. But remember: Sargan can’t “prove” validity, it only checks consistency across instruments.
Overall conclusion: Instruments are strong (F = 55). There’s some suggestion of endogeneity (p ≈ 0.10). No sign the instruments are invalid (Sargan p = 0.54). So IV is reasonable, but interpretation still relies on believing the exclusion restriction (that parents’ education only matters for wages through the woman’s education).
Try using husband’s education huseduc as an instrument
for the woman’s education.
m_iv_hus <- ivreg(
lwage ~ educ + exper + expersq |
huseduc + exper + expersq,
data = df_use
)
summary(m_iv_hus, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | huseduc + exper +
## expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07677 -0.32148 0.03525 0.37605 2.36256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2980987 0.3099189 -0.962 0.336668
## educ 0.0893851 0.0238705 3.745 0.000206 ***
## exper 0.0425893 0.0132451 3.215 0.001402 **
## expersq -0.0008457 0.0003957 -2.137 0.033155 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 424 230.900 <2e-16 ***
## Wu-Hausman 1 423 0.892 0.346
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6677 on 424 degrees of freedom
## Multiple R-Squared: 0.1536, Adjusted R-squared: 0.1476
## Wald test: 11.69 on 3 and 424 DF, p-value: 2.26e-07
…..a “strong-looking” instrument can still be invalid if the exclusion restriction is not credible.
OLS estimate: ~0.107 (about 10.7% higher wages per extra year of schooling).
IV with parents’ education: ~0.061 (much lower, imprecise).
IV with husband’s education: ~0.089 (between the two, closer to OLS, but still smaller).
So compared to OLS, the husband-IV estimate is lower than OLS but higher than the parents-IV estimate.
tab_ols <- tidy(m_ols) %>% mutate(model = "OLS")
tab_iv1 <- tidy(m_iv_parents) %>% mutate(model = "IV: Parents' educ")
tab_iv2 <- tidy(m_iv_hus) %>% mutate(model = "IV: Husband's educ")
combined <- bind_rows(tab_ols, tab_iv1, tab_iv2) %>%
filter(term == "educ") %>%
select(model, term, estimate, std.error, statistic, p.value)
# Print
# results_table
# (Optional) Pretty table when knitting
knitr::kable(combined, digits = 3, align = "lrrrrrl",
caption = "Education Coefficient and IV Diagnostics (Side-by-Side)")
| model | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| OLS | educ | 0.107 | 0.014 | 7.598 | 0.000 |
| IV: Parents’ educ | educ | 0.061 | 0.031 | 1.953 | 0.051 |
| IV: Husband’s educ | educ | 0.089 | 0.024 | 3.745 | 0.000 |
Why “strong” instruments aren’t automatically “good” instruments.
Let’s unpack why the husband’s education (huseduc) instrument gives a higher return to schooling than OLS.
Why might the husband’s education IV push the coefficient upward relative to the parents-IV case?
Exclusion restriction problem: Husband’s education may be correlated with the wife’s wages for reasons other than her schooling:
Assortative mating: highly educated women tend to marry highly educated men. But beyond schooling, husband’s education also correlates with:
- Household income (which can relax constraints, allow wife to choose higher-paying but riskier jobs).
- Social networks, labor market connections, location choices.
- Even “quality” of match (ambition, preferences).These channels directly affect the wife’s wages independent of her education. If so, the IV estimate picks up not just the causal effect of the woman’s own schooling, but also these extra channels that bias the estimate upward.
Assortative mating: high-earning women may marry high-earning men; husband’s education may proxy for social networks or local labor market conditions that directly affect the woman’s wage.
If so, huseduc affects wages not only through the woman’s schooling → invalid instrument.
ie a “strong-looking” instrument can still be invalid if the exclusion restriction is not credible.
A bit more intuition in “IV logic” terms
Remember IV estimates are LATEs (local average treatment effects): they reflect the causal effect of education for the “compliers” — people whose schooling is shifted by the instrument.
Parents’ education instruments → compliers are daughters whose education levels are influenced by their parents’ background. This group may be more marginal in the labor market, so the estimated returns to education could be smaller.
Husband’s education instrument → compliers are women whose education choices are “aligned” with the kind of husband they eventually marry. This is a very selected group — likely higher-ability or higher-income households — where the return to education looks higher.
So part of the difference is a different complier population, and part is the invalidity of the instrument (husband’s education directly affecting wages).
In summary
OLS mixes up the effect of education with ability and background.
IV with parents’ education tries to strip that out; it gives a lower return, but it might be closer to the “true” causal effect. Parents’ education is a common (imperfect) candidate: it often passes relevance; exclusion is arguable and must be discussed.
The “husband-as-IV” example illustrates that even if the instrument is relevant, it might fail exclusion (economically implausible) — treat with caution.
IV with husband’s education looks statistically strong (high F-statistic), but it’s more problematic than parental education: husband’s schooling affects wife’s wages through many other channels. That contamination pushes the coefficient up, making it look larger than the parents-IV estimate.
A strong first stage (big F-statistic) isn’t enough. If the exclusion restriction is not believable, the IV estimate can be biased — sometimes more biased than OLS.
Include reasonable controls that are not caused by education and help isolate the effect:
Experience and experience squared (learning-by-doing patterns)
Age (sometimes), region, cohort if they are not themselves outcomes of education decisions in your setup.
Avoid “bad controls” — variables affected by education:
Current occupation, current firm/industry, job tenure if education helped you get that job.
These can “soak up” the very pathway through which education raises wages, biasing the education coefficient downward.
Multiple instruments:
If you have more instruments than endogenous regressors, you can test “overidentifying restrictions,” but remember tests are imperfect; it’s still about economic reasoning.Consider: What if we (probably inappropriately) add a potential “bad control” like occupation (if available)? This is just to demonstrate how estimates can shift; it does not validate the model.
m_ols <- lm(lwage ~ educ + exper + expersq, data = df_use)
summary(m_ols)$coef["educ",]
## Estimate Std. Error t value Pr(>|t|)
## 1.074896e-01 1.414648e-02 7.598332e+00 1.939931e-13
m_iv_f <- ivreg(lwage ~ educ + exper + expersq |
fatheduc + exper + expersq, data = df_use)
summary(m_iv_f, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | fatheduc + exper +
## expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09170 -0.32776 0.05006 0.37365 2.35346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0611169 0.4364461 -0.140 0.88870
## educ 0.0702263 0.0344427 2.039 0.04208 *
## exper 0.0436716 0.0134001 3.259 0.00121 **
## expersq -0.0008822 0.0004009 -2.200 0.02832 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 424 87.741 <2e-16 ***
## Wu-Hausman 1 423 1.437 0.231
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6719 on 424 degrees of freedom
## Multiple R-Squared: 0.143, Adjusted R-squared: 0.137
## Wald test: 8.314 on 3 and 424 DF, p-value: 2.201e-05
m_iv_parents <- ivreg(lwage ~ educ + exper + expersq |
motheduc + fatheduc + exper + expersq, data = df_use)
summary(m_iv_parents, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc +
## exper + expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0986 -0.3196 0.0551 0.3689 2.3493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0481003 0.4003281 0.120 0.90442
## educ 0.0613966 0.0314367 1.953 0.05147 .
## exper 0.0441704 0.0134325 3.288 0.00109 **
## expersq -0.0008990 0.0004017 -2.238 0.02574 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 423 55.400 <2e-16 ***
## Wu-Hausman 1 423 2.793 0.0954 .
## Sargan 1 NA 0.378 0.5386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357, Adjusted R-squared: 0.1296
## Wald test: 8.141 on 3 and 424 DF, p-value: 2.787e-05
m_iv_fh <- ivreg(lwage ~ educ + exper + expersq |
fatheduc + huseduc + exper + expersq, data = df_use)
summary(m_iv_fh, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | fatheduc + huseduc +
## exper + expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0805 -0.3195 0.0319 0.3732 2.3603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2385882 0.2908234 -0.820 0.412456
## educ 0.0845739 0.0222416 3.803 0.000164 ***
## exper 0.0428611 0.0132513 3.234 0.001314 **
## expersq -0.0008548 0.0003958 -2.160 0.031353 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 423 145.189 <2e-16 ***
## Wu-Hausman 1 423 1.805 0.18
## Sargan 1 NA 0.306 0.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6685 on 424 degrees of freedom
## Multiple R-Squared: 0.1516, Adjusted R-squared: 0.1456
## Wald test: 11.82 on 3 and 424 DF, p-value: 1.898e-07
m_iv_all <- ivreg(lwage ~ educ + exper + expersq |
motheduc + fatheduc + huseduc + exper + expersq, data = df_use)
summary(m_iv_all, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | motheduc + fatheduc +
## huseduc + exper + expersq, data = df_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.08378 -0.32135 0.03538 0.36934 2.35829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1868572 0.2853959 -0.655 0.512997
## educ 0.0803918 0.0217740 3.692 0.000251 ***
## exper 0.0430973 0.0132649 3.249 0.001250 **
## expersq -0.0008628 0.0003962 -2.178 0.029976 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 3 422 104.294 <2e-16 ***
## Wu-Hausman 1 423 2.732 0.0991 .
## Sargan 2 NA 1.115 0.5726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6693 on 424 degrees of freedom
## Multiple R-Squared: 0.1495, Adjusted R-squared: 0.1435
## Wald test: 11.52 on 3 and 424 DF, p-value: 2.817e-07
# Helper to extract the educ coefficient and IV diagnostics
extract_row <- function(model, label, note) {
is_iv <- inherits(model, "ivreg")
if (is_iv) {
sm <- summary(model, diagnostics = TRUE)
coefs <- sm$coefficients
diags <- as.data.frame(sm$diagnostics)
# Pull diagnostics safely (may be NA for just-identified models)
weak_F <- if ("Weak instruments" %in% rownames(diags)) diags["Weak instruments", "statistic"] else NA_real_
wu_p <- if ("Wu-Hausman" %in% rownames(diags)) diags["Wu-Hausman", "p-value"] else NA_real_
sargan_p <- if ("Sargan" %in% rownames(diags)) diags["Sargan", "p-value"] else NA_real_
} else {
sm <- summary(model)
coefs <- sm$coefficients
weak_F <- NA_real_
wu_p <- NA_real_
sargan_p <- NA_real_
}
est <- coefs["educ", "Estimate"]
se <- coefs["educ", "Std. Error"]
data.frame(
Model = label,
Estimate = est,
`Std. Err.` = se,
`Weak F` = weak_F,
`Wu–Hausman p` = wu_p,
`Sargan p` = sargan_p,
Notes = note,
check.names = FALSE
)
}
# Build the table exactly like shown
results_table <- rbind(
extract_row(m_ols, "OLS",
"Raw association; possibly biased"),
extract_row(m_iv_f, "IV: father only",
"Just identified; weaker return than OLS"),
extract_row(m_iv_parents, "IV: mother + father",
"Strong instruments, consistent; lower than OLS"),
extract_row(m_iv_fh, "IV: father + husband",
"Strong instruments; estimate pulled upward by husband’s education"),
extract_row(m_iv_all, "IV: mother + father + husband",
"Very strong; estimate between parents-only and with husband")
)
# Round to match the style in the example
results_table$Estimate <- round(results_table$Estimate, 3)
results_table$`Std. Err.` <- round(results_table$`Std. Err.`, 3)
results_table$`Weak F` <- ifelse(is.na(results_table$`Weak F`), NA, round(results_table$`Weak F`, 1))
results_table$`Wu–Hausman p`<- ifelse(is.na(results_table$`Wu–Hausman p`), NA, signif(results_table$`Wu–Hausman p`, 3))
results_table$`Sargan p` <- ifelse(is.na(results_table$`Sargan p`), NA, signif(results_table$`Sargan p`, 3))
# Print
results_table
## Model Estimate Std. Err. Weak F Wu–Hausman p Sargan p
## 1 OLS 0.107 0.014 NA NA NA
## 2 IV: father only 0.070 0.034 87.7 0.2310 NA
## 3 IV: mother + father 0.061 0.031 55.4 0.0954 0.539
## 4 IV: father + husband 0.085 0.022 145.2 0.1800 0.580
## 5 IV: mother + father + husband 0.080 0.022 104.3 0.0991 0.573
## Notes
## 1 Raw association; possibly biased
## 2 Just identified; weaker return than OLS
## 3 Strong instruments, consistent; lower than OLS
## 4 Strong instruments; estimate pulled upward by husband’s education
## 5 Very strong; estimate between parents-only and with husband
# (Optional) Pretty table when knitting
knitr::kable(results_table, digits = 3, align = "lrrrrrl",
caption = "Education Coefficient and IV Diagnostics (Side-by-Side)")
| Model | Estimate | Std. Err. | Weak F | Wu–Hausman p | Sargan p | Notes |
|---|---|---|---|---|---|---|
| OLS | 0.107 | 0.014 | NA | NA | NA | Raw association; possibly biased |
| IV: father only | 0.070 | 0.034 | 87.7 | 0.231 | NA | Just identified; weaker return than OLS |
| IV: mother + father | 0.061 | 0.031 | 55.4 | 0.095 | 0.539 | Strong instruments, consistent; lower than OLS |
| IV: father + husband | 0.085 | 0.022 | 145.2 | 0.180 | 0.580 | Strong instruments; estimate pulled upward by husband’s education |
| IV: mother + father + husband | 0.080 | 0.022 | 104.3 | 0.099 | 0.573 | Very strong; estimate between parents-only and with husband |
1. Identification with instruments
If you have 1 endogenous regressor (say, education) and exactly 1 instrument (say, mother’s education), then the model is just identified. You can estimate the causal effect, but you cannot test the instrument’s validity — because there’s no redundancy.
If you have 1 endogenous regressor (education) and 2 instruments (say, mother’s and father’s education), the model is overidentified. Both instruments are supposed to point to the same true effect. If they don’t, that tells you something is off (at least one instrument may be invalid).
2. Overidentifying restrictions test (Sargan/Hansen test)
If all instruments are valid, then it shouldn’t matter which one you use — the IV estimates should line up.
With >1 instrument, you can statistically check whether the extra instruments are “telling the same story.”
This is the Sargan test (or Hansen’s J-test, in GMM settings).
Null hypothesis: All instruments are valid (they only affect wages through education).
Alternative: At least one instrument is invalid (affects wages directly).
A low p-value (< 0.05) → reject validity → red flag about instruments.
A high p-value → no evidence against validity (but not proof they are valid — tests can miss subtle violations).
3. In this example:
Parents’ education IV (motheduc + fatheduc):
Overidentified: 2 instruments for 1 endogenous regressor.
Sargan p-value = 0.54.
Interpretation: The two instruments give consistent answers; no evidence they are invalid.
Husband’s education IV (huseduc only):
Just identified: 1 instrument for 1 endogenous regressor.
No overidentifying restriction possible → test is NA in the output.
Interpretation: You can’t check exclusion validity with the data; you have to argue it based on economic reasoning.
4. Key caution
Even if the test doesn’t reject, it doesn’t prove the instruments are valid — it just means they’re not contradicting each other in the sample. For example, if both instruments are invalid in the same way, the test won’t catch it. That’s why you always combine the test with economic reasoning
# install.packages(c("AER","broom","dplyr","knitr","kableExtra")) # run once if needed
library(AER)
library(broom)
library(dplyr)
library(knitr)
library(kableExtra)
set.seed(123)
N <- 5000
beta <- 0.5
gamma <- 1.0 # makes z2 invalid (direct effect on y)
pi1 <- 1.0
pi2 <- 1.0
z1 <- rnorm(N)
z2 <- rnorm(N)
v <- rnorm(N)
e <- rnorm(N)
x <- pi1*z1 + pi2*z2 + v
y <- beta*x + gamma*z2 + e
dat <- data.frame(y, x, z1, z2)
head(dat)
## y x z1 z2
## 1 -1.1899849 1.3160757 -0.56047565 -0.4941739
## 2 0.9135182 0.7306040 -0.23017749 1.1275935
## 3 -1.3386337 1.3387202 1.55870831 -1.1469495
## 4 2.9453846 0.9833752 0.07050839 1.4810186
## 5 2.1706215 1.2705690 0.12928774 0.9161912
## 6 3.3116676 3.1821819 1.71506499 0.3351310
m_ols <- lm(y ~ x, data = dat)
m_iv_z1 <- ivreg(y ~ x | z1, data = dat) # just-identified (valid)
m_iv_z2 <- ivreg(y ~ x | z2, data = dat) # just-identified (invalid)
m_iv_both <- ivreg(y ~ x | z1 + z2, data = dat) # overidentified (will run Sargan)
extract_row <- function(model, label, note) {
is_iv <- inherits(model, "ivreg")
if (is_iv) {
sm <- summary(model, diagnostics = TRUE)
co <- sm$coefficients
di <- as.data.frame(sm$diagnostics)
weakF <- if ("Weak instruments" %in% rownames(di)) di["Weak instruments","statistic"] else NA_real_
wu_p <- if ("Wu-Hausman" %in% rownames(di)) di["Wu-Hausman","p-value"] else NA_real_
sargan_p<- if ("Sargan" %in% rownames(di)) di["Sargan","p-value"] else NA_real_
} else {
sm <- summary(model)
co <- sm$coefficients
weakF <- wu_p <- sargan_p <- NA_real_
}
data.frame(
Model = label,
Estimate = co["x","Estimate"],
`Std. Err.` = co["x","Std. Error"],
`Weak F` = weakF,
`Wu–Hausman p` = wu_p,
`Sargan p` = sargan_p,
Notes = note,
check.names = FALSE
)
}
tab <- bind_rows(
extract_row(m_ols, "OLS", "Raw association; biased (x endogenous)"),
extract_row(m_iv_z1, "IV: z1 only (valid)", "Strong & just-identified; credible"),
extract_row(m_iv_z2, "IV: z2 only (invalid)", "Strong but invalid (exclusion fails)"),
extract_row(m_iv_both, "IV: z1 + z2 (overid.)", "Overidentified; Sargan should reject")
) %>%
mutate(
Estimate = round(Estimate, 3),
`Std. Err.` = round(`Std. Err.`, 3),
`Weak F` = ifelse(is.na(`Weak F`), "", round(`Weak F`, 1)),
`Wu–Hausman p` = ifelse(is.na(`Wu–Hausman p`), "", signif(`Wu–Hausman p`, 3)),
`Sargan p` = ifelse(is.na(`Sargan p`), "", signif(`Sargan p`, 3))
)
kable(tab, caption = "Side-by-side results", align = "lrrrrrl", booktabs = TRUE) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover")) %>%
column_spec(1, bold = TRUE) %>%
add_header_above(c(" " = 1, "Effect of x on y" = 2, "Diagnostics" = 3, " " = 1))
| Model | Estimate | Std. Err. | Weak F | Wu–Hausman p | Sargan p | Notes |
|---|---|---|---|---|---|---|
| OLS | 0.828 | 0.011 | Raw association; biased (x endogenous) | |||
| IV: z1 only (valid) | 0.488 | 0.020 | 2449.9 | 2.03e-116 | Strong & just-identified; credible | |
| IV: z2 only (invalid) | 1.510 | 0.025 | 2496.9 | 0 | Strong but invalid (exclusion fails) | |
| IV: z1 + z2 (overid.) | 1.002 | 0.013 | 4983.4 | 5.71e-125 | 6.50999999984885e-315 | Overidentified; Sargan should reject |
What is the effect on test scores of reducing class size in early grades?
Project STAR (Student Teacher Achievement RAtio): Tenessee class size reduction experiment
Comparison between three different class arrangements for kindergarten through third grade:
regular class size (22-25 students per class and a single teacher with no aides)
small class size (13-17 students per class and no aide)
regular-sized class plus a teacher’s aide
\[ Y_i = \beta_0 + \beta_1 \text{SmallClass}_i + \beta_2 \text{RegAide}_i + u_i \]
\(Y_i\): test score
\(\text{SmallClass}_i\): =1 if student i is in a small class,=0 otherwise
\(\text{RegAide}_i\): =1 if student i is in a regular class with aide,=0 otherwise
\(\beta_1\): the effect of a small class, relative to a regular class
\(\beta_2\): the effect of a regular class with an aide, relative to a regular class
## extra for display:
if(!requireNamespace("lattice")) {
if(interactive() || is.na(Sys.getenv("_R_CHECK_PACKAGE_NAME_", NA))) {
stop("not all packages required for the example are installed")
} else q() }
library(AER)
data("STAR")
## Stock and Watson, p. 488
fmk <- lm(I(readk + mathk) ~ stark, data = STAR)
fm1 <- lm(I(read1 + math1) ~ star1, data = STAR)
fm2 <- lm(I(read2 + math2) ~ star2, data = STAR)
fm3 <- lm(I(read3 + math3) ~ star3, data = STAR)
library(stargazer)
stargazer(fmk, fm1, fm2, fm3,
type = "text", # change to "latex" or "html" if needed
title = "Effect of Small Classes on Test Scores",
column.labels = c("Kindergarten", "Grade 1", "Grade 2", "Grade 3"),
dep.var.labels = "Reading + Math Score",
covariate.labels = c("Small class", "Regular+aide class"),
omit.stat = c("f", "ser")) # cleaner output
##
## Effect of Small Classes on Test Scores
## ==========================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## Reading + Math Score I(read1 + math1) I(read2 + math2) I(read3 + math3)
## Kindergarten Grade 1 Grade 2 Grade 3
## (1) (2) (3) (4)
## ------------------------------------------------------------------------------------------
## Small class 13.899***
## (2.409)
##
## Regular+aide class 0.314
## (2.310)
##
## star1small 29.781***
## (2.807)
##
## star1regular+aide 11.959***
## (2.686)
##
## star2small 19.394***
## (2.710)
##
## star2regular+aide 3.479
## (2.566)
##
## star3small 15.587***
## (2.395)
##
## star3regular+aide -0.291
## (2.302)
##
## Constant 918.043*** 1,039.393*** 1,157.807*** 1,228.506***
## (1.641) (1.836) (1.849) (1.715)
##
## ------------------------------------------------------------------------------------------
## Observations 5,786 6,379 6,049 5,967
## R2 0.007 0.017 0.009 0.010
## Adjusted R2 0.007 0.017 0.009 0.010
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
coeftest(fm3, vcov = sandwich)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1228.50636 1.67959 731.4322 < 2.2e-16 ***
## star3small 15.58660 2.39544 6.5068 8.303e-11 ***
## star3regular+aide -0.29094 2.27214 -0.1280 0.8981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(I(read3 + math3) ~ star3, data = STAR)
These estimates suggest that reducing class size has an effect on test performance, but adding an aide to a regular-sized class has a much smaller effect, possibly zero.
Should we include additional regressors?
Augmenting the regressions with additional regressors can provide more efficient estimates of the causal effects.
Morever, if the treatment receivced is not random because of failures to follow the treatment protocol, then the estimates of the experimental effects based on regressions with additional regressors could differ from the difference estimates reported above.
For these reasons, we now also look at the results with additional regressors that measure teacher, school and students characteristics
## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)
addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)
addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)
stargazer(addreg_A, addreg_B, addreg_C,
type = "text", # change to "latex" or "html" if needed
title = "Effect of Small Classes on Test Scores",
column.labels = c("A", "B", "C"),
dep.var.labels = "Reading + Math Score",
#covariate.labels = c("Small class", "Regular+aide class"),
omit.stat = c("f", "ser")) # cleaner output
##
## Effect of Small Classes on Test Scores
## ===================================================
## Dependent variable:
## ---------------------------------
## Reading + Math Score
## A B C
## (1) (2) (3)
## ---------------------------------------------------
## starksmall 14.006*** 15.933*** 15.918***
## (2.395) (2.167) (2.092)
##
## starkregular+aide -0.601 1.215 1.808
## (2.306) (2.093) (2.019)
##
## experiencek 1.469*** 0.743*** 0.665***
## (0.167) (0.168) (0.162)
##
## genderfemale 12.106***
## (1.667)
##
## lunchkfree -34.636***
## (2.020)
##
## ethnicityafam -25.426***
## (3.584)
##
## ethnicityasian 0.184
## (16.972)
##
## ethnicityhispanic 3.225
## (31.648)
##
## ethnicityamindian -34.973
## (44.676)
##
## ethnicityother -24.933
## (23.935)
##
## schoolidk2 -81.716*** -57.496***
## (12.077) (11.702)
##
## schoolidk3 -7.175 -7.679
## (10.446) (10.075)
##
## schoolidk4 -44.735*** -51.680***
## (11.731) (11.309)
##
## schoolidk5 -48.425*** -45.966***
## (11.664) (11.246)
##
## schoolidk6 -38.441*** -33.491***
## (11.949) (11.569)
##
## schoolidk7 22.672** 26.266***
## (10.036) (9.669)
##
## schoolidk8 -34.505*** -35.777***
## (10.332) (9.973)
##
## schoolidk9 4.032 2.870
## (9.979) (9.635)
##
## schoolidk10 38.558*** 32.084***
## (12.137) (11.695)
##
## schoolidk11 57.981*** 59.025***
## (11.833) (11.397)
##
## schoolidk12 1.828 11.171
## (11.852) (11.463)
##
## schoolidk13 36.435*** 37.072***
## (11.733) (11.312)
##
## schoolidk14 -32.964** 8.688
## (13.788) (13.628)
##
## schoolidk15 -51.949*** -9.525
## (11.893) (11.861)
##
## schoolidk16 -76.829*** -31.285***
## (10.241) (10.360)
##
## schoolidk17 -18.900* -13.323
## (11.142) (10.798)
##
## schoolidk18 -51.424*** -24.005**
## (10.853) (10.681)
##
## schoolidk19 -29.031*** 15.643
## (10.413) (10.497)
##
## schoolidk20 -24.413** 2.657
## (10.896) (10.913)
##
## schoolidk21 16.309 32.475***
## (10.998) (10.765)
##
## schoolidk22 -16.033 27.355***
## (9.851) (9.975)
##
## schoolidk23 31.210*** 52.800***
## (11.059) (10.956)
##
## schoolidk24 -39.174*** -14.848
## (11.474) (11.440)
##
## schoolidk25 -33.916*** -11.968
## (11.887) (11.513)
##
## schoolidk26 -65.901*** -26.378**
## (12.279) (12.213)
##
## schoolidk27 20.344** 60.777***
## (9.786) (9.920)
##
## schoolidk28 -43.699*** -1.908
## (9.847) (9.976)
##
## schoolidk29 -19.793* 23.665**
## (12.007) (12.027)
##
## schoolidk30 73.697*** 114.765***
## (11.734) (11.705)
##
## schoolidk31 5.703 51.576***
## (12.221) (12.178)
##
## schoolidk32 -76.597*** -31.771***
## (10.258) (10.357)
##
## schoolidk33 -74.895*** -29.645***
## (10.448) (10.532)
##
## schoolidk34 -43.062*** -47.958***
## (11.216) (10.814)
##
## schoolidk35 -46.238*** -41.571***
## (11.616) (11.195)
##
## schoolidk36 -13.686 -21.777**
## (11.490) (11.080)
##
## schoolidk37 -12.351 -7.870
## (10.548) (10.170)
##
## schoolidk38 -10.610 -2.724
## (12.669) (12.242)
##
## schoolidk39 -23.309** -13.302
## (11.648) (11.270)
##
## schoolidk40 17.470 30.033***
## (11.348) (11.005)
##
## schoolidk41 51.231*** 39.230***
## (11.369) (10.978)
##
## schoolidk42 -2.847 -9.913
## (12.137) (11.749)
##
## schoolidk43 -16.442 -25.884**
## (11.292) (10.892)
##
## schoolidk44 6.826 43.162***
## (11.029) (10.898)
##
## schoolidk45 -105.064*** -58.205***
## (11.611) (11.605)
##
## schoolidk46 -17.011 -20.996*
## (12.018) (11.577)
##
## schoolidk47 -18.214 -22.803**
## (11.771) (11.459)
##
## schoolidk48 -4.983 0.816
## (11.440) (11.041)
##
## schoolidk49 9.272 12.195
## (11.769) (11.339)
##
## schoolidk50 -3.353 1.766
## (11.102) (10.751)
##
## schoolidk51 11.627 7.057
## (9.766) (9.423)
##
## schoolidk52 30.792** 17.722
## (12.378) (11.941)
##
## schoolidk53 -66.974*** -50.533***
## (12.014) (11.621)
##
## schoolidk54 -0.463 -6.483
## (12.007) (11.568)
##
## schoolidk55 -35.635*** -37.789***
## (10.970) (10.573)
##
## schoolidk56 -90.780*** -82.116***
## (10.408) (10.050)
##
## schoolidk57 -43.141*** -45.715***
## (11.592) (11.166)
##
## schoolidk58 19.694* 8.719
## (10.450) (10.081)
##
## schoolidk59 10.294 8.540
## (11.837) (11.460)
##
## schoolidk60 -40.156*** -32.861***
## (11.341) (10.936)
##
## schoolidk61 -28.758*** -36.711***
## (11.150) (10.750)
##
## schoolidk62 -36.164*** -33.598***
## (12.136) (11.718)
##
## schoolidk63 18.475* 19.295**
## (10.105) (9.736)
##
## schoolidk64 -24.511** -22.818**
## (10.604) (10.260)
##
## schoolidk65 17.476 16.041
## (12.948) (12.478)
##
## schoolidk66 -36.682*** -39.517***
## (10.559) (10.212)
##
## schoolidk67 -13.550 -10.331
## (13.157) (12.717)
##
## schoolidk68 26.870*** 38.880***
## (10.410) (10.045)
##
## schoolidk69 24.625** 18.973*
## (11.885) (11.460)
##
## schoolidk70 -20.916** -15.221
## (10.509) (10.155)
##
## schoolidk71 -39.109*** -39.601***
## (10.899) (10.513)
##
## schoolidk72 11.188 19.282*
## (10.189) (9.851)
##
## schoolidk73 -0.435 0.436
## (11.427) (11.023)
##
## schoolidk74 4.574 6.433
## (10.997) (10.597)
##
## schoolidk75 7.822 -1.311
## (10.762) (10.380)
##
## schoolidk76 -15.980 -19.275*
## (10.367) (10.040)
##
## schoolidk78 -48.027*** -51.562***
## (11.675) (11.270)
##
## schoolidk79 -15.241 -15.346
## (11.520) (11.112)
##
## schoolidk80 -4.414 -6.440
## (11.944) (11.511)
##
## Constant 904.721*** 925.675*** 934.031***
## (2.228) (8.203) (7.964)
##
## ---------------------------------------------------
## Observations 5,766 5,766 5,748
## R2 0.020 0.234 0.291
## Adjusted R2 0.020 0.223 0.280
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The above output looks quite messy because of the inclusion of all the school fixed effects. We can make it look nicer by including some additional omit prompts in stargazer:
## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)
addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)
addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)
stargazer(addreg_A, addreg_B, addreg_C,
type = "text",
title = "Effect of Small Classes on Test Scores",
column.labels = c("A", "B", "C"),
dep.var.labels = "Reading + Math Score",
omit = "schoolidk", # <--- hides school dummies
omit.labels = "School FE", # <--- replaces them with one line
omit.stat = c("f", "ser"))
##
## Effect of Small Classes on Test Scores
## ==================================================
## Dependent variable:
## --------------------------------
## Reading + Math Score
## A B C
## (1) (2) (3)
## --------------------------------------------------
## starksmall 14.006*** 15.933*** 15.918***
## (2.395) (2.167) (2.092)
##
## starkregular+aide -0.601 1.215 1.808
## (2.306) (2.093) (2.019)
##
## experiencek 1.469*** 0.743*** 0.665***
## (0.167) (0.168) (0.162)
##
## genderfemale 12.106***
## (1.667)
##
## lunchkfree -34.636***
## (2.020)
##
## ethnicityafam -25.426***
## (3.584)
##
## ethnicityasian 0.184
## (16.972)
##
## ethnicityhispanic 3.225
## (31.648)
##
## ethnicityamindian -34.973
## (44.676)
##
## ethnicityother -24.933
## (23.935)
##
## Constant 904.721*** 925.675*** 934.031***
## (2.228) (8.203) (7.964)
##
## --------------------------------------------------
## School FE No No No
## --------------------------------------------------
## Observations 5,766 5,766 5,748
## R2 0.020 0.234 0.291
## Adjusted R2 0.020 0.223 0.280
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
OR alternatively we can use the modelsummary package to make things look a little nicer:
library(modelsummary)
## Stock and Watson, p. 489
addreg_A <- lm(I(readk + mathk) ~ stark + experiencek, data = STAR)
addreg_B <- lm(I(readk + mathk) ~ stark + experiencek + schoolidk, data = STAR)
addreg_C <- lm(I(readk + mathk) ~ stark + experiencek + gender + lunchk + ethnicity + schoolidk, data = STAR)
models <- list("A" = addreg_A, "B" = addreg_A, "C" = addreg_A)
modelsummary(models,
stars = TRUE,
coef_omit = "schoolidk", # regex, hides all school dummies
coef_rename = c(starkSmall = "Small class",
starkRegularaide = "Regular+aide class",
experiencek = "Teacher experience"),
gof_omit = "IC|Log|F|Adj") # cleaner output
| A | B | C | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| (Intercept) | 904.721*** | 904.721*** | 904.721*** |
| (2.228) | (2.228) | (2.228) | |
| starksmall | 14.006*** | 14.006*** | 14.006*** |
| (2.395) | (2.395) | (2.395) | |
| starkregular+aide | -0.601 | -0.601 | -0.601 |
| (2.306) | (2.306) | (2.306) | |
| Teacher experience | 1.469*** | 1.469*** | 1.469*** |
| (0.167) | (0.167) | (0.167) | |
| Num.Obs. | 5766 | 5766 | 5766 |
| R2 | 0.020 | 0.020 | 0.020 |
| RMSE | 73.06 | 73.06 | 73.06 |
Adding additional regressors does not change the estimated causal effects of the different treatments \(\Rightarrow\) This suggests that the random assignment to the smaller lasses does not depend on unobserved variables.
Additional regressors increase the \(\bar{R}^2\) and the standard error of the estimated class size effect decreases from 2.45 to 2.16.
Setup:
# install missing packages automatically
pkgs <- c("tidyverse", "WDI", "broom", "kableExtra", "fixest", "sandwich", "lmtest", "countrycode")
new <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(new)) install.packages(new, repos = "https://cloud.r-project.org")
library(tidyverse)
library(WDI)
library(broom)
library(kableExtra)
library(fixest)
library(sandwich)
library(lmtest)
library(countrycode)
theme_set(theme_minimal(base_size = 13))
set.seed(123)
Policy Setting: Many Sub-Saharan African countries eliminated primary school fees in the 1990s–2000s (Free Primary Education, FPE).
Idea When fees are removed, children who are at primary-school age at that time are more likely to enroll/stay in school.
Causal question: DiD (Policy to Outcome): Did FPE increase girls’ primary completion?
Outcome: Primary completion rate, female (%).
Treatment: FPE is in effect (post-reform years).
Design: Country and year fixed effects to compare changes within a country before vs after FPE, relative to countries not yet (or never) treated.
Load Data: Indicators and FPE Reform Years
We’ll pull a panel (1990–2018) from the World Bank API:
Primary completion rate, female (%): SE.PRM.CMPT.FE.ZS
Girls’ secondary enrollment (% gross): SE.SEC.ENRR.FE
Adolescent fertility rate (births per 1,000 women ages 15–19): SP.ADO.TFRT
GDP per capita (constant 2015 US$): NY.GDP.PCAP.KD
Urban population (%): SP.URB.TOTL.IN.ZS
We’ll also hand-code FPE adoption years for a few countries (commonly cited dates):
# Indicators
ind <- c(
prm_comp_f = "SE.PRM.CMPT.FE.ZS", # Outcome for DiD
sec_enrol_f = "SE.SEC.ENRR.FE", # Endogenous X for IV
teen_fert = "SP.ADO.TFRT", # Outcome for IV
gdppc = "NY.GDP.PCAP.KD",
urban = "SP.URB.TOTL.IN.ZS"
)
yrs <- 1990:2018
raw <- WDI(
country = "all",
indicator = ind,
start = min(yrs), end = max(yrs),
extra = TRUE, cache = NULL
)
# Keep only Sub-Saharan Africa "countries" (drop aggregates)
df <- raw %>%
filter(region == "Sub-Saharan Africa", !is.na(iso3c)) %>%
select(iso3c, country, year, prm_comp_f, sec_enrol_f, teen_fert, gdppc, urban)
# Hand-coded FPE reform years (illustrative teaching sample) --------------
fpe_years <- tribble(
~country, ~iso3c, ~fpe_year,
"Kenya", "KEN", 2003,
"Uganda", "UGA", 1997,
"Tanzania", "TZA", 2001,
"Malawi", "MWI", 1994,
"Zambia", "ZMB", 2002,
"Mozambique", "MOZ", 2004,
"Rwanda", "RWA", 2003,
"Ghana", "GHA", 2005
)
# Keep panel for the countries above, to keep the sample simple and transparent
panel <- df %>%
semi_join(fpe_years, by = c("iso3c")) %>%
arrange(iso3c, year)
# Merge in reform year
panel <- panel %>%
left_join(fpe_years, by = c("iso3c", "country"))
panel %>%
group_by(country) %>%
slice_head(n = 3) %>%
ungroup() %>%
kbl(digits = 1, caption = "Sample rows from our panel (first 3 years per country)") %>%
kable_paper()
| iso3c | country | year | prm_comp_f | sec_enrol_f | teen_fert | gdppc | urban | fpe_year |
|---|---|---|---|---|---|---|---|---|
| GHA | Ghana | 1990 | NA | 27.4 | 111.0 | 855.9 | 36.4 | 2005 |
| GHA | Ghana | 1991 | 49.9 | NA | 110.0 | 878.6 | 37.2 | 2005 |
| GHA | Ghana | 1992 | 51.4 | NA | 106.8 | 890.3 | 37.9 | 2005 |
| KEN | Kenya | 1990 | NA | NA | 133.5 | 1329.3 | 16.7 | 2003 |
| KEN | Kenya | 1991 | NA | NA | 130.2 | 1305.5 | 17.0 | 2003 |
| KEN | Kenya | 1992 | NA | NA | 126.4 | 1255.9 | 17.3 | 2003 |
| MWI | Malawi | 1990 | 25.3 | 12.3 | 155.1 | 353.8 | 11.6 | 1994 |
| MWI | Malawi | 1991 | 27.5 | 12.7 | 156.4 | 373.0 | 11.9 | 1994 |
| MWI | Malawi | 1992 | 30.1 | NA | 155.5 | 335.7 | 12.2 | 1994 |
| MOZ | Mozambique | 1990 | 23.6 | 5.4 | 164.9 | 221.0 | 25.0 | 2004 |
| MOZ | Mozambique | 1991 | 24.0 | 5.5 | 164.1 | 227.5 | 25.5 | 2004 |
| MOZ | Mozambique | 1992 | 22.5 | 5.5 | 159.0 | 206.2 | 26.0 | 2004 |
| RWA | Rwanda | 1990 | 43.4 | 14.8 | 56.9 | 368.2 | 5.4 | 2003 |
| RWA | Rwanda | 1991 | 43.3 | 14.7 | 54.3 | 351.7 | 5.5 | 2003 |
| RWA | Rwanda | 1992 | 42.9 | 13.5 | 52.6 | 364.8 | 6.3 | 2003 |
| TZA | Tanzania | 1990 | NA | NA | 137.9 | 541.5 | 18.9 | 2001 |
| TZA | Tanzania | 1991 | 61.4 | 4.5 | 135.7 | 539.0 | 19.2 | 2001 |
| TZA | Tanzania | 1992 | 52.8 | 4.5 | 131.0 | 528.8 | 19.5 | 2001 |
| UGA | Uganda | 1990 | NA | 7.9 | 180.4 | 369.8 | 11.1 | 1997 |
| UGA | Uganda | 1991 | NA | NA | 179.8 | 377.8 | 11.5 | 1997 |
| UGA | Uganda | 1992 | NA | 7.4 | 184.0 | 377.6 | 11.8 | 1997 |
| ZMB | Zambia | 1990 | NA | NA | 149.2 | 878.6 | 39.4 | 2002 |
| ZMB | Zambia | 1991 | NA | NA | 150.8 | 856.8 | 39.0 | 2002 |
| ZMB | Zambia | 1992 | NA | NA | 150.2 | 821.9 | 38.5 | 2002 |
Note: The FPE years are a minimal teaching list. For research, you’d verify dates carefully and consider staggered or partial implementations.
Basic Cleaning and Treatment Variables
For DiD, we need an indicator for when FPE is active in a country-year.
For IV, we need a cohort-exposure measure. We’ll approximate exposure for the 15–19 age group in year t as the share of their primary-school years (ages 6–13) that occurred after the FPE year. This yields a 0–1 exposure:
If FPE happens long before they start primary: exposure ≈ 1
If FPE happens after they’ve already finished primary: exposure ≈ 0
In between: partial exposure (between 0 and 1)
This is a simple, intuitive exposure index. (A research paper would build this from microdata or detailed age-by-year records.)
# DiD treatment: post-FPE indicator
panel <- panel %>%
mutate(post_fpe = if_else(!is.na(fpe_year) & year >= fpe_year, 1, 0))
# IV instrument: cohort exposure to FPE for ages 15-19 in year t
# Approximate the "middle" birth year for 15-19 as t-17 (midpoint).
# A child's primary years are ages 6..13 -> calendar years (b+6)..(b+13).
# Exposure share = fraction of those 8 years that occur at or after fpe_year.
expo_fun <- function(year, fpe_year){
if(is.na(fpe_year)) return(0) # treat as never exposed
b <- year - 17 # midpoint birth year for 15-19
start <- b + 6
end <- b + 13
# overlap of [start, end] with [fpe_year, +inf)
exposed_years <- pmax(0, (end - pmax(start, fpe_year) + 1))
share <- exposed_years / 8
share <- pmin(pmax(share, 0), 1)
return(share)
}
panel <- panel %>%
rowwise() %>%
mutate(exposure_15_19 = expo_fun(year, fpe_year)) %>%
ungroup()
# Inspect a few rows to see exposure growing over time after reform
panel %>%
filter(country %in% c("Kenya","Uganda")) %>%
select(country, year, fpe_year, post_fpe, exposure_15_19) %>%
arrange(country, year) %>%
slice(1:10) %>%
kbl(digits = 2, caption = "Illustration: post-FPE and exposure index over time") %>%
kable_paper()
| country | year | fpe_year | post_fpe | exposure_15_19 |
|---|---|---|---|---|
| Kenya | 1990 | 2003 | 0 | 0 |
| Kenya | 1991 | 2003 | 0 | 0 |
| Kenya | 1992 | 2003 | 0 | 0 |
| Kenya | 1993 | 2003 | 0 | 0 |
| Kenya | 1994 | 2003 | 0 | 0 |
| Kenya | 1995 | 2003 | 0 | 0 |
| Kenya | 1996 | 2003 | 0 | 0 |
| Kenya | 1997 | 2003 | 0 | 0 |
| Kenya | 1998 | 2003 | 0 | 0 |
| Kenya | 1999 | 2003 | 0 | 0 |
Can cigarette consumption be reduced by taxing cigarettes more heavily? For example, what would the after-tax sales price of cigarettes need to be to to achieve a 20 percent reduction in cigarette consumption?
Need to estimate the price elasticity for the demand of cigarettes needs to be estimated from data on price and sales.
An OLS regression of log quantity on log price cannot be used to estimate the effect of interest since there is simultaneous causality between demand and supply.
Load the data set and get an overview:
library(AER)
data("CigarettesSW")
summary(CigarettesSW)
## state year cpi population packs
## AL : 2 1985:48 Min. :1.076 Min. : 478447 Min. : 49.27
## AR : 2 1995:48 1st Qu.:1.076 1st Qu.: 1622606 1st Qu.: 92.45
## AZ : 2 Median :1.300 Median : 3697472 Median :110.16
## CA : 2 Mean :1.300 Mean : 5168866 Mean :109.18
## CO : 2 3rd Qu.:1.524 3rd Qu.: 5901500 3rd Qu.:123.52
## CT : 2 Max. :1.524 Max. :31493524 Max. :197.99
## (Other):84
## income tax price taxs
## Min. : 6887097 Min. :18.00 Min. : 84.97 Min. : 21.27
## 1st Qu.: 25520384 1st Qu.:31.00 1st Qu.:102.71 1st Qu.: 34.77
## Median : 61661644 Median :37.00 Median :137.72 Median : 41.05
## Mean : 99878736 Mean :42.68 Mean :143.45 Mean : 48.33
## 3rd Qu.:127313964 3rd Qu.:50.88 3rd Qu.:176.15 3rd Qu.: 59.48
## Max. :771470144 Max. :99.00 Max. :240.85 Max. :112.63
##
price: price is the average real price per pack of cigarettes including all taxes
pack: number of packs per capita
income: state personal income (total, nominal)
tax: average state, federal and average local excise taxes for fiscal year
taxs: average excise taxes for fiscal year, including sales tax
Compute real price per pack of cigarettes, the sales tax and the correlation between sales tax and price.
# real prices:
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)
# sales tax:
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)
# correlation between sales tax and price:
cor(CigarettesSW$salestax, CigarettesSW$price)
## [1] 0.6141228
Generate a subset for the year 1995:
c1995 <- subset(CigarettesSW, year == "1995")
Using the 1995 subset, perform the first stage regression:
\[ log( P_i^{\text{cigarettes}}) = \pi_0 + \pi_1 \text{SalesTax}_i + \nu_i \]
cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)
coeftest(cig_s1, vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6165463 0.0289177 159.6444 < 2.2e-16 ***
## salestax 0.0307289 0.0048354 6.3549 8.489e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the regression’s \(R^2\) we see that 47 percent of the observed variation in \(log( P^{\text{cigarettes}})\) can be explained by the instrument \(\text{SalesTax}\):
# inspect the R^2 of the first stage regression
summary(cig_s1)$r.squared
## [1] 0.4709961
Store the fitted values obtained by the first stage regression in a new variable:
# store the predicted values
lcigp_pred <- cig_s1$fitted.values
Next, we run the second stage regression which gives us the TSLS estimates we seek.
# run the stage 2 regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.71988 1.70304 5.7074 7.932e-07 ***
## lcigp_pred -1.08359 0.35563 -3.0469 0.003822 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The function ivreg() from the package AER carries out
TSLS procedure automatically. It is used similarly as lm().
Instruments can be added to the usual specification of the regression
formula using a vertical bar separating the model equation from the
instruments. Thus, for the regression at hand the correct formula
is:
log(packs) ~ log(rprice) | salestax
# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)
coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.71988 1.52832 6.3598 8.346e-08 ***
## log(rprice) -1.08359 0.31892 -3.3977 0.001411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note the following:
The individual regressions for each stage of TSLS using
lm() leads to the same coefficient estimates as when using
ivreg(). However, the standard errors reported for the
second-stage regression (e.g.,coeftest() or
summary()) are invalid because they do not account for the
use of predictions from the first-stage regression as regressors in the
second-stage regression. In contrast, ivreg() performs the
necessary adjustment automatically. The manual step-by-step estimation
is only included here for the purpose of demonstrating the mechanics of
the procedure.
Just like in multiple regression it is important to compute
heteroskedasticity-robust standard errors as we have done above using
vcovHC().
The TSLS estimate for \(\beta_1\) shows that an increase in cigarette prices by 1 percent reduces cigarette consumption by 1.08 percentage points, which is fairly elastic.
However, even though we used IV estimation, there still might be bias in the coefficient estimates due to omitted variables.
For example: In this model, the TSLS estimator is inconsistent for the true \(\beta_1\) if the instrument (the real sales tax per pack) correlates with the error term.
This is likely to be the case since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model.
a multiple variables regression is needed in order to try to account for this
# add rincome to the dataset
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)
c1995 <- subset(CigarettesSW, year == "1995")
# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) +
salestax, data = c1995)
coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.43066 1.25939 7.4883 1.935e-09 ***
## log(rprice) -1.14338 0.37230 -3.0711 0.003611 **
## log(rincome) 0.21452 0.31175 0.6881 0.494917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, add cigtax to the data set:
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)
c1995 <- subset(CigarettesSW, year == "1995")
# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) |
log(rincome) + salestax + cigtax, data = c1995)
coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.89496 0.95922 10.3157 1.947e-13 ***
## log(rprice) -1.27742 0.24961 -5.1177 6.211e-06 ***
## log(rincome) 0.28040 0.25389 1.1044 0.2753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Four panes in R Studio:
console
script editor: save you work! eg. a do file
output
environment
Two important shortcuts when using the script editor:
Cmd/Ctrl + Enter Run part of the code
Cmd/Ctrl + Shift + S : Run the entire code in the script
Script Editor
Open the script editor up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, one of which is the script editor.
When done writing the script, save with extension .R
Example:
Combine script and output in one file with markdown
Example:
RStudio Cheat Sheets: Cheatsheets for numerous packages.