Experimental vs. Observational Data

Author

This document runs through the basics of experimental vs. observational data with some examples in R. There is included code for your reference that employs simulated data. You will understand very little of it at first and THAT’S OK! If you don’t know what’s going on, think of the code as being a “black box” from which something (i.e., a graph or number) emerges. As we move throughout the quarter it will make more and more sense to you. One other precautionary note: all of these examples are based on simulated data. These are not real data, they’re just for illustrative purposes, please don’t base your substantive beliefs on these toy examples.

Code

# Load necessary packages, if you don't have them use install.packages("packagename")
library(ggplot2)
library(ggthemes)
library(distributional)
library(ggdist)
library(dplyr)
library(knitr)

Experimental Data

Random experiments are the gold standard for causal inference across fields. This simple method underlies most empirical work in the hard sciences from drug trials to lab tests and so on. The critical features of random experiments are: 1) random assignment and 2) random sampling. What are these?

So what does experimental data look like?

Let’s pretend we’re a group of political operatives working for Nikki Haley. We’re interested in the effect of a negative campaign ad on feelings/affect toward former President Trump among GOP primary voters. To do this, we run a simple online survey experiment with a sample of 1,000 GOP primary voters. We split our observations into two groups: treatment and control. In the treatment group, we’ll show the participants the campaign ad, in the control group we’ll show them no ad. We’ll then ask both groups how they feel about Trump using a feeling thermometer (a 0-100 scale where 0 is ice cold and 100 is red hot).

Here’s what the distribution of feeling thermometers looks like in the control group:

Code

# Set seed for replicability 

set.seed(1995)
# 
df <- data.frame(treat = rbinom(1000,1,.5)) #generate treatment assignment
df$trump_support <- df$treat*5 + rnorm(1000,75,5) #simulate the outcome variable 
index_nt <- sample(seq(1,1000,1),250)
# let's make it bimodal for never trumpers. 
df[index_nt,]$trump_support <- df[index_nt,]$trump_support - rnorm(250,50,5) 
df$treat <- ifelse(df$treat == 1, "Treatment","Control")
#
# simple histogram 
df |> 
  filter(treat == "Control") |> 
  ggplot(aes(x = trump_support)) + 
  geom_histogram(alpha=0.8, color = "grey90",fill = "firebrick2") + 
  xlab("Trump Support (0-100)") + 
  ylab("Count") + 
  xlim(0,100) + 
  theme_few() + 
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold"))

What do we notice about Trump support here among our sample of GOP primary voters? Does this have anything to do with our experiment?

Now let’s compare the treatment and control groups. Treatment in blue, control in red. Here’s a density plot:

Code

# comparison histogram 
df |> 
  ggplot(aes(x = trump_support,y = treat,fill = treat)) + 
  ggdist::stat_slab(color = "black") + 
  xlab("Trump Support (0-100)") + 
  geom_vline(xintercept = 50,linetype = "dotted") + 
  ylab("") + 
  theme_few() + 
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold"),
        legend.position = "none") + 
  xlim(0,100) + 
  scale_fill_manual(values = c("firebrick2","dodgerblue2"))

What do we notice about the two distributions here? Are they different, the same…? What effect did our campaign ad have in this sample?

If we wanted to calculate the effect of our treatment (i.e., how much did our ad cause an increase/decrease in Trump affect), we can take the simple difference in means between the FTs from the treatment and control groups. Nothing fancy. We’ll talk about why this works later, and how to quantify uncertainty.

But for now. . .

Code

df |> 
  group_by(treat) |>  #cluster obs into groups by treatment status  
  summarize(mean_support = mean(trump_support)) |> # take the mean within each group
  summarize(diff_in_means = diff(mean_support)) |> #take the difference in means 
  kable(format = "html", table.attr = "style='width:50%;'")

diff_in_means
5.424109

How would we interpret this difference in means? What was the effect of our ad?

This might seem like a simple example, but the power of random assignment + random sampling underlies vast swathes of empirical work. Understanding how experiments work, and how they can fail, is crucial for understanding + diagnosing more advanced methods.

One last thing: in experimental studies you’ll often see a table called a balance table. It will look something like this:

Code

df$white <- rbinom(1000,1,.7)
df$income <- rnorm(1000,45000,10000)
df$age <- runif(1000,18,80)
df$democrat <- rbinom(1000,1,.5)
#
modelsummary::datasummary_balance(~treat, data = df)

	Control (N=500)		Treatment (N=500)
	Mean	Std. Dev.	Mean	Std. Dev.	Diff. in Means	Std. Error
trump_support	62.3	22.0	67.7	22.1	5.4	1.4
white	0.7	0.5	0.7	0.5	0.0	0.0
income	45174.5	10218.1	44929.1	9393.8	-245.3	620.7
age	49.5	17.9	50.0	18.5	0.5	1.2
democrat	0.5	0.5	0.5	0.5	0.0	0.0

The balance table serves as a rudimentary check that the random assignment “worked” and that we didn’t accidentally condition treatment on some other variable we didn’t intend. Since the assignment process should be exogenous/orthogonal (i.e., unrelated) to observable factors, detecting serious imbalance could be a sign that something has gone awry. For example, if we have exclusively people over the age of 65 in our treatment group, that’s a sign the assignment process likely failed. Imbalance can also occur simply by chance, especially as the number of measured variables increases.

Observational Data

Observational data contrast with experiments in a variety of ways. An easy heuristic to differentiate experimental studies and observational studies: does the researcher control treatment assignment? In other words, does the researcher assign which observations are in the treatment and control group? Or does the “treatment” assignment occur naturally, and we’re working with what nature has given us? If we don’t control the assignment mechanism, the data are likely observational. Observational data are trickier to work with, but have advantages over experiments in terms of external validity.

So what do observational data look like? Let’s take an example of an observational study where we’re interested in the effect of campaign donations on legislative productivity. We think that the more donations a politican gets, the more productive they will be (stop here: does this make sense? Discuss!). Our sample is the whole US House of Representatives in, say, 2018. To operationalize productivity, we’ll use number of bills authored or coauthored.

Here’s the distribution:

Code

# let's pretend congressional contributions are distributed gamma 
c_dist <- distributional::dist_gamma(5,1)
leg_dist <- distributional::dist_exponential(.5)
#
prod <- data.frame(contributions = unlist(generate(c_dist,435))*10000)
prod$leg_out <- unlist(generate(leg_dist,435))*100
# 
prod |> 
  ggplot(aes(x = leg_out)) + 
  ggdist::stat_histinterval(slab_color = "grey95", breaks = 30,point_interval = "mean_qi",
                            slab_fill = "darkorange", outline_bars = TRUE) + 
  theme_few() +
  xlab("Legislative Productivity (# of Bills Authored)") + 
  ylab("Density") + 
    theme(axis.text = element_text(color = "black", face = "bold"), 
          axis.title = element_text(color = "black", face = "bold"))

How would we describe this distribution?

Now let’s make a simple scatter plot of bill coauthorship and campaign contributions

Code

prod |> 
  ggplot(aes(x = contributions, y = leg_out)) + 
  geom_point(shape = 1) + 
  theme_few() + 
  ylab("Legislative Productivity (# of Bills Authored)") + 
  xlab("Campaign Contributions ($$)") +
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold"))

Can we get much from this simple scatterplot?

What if we draw some lines through the scatterplot?

Code

##
spline_counts <- c(1, 10, 50, 200)
prod_expanded <- bind_rows(
  lapply(spline_counts, function(count) {
    prod |> 
      mutate(spline_count = count)
  }),
  .id = "id"
)
#
# Create a plot for each spline count
ggplot(prod_expanded, aes(x = contributions, y = leg_out)) + 
  geom_point(shape = 1) + 
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[1]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[1]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[2]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[2]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[3]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[3]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[4]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[4]),color = "red") +
  facet_wrap(~ spline_count) +
  theme_few() + 
  ylab("Legislative Productivity (# of Bills Authored)") + 
  xlab("Campaign Contributions Received ($$)") +
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold")) + 
  ylim(-5, 1600)

Here are a few examples of different lines/curves we might draw through these data. This is an example of curve-fitting. Which, of any of these curves look like a good fit for these data? What might be going on here?

Let’s modify the example a bit. Now we’re going to look at legislative productivity by seniority. First, let’s take a look at the number of terms served in the House in 2018:

Code

age_dist <- distributional::dist_exponential(.4)
prod$terms <- (unlist(generate(age_dist,435))*2)
prod$leg_out <- prod$leg_out + prod$terms*35
# 
prod |> 
  ggplot(aes(x = terms)) + 
  ggdist::stat_histinterval(slab_color = "grey95", breaks = 30,point_interval = "mean_qi",
                            slab_fill = "darkorange", outline_bars = TRUE) + 
  theme_few() +
  xlab("Number of Two-Year House Terms Served") + 
  ylab("Density") + 
    theme(axis.text = element_text(color = "black", face = "bold"), 
          axis.title = element_text(color = "black", face = "bold"))

How would we describe this distribution?

Now let’s look at some simple scatter plots of terms vs. productivity.

Code

prod |> 
  ggplot(aes(x = terms, y = leg_out)) + 
  geom_point(shape = 1) + 
  theme_few() + 
  ylab("Legislative Productivity (# of Bills Authored)") + 
  xlab("Number of Two-Year House Terms Served ($$)") +
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold"))

Thoughts?

Now let’s once again do some rank curve fitting…

Code

##
spline_counts <- c(1, 10, 50, 200)
prod_expanded <- bind_rows(
  lapply(spline_counts, function(count) {
    prod |> 
      mutate(spline_count = count)
  }),
  .id = "id"
)
#
# Create a plot for each spline count
prod_expanded |> 
  ggplot(aes(x = terms, y = leg_out)) + 
  geom_point(shape = 1) + 
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[1]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[1]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[2]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[2]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[3]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[3]),color = "red") +
  
  geom_smooth(data = subset(prod_expanded, spline_count == spline_counts[4]), 
              method = "lm", se = FALSE, 
              formula = y ~ splines::bs(x, df = spline_counts[4]),color = "red") +
  facet_wrap(~ spline_count) +
  theme_few() + 
  ylab("Legislative Productivity (# of Bills Authored)") + 
  xlab("Campaign Contributions Received ($$)") +
  theme(axis.text = element_text(color = "black", face = "bold"), 
        axis.title = element_text(color = "black", face = "bold")) + 
  ylim(-5, 1600)

What would we say is the relationship between these two variables in these data? Can we say contributions are “causing” increases in productivity here? And which of these lines is “best”? We’ll talk more about modeling and overfitting in future weeks.

Real-World Examples

If you’re interested in seeing some different examples of experimental vs. observational papers in political science, here’s a sampler of (truly) miscellaneous work you might enjoy:

Experiments

“Wither Elites? The Role of Elite Credibility and Knowledge in Public Perceptions of Foreign Policy”

“Does Party Trump Ideology? Disentangling Party and Ideology in America.”

“Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment

Observational

“Modern Police Tactics, Police-Citizen Interactions and the Prospects for Reform”

“How words and money cultivate a personal vote: The effect of legislator credit claiming on constituent credit allocation”

“Are niche parties fundamentally different from mainstream parties? The causes and the electoral consequences of Western European parties’ policy shifts, 1976–1998”

“Political Appointments and Outcomes in Federal District Courts”