EEFE 530, Spring 2025

GENERAL NOTES
PROBLEM 1. Research question and DAG
PROBLEM 2. Data description and analysis based on the DAG

GENERAL NOTES

Objectives of Problem Set 2

This problem set has two objectives: (1) state a research question and conceive its corresponding Directed Acyclic Graph (DAG); (2) describe and analyze data in R based on the DAG.

Submission process

When you are done with the problem set, publish it on Rpubs using your temporary account.
Copy the RPubs link of your work and submit it on Canvas.
For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask your R script and .Rmd files for double-checking purposes. As a standard practice, work in a script file before making your code chunks in the .Rmd file. Your .Rmd file and Rpubs submission page MUST show the code used to produce any of the outputs you present in your answers.

Academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

Practice Data

The primary practice data of this problem set is a housing dataset on Canvas, testdata20250121.RDS, with information on sale price and date of the house, longitude (x), latitude (y), state, FIPS county code, year in which the house was built, number of bedrooms, bathrooms, fireplaces, stories, square footage, and presence or absence of AC. You will have to add secondary data to solve certain questions of the problem set.

# Load packages
library(pacman)
p_load(tidyverse, lubridate, usmap, gridExtra, stringr, readxl, plot3D,  
       cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,  
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor)

# Load housing data
housingdata <- readRDS("testdata20250121.RDS")

PROBLEM 1. Research question and DAG

As a student in your graduate program, you have free access to a comprehensive housing dataset with variations in housing attributes and prices across counties in the United States from 2008 to 2019. Using this dataset will save you significant time, complications, and challenges associated with collecting observational or experimental data in the field or acquiring secondary data at a cost.

You are tasked with studying the causal effect on housing prices (\(Y\)) of a variable (\(D\)) that you select based on a literature search, your interests, and data availability. Your choice of \(D\) must meet the following criteria:

You are able to provide 1-2 sentences explaining why \(D\) might influence \(Y\).
You are able to provide 1-2 sentences explaining why policymakers and the society should care about the relationship between \(D\) and \(Y\).
\(D\) is either a continuous or a binary variable.
\(D\) must vary both over time and across locations, where locations can be census-designated regions, divisions, states, or counties, and time can be year, quarter, or month.
Data on \(D\) must be publicly available. For example, you can use US state-year-month-level Economic Policy Uncertainty (EPU) sourced from https://www.policyuncertainty.com/state_epu.html, or other US location-time series data on economic, agricultural, environmental, energy, or educational quality variables from sources like https://fred.stlouisfed.org/, https://www.bea.gov/, etc.

Feel free to choose any variable \(D\) as long as the data source is public, \(D\) varies over time within locations, and you can provide a plausible justification for why \(D\) may affect housing prices.

State your research question. What is the causal relationship of interest? Clearly define the treatment and outcome variables. Provide a concise motivation for this research.
Draw a basic Directed Acyclic Graph (DAG) corresponding to your research question. Construct the figure based on your literature review, theoretical considerations, and any simplifications you’ve made. A DAG is inherently a simplified representation of the causal pathways in your study, so include all key variables and their connections while omitting less important ones. Provide a brief description of your DAG.
List all causal paths from \(D\) to \(Y\).
Are there any confounder(s) and collider(s)?
Based on your DAG, write your estimating equation that isolates your causal path of interest: \(D \rightarrow Y\).

PROBLEM 2. Data description and analysis based on the DAG

Construct the dataset needed to study your research question by merging your primary dataset, testdata20250121.RDS, with publicly available data on \(D\) and any other variables necessary to estimate the equation in part (v) of Problem 1. For reproducibility, document where you download the new data from and all steps involved in the data merging process, including the code used.
Plot the means of your treatment variable by state and year. Create state-level maps showing the data year-by-year. You may use the usmap package or any other package that helps you produce high-quality maps. Interpret the figure.
Plot the means of your outcome variable by state and year. Create state-level maps showing the data year-by-year. You may use the usmap package or any other package that helps you produce high-quality maps. Interpret the figure.
Generate a table of summary statistics for all variables needed to estimate the equation in part (v) of Problem 1. You may use packages like stargazer, xtable, kable, or any other package that helps you produce well-formatted tables of descriptive statistics.
Estimate the equation in part (v) of Problem 1 and generate a formatted table summarizing the estimation results. You may use packages like stargazer, texreg, etable, or any other package that helps you produce well-formatted estimation tables. Discuss your research findings.

HAVE FUN AND KEEP FAITH IN THE FUN!