setwd("C:/Users/Lenovo/OneDrive - The Pennsylvania State University/Penn State/Coursework/Sem IV/EEFE 530/Problem Sets/Problem Set 2")
#Libraries
# Load packages
library(pacman)
p_load(tidyverse, lubridate, usmap, gridExtra, stringr, readxl, plot3D,
cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)
library(dplyr)
#PROBLEM 1
My research question is whether median household income influences housing prices in the US.
Causal Relationship: Income levels within a county directly impact housing demand and affordability. Higher median household income suggests greater purchasing power, which can drive up demand for housing, leading to higher property values. Households with higher incomes are more likely to afford better housing, bid up prices, and invest in property improvements.
Conversely, lower median household income may indicate weaker economic conditions, reduced affordability, and lower demand for housing, which can result in stagnant or declining housing prices. The relationship between income levels and housing prices is well-established in economic literature, highlighting how local economic well-being influences property values.
Outcome Variable: Sale price of houses (sale_amount from the dataset) Treatment Variable: Median Household Income (continuous variable) at the county level for the years 2010, 2011, 2012 from Census Bureau website.
Justification: Counties with higher median household income are expected to have higher housing prices due to increased demand, better infrastructure, and stronger local economies. This study aims to analyze how variations in income levels correlate with property values over time and across different regions in PA, providing insights for housing policies and economic development strategies.
# Load necessary packages
library(dagitty)
## Warning: package 'dagitty' was built under R version 4.4.2
library(ggdag)
## Warning: package 'ggdag' was built under R version 4.4.2
##
## Attaching package: 'ggdag'
## The following object is masked from 'package:stats':
##
## filter
# Define the DAG
dag <- dagitty('
dag {
"Median Household Income" -> "Housing Prices"
"Median Household Income" -> "Economic Growth"
"Economic Growth" -> "Housing Prices"
"House features" -> "Housing Prices"
"House Age" -> "Housing Prices"
}
')
# Plot the DAG with red text
ggdag(dag) +
theme_minimal() +
ggtitle("Causal DAG for Median Household Income and Housing Prices") +
geom_dag_text(color = "red")
Since income directly relates to purchasing power, the variable of
income itself speaks to economic growth. Therefore, I will control for
“economic growth” or “purchasing power” through median income only. The
other things that may affect our housing prices are confounders such as
household charactertistics such as number of bedrooms, bathrooms, living
square feet, etc. and the age of the house. Usually, the older the
house, the lower their prices are.
Causal Paths from D (Median Household Income) to Y (Housing Prices) From the DAG structure, the causal paths are:
Direct Path: Median Household Income → Housing Prices
Higher median household income increases purchasing power, leading to higher demand and higher house prices.
Indirect Path via Economic Growth: Median Household Income → Economic Growth → Housing Prices
Higher median income boosts local economic growth (e.g., business activity, infrastructure investment), which in turn increases housing demand and prices.
Backdoor Paths (Confounders creating spurious associations): If unobserved confounders influence both income and housing prices, they create backdoor paths that may bias estimates:
Median Household Income ← Unobserved Factors → Housing Prices
Example: Local government policies, education quality, or natural amenities that affect both income levels and housing demand.
Within the dataset at hand, there are no colliders but generally there can be some business/economic growth policies helping both income and housing prices.
The confounders are household characteristics that we will control for.
We estimate the following log-linear regression model using panel data at the county level:
where i is for individual household sale, for county c, in time t. I use county level fixed effects and time fixed effects
housingdata <- readRDS("C:/Users/Lenovo/OneDrive - The Pennsylvania State University/Penn State/Coursework/Sem IV/EEFE 530/Problem Sets/Problem Set 1/testdata20250121.RDS")
## Median HH Income Data
data10 <- read_excel("est10all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
data11 <- read_excel("est11all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
data12 <- read_excel("est12all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
#removed extra headers from the excel sheets before importing
CB Data links: census 2010: https://www.census.gov/data/datasets/2010/demo/saipe/2010-state-and-county.html census 2011: https://www.census.gov/data/datasets/2011/demo/saipe/2011-state-and-county.html census 2012: https://www.census.gov/data/datasets/2012/demo/saipe/2012-state-and-county.html
data10$year <- 2010
data11$year <- 2011
data12$year <- 2012
#Important Columns - State FIPS, County FIPS, Postal, Name, Median Household Income, year
#column names not equal in all datasets
setequal(names(data10), names(data11)) && setequal(names(data11), names(data12))
## [1] FALSE
#rename vars in data12
data12 <- data12 %>% dplyr::rename(
`State FIPS` = `State FIPS Code`,
`County FIPS` = `County FIPS Code`,
Postal = `Postal Code`
)
#select imp vars in all datasets
## data10
data10 <- data10 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))
## data11
data11 <- data11 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))
## data12
data12 <- data12 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))
#column names are now equal
setequal(names(data10), names(data11)) && setequal(names(data11), names(data12))
## [1] TRUE
#combine the datasets
str(data10$`County FIPS`) #num
## num [1:3198] 0 0 1 3 5 7 9 11 13 15 ...
str(data11$`County FIPS`) #num
## num [1:3198] 0 0 1 3 5 7 9 11 13 15 ...
str(data12$`County FIPS`) #chr
## chr [1:3195] "000" "000" "001" "003" "005" "007" "009" "011" "013" "015" ...
data12$`County FIPS` <- as.numeric(data12$`County FIPS`) #converting to numeric as all else
df_combined <- bind_rows(data10, data11, data12)
# Check the result
head(df_combined)
## # A tibble: 6 × 6
## `State FIPS` `County FIPS` Postal Name Median Household Inco…¹ year
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 00 0 US United States 50046 2010
## 2 01 0 AL Alabama 40538 2010
## 3 01 1 AL Autauga County 53049 2010
## 4 01 3 AL Baldwin County 47618 2010
## 5 01 5 AL Barbour County 33074 2010
## 6 01 7 AL Bibb County 35472 2010
## # ℹ abbreviated name: ¹`Median Household Income`
#Drop state only vars
df_combined <- df_combined %>% filter(`County FIPS` > 0)
##### Create Copy
df_combined1 <- df_combined
#create combined var
df_combined$`State FIPS` <- as.numeric(df_combined$`State FIPS`)
#create combined fips county code to match with housing data
df_combined <- df_combined %>%
mutate(
`State FIPS` = as.character(as.numeric(`State FIPS`)), # Remove leading zero from State FIPS
`County FIPS` = sprintf("%03d", as.numeric(`County FIPS`)), # Format County FIPS to 3 digits
fips_county_code = paste0(`State FIPS`, `County FIPS`) # Combine into final FIPS code
)
df_combined$`State FIPS` <- as.numeric(df_combined$`State FIPS`)
#convert fips_county_code to numeric
str(df_combined$fips_county_code)
## chr [1:9429] "1001" "1003" "1005" "1007" "1009" "1011" "1013" "1015" ...
df_combined$fips_county_code <- as.numeric(df_combined$fips_county_code)
range(df_combined$fips_county_code) #1001 56045
## [1] 1001 56045
range(housingdata$fips_county_code) #1101 - 53071
## [1] 1101 53071
str(housingdata$fips_county_code) #integer
## int [1:2350548] 1101 1101 1101 1101 1101 1101 1101 1101 1101 1101 ...
housingdata$fips_county_code <- as.numeric(housingdata$fips_county_code)
#Break Sale Date
housingdata <- housingdata %>%
mutate(
sale_date = ymd(sale_date), # Convert to Date format
year = year(sale_date),
month = month(sale_date),
day = day(sale_date)
)
common_cols <- intersect(names(df_combined), names(housingdata))
print(common_cols)
## [1] "year" "fips_county_code"
# Merge by fips code and year
housing_merged <- housingdata %>% left_join(df_combined, by = c("fips_county_code", "year"))
# drop unnecessary vars
housing_merged <- housing_merged %>% select(-c(`State FIPS`, `County FIPS`, Name, Postal))
sum(is.na(housing_merged))
## [1] 1926733
#subset the data to the years where treatment is available
housing_subset <- housing_merged %>% filter(year == 2010:2012)
str(housing_subset$`Median Household Income`) #chr
## chr [1:140839] "42962" "41556" "42962" "41556" "42962" "41430" "41556" ...
housing_subset$`Median Household Income` <- as.numeric(housing_subset$`Median Household Income`)
state_income <- housing_subset %>%
group_by(abbr, year) %>%
summarize(median_income = mean(`Median Household Income`, na.rm = TRUE), .groups = "drop")
# Check the result
head(state_income)
## # A tibble: 6 × 3
## abbr year median_income
## <chr> <dbl> <dbl>
## 1 AL 2010 41556
## 2 AL 2011 42962
## 3 AL 2012 41430
## 4 CA 2010 59074.
## 5 CA 2011 58746.
## 6 CA 2012 60257.
library(usmap)
library(ggplot2)
library(dplyr)
# Ensure state_income contains a 'state' column with full names
state_income <- state_income %>%
mutate(state = state.name[match(abbr, state.abb)]) %>% # Convert abbreviations to full names
filter(!is.na(state)) # Remove any unmatched states (if any)
##
plot_income_map <- function(data, year) {
# Filter data for the given year
data_year <- data %>% filter(year == year)
# Plot US map
plot_usmap(data = data_year, values = "median_income", regions = "states") +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Income") +
ggtitle(paste("Median Household Income by State -", year)) +
theme(legend.position = "right")
}
##
plot_income_map(state_income, 2010)
plot_income_map(state_income, 2011)
plot_income_map(state_income, 2012)
The darker shaded states have higher median income than the lighter
shaded ones. States like New York, Virginia, Colorado, Texas, and
California have higher median incomes for all the years and Michigan,
North Carolina, Alabama, Montana, and New Mexico have lower median
income. The gray states are the ones where there is no data.
str(housing_subset$sale_amount)
## num [1:140839] 399200 140272 145374 104440 348000 ...
# Aggregate total sales amount by state and year
state_sales <- housing_subset %>%
group_by(abbr, year) %>%
summarize(total_sales = sum(sale_amount, na.rm = TRUE), .groups = "drop")
# Convert state abbreviations to full state names
state_sales <- state_sales %>%
mutate(state = state.name[match(abbr, state.abb)]) %>%
filter(!is.na(state)) # Remove any unmatched states (if any)
# Check the result
head(state_sales)
## # A tibble: 6 × 4
## abbr year total_sales state
## <chr> <dbl> <dbl> <chr>
## 1 AL 2010 38252274 Alabama
## 2 AL 2011 30160740 Alabama
## 3 AL 2012 25785007 Alabama
## 4 CA 2010 8917926256. California
## 5 CA 2011 8567895426. California
## 6 CA 2012 10577247891. California
library(usmap)
library(ggplot2)
plot_sales_map <- function(data, year) {
# Filter data for the given year
data_year <- data %>% filter(year == year)
# Plot US map for sales amount
plot_usmap(data = data_year, values = "total_sales", regions = "states") +
scale_fill_gradient(low = "lightyellow", high = "darkred", name = "Total Sales") +
ggtitle(paste("Total Housing Sales Amount by State -", year)) +
theme(legend.position = "right")
}
plot_sales_map(state_sales, 2010)
plot_sales_map(state_sales, 2011)
plot_sales_map(state_sales, 2012)
According to the maps, the sales amount has been the highest in California, followed by Texas, Colorado, Georgia, and Florida.
# Select the required variables
summary_data <- housing_subset %>%
select(sale_amount, `Median Household Income`, bedrooms_all_buildings,
number_of_bathrooms, number_of_fireplaces, stories_number,
land_square_footage, AC_presence)
# summary stats table
stargazer(summary_data, type = "text", title = "Summary Statistics", digits = 2)
##
## Summary Statistics
## =============================================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------------------------------
## sale_amount 140,839 355,221.60 663,754.10 100.00 118,750,000.00
## Median Household Income 140,839 57,207.88 14,158.81 25,416 98,426
## bedrooms_all_buildings 140,839 3.44 1.09 1 170
## number_of_bathrooms 140,839 2.47 1.07 0.02 175.00
## number_of_fireplaces 140,839 1.18 6.28 1 993
## stories_number 140,839 1.36 0.49 1.00 16.00
## land_square_footage 140,839 37,391.86 1,362,346.00 1 304,353,720
## AC_presence 140,839 0.60 0.49 0 1
## -----------------------------------------------------------------------------
library(lfe)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
housing_subset <- housing_subset %>%
mutate(
log_sale_amount = log(sale_amount),
log_median_income = log(`Median Household Income`)
)
# Estimate the fixed effects regression
model_fe <- felm(log_sale_amount ~ log_median_income + bedrooms_all_buildings +
number_of_bathrooms + number_of_fireplaces + stories_number +
land_square_footage + AC_presence | fips_county_code + year,
data = housing_subset)
summary(model_fe)
##
## Call:
## felm(formula = log_sale_amount ~ log_median_income + bedrooms_all_buildings + number_of_bathrooms + number_of_fireplaces + stories_number + land_square_footage + AC_presence | fips_county_code + year, data = housing_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.564 -0.279 0.010 0.311 5.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## log_median_income 6.018e-01 7.192e-02 8.367 < 2e-16 ***
## bedrooms_all_buildings 8.097e-02 1.546e-03 52.370 < 2e-16 ***
## number_of_bathrooms 2.417e-01 1.698e-03 142.380 < 2e-16 ***
## number_of_fireplaces -3.893e-04 2.416e-04 -1.611 0.10713
## stories_number 1.219e-01 3.496e-03 34.868 < 2e-16 ***
## land_square_footage 2.210e-09 1.101e-09 2.008 0.04463 *
## AC_presence 1.474e-02 4.632e-03 3.183 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5597 on 140402 degrees of freedom
## Multiple R-squared(full model): 0.5442 Adjusted R-squared: 0.5428
## Multiple R-squared(proj model): 0.242 Adjusted R-squared: 0.2396
## F-statistic(full model):384.4 on 436 and 140402 DF, p-value: < 2.2e-16
## F-statistic(proj model): 6402 on 7 and 140402 DF, p-value: < 2.2e-16
#Generate formatted table
stargazer(model_fe, type = "text", title = "Fixed Effects Regression Results",
dep.var.labels = "Log Sale Amount", covariate.labels = c(
"Log Median Household Income", "Bedrooms", "Bathrooms",
"Fireplaces", "Stories", "Land Square Footage", "AC Presence"
), digits = 3)
##
## Fixed Effects Regression Results
## =======================================================
## Dependent variable:
## ---------------------------
## Log Sale Amount
## -------------------------------------------------------
## Log Median Household Income 0.602***
## (0.072)
##
## Bedrooms 0.081***
## (0.002)
##
## Bathrooms 0.242***
## (0.002)
##
## Fireplaces -0.0004
## (0.0002)
##
## Stories 0.122***
## (0.003)
##
## Land Square Footage 0.000**
## (0.000)
##
## AC Presence 0.015***
## (0.005)
##
## -------------------------------------------------------
## Observations 140,839
## R2 0.544
## Adjusted R2 0.543
## Residual Std. Error 0.560 (df = 140402)
## =======================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The results suggest that income has a positive significant effect on housing prices, roughly 60%. This is an expected outcome. Additionally, having more bedrooms, bathrooms, stories, and land footage area also influences the housing prices positively. This is a simple model which depicts that income has high and positive effect on housing prices.