Setting Working Directory

setwd("C:/Users/Lenovo/OneDrive - The Pennsylvania State University/Penn State/Coursework/Sem IV/EEFE 530/Problem Sets/Problem Set 2")

#Libraries

# Load packages
library(pacman)
p_load(tidyverse, lubridate, usmap, gridExtra, stringr, readxl, plot3D,  
       cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,  
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor)
library(dplyr)

#PROBLEM 1

Part i. Research Question

My research question is whether median household income influences housing prices in the US.

Causal Relationship: Income levels within a county directly impact housing demand and affordability. Higher median household income suggests greater purchasing power, which can drive up demand for housing, leading to higher property values. Households with higher incomes are more likely to afford better housing, bid up prices, and invest in property improvements.

Conversely, lower median household income may indicate weaker economic conditions, reduced affordability, and lower demand for housing, which can result in stagnant or declining housing prices. The relationship between income levels and housing prices is well-established in economic literature, highlighting how local economic well-being influences property values.

Outcome Variable: Sale price of houses (sale_amount from the dataset) Treatment Variable: Median Household Income (continuous variable) at the county level for the years 2010, 2011, 2012 from Census Bureau website.

Justification: Counties with higher median household income are expected to have higher housing prices due to increased demand, better infrastructure, and stronger local economies. This study aims to analyze how variations in income levels correlate with property values over time and across different regions in PA, providing insights for housing policies and economic development strategies.

Part ii. DAG diagram

# Load necessary packages
library(dagitty)
## Warning: package 'dagitty' was built under R version 4.4.2
library(ggdag)
## Warning: package 'ggdag' was built under R version 4.4.2
## 
## Attaching package: 'ggdag'
## The following object is masked from 'package:stats':
## 
##     filter
# Define the DAG
dag <- dagitty('
dag {
  "Median Household Income" -> "Housing Prices"
  "Median Household Income" -> "Economic Growth"
  "Economic Growth" -> "Housing Prices"
  "House features" -> "Housing Prices"
  "House Age" -> "Housing Prices"
}
')

#  Plot the DAG with red text
ggdag(dag) + 
  theme_minimal() + 
  ggtitle("Causal DAG for Median Household Income and Housing Prices") +
  geom_dag_text(color = "red")

Since income directly relates to purchasing power, the variable of income itself speaks to economic growth. Therefore, I will control for “economic growth” or “purchasing power” through median income only. The other things that may affect our housing prices are confounders such as household charactertistics such as number of bedrooms, bathrooms, living square feet, etc. and the age of the house. Usually, the older the house, the lower their prices are.

Part iii. Causal Paths

Causal Paths from D (Median Household Income) to Y (Housing Prices) From the DAG structure, the causal paths are:

Direct Path: Median Household Income → Housing Prices

Higher median household income increases purchasing power, leading to higher demand and higher house prices.

Indirect Path via Economic Growth: Median Household Income → Economic Growth → Housing Prices

Higher median income boosts local economic growth (e.g., business activity, infrastructure investment), which in turn increases housing demand and prices.

Backdoor Paths (Confounders creating spurious associations): If unobserved confounders influence both income and housing prices, they create backdoor paths that may bias estimates:

Median Household Income ← Unobserved Factors → Housing Prices

Example: Local government policies, education quality, or natural amenities that affect both income levels and housing demand.

Part iv. Confounderes and Colliders

Within the dataset at hand, there are no colliders but generally there can be some business/economic growth policies helping both income and housing prices.

The confounders are household characteristics that we will control for.

Part v. Estimating Equation

We estimate the following log-linear regression model using panel data at the county level:

log(house sale prices_c,t) = a + B1*log(median household income_c,t ) +

B2*Housing Characteristics_i,c,t +

County FE + Time FE + error_i,s,t

where i is for individual household sale, for county c, in time t. I use county level fixed effects and time fixed effects

PROBLEM 2

Part i. Construct Dataset

Import Datasets

housingdata <- readRDS("C:/Users/Lenovo/OneDrive - The Pennsylvania State University/Penn State/Coursework/Sem IV/EEFE 530/Problem Sets/Problem Set 1/testdata20250121.RDS")

## Median HH Income Data
data10 <- read_excel("est10all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
data11 <- read_excel("est11all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
data12 <- read_excel("est12all.xls")
## New names:
## • `90% CI Lower Bound` -> `90% CI Lower Bound...6`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...7`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...9`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...10`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...12`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...13`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...15`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...16`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...18`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...19`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...21`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...22`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...24`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...25`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...27`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...28`
## • `90% CI Lower Bound` -> `90% CI Lower Bound...30`
## • `90% CI Upper Bound` -> `90% CI Upper Bound...31`
#removed extra headers from the excel sheets before importing

CB Data links: census 2010: https://www.census.gov/data/datasets/2010/demo/saipe/2010-state-and-county.html census 2011: https://www.census.gov/data/datasets/2011/demo/saipe/2011-state-and-county.html census 2012: https://www.census.gov/data/datasets/2012/demo/saipe/2012-state-and-county.html

Data formatting - Income Data

data10$year <- 2010
data11$year <- 2011
data12$year <- 2012

#Important Columns - State FIPS, County FIPS, Postal, Name, Median Household Income, year

#column names not equal in all datasets
setequal(names(data10), names(data11)) && setequal(names(data11), names(data12))
## [1] FALSE
#rename vars in data12
data12 <- data12 %>% dplyr::rename(
  `State FIPS` = `State FIPS Code`,
  `County FIPS` = `County FIPS Code`,
  Postal = `Postal Code`
)

#select imp vars in all datasets
## data10
data10 <- data10 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))

## data11
data11 <- data11 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))

## data12
data12 <- data12 %>% select(c(`State FIPS`, `County FIPS`, Postal, Name, `Median Household Income`, year))

#column names are now equal
setequal(names(data10), names(data11)) && setequal(names(data11), names(data12))
## [1] TRUE
#combine the datasets
str(data10$`County FIPS`) #num
##  num [1:3198] 0 0 1 3 5 7 9 11 13 15 ...
str(data11$`County FIPS`) #num
##  num [1:3198] 0 0 1 3 5 7 9 11 13 15 ...
str(data12$`County FIPS`) #chr
##  chr [1:3195] "000" "000" "001" "003" "005" "007" "009" "011" "013" "015" ...
data12$`County FIPS` <- as.numeric(data12$`County FIPS`) #converting to numeric as all else

df_combined <- bind_rows(data10, data11, data12)

# Check the result
head(df_combined)
## # A tibble: 6 × 6
##   `State FIPS` `County FIPS` Postal Name           Median Household Inco…¹  year
##   <chr>                <dbl> <chr>  <chr>          <chr>                   <dbl>
## 1 00                       0 US     United States  50046                    2010
## 2 01                       0 AL     Alabama        40538                    2010
## 3 01                       1 AL     Autauga County 53049                    2010
## 4 01                       3 AL     Baldwin County 47618                    2010
## 5 01                       5 AL     Barbour County 33074                    2010
## 6 01                       7 AL     Bibb County    35472                    2010
## # ℹ abbreviated name: ¹​`Median Household Income`
#Drop state only vars
df_combined <- df_combined %>% filter(`County FIPS` > 0)

##### Create Copy
df_combined1 <- df_combined 

#create combined var
df_combined$`State FIPS` <- as.numeric(df_combined$`State FIPS`)

#create combined fips county code to match with housing data
df_combined <- df_combined %>%
  mutate(
    `State FIPS` = as.character(as.numeric(`State FIPS`)),  # Remove leading zero from State FIPS
    `County FIPS` = sprintf("%03d", as.numeric(`County FIPS`)),  # Format County FIPS to 3 digits
    fips_county_code = paste0(`State FIPS`, `County FIPS`)  # Combine into final FIPS code
  )

df_combined$`State FIPS` <- as.numeric(df_combined$`State FIPS`)

#convert fips_county_code to numeric
str(df_combined$fips_county_code) 
##  chr [1:9429] "1001" "1003" "1005" "1007" "1009" "1011" "1013" "1015" ...
df_combined$fips_county_code <- as.numeric(df_combined$fips_county_code)

range(df_combined$fips_county_code) #1001 56045
## [1]  1001 56045

Data Formatting - Household Data

range(housingdata$fips_county_code) #1101 - 53071
## [1]  1101 53071
str(housingdata$fips_county_code) #integer
##  int [1:2350548] 1101 1101 1101 1101 1101 1101 1101 1101 1101 1101 ...
housingdata$fips_county_code <- as.numeric(housingdata$fips_county_code)

#Break Sale Date
housingdata <- housingdata %>%
  mutate(
    sale_date = ymd(sale_date),  # Convert to Date format
    year = year(sale_date),
    month = month(sale_date),
    day = day(sale_date)
  )

Merge Datasets

common_cols <- intersect(names(df_combined), names(housingdata))
print(common_cols)
## [1] "year"             "fips_county_code"
# Merge by fips code and year
housing_merged <- housingdata %>% left_join(df_combined, by = c("fips_county_code", "year"))

# drop unnecessary vars
housing_merged <- housing_merged %>% select(-c(`State FIPS`, `County FIPS`, Name, Postal))

sum(is.na(housing_merged))
## [1] 1926733
#subset the data to the years where treatment is available
housing_subset <- housing_merged %>% filter(year == 2010:2012)

str(housing_subset$`Median Household Income`) #chr
##  chr [1:140839] "42962" "41556" "42962" "41556" "42962" "41430" "41556" ...
housing_subset$`Median Household Income` <- as.numeric(housing_subset$`Median Household Income`)

Part ii. State by year treatment plot

state_income <- housing_subset %>%
  group_by(abbr, year) %>%
  summarize(median_income = mean(`Median Household Income`, na.rm = TRUE), .groups = "drop")

# Check the result
head(state_income)
## # A tibble: 6 × 3
##   abbr   year median_income
##   <chr> <dbl>         <dbl>
## 1 AL     2010        41556 
## 2 AL     2011        42962 
## 3 AL     2012        41430 
## 4 CA     2010        59074.
## 5 CA     2011        58746.
## 6 CA     2012        60257.
library(usmap)
library(ggplot2)
library(dplyr)

# Ensure state_income contains a 'state' column with full names
state_income <- state_income %>%
  mutate(state = state.name[match(abbr, state.abb)]) %>%  # Convert abbreviations to full names
  filter(!is.na(state))  # Remove any unmatched states (if any)

##
plot_income_map <- function(data, year) {
  # Filter data for the given year
  data_year <- data %>% filter(year == year)
  
  # Plot US map
  plot_usmap(data = data_year, values = "median_income", regions = "states") +
    scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Income") +
    ggtitle(paste("Median Household Income by State -", year)) +
    theme(legend.position = "right")
}

##
plot_income_map(state_income, 2010)

plot_income_map(state_income, 2011)

plot_income_map(state_income, 2012)

The darker shaded states have higher median income than the lighter shaded ones. States like New York, Virginia, Colorado, Texas, and California have higher median incomes for all the years and Michigan, North Carolina, Alabama, Montana, and New Mexico have lower median income. The gray states are the ones where there is no data.

Part iii. State by year sales amount plot

str(housing_subset$sale_amount)
##  num [1:140839] 399200 140272 145374 104440 348000 ...
# Aggregate total sales amount by state and year
state_sales <- housing_subset %>%
  group_by(abbr, year) %>%
  summarize(total_sales = sum(sale_amount, na.rm = TRUE), .groups = "drop")

# Convert state abbreviations to full state names
state_sales <- state_sales %>%
  mutate(state = state.name[match(abbr, state.abb)]) %>%
  filter(!is.na(state))  # Remove any unmatched states (if any)

# Check the result
head(state_sales)
## # A tibble: 6 × 4
##   abbr   year  total_sales state     
##   <chr> <dbl>        <dbl> <chr>     
## 1 AL     2010    38252274  Alabama   
## 2 AL     2011    30160740  Alabama   
## 3 AL     2012    25785007  Alabama   
## 4 CA     2010  8917926256. California
## 5 CA     2011  8567895426. California
## 6 CA     2012 10577247891. California
library(usmap)
library(ggplot2)

plot_sales_map <- function(data, year) {
  # Filter data for the given year
  data_year <- data %>% filter(year == year)
  
  # Plot US map for sales amount
  plot_usmap(data = data_year, values = "total_sales", regions = "states") +
    scale_fill_gradient(low = "lightyellow", high = "darkred", name = "Total Sales") +
    ggtitle(paste("Total Housing Sales Amount by State -", year)) +
    theme(legend.position = "right")
}

plot_sales_map(state_sales, 2010)

plot_sales_map(state_sales, 2011)

plot_sales_map(state_sales, 2012)

According to the maps, the sales amount has been the highest in California, followed by Texas, Colorado, Georgia, and Florida.

Part iv. Summary Statistics

# Select the required variables
summary_data <- housing_subset %>%
  select(sale_amount, `Median Household Income`, bedrooms_all_buildings, 
         number_of_bathrooms, number_of_fireplaces, stories_number, 
         land_square_footage, AC_presence)

# summary stats table
stargazer(summary_data, type = "text", title = "Summary Statistics", digits = 2)
## 
## Summary Statistics
## =============================================================================
## Statistic                  N       Mean      St. Dev.    Min        Max      
## -----------------------------------------------------------------------------
## sale_amount             140,839 355,221.60  663,754.10  100.00 118,750,000.00
## Median Household Income 140,839 57,207.88   14,158.81   25,416     98,426    
## bedrooms_all_buildings  140,839    3.44        1.09       1         170      
## number_of_bathrooms     140,839    2.47        1.07      0.02      175.00    
## number_of_fireplaces    140,839    1.18        6.28       1         993      
## stories_number          140,839    1.36        0.49      1.00      16.00     
## land_square_footage     140,839 37,391.86  1,362,346.00   1     304,353,720  
## AC_presence             140,839    0.60        0.49       0          1       
## -----------------------------------------------------------------------------

Part v.

library(lfe)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
housing_subset <- housing_subset %>%
  mutate(
    log_sale_amount = log(sale_amount),
    log_median_income = log(`Median Household Income`)
  )

# Estimate the fixed effects regression
model_fe <- felm(log_sale_amount ~ log_median_income + bedrooms_all_buildings + 
                    number_of_bathrooms + number_of_fireplaces + stories_number + 
                    land_square_footage + AC_presence | fips_county_code + year, 
                  data = housing_subset)
summary(model_fe)
## 
## Call:
##    felm(formula = log_sale_amount ~ log_median_income + bedrooms_all_buildings +      number_of_bathrooms + number_of_fireplaces + stories_number +      land_square_footage + AC_presence | fips_county_code + year,      data = housing_subset) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.564  -0.279   0.010   0.311   5.917 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## log_median_income       6.018e-01  7.192e-02   8.367  < 2e-16 ***
## bedrooms_all_buildings  8.097e-02  1.546e-03  52.370  < 2e-16 ***
## number_of_bathrooms     2.417e-01  1.698e-03 142.380  < 2e-16 ***
## number_of_fireplaces   -3.893e-04  2.416e-04  -1.611  0.10713    
## stories_number          1.219e-01  3.496e-03  34.868  < 2e-16 ***
## land_square_footage     2.210e-09  1.101e-09   2.008  0.04463 *  
## AC_presence             1.474e-02  4.632e-03   3.183  0.00146 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5597 on 140402 degrees of freedom
## Multiple R-squared(full model): 0.5442   Adjusted R-squared: 0.5428 
## Multiple R-squared(proj model): 0.242   Adjusted R-squared: 0.2396 
## F-statistic(full model):384.4 on 436 and 140402 DF, p-value: < 2.2e-16 
## F-statistic(proj model):  6402 on 7 and 140402 DF, p-value: < 2.2e-16
#Generate formatted table
stargazer(model_fe, type = "text", title = "Fixed Effects Regression Results",
          dep.var.labels = "Log Sale Amount", covariate.labels = c(
            "Log Median Household Income", "Bedrooms", "Bathrooms", 
            "Fireplaces", "Stories", "Land Square Footage", "AC Presence"
          ), digits = 3)
## 
## Fixed Effects Regression Results
## =======================================================
##                                 Dependent variable:    
##                             ---------------------------
##                                   Log Sale Amount      
## -------------------------------------------------------
## Log Median Household Income          0.602***          
##                                       (0.072)          
##                                                        
## Bedrooms                             0.081***          
##                                       (0.002)          
##                                                        
## Bathrooms                            0.242***          
##                                       (0.002)          
##                                                        
## Fireplaces                            -0.0004          
##                                      (0.0002)          
##                                                        
## Stories                              0.122***          
##                                       (0.003)          
##                                                        
## Land Square Footage                   0.000**          
##                                       (0.000)          
##                                                        
## AC Presence                          0.015***          
##                                       (0.005)          
##                                                        
## -------------------------------------------------------
## Observations                          140,839          
## R2                                     0.544           
## Adjusted R2                            0.543           
## Residual Std. Error             0.560 (df = 140402)    
## =======================================================
## Note:                       *p<0.1; **p<0.05; ***p<0.01

The results suggest that income has a positive significant effect on housing prices, roughly 60%. This is an expected outcome. Additionally, having more bedrooms, bathrooms, stories, and land footage area also influences the housing prices positively. This is a simple model which depicts that income has high and positive effect on housing prices.