EEFE 530, Spring 2025

GENERAL NOTES
PROBLEM 1. Checking the raw data in the primary dataset
PROBLEM 2. Adding secondary data to the primary dataset
PROBLEM 3. Adding further secondary data
PROBLEM 4. Summarizing the data
PROBLEM 5. Running regressions

GENERAL NOTES

Objectives of Problem Set 1

This problem set has three purposes: (1) manage data in R; (2) generate descriptive tables, plots, and maps; (3) run basic regressions and visualize the estimates on a summary map.

Submission process

When you are done with the problem set, publish it on Rpubs using your temporary account.
Copy the RPubs link of your work and submit it on Canvas.
For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask your R script and .Rmd files for double-checking purposes. As a standard practice, work in a script file before making your code chunks in the .Rmd file. Your .Rmd file and Rpubs submission page MUST show the code used to produce any of the outputs you present in your answers.

Academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

Practice Data

The primary practice data of this problem set is a housing dataset in Canvas, “testdata20250121.RDS”, with information on sale price and date of the house, longitude (x), latitude (y), state, FIPS county code, year in which the house was built, number of bedrooms, bathrooms, fireplaces, stories, square footage, and presence or absence of AC. You will have to add secondary data to solve certain questions of the problem set.

# Load packages
library(pacman)
p_load(tidyverse, lubridate, usmap, gridExtra, stringr, readxl, plot3D,  
       cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,  
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor)

# Load housing data
housingdata <- readRDS("testdata20250121.RDS")

PROBLEM 1. Checking the raw data in the primary dataset

It is common to work with years, quarters, months, etc. rather than the dates. Extract the year, month, and day of sale from the date of sale (variable sale_date). Name the three new variables you generated year, month, and day. Show the variables sale_date, year, month, day for the first ten rows of the dataset.
The year of construction of the house might possibly determine the house price, assuming no retrofitting. Summarize the two variables year and year_built. You may use histograms or present the quantiles of the variables. Generate a categorical variable year_built_grp based on year_built. For example, year_built_grp may take value 1 for very old houses, 2 for quite old houses, 3 for quite new houses, and 4 for very new houses.
Make a 3D plot of log(sale_amount), year_built_grp, and another variable of your choice.
Are there any outliers that will be of concern? How will you address them? Examine briefly all variables of the primary dataset and discuss. Generate a new dataframe cleanhousingdata if you apply any changes to the original dataframe housingdata.

PROBLEM 2. Adding secondary data to the primary dataset

It is common to bring secondary data into your primary dataset for an endless list of reasons. Your treatment variable of interest and many other variables may be sourced from another dataset; many researchers have been using the primary dataset, and you want to be a little bit more original in your work by using other variables that hopefully they did not use. Some of the variables in your primary dataset may need to be refined using secondary data. Or, you may just need some demographic, census, geolocation variables to explore better your primary dataset.

You want to deflate the house prices in your primary dataset using House Price Index (HPI) from the Federal Housing Finance Agency (FHFA). Visit the webpage https://www.fhfa.gov/data/hpi/datasets?tab=quarterly-data and download the quarterly house price indexes U.S. and Census Divisions (Seasonally Adjusted and Not Adjusted). You may figure out that the data is at the division-year-quarter level (division-year-qtr), but it is also possible to get it at the national-year-quarter level if division is restricted to “USA” values. As you think about how to merge this HPI data with your primary dataset, you will need to generate the quarter variable qtr based on month and the variable division based on abbr in your primary dataset, since you have already generated the variable year. U.S. census-designated divisions are available on page 2 of this link https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.

hpidata <- read_excel("hpi_po_us_and_census.xls")
summary(hpidata)

##    division              year           qtr       
##  Length:1350        Min.   :1991   Min.   :1.000  
##  Class :character   1st Qu.:1999   1st Qu.:1.000  
##  Mode  :character   Median :2007   Median :2.000  
##                     Mean   :2007   Mean   :2.489  
##                     3rd Qu.:2016   3rd Qu.:3.000  
##                     Max.   :2024   Max.   :4.000  
##  index_po_not_seasonally_adjusted index_po_seasonally_adjusted
##  Min.   : 94.17                   Min.   : 94.35              
##  1st Qu.:136.18                   1st Qu.:135.38              
##  Median :190.72                   Median :189.87              
##  Mean   :202.96                   Mean   :202.11              
##  3rd Qu.:233.36                   3rd Qu.:233.31              
##  Max.   :599.69                   Max.   :591.16

table(hpidata$division)

## 
## DV_ENC DV_ESC  DV_MA  DV_MT  DV_NE DV_PAC  DV_SA DV_WNC DV_WSC    USA 
##    135    135    135    135    135    135    135    135    135    135

First, create a subset of the HPI dataset named cleanhpidata that does not have “USA” as the division. Then, check where the index takes value 100 (base of the index).
Generate the variable hpi that re-scales index_po_seasonally_adjusted such that for each division it takes value 100 in the first quarter of the start year of house sales in the primary dataset. Then restrict the dataset cleanhpidata to the range of years available in your primary dataset.

Generate the quarter variable qtr and the variable division in your primary dataset cleanhousingdata. Then generate the merged dataset cleanhousinghpidata that has the house price index variable hpi defined at the division-year-quarter level. Also, generate the deflated house prices dlog_sale_amount that is “log(sale_amount * 100 / hpi)”.
You want to visualize the mean of dlog_sale_amount by state and year in a figure that shows a series of annual state-level maps. You have the longitude, latitude, FIPS county code, and the two-letter state abbreviation for the locations of the houses. Matching locations and place names over time is usually a nightmare. Fortunately, the US government has solved this issue with FIPS codes. FIPS stands for Federal Information Processing Standards (FIPS). You may read the following description of FIPS codes (https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html):

“FIPS state codes are unique within nation and FIPS county codes are unique within state. Since counties nest within states, a full county FIPS code identifies both the state and the nesting county. For example, there are 49 counties in the 50 states ending in the digits “001”. To make these county FIPS codes unique, the state FIPS codes are added to the front of each county (01001, 02001, 04001, etc), where the first two digits refer to the state the county is in and the last three digits refer specifically to the county.”

There are several R packages that allow making maps, including those we have talked about in the lecture. Some may require the longitude and latitude, but the easiest will just require the FIPS codes. Perhaps the easiest one to get started with is usmap. You have several resources to practice usmap:

https://cran.r-project.org/web/packages/usmap

https://cran.r-project.org/web/packages/usmap/usmap.pdf

https://cran.r-project.org/web/packages/usmap/vignettes/usmap1.html

https://cran.r-project.org/web/packages/usmap/vignettes/usmap2.html

https://cran.r-project.org/web/packages/usmap/vignettes/usmap3.html

R has a built-in data frame for states that includes FIPS codes: statepop. The usmap package also has a built-in data frame for this purpose.

#print(statepop,n=52)

fips(c("Pennsylvania", "Washington"))

## [1] "42" "53"

fips("Pennsylvania", "Centre")

## [1] "42027"

fips_info("42")

##   abbr fips         full
## 1   PA   42 Pennsylvania

fips_info("42027")

##           full abbr        county  fips
## 1 Pennsylvania   PA Centre County 42027

states_df <- usmap::us_map()
summary(states_df)

##      fips               abbr               full                      geom   
##  Length:51          Length:51          Length:51          MULTIPOLYGON :51  
##  Class :character   Class :character   Class :character   epsg:9311    : 0  
##  Mode  :character   Mode  :character   Mode  :character   +proj=laea...: 0

Add FIPS state code from statepop or states_df to cleanhousinghpidata using their shared identifier.
Filter the mean of dlog_sale_amount by state for the start year of house sales in the sample and store the map for that year. Next, filter the mean of dlog_sale_amount by state for the end year of house sales in the sample and store the map for that year. Show the figure that allows to visualize the two stored maps. Your output at this stage may look like the figure below.

You are not satisfied with these two descriptive maps as you don’t have a sense of the existing price variation in the middle years. Produce a figure that shows a series of all annual state-level maps in your sample. Your output at this stage may look like the figure below.

Rather than visualize these maps for the years of house sale, you want to visualize the mean of dlog_sale_amount by state and year of the construction of the house.

Produce a figure that shows a series of state-level maps by year_built_grp, which you generated in part (ii) of Problem 1, and discuss whether year_built_grp matters for the house prices based on that figure. Your output may look like this.

Read again the information provided on county FIPS codes (https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html). Does your primary dataset have all correct county FIPS codes? If it does not, can you explain why and fix the problem?
Produce a figure that shows county-level maps with the means of year_built_grp and dlog_sale_amount. Do counties with a high mean of year_built_grp match those with a high mean of dlog_sale_amount? Your output may look like this.

PROBLEM 3. Adding further secondary data

Your literature review pointed you to the need for adding crime data to the merged dataset cleanhousinghpidata. For the moment, you wanted to gather crime data at the national-year level. When you find more time, you will gather it at a finer scale for each year.

Visit the webpage https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-1 and download the data: “Download Excel (Table 1)” Crime in the United States by Volume and Rate per 100,000 Inhabitants, 2000–2019. Without changing anything in the excel file downloaded, load it into R and name the data frame crimedata. Hint: you can load just the range of data cells and use col_names to rewrite the headers or short names of the variables.
Remove the footnotes or remarks from the data frame and restrict the data frame to the years that are available in cleanhousinghpidata.
Generate the dataset cleancrimedata that has the variables year and non-missing crime rate variables.
Merge cleancrimedata with cleanhousinghpidata. Save the dataset as cleanhousinghpicrimedata. Plot crime rates against year. Also plot dlog_sale_amount against year for each value of year-built_grp. Discuss all these plots.

PROBLEM 4. Summarizing the data

You finally reached your favorite stage, where you do things mechanically.

Using the full sample of cleanhousinghpicrimedata and keeping in mind the variables entering the hedonic price regressions that you will later estimate in PROBLEM 5, generate a table of summary statistics of sale_amount, dlog_sale_amount, crime rates, year_built, and all other housing characteristics that you will use. For variables that are log-transformed, present the summary statistics of the variable before and after the log-transformation. You may use the packages stargazer, xtable, kable, or any package that helps you produce well-formatted tables of descriptive statistics.
Generate a table of summary statistics of housing characteristics for each value of year-built_grp. Does the tendency differ from what you have seen in the pooled sample in part (i) above?
Generate a table of summary statistics of housing characteristics shown in part (i) for each census-designated division. Does the tendency differ from what you have seen in the pooled sample in part (i)?
Generate a table of summary statistics of housing characteristics shown in part (i) above for each census-designated region. U.S. census-designated regions are available on page 2 of this link https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html. Does the tendency differ from what you have seen in the pooled sample in part (i)?

PROBLEM 5. Running regressions

You are interested in the effect of crime rates on house prices in the United States. Write your estimating equation, with \(log\left(\frac{P_{it}}{HPI_{it}}\right)\) as the dependent variable. Specify your smaller set of crime rate variables of interest based on part (iv) of PROBLEM 3 and the additional housing characteristics in the equation.
Estimate at least three different versions of your equation using the full sample of cleanhousinghpicrimedata and generate a formatted table summarizing the estimation results. You may use the packages stargazer, texreg, etable, or any package that helps you produce well-formatted estimation tables. Discuss which version or model you prefer and interpret the effects of crime rates on house prices.
Re-estimate your preferred model for each census-designated region.
Based on the region-specific regressions in part (iii), store in a new variable the estimated effect of a key crime rate variable. Produce a map that shows the variation of the estimated effect of your key crime rate variable in the United States.

HAVE FUN AND KEEP FAITH IN THE FUN!