R
; (2) generate descriptive tables, plots, and maps; (3)
run basic regressions and visualize the estimates on a summary map.Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.
Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.
The primary practice data of this problem set is a housing dataset in Canvas, “testdata20250121.RDS”, with information on sale price and date of the house, longitude (x), latitude (y), state, FIPS county code, year in which the house was built, number of bedrooms, bathrooms, fireplaces, stories, square footage, and presence or absence of AC. You will have to add secondary data to solve certain questions of the problem set.
# Load packages
library(pacman)
p_load(tidyverse, lubridate, usmap, gridExtra, stringr, readxl, plot3D,
cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)
# Load housing data
housingdata <- readRDS("testdata20250121.RDS")
It is common to work with years, quarters, months, etc. rather
than the dates. Extract the year, month, and day of sale from the date
of sale (variable sale_date
). Name the three new variables
you generated year
, month
, and
day
. Show the variables sale_date
,
year
, month
, day
for the first
ten rows of the dataset.
The year of construction of the house might possibly determine
the house price, assuming no retrofitting. Summarize the two variables
year
and year_built
. You may use histograms or
present the quantiles of the variables. Generate a categorical variable
year_built_grp
based on year_built
. For
example, year_built_grp
may take value 1 for very old
houses, 2 for quite old houses, 3 for quite new houses, and 4 for very
new houses.
Make a 3D plot of log(sale_amount),
year_built_grp
, and another variable of your
choice.
Are there any outliers that will be of concern? How will you
address them? Examine briefly all variables of the primary dataset and
discuss. Generate a new dataframe cleanhousingdata
if you
apply any changes to the original dataframe
housingdata
.
It is common to bring secondary data into your primary dataset for an endless list of reasons. Your treatment variable of interest and many other variables may be sourced from another dataset; many researchers have been using the primary dataset, and you want to be a little bit more original in your work by using other variables that hopefully they did not use. Some of the variables in your primary dataset may need to be refined using secondary data. Or, you may just need some demographic, census, geolocation variables to explore better your primary dataset.
division
-year
-qtr
), but it is
also possible to get it at the national-year-quarter level if
division
is restricted to “USA” values. As you think about
how to merge this HPI data with your primary dataset, you will need to
generate the quarter variable qtr
based on
month
and the variable division
based on
abbr
in your primary dataset, since you have already
generated the variable year
. U.S. census-designated
divisions are available on page 2 of this link https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.hpidata <- read_excel("hpi_po_us_and_census.xls")
summary(hpidata)
## division year qtr
## Length:1350 Min. :1991 Min. :1.000
## Class :character 1st Qu.:1999 1st Qu.:1.000
## Mode :character Median :2007 Median :2.000
## Mean :2007 Mean :2.489
## 3rd Qu.:2016 3rd Qu.:3.000
## Max. :2024 Max. :4.000
## index_po_not_seasonally_adjusted index_po_seasonally_adjusted
## Min. : 94.17 Min. : 94.35
## 1st Qu.:136.18 1st Qu.:135.38
## Median :190.72 Median :189.87
## Mean :202.96 Mean :202.11
## 3rd Qu.:233.36 3rd Qu.:233.31
## Max. :599.69 Max. :591.16
table(hpidata$division)
##
## DV_ENC DV_ESC DV_MA DV_MT DV_NE DV_PAC DV_SA DV_WNC DV_WSC USA
## 135 135 135 135 135 135 135 135 135 135
cleanhpidata
that does not have “USA” as the division.
Then, check where the index takes value 100 (base of the index).hpi
that re-scales
index_po_seasonally_adjusted
such that for each division it
takes value 100 in the first quarter of the start year of house sales in
the primary dataset. Then restrict the dataset cleanhpidata
to the range of years available in your primary dataset.Generate the quarter variable qtr
and the variable
division
in your primary dataset
cleanhousingdata
. Then generate the merged dataset
cleanhousinghpidata
that has the house price index variable
hpi
defined at the division-year-quarter level. Also,
generate the deflated house prices dlog_sale_amount
that is
“log(sale_amount * 100 / hpi)”.
You want to visualize the mean of dlog_sale_amount
by state and year in a figure that shows a series of annual state-level
maps. You have the longitude, latitude, FIPS county code, and the
two-letter state abbreviation for the locations of the houses. Matching
locations and place names over time is usually a nightmare. Fortunately,
the US government has solved this issue with FIPS codes. FIPS stands for
Federal Information Processing Standards (FIPS). You
may read the following description of FIPS codes (https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html):
“FIPS state codes are unique within nation and FIPS county codes are unique within state. Since counties nest within states, a full county FIPS code identifies both the state and the nesting county. For example, there are 49 counties in the 50 states ending in the digits “001”. To make these county FIPS codes unique, the state FIPS codes are added to the front of each county (01001, 02001, 04001, etc), where the first two digits refer to the state the county is in and the last three digits refer specifically to the county.”
There are several R
packages that allow making maps,
including those we have talked about in the lecture. Some may require
the longitude and latitude, but the easiest will just require the FIPS
codes. Perhaps the easiest one to get started with is
usmap
. You have several resources to practice
usmap
:
https://cran.r-project.org/web/packages/usmap
https://cran.r-project.org/web/packages/usmap/usmap.pdf
https://cran.r-project.org/web/packages/usmap/vignettes/usmap1.html
https://cran.r-project.org/web/packages/usmap/vignettes/usmap2.html
https://cran.r-project.org/web/packages/usmap/vignettes/usmap3.html
R
has a built-in data frame for states that includes
FIPS codes: statepop
. The usmap
package also
has a built-in data frame for this purpose.
#print(statepop,n=52)
fips(c("Pennsylvania", "Washington"))
## [1] "42" "53"
fips("Pennsylvania", "Centre")
## [1] "42027"
fips_info("42")
## abbr fips full
## 1 PA 42 Pennsylvania
fips_info("42027")
## full abbr county fips
## 1 Pennsylvania PA Centre County 42027
states_df <- usmap::us_map()
summary(states_df)
## fips abbr full geom
## Length:51 Length:51 Length:51 MULTIPOLYGON :51
## Class :character Class :character Class :character epsg:9311 : 0
## Mode :character Mode :character Mode :character +proj=laea...: 0
statepop
or
states_df
to cleanhousinghpidata
using their
shared identifier.dlog_sale_amount
by state for the
start year of house sales in the sample and store the map for that year.
Next, filter the mean of dlog_sale_amount
by state for the
end year of house sales in the sample and store the map for that year.
Show the figure that allows to visualize the two stored maps. Your
output at this stage may look like the figure below.dlog_sale_amount
by state and
year of the construction of the house.year_built_grp
, which you generated in part (ii) of
Problem 1, and discuss whether year_built_grp
matters
for the house prices based on that figure. Your output may look like
this.Read again the information provided on county FIPS codes (https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html). Does your primary dataset have all correct county FIPS codes? If it does not, can you explain why and fix the problem?
Produce a figure that shows county-level maps with the means of
year_built_grp
and dlog_sale_amount
. Do
counties with a high mean of year_built_grp
match those
with a high mean of dlog_sale_amount
? Your output may look
like this.
Your literature review pointed you to the need for adding crime data
to the merged dataset cleanhousinghpidata
. For the moment,
you wanted to gather crime data at the national-year level. When you
find more time, you will gather it at a finer scale for each year.
Visit the webpage https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-1
and download the data: “Download Excel (Table 1)” Crime in the
United States by Volume and Rate per 100,000 Inhabitants,
2000–2019. Without changing anything in the excel file downloaded,
load it into R
and name the data frame
crimedata
. Hint: you can load just the range
of data cells and use col_names
to rewrite the headers or
short names of the variables.
Remove the footnotes or remarks from the data frame and restrict
the data frame to the years that are available in
cleanhousinghpidata
.
Generate the dataset cleancrimedata
that has the
variables year
and non-missing crime rate
variables.
Merge cleancrimedata
with
cleanhousinghpidata
. Save the dataset as
cleanhousinghpicrimedata
. Plot crime rates against
year
. Also plot dlog_sale_amount
against
year
for each value of year-built_grp
. Discuss
all these plots.
You finally reached your favorite stage, where you do things mechanically.
Using the full sample of cleanhousinghpicrimedata
and keeping in mind the variables entering the hedonic price regressions
that you will later estimate in PROBLEM 5, generate a
table of summary statistics of sale_amount
,
dlog_sale_amount
, crime rates, year_built
, and
all other housing characteristics that you will use. For variables that
are log-transformed, present the summary statistics of the variable
before and after the log-transformation. You may use the packages
stargazer
, xtable
, kable
, or any
package that helps you produce well-formatted tables of descriptive
statistics.
Generate a table of summary statistics of housing characteristics
for each value of year-built_grp
. Does the tendency differ
from what you have seen in the pooled sample in part (i) above?
Generate a table of summary statistics of housing characteristics shown in part (i) for each census-designated division. Does the tendency differ from what you have seen in the pooled sample in part (i)?
Generate a table of summary statistics of housing characteristics shown in part (i) above for each census-designated region. U.S. census-designated regions are available on page 2 of this link https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html. Does the tendency differ from what you have seen in the pooled sample in part (i)?
You are interested in the effect of crime rates on house prices in the United States. Write your estimating equation, with \(log\left(\frac{P_{it}}{HPI_{it}}\right)\) as the dependent variable. Specify your smaller set of crime rate variables of interest based on part (iv) of PROBLEM 3 and the additional housing characteristics in the equation.
Estimate at least three different versions of your equation using
the full sample of cleanhousinghpicrimedata
and generate a
formatted table summarizing the estimation results. You may use the
packages stargazer
, texreg
,
etable
, or any package that helps you produce
well-formatted estimation tables. Discuss which version or model you
prefer and interpret the effects of crime rates on house
prices.
Re-estimate your preferred model for each census-designated region.
Based on the region-specific regressions in part (iii), store in a new variable the estimated effect of a key crime rate variable. Produce a map that shows the variation of the estimated effect of your key crime rate variable in the United States.
HAVE FUN AND KEEP FAITH IN THE FUN!