library(tidyverse)
library(plotly)
library(tidyr)
library(leaflet)
library(highcharter)The US Housing Market Trends From 2001-2022
The United States Housing Market
The housing market trends have seen many inclines and declines over the last two decades. Whether it be the housing market crash between the years 2007-2010, or the more current housing market bubble inflation which is projected to burst at any given time.
I will be observing the trends of the housing market crash of 2007-2010 to see if I can determine any potential markers for a repeat of the past in the next coming years, by utilizing a data set which contains information on a list of public transaction records across the US from 2001-2022 which was web scraped by Omnia Mahmoud Saeed.
If you would to observe the source for the data set yourself, you may locate it at https://www.kaggle.com/datasets/omniamahmoudsaeed/real-estate-sales-2001-2022/data
About the Data Set
In this data set, there are a total of 14 variables, each of which has over 1 million observations. The variables included in this data set are:
- Serial Number (of the sale)
- List Year
- Date Recorded (when the author recorded the data)
- Town
- Address
- Assessed Value (value assigned to the property for tax purposes)
- Sale Amount (final sale price)
- Sales Ratio (ratio of the sale amount to the assessed value)
- Property Type (whether it be residential, commercial, etc)
- Residential Type (specific type of residential property i.e. Single Family)
- Non Use Code (for land that is not used for typical purpose i.e. vacant land)
- Assessor Remarks (will likely not be used for my observations)
- OPM Remarks (will likely not be used for my observations)
- Location (longitude and latitude of location)
Thanks to the original author’s thorough web scraping, I am confident we will be able to locate some interesting trends occurring in the US from 2001-2022!
Getting Started
For starters, I will need to load all needed libraries for my observations:
Now I will need to grab the data set from my directory, which I will save under the tibble named housing.
setwd("/home/andrewarsaw/DATA 110")
housing <- read_csv("Real_Estate_Sales_2001-2022_GL.csv")Now that we have the data set saved in the global environment, I will start with cleaning the data. Thankfully the original author has done a great job at creating a neat data set, so I will spend more time narrowing down on the data I would like to observe. I will potentially create multiple tibbles to observe the data in multiple different ways.
Cleaning the Data
For starters, I will need to adjust the naming of my variables. I want to make the all lowercase and also remove all blank spaces to be replaced with underscores.
names(housing) <- tolower(names(housing))
names(housing) <- gsub(" ", "_", names(housing))
head(housing)# A tibble: 6 × 14
serial_number list_year date_recorded town address assessed_value sale_amount
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2020177 2020 04/14/2021 Anso… 323 BE… 133000 248400
2 2020225 2020 05/26/2021 Anso… 152 JA… 110500 239900
3 2020348 2020 09/13/2021 Anso… 230 WA… 150500 325000
4 2020090 2020 12/14/2020 Anso… 57 PLA… 127400 202500
5 210288 2021 06/20/2022 Avon 12 BYR… 179990 362500
6 200500 2020 09/07/2021 Avon 245 NE… 217640 400000
# ℹ 7 more variables: sales_ratio <dbl>, property_type <chr>,
# residential_type <chr>, non_use_code <chr>, assessor_remarks <chr>,
# opm_remarks <chr>, location <chr>
For my first tibble, I will be grabbing the following variables:
- list year
- assessed value
- sale amount
- sales ratio
- property type
- residential type
I will be removing any unavailable information and will also ensure that all information is showing in uniform order in reference to the list year. This will then be named house1
house1 <- housing |>
select(list_year, assessed_value, sale_amount, sales_ratio,
property_type, residential_type) |>
group_by(list_year) |>
filter(!is.na(residential_type)) |>
arrange(list_year)I may potentially create another tibble for further observations down the line. But, for now, this tibble should be good enough to work with.
Housing Market Correlation and Trends
Before creating any observations, I would like to observe the trends between assessed value and sales value (along with the sales ratio) in a scatter plot to see if we can find any interesting trends occurring.
p1 <- ggplot(house1, aes(x = assessed_value, y = sale_amount)) +
labs(title = "Assessed Value versus Sale Amount in US Housing Market (2006-2022) ",
x = "Assessed Value",
y = "Sale Amount",
caption = "Source: Webscraped by Omnia Mahmoud Saeed") +
theme_minimal()
p1 + geom_point()options(scipen = 999)The outliers are making it hard to get a proper observation of the scatter plot, so for us to get an accurate breakdown of the trends I will need to filter the information down some
house2 <- house1 |>
filter(assessed_value <= 10000000) |>
filter(sale_amount <= 20000000)Now lets see how this data set will appear
p2 <- ggplot(house2, aes(x = assessed_value, y = sale_amount)) +
labs(title = "Assessed Value versus Sale Amount in US Housing Market (2006-2022) ",
x = "Assessed Value",
y = "Sale Amount",
caption = "Source: Webscraped by Omnia Mahmoud Saeed") +
theme_minimal()
p2 + geom_point()Given this, we can see a positive trend where the higher the assessed value of a property, the higher cost (in US dollars) the price will go for. But for one final trend analysis, I would like to see how the sales ratio (which takes into account the rate of assessed value over the sale amount) changes over time.
For this I will need to create a stratified random sample of the data using the function set.seed (credit to ChatGPT for assisting in creating this random sample)
# Making the random sample reproducible
set.seed(511)
# Determine how many years are in the data set, I chose the original dataset for this instance
n_years <- housing |>
distinct(list_year) |>
nrow()
# Compute equal sample size per year
samples_per_year <- floor(900/n_years)
# Perform the stratified sampling of 900 total rows
sample_housing <- housing |>
group_by(list_year) |>
slice_sample(n = samples_per_year) |>
ungroup()And since I used the original data set for this chunk, I will need to clean the data again to ensure I am in the clear to move forward:
finalsample <- sample_housing |>
select(list_year, assessed_value, sale_amount, sales_ratio)Now that I have created a new tibble, named finalsample, where I pulled 880 observations from the original data set using a stratified random sample method to ensure that my statistical observations would be ethical, we can now proceed with further observations.
p3 <- ggplot(finalsample, aes(x = list_year, y = sales_ratio)) +
geom_smooth(method = "loess", se = TRUE, color = "#6a760c") +
labs(title = "US Housing Market Sales Ratio from 2006-2022",
x = "Year",
y = "Sales Ratio",
caption = "Source: Webscraped by Omnia Mahmoud Saeed") +
theme_minimal()
p3 `geom_smooth()` using formula = 'y ~ x'
This chart actually shows very interesting and unexpected results. To give a reminder, the sales ratio data is calculated by dividing the assessed value of a property by the final sale amount. The gray area of this chart shows the confidence interval, considering that this chart is calculated with 880 observed data points, it is not much of a surprise that the confidence interval is so wide. However there is apparently a sharp decline in sales ratio from approximately 2016 to 2022, going from a peak of 3.75 around 2017 to a trough of nearly zero by 2022.
Now that rings into question, what has occurred in the housing market from 2017 to 2022 (and assuming up to even present date) to have created such a low sales ratio?
Housing Market Observations
Before making any conclusive remarks, I would like to observe the sample data a bit more, to see if I find anymore interesting details that will help guide me towards the correct conclusions.
highchart() |>
hc_title(text = "Assessed Value vs Sale Amount from 2001-2022") |>
hc_xAxis(categories = finalsample$list_year,
title = list(text = "Year"),
tickInterval = 1) |>
hc_yAxis(title = list(text = "Dollar Amount")) |>
hc_add_series(data = finalsample$assessed_value,
name = "Assessed Value",
type = "column",
color = "#2ca02c") |>
hc_add_series(data = finalsample$sale_amount,
name = "Sale Amount",
type = "line",
color = "#ff7f0e") |>
hc_chart(style = list(fontFamily = "AvantGarde",
fontWeight = "bold")) Disregarding the duplicate year outputs on the x-axis, we can immediately see a sharp increase of sale amount of houses occurring in 2015 (in comparison to the noticeably lower assessed value) . This is also in alignment with the previous chart, showcasing that something notable must have occurred in 2015 to cause such a sharp change, which must have also inadvertently caused the drastic decline in sales ratio from then until 2022.
Chart Assessment
After careful research, the sudden sharp spike in final sale prices for homes in 2015 vs the assessed value is a direct reflection of the recovery period after the housing market crash of 2008. According to Timothy Rye, “in a rising market, sales chasing can cause the study [or assessed value] to arrive at an inaccurately low value, and in a declining market sales chasing can cause [assessed values] to arrive at erroneously high values” (Rye). In other words, it is actually a good sign to see an assessed value being drastically lower than the sales value- and in turn the sales ratio being at a very low value. One easy way to recognize the concerns of a higher assessed value than sales value is to observe the assessed value for 2008 in the previous chart. There you can see it is one of very few times where that actually occurs in the chart, thus being a good sign that the housing market appears to be in good health overall. However, there has been other instances of it occurring. That including in 2011, 2017, and 2019, along with what appears to be a very tight sales ratio margin in the early 2020’s. This consistent fluctuation must not be a good sign of a stable market, as it appears to fluctuate every other year as of recent years.
Another important factor to consider regarding assessed value is that it is a variable that is always lagging behind its’ counterpart (sale value). That is to say that the assessed value is more relevant to its worth from a previous time (like the previous year for instance), whereas a sale value/amount is in reference to the current price point. So, although they are in relation to each other, it is not directly affecting each other at the same exact pacing. For homeowners, it is actually much more important to purchase a home when assessed values are lower, as property taxes are determined from the assessed value of a property (Campisi). When referring back to the same previous chart, yet again this explains the housing market crash of 2008, as it appears to be less worthwhile to own a home during that period as property tax would appear to have spiked at the time in comparison to other years. Regarding 2015 again, there has also been multiple programs that assisted in stabilizing the market at the time including:
Lowered FHA Mortgage Insurance Premiums, which went from roughly 1.35% to 0.85% in January 2015, saving FHA home buyers roughly $900/year (“FHA Launches”)
Affirmatively Furthering Fair Housing (AFFH), a 2015 initiative which strengthened efforts to enforce fair housing, eliminating discriminatory practices (Exit Memo)
The only other occurrence where the housing market appears to be in really good health is in 2020, another interesting observation that I would not have expected given that that would be during a global pandemic.
In conclusion, although there is clear evidence of the US housing market crash in my stratified random sample, it is still not enough information to assume a repeat of history potentially occurring soon. However, despite that being the case, I still believe that we must pay close attention to the health of the housing market as it does not appear to be in the most stable space in the recent years between 2021-2022, assuming the trend has continued to present date since then.
Before I close everything out, I wanted to throw in a map of the lowest and highest valued home in the US for every year. This one is not necessary for my observation, but it purely to give the viewers something fun to observe (and also I am curious to see what shows up!)
Lowest and Highest Priced Homes From 2001-2022
For this observation, I will need to create a few more tibbles prior to creating the map. As the original data set (housing) has a variable which includes the longitude and latitude, I would like to create an interactive map using that information. But for starters, I would need to create a tibble that filters out any unavailable information, followed by creating a separate tibble to gather the lowest and highest values per year and then combining the two into one final tibble, while also separating the locations variable into two new ones titles long and lat to create my leaflet chart.
house_map <- housing |>
select(list_year, sale_amount, location) |>
filter(!is.na(location))
map2 <- house_map |>
group_by(list_year) |>
slice_min(order_by = sale_amount, n = 1, with_ties = FALSE) |>
rename(min_sale = sale_amount, min_location = location) |>
select(list_year, min_sale, min_location) |>
left_join(
house_map |>
group_by(list_year) |>
slice_max(order_by = sale_amount, n = 1, with_ties = FALSE) |>
rename(max_sale = sale_amount, max_location = location) |>
select(list_year, max_sale, max_location),
by = "list_year"
) |>
arrange(list_year)
map_location <- map2 |>
select(list_year, min_location, max_location) |>
pivot_longer(cols = c(min_location, max_location),
names_to = "type",
names_pattern = "(min|max)_location",
values_to = "location")
map_sale <- map2 |>
select(list_year, min_sale, max_sale) |>
pivot_longer(cols = c(min_sale, max_sale),
names_to = "type",
names_pattern = "(min|max)_sale",
values_to = "sale")
finalmap <- left_join(map_location, map_sale, by = c("list_year", "type"))
# Credit to ChatGPT for assisting with code below this point
finalmap <- finalmap |>
mutate(
coords = str_match(location, "(?i)POINT\\s*\\(\\s*(-?\\d+\\.?\\d*)\\s+(-?\\d+\\.?\\d*)\\s*\\)"),
long = as.numeric(coords[, 2]),
lat = as.numeric(coords[, 3])) |>
select(-coords)
finalmap <- finalmap |>
mutate(long = as.numeric(long), lat = as.numeric(lat))Interactive Map of US Housing Market Data Set
Now that I have my final tibble to work with, named mapfinal, we are now ready to create the final chart.
leaflet(finalmap) |>
addProviderTiles(providers$Stadia.StamenTonerLite) |>
setView(lng = -72.5, lat = 41.6, zoom = 7) |>
addCircleMarkers(
lng = ~long,
lat = ~lat,
radius = 3,
color = ~ifelse(type == "min", "#9dc183", "#e74c3c"),
stroke = TRUE,
fillOpacity = 0.7,
popup = ~paste0(
"<b>Year:</b>", list_year, "<br>",
"<b>Sale Amount:</b> $", formatC(sale, format = "f", big.mark = ",", digits = 0)
)
)Now I must take into account that not all observations had location points added to the housing data set, thus this is why the result of the map appears as skewed as it does, but thi leads me to believe either the author is located in this area, or has a keen interest in the area!
Observing this data set has been exciting, it is my first time observing such a massive data set, and has come with its own challenges (some of which I unfortunately was unable to solve in a timely manner). I look forward to looking back on this project, seeing how far I have come!
Bibliography
- Exit Memo: Department of Housing and Urban Development. Obama White House Archives, 2017, obamawhitehouse.archives.gov/administration/cabinet/exit-memos/department-housing-urban-development.
- “FHA Launches Historic Homebuyer ‘Care Package’ for 2015.” Inman, 19 Feb. 2015, www.inman.com/2015/02/19/fha-launches-historic-homebuyer-care-package-for-2015/.
- Johnson, Natalie Campisi. “Assessed Value vs. Market Value: What’s the Difference?” Forbes Advisor, 7 Mar. 2023, www.forbes.com/advisor/mortgages/assessed-value-vs-market-value/.
- Rye, Timothy A. “How Does the Sale of a Property Affect Its Assessed Value?” Larkin Hoffman, 30 July 2019, www.larkinhoffman.com/real-estate-construction-blog/how-does-the-sale-of-a-property-affect-its-assessed-value.