#install.packages('gganimate')
library(ggplot2)
This project will explore the Texas Housing data set, which is a collection of Real Estate and House sales data in the state of Texas from 2000 to 2015. This data set was collected and maintained by Texas A&M University’s Texas Real Estate Research Center.
This investigation will explore the Texas Housing market in the wake of the 2008 Global Recession and 2000’s US Housing Bubble. We will explore the dynamics of the housing market prior to the market-crash on September 2008 as well as attempt to infer the similarities and differences between the Texas Real Estate market in the periods before and after the crash.
We will also observe the differences in Housing Market behavior across the different cities and regions present in our data set, along with an investigation to the regional effects of the 2008 Recession.
This exploration is especially useful and relevant to those interested in Real Estate investments as well as for banks and financial regulators to predict potential bubbles, shocks and inefficiencies in the housing market. This exploration may also be useful to home-owners and investors who are looking to evaluate the safety of their investments.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.3 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” lubridate 1.9.3 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 Ă— 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
txhousing = na.omit(txhousing)
In this investigation, we will explore the dynamics of the housing market across the period 2000 - 2015 and observe the following 4 parameters:
We start by creating a tsibble or time-series tibble to store our required values and we observe their behavior over time:
library(tsibble)
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
### Geouping our Data set on year and month to calculate aggregate values of sales, volume, listings and price
combined = txhousing |> group_by(year, month) |> summarize(total_value = sum(volume),
total_sales = sum(sales),
total_listings = sum(listings),
median_price = mean(median))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
test_dates <- as.Date(paste(combined$year, combined$month, "01", sep = "-"), format = "%Y-%m-%d")
combined$date = test_dates
combined = as_tsibble(combined)
## Using `date` as index variable.
combined
## # A tsibble: 187 x 7 [1D]
## # Groups: year [16]
## year month total_value total_sales total_listings median_price date
## <int> <int> <dbl> <dbl> <dbl> <dbl> <date>
## 1 2000 1 1609824758 11314 75578 92278. 2000-01-01
## 2 2000 2 2216113462 15585 77071 91342. 2000-02-01
## 3 2000 3 2915202724 20081 76505 93194. 2000-03-01
## 4 2000 4 2721063916 18537 79361 94649. 2000-04-01
## 5 2000 5 3348802389 22222 81290 95038. 2000-05-01
## 6 2000 6 3556391797 22799 78462 101734. 2000-06-01
## 7 2000 7 3242934441 20966 83877 99757. 2000-07-01
## 8 2000 8 3367394146 22155 84907 97439. 2000-08-01
## 9 2000 9 2710203553 17761 84406 98730. 2000-09-01
## 10 2000 10 2597674395 17143 86657 98286. 2000-10-01
## # ℹ 177 more rows
# Plot using ggplot
start = as.Date('2006-12-01')
end = as.Date('2010-12-01')
ggplot(combined, aes(x = test_dates, y = total_sales)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf),
fill = "lightblue", alpha = 0.3)+
geom_line() +
labs(x = "Date", y = "Sales", title = "Number of Houses Sold in Texas from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
geom_smooth(span=0.2, color = 'blue', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#transition_states(test_dates)
As seen in the graph above we have plotted the beginning of the 2008 Recession in September 2008 as a red-dotted line. We have also included a Locally Weighted Scatter plot Smoothing (LOWESS) line to observe general trends across the various seasons in our data set.
The most interesting observation in this plot is that we see a sharp decline in the number of house sales around 1 year before the beginning of the 2008 recession. This might be indicative of the peak in the housing market bubble that popped in 2008. We also see that number of sales declined steadily from this peak in early 2007 to early 2009 but then began to rise in an uninterrupted manner till 2015.
# Plot using ggplot
ggplot(combined, aes(x = date, y = total_listings)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
geom_line() +
labs(x = "Date", y = "Volume", title = " Number of Home Listings in Texas from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
geom_smooth(span=0.2, color = 'blue', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Interestingly, contrary to our expectations, the number of Listings actually went up significantly in the months prior to the recession. In fact the number of listings actually peak right before the 2008 Recession and do not recover to similar levels till much later.
This is actually a good representation of the Market Over-Capacity that often supports Real-Estate bubbles in the economy. We observe that the number of listings in the market drops significantly from 2010 to 2015 reaching fairly low-levels towards the end of the data set.
Our previous graph indicated just the number of sales in houses without any regard of $ value. Therefore we plot the volume of sales in this data set against time to observe if the pattern is similar to housing sales
# Plot using ggplot
ggplot(combined, aes(x = date, y = total_value)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3)+
geom_line() +
labs(x = "Date", y = "Volume", title = " $ Value of houses sold in Texas from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
geom_smooth(span=0.2, color = 'blue', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As seen in the graph above, the value of house sales also follows a very
similar pattern as Number of Sales, with a peak in early 2007 and a
steady decline until early 2009. However, this is to be expected as the
total $ value of sales would typically follow the number of sales in a
proportional manner.
ggplot(combined, aes(x = date, y = median_price)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
geom_line() +
labs(x = "Date", y = "Volume", title = " Median Home Prices in Texas from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
geom_smooth(span=0.2, color = 'blue', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This graph is a lot more interesting to us as we can observe that Median Price of houses in Texas plateaus for around a year before the 2008 Recession and stays relatively constant for a few years after. This stability in housing prices is useful to observe as we can come to observe the dynamics of the market right as banks began to realize that majority of mortgages were in-fact sub-prime.
However, it could also just mean that fewer houses were being sold in the market therefore prices were not rising to account for the fall in demand. It could also be possible that fewer houses were being constructed at this time. The lack of constructed houses could also explain the lack of sales in Texas during this period.
# Plot using ggplot
filtered_df <- txhousing %>%
filter(year %in% c(2006, 2007, 2008, 2009))
ggplot(filtered_df, aes(x = factor(year), y = median)) +
geom_boxplot() +
facet_wrap(~year, scales = "free_x") +
labs(title = "Boxplot of Median Price by Year",
x = "Month",
y = "Median Price")
Interestingly we observe that the distribution of Median Prices throughout the periods directly preceding and following the 2008 Recession and Housing Bubble crash remain very similar. This is interesting, give
The documentation of this data set refers to the inventory column as being the “amount of time it would take to sell all current listings at current pace of sales.”
This gives us the an interesting value called “pace of sales” which refer to how quickly listed houses are being sold in the market at a given point in time. The 2008 Recession and Housing Bubble resulted in a sharp drop in housing asset prices, making it likely that homes suddenly became worth much less very quickly. I suspect that this might have resulted in a sharp increase in the pace of home sales during this period as investors would want to buy as many assets as possible in a given period of time.
txhousing$sale_pace = txhousing$listings/ txhousing$inventory
#combined$mean_pace =
combined = txhousing |> group_by(year, month) |> summarize(total_value = sum(volume),
total_sales = sum(sales),
total_listings = sum(listings),
median_price = mean(median),
mean_pace = mean(sale_pace))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
test_dates <- as.Date(paste(combined$year, combined$month, "01", sep = "-"), format = "%Y-%m-%d")
combined$date = test_dates
combined = as_tsibble(combined)
## Using `date` as index variable.
ggplot(combined, aes(x = date, y = mean_pace)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
geom_line() +
labs(x = "Date", y = "Pace", title = " Change in Pace of Sales from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
geom_smooth(span=0.2, color = 'blue', se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
Interestingly we observe that the recession period demonstrates a very steep fall in Pace of sales. This steep fall indicates that home-purchases were very low in spite of a high number of listings.
An important aspect to consider is that perhaps our data is strongly biased towards Urban Areas. It could be that larger cities were affected much worse by the Recession and Housing Bubble than in non urban areas.
Firstly, it is important to consider how much of our data is concentrated in large cities. To do so we plot the number of listings by city across the entirety of our data set.
city_counts = txhousing |> group_by(city) |> summarize(total_sales = sum(sales),
total_listings = sum(listings),
total_value = sum(volume))
city_counts$city <- factor(city_counts$city, levels = city_counts$city[order(-city_counts$total_sales)])
ggplot(city_counts, aes(x = city, y = total_listings)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
ggtitle("Total Listings Across Cities")
We observe that our data is heavily centered around 4 cities in Texas:
Houston
Dallas
Austin
San Antonio
In fact, Houston and Dallas appear to account for a significantly higher count than Austin and San Antonio implying that these cities might be disproportionately be impacted by Real Estate Bubbles. We can observe this in the following chart as well:
city_data = txhousing |> group_by(city, year, month) |> summarize(total_sales = sum(sales),
total_listings = sum(listings),
total_value = sum(volume))
## `summarise()` has grouped output by 'city', 'year'. You can override using the
## `.groups` argument.
test_dates <- as.Date(paste(city_data$year, city_data$month, "01", sep = "-"), format = "%Y-%m-%d")
city_data$date = test_dates
top_4_cities <- city_data[city_data$city %in% c("Houston","Dallas","Austin","San Antonio"), ]
ggplot(top_4_cities, aes(x = date, y = total_listings, color = city)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
geom_line() +
labs(x = "Date", y = "Listings", title = " Number of Home Listings in the top 4 cities from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
theme(legend.position = "bottom")
Clearly Houston and Dallas have a disproportionately higher number of listings than Austin or San Antonio. In fact, another observation we can make is that Austin and San Antonio do not exhibit a fall in listings during the Recession period. In fact, house listings stay roughly constant in this period.
In fact we can observe this fact even when we look at $ Sales Value across time:
ggplot(top_4_cities, aes(x = date, y = total_value, color = city)) +
geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
geom_line() +
labs(x = "Date", y = "Volume", title = " Change in $Value of Sales in the top 4 cities from 2000 - 2015") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
theme(legend.position = "bottom")
As seen above, Austin and San Antonio do not change as strongly as Dallas or Houston. This forces us to ask multiple questions regarding how different the impact of the Recession was on different cities? Is it possible that certain cities like Houston and Dallas were more effected by the Housing Bubble and Recession than other cities?
This might be worth exploring because it allows us to explore how Urban location, population and size of a city can affect asset values.
Since we can see such a variation between the effects of the recession across cities, perhaps we can run a logistic regression to see if we can predict whether a home sale is from our top-2 cities of Houston or Dallas or from any other city in our data set
We begin by creating a binary variable in our data set called - city_binary that has a value of 1 if a house sale is registered in either Houston or Dallas and a value of 0 otherwise.
txhousing$city_binary <- 0
txhousing$city_binary[txhousing$city %in% c("Dallas", "Austin")] <- 1
We then restrict our values only to the 4 year period from 2006 to 2010 where the Housing Bubble and following Recession was at its peak.
recession_period = txhousing[txhousing$year%in% c(2006, 2010), ]
recession_period |>
ggplot(mapping = aes(x = sales, y = city_binary)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Modeling a Binary Response with OLS") +
theme_minimal()
model <- glm(city_binary ~ sales + inventory + median + month , data = recession_period,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) sales inventory median month
## -9.421408e+00 7.391701e-04 -2.987745e-01 4.792902e-05 -3.224484e-02
summary(model)
##
## Call:
## glm(formula = city_binary ~ sales + inventory + median + month,
## family = binomial(link = "logit"), data = recession_period)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28250 -0.14845 -0.08112 -0.05168 2.38126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.421e+00 1.672e+00 -5.636 1.75e-08 ***
## sales 7.392e-04 9.380e-05 7.881 3.26e-15 ***
## inventory -2.988e-01 1.294e-01 -2.309 0.021 *
## median 4.793e-05 8.293e-06 5.779 7.50e-09 ***
## month -3.224e-02 5.589e-02 -0.577 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 372.57 on 879 degrees of freedom
## Residual deviance: 206.43 on 875 degrees of freedom
## AIC: 216.43
##
## Number of Fisher Scoring iterations: 8
At first glance it appears that all our coefficient values are fairly small. Our intercept of -0.9224 represents the Log-odds of the outcome variable being in the positive class when all predictor variables are set to zero. Our intercept also has an extremely small p-value of 0.0000000049 implying that it is a fairly significant value.
Our coefficient for sales is 0.00007359 which is a fairly small but positive value, implying that the Log Odds of belonging to wither Houston or Dallas goes up by a very small amount 0.007359 for every marginal increase in sales amount. Our p-value for sales is also extremely small, at 0.0000000000000047. This also indicates that the coefficient value for sales is significant.
Our coefficient for inventory is 0.0326 which is a relatively (to other coefficients) larger and negative value, implying that the Log Odds of belonging to wither Houston or Dallas goes down by 0.0326 for every marginal increase in inventory. Our p-value for inventory is 0.019, which is less than the 0.05 cuttoff level indicating that it is a statistically significant relationship between inventory and the city_binary variable.
Our coefficient for median is 0.00000478 which is a positive value but the smallest of the 3 coefficients. This implies that the Log odds of belonging to wither Houston or Dallas goes up by a meagre 0.00000478 for every increase in median price. The p-value for this value is also extremely small at 0.00000007, indicating a statistically significant relationship.
Our coefficient for month is -0.003526 which is negative but also relatively larger coefficient value. This implies that houses sold in months later in the year have a larger log-odds of belonging to either Houston or Dallas. Our p-value for this coefficient is 0.0528. Which is slightly above the standard cutoff value of 0.05. This might indicate that this relationship is barely statistically significant and may require further study before we can draw any conclusions from it.
In this exploration we observed the change of various variables in Texas during the period 2000 - 2015 as well as their behavior during the 2008 Financial Crisis and Housing Bubble. We were also able to observe the fact that the effect of the Housing Bubble was strongest in urban areas like Houston and Dallas. This relationship could also be observed in our Logistic Regression Model where we attempted predict whether a listing was Urban or Rural based on the sales value, inventory, median sales price, month. We were able to find 3 statistically significant relationships between these columns and a listing’s urban/rural location.