Final Project

This project will explore the Texas Housing data set, which is a collection of Real Estate and House sales data in the state of Texas from 2000 to 2015. This data set was collected and maintained by Texas A&M University’s Texas Real Estate Research Center.

This investigation will explore the Texas Housing market in the wake of the 2008 Global Recession and 2000’s US Housing Bubble. We will explore the dynamics of the housing market prior to the market-crash on September 2008 as well as attempt to infer the similarities and differences between the Texas Real Estate market in the periods before and after the crash.

We will also observe the differences in Housing Market behavior across the different cities and regions present in our data set, along with an investigation to the regional effects of the 2008 Recession.

This exploration is especially useful and relevant to those interested in Real Estate investments as well as for banks and financial regulators to predict potential bubbles, shocks and inefficiencies in the housing market. This exploration may also be useful to home-owners and investors who are looking to evaluate the safety of their investments.

Loading our Data set

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

txhousing

## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

txhousing = na.omit(txhousing)

Observing Housing Market Dynamics:

In this investigation, we will explore the dynamics of the housing market across the period 2000 - 2015 and observe the following 4 parameters:

- Number of Houses Sold: represented by the total_sales column in our data set
- Total $ value of Sales: represented by the total_value column in our data set
- Number of Active Listings : represented by the total_listings column in our data set
- Median Selling Price of Houses: represented by the median_price column in our data set

We start by creating a tsibble or time-series tibble to store our required values and we observe their behavior over time:

library(tsibble)

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

### Geouping our Data set on year and month to calculate aggregate values of sales, volume, listings and price
combined = txhousing |> group_by(year, month) |> summarize(total_value = sum(volume),
                                                           total_sales = sum(sales),
                                                           total_listings = sum(listings),
                                                           median_price = mean(median))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

test_dates <- as.Date(paste(combined$year, combined$month, "01", sep = "-"), format = "%Y-%m-%d") 


combined$date = test_dates
combined = as_tsibble(combined)

## Using `date` as index variable.

combined

## # A tsibble: 187 x 7 [1D]
## # Groups:    year [16]
##     year month total_value total_sales total_listings median_price date      
##    <int> <int>       <dbl>       <dbl>          <dbl>        <dbl> <date>    
##  1  2000     1  1609824758       11314          75578       92278. 2000-01-01
##  2  2000     2  2216113462       15585          77071       91342. 2000-02-01
##  3  2000     3  2915202724       20081          76505       93194. 2000-03-01
##  4  2000     4  2721063916       18537          79361       94649. 2000-04-01
##  5  2000     5  3348802389       22222          81290       95038. 2000-05-01
##  6  2000     6  3556391797       22799          78462      101734. 2000-06-01
##  7  2000     7  3242934441       20966          83877       99757. 2000-07-01
##  8  2000     8  3367394146       22155          84907       97439. 2000-08-01
##  9  2000     9  2710203553       17761          84406       98730. 2000-09-01
## 10  2000    10  2597674395       17143          86657       98286. 2000-10-01
## # ℹ 177 more rows

1) Plotting changes in House Sales over time:

# Plot using ggplot

start =  as.Date('2006-12-01')
end = as.Date('2010-12-01')

ggplot(combined, aes(x = test_dates, y = total_sales)) +
    geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf),
            fill = "lightblue", alpha = 0.3)+ 
  geom_line() +
  labs(x = "Date", y = "Sales", title = "Number of Houses Sold in Texas from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  geom_smooth(span=0.2, color = 'blue', se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  #transition_states(test_dates)

As seen in the graph above we have plotted the beginning of the 2008 Recession in September 2008 as a red-dotted line. We have also included a Locally Weighted Scatter plot Smoothing (LOWESS) line to observe general trends across the various seasons in our data set.

The most interesting observation in this plot is that we see a sharp decline in the number of house sales around 1 year before the beginning of the 2008 recession. This might be indicative of the peak in the housing market bubble that popped in 2008. We also see that number of sales declined steadily from this peak in early 2007 to early 2009 but then began to rise in an uninterrupted manner till 2015.

2) Plotting changes in Number of Listings over time:

# Plot using ggplot

ggplot(combined, aes(x = date, y = total_listings)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) + 
  geom_line() +
  labs(x = "Date", y = "Volume", title = " Number of Home Listings in Texas from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  geom_smooth(span=0.2, color = 'blue', se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Interestingly, contrary to our expectations, the number of Listings actually went up significantly in the months prior to the recession. In fact the number of listings actually peak right before the 2008 Recession and do not recover to similar levels till much later.

This is actually a good representation of the Market Over-Capacity that often supports Real-Estate bubbles in the economy. We observe that the number of listings in the market drops significantly from 2010 to 2015 reaching fairly low-levels towards the end of the data set.

2) Plotting changes in Dollar Value of Housing Sales over time:

Our previous graph indicated just the number of sales in houses without any regard of $ value. Therefore we plot the volume of sales in this data set against time to observe if the pattern is similar to housing sales

# Plot using ggplot
ggplot(combined, aes(x = date, y = total_value)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3)+ 
  geom_line() +
  labs(x = "Date", y = "Volume", title = " $ Value of houses sold in Texas from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  geom_smooth(span=0.2, color = 'blue', se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As seen in the graph above, the value of house sales also follows a very similar pattern as Number of Sales, with a peak in early 2007 and a steady decline until early 2009. However, this is to be expected as the total $ value of sales would typically follow the number of sales in a proportional manner.

4) Plotting changes in Median Home Prices over time:

ggplot(combined, aes(x = date, y = median_price)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) + 
  geom_line() +
  labs(x = "Date", y = "Volume", title = " Median Home Prices in Texas from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  geom_smooth(span=0.2, color = 'blue', se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This graph is a lot more interesting to us as we can observe that Median Price of houses in Texas plateaus for around a year before the 2008 Recession and stays relatively constant for a few years after. This stability in housing prices is useful to observe as we can come to observe the dynamics of the market right as banks began to realize that majority of mortgages were in-fact sub-prime.

However, it could also just mean that fewer houses were being sold in the market therefore prices were not rising to account for the fall in demand. It could also be possible that fewer houses were being constructed at this time. The lack of constructed houses could also explain the lack of sales in Texas during this period.

Observing Price Dynamics during the Recession Period:

# Plot using ggplot

filtered_df <- txhousing %>%
  filter(year %in% c(2006, 2007, 2008, 2009))



ggplot(filtered_df, aes(x = factor(year), y = median)) +
  geom_boxplot() +
  facet_wrap(~year, scales = "free_x") +
  labs(title = "Boxplot of Median Price by Year",
       x = "Month",
       y = "Median Price")

Interestingly we observe that the distribution of Median Prices throughout the periods directly preceding and following the 2008 Recession and Housing Bubble crash remain very similar. This is interesting, give

Deriving a “Pace” of sales column from our data:

The documentation of this data set refers to the inventory column as being the “amount of time it would take to sell all current listings at current pace of sales.”

This gives us the an interesting value called “pace of sales” which refer to how quickly listed houses are being sold in the market at a given point in time. The 2008 Recession and Housing Bubble resulted in a sharp drop in housing asset prices, making it likely that homes suddenly became worth much less very quickly. I suspect that this might have resulted in a sharp increase in the pace of home sales during this period as investors would want to buy as many assets as possible in a given period of time.

txhousing$sale_pace = txhousing$listings/ txhousing$inventory


#combined$mean_pace = 
combined = txhousing |> group_by(year, month) |> summarize(total_value = sum(volume),
                                                           total_sales = sum(sales),
                                                           total_listings = sum(listings),
                                                           median_price = mean(median), 
                                                           mean_pace = mean(sale_pace))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

test_dates <- as.Date(paste(combined$year, combined$month, "01", sep = "-"), format = "%Y-%m-%d") 


combined$date = test_dates
combined = as_tsibble(combined)

## Using `date` as index variable.

ggplot(combined, aes(x = date, y = mean_pace)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) + 
  geom_line() +
  labs(x = "Date", y = "Pace", title = " Change in Pace of Sales from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  geom_smooth(span=0.2, color = 'blue', se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

Interestingly we observe that the recession period demonstrates a very steep fall in Pace of sales. This steep fall indicates that home-purchases were very low in spite of a high number of listings.

How well do these relationships hold across cities?

An important aspect to consider is that perhaps our data is strongly biased towards Urban Areas. It could be that larger cities were affected much worse by the Recession and Housing Bubble than in non urban areas.

Firstly, it is important to consider how much of our data is concentrated in large cities. To do so we plot the number of listings by city across the entirety of our data set.

city_counts = txhousing |> group_by(city) |> summarize(total_sales = sum(sales), 
                                                       total_listings = sum(listings),
                                                       total_value = sum(volume))

city_counts$city <- factor(city_counts$city, levels = city_counts$city[order(-city_counts$total_sales)])

ggplot(city_counts, aes(x = city, y = total_listings)) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  ggtitle("Total Listings Across Cities")

We observe that our data is heavily centered around 4 cities in Texas:

Houston
Dallas
Austin
San Antonio

In fact, Houston and Dallas appear to account for a significantly higher count than Austin and San Antonio implying that these cities might be disproportionately be impacted by Real Estate Bubbles. We can observe this in the following chart as well:

city_data = txhousing |> group_by(city, year, month) |> summarize(total_sales = sum(sales),
                                                            total_listings = sum(listings),
                                                            total_value = sum(volume))

## `summarise()` has grouped output by 'city', 'year'. You can override using the
## `.groups` argument.

test_dates <- as.Date(paste(city_data$year, city_data$month, "01", sep = "-"), format = "%Y-%m-%d") 

city_data$date = test_dates

top_4_cities <- city_data[city_data$city %in% c("Houston","Dallas","Austin","San Antonio"), ]



ggplot(top_4_cities, aes(x = date, y = total_listings, color = city)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
  geom_line() +
  labs(x = "Date", y = "Listings", title = " Number of Home Listings in the top 4 cities from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  theme(legend.position = "bottom")

Clearly Houston and Dallas have a disproportionately higher number of listings than Austin or San Antonio. In fact, another observation we can make is that Austin and San Antonio do not exhibit a fall in listings during the Recession period. In fact, house listings stay roughly constant in this period.

In fact we can observe this fact even when we look at $ Sales Value across time:

ggplot(top_4_cities, aes(x = date, y = total_value, color = city)) +
  geom_rect(aes(xmin = start , xmax = end, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.3) +
  geom_line() +
  labs(x = "Date", y = "Volume", title = " Change in $Value of Sales in the top 4 cities from 2000 - 2015") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4)) +
  geom_vline(xintercept = as.Date("2008-09-01"), linetype = "dashed", color = "red") +
  theme(legend.position = "bottom")

As seen above, Austin and San Antonio do not change as strongly as Dallas or Houston. This forces us to ask multiple questions regarding how different the impact of the Recession was on different cities? Is it possible that certain cities like Houston and Dallas were more effected by the Housing Bubble and Recession than other cities?

This might be worth exploring because it allows us to explore how Urban location, population and size of a city can affect asset values.

Can we predict if a house sale during the recession is from Houston or Dallas based on the sales price?

Since we can see such a variation between the effects of the recession across cities, perhaps we can run a logistic regression to see if we can predict whether a home sale is from our top-2 cities of Houston or Dallas or from any other city in our data set

We begin by creating a binary variable in our data set called - city_binary that has a value of 1 if a house sale is registered in either Houston or Dallas and a value of 0 otherwise.

txhousing$city_binary <- 0
txhousing$city_binary[txhousing$city %in% c("Dallas", "Austin")] <- 1

We then restrict our values only to the 4 year period from 2006 to 2010 where the Housing Bubble and following Recession was at its peak.

recession_period = txhousing[txhousing$year%in% c(2006, 2010), ]

Running a logistic regression:

recession_period |>
  ggplot(mapping = aes(x = sales, y = city_binary)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()

model <- glm(city_binary ~ sales + inventory + median + month  , data = recession_period,
             family = binomial(link = 'logit'))

model$coefficients

##   (Intercept)         sales     inventory        median         month 
## -9.421408e+00  7.391701e-04 -2.987745e-01  4.792902e-05 -3.224484e-02

summary(model)

## 
## Call:
## glm(formula = city_binary ~ sales + inventory + median + month, 
##     family = binomial(link = "logit"), data = recession_period)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28250  -0.14845  -0.08112  -0.05168   2.38126  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.421e+00  1.672e+00  -5.636 1.75e-08 ***
## sales        7.392e-04  9.380e-05   7.881 3.26e-15 ***
## inventory   -2.988e-01  1.294e-01  -2.309    0.021 *  
## median       4.793e-05  8.293e-06   5.779 7.50e-09 ***
## month       -3.224e-02  5.589e-02  -0.577    0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 372.57  on 879  degrees of freedom
## Residual deviance: 206.43  on 875  degrees of freedom
## AIC: 216.43
## 
## Number of Fisher Scoring iterations: 8

Evaluating the Model:

At first glance it appears that all our coefficient values are fairly small. Our intercept of -0.9224 represents the Log-odds of the outcome variable being in the positive class when all predictor variables are set to zero. Our intercept also has an extremely small p-value of 0.0000000049 implying that it is a fairly significant value.

Our coefficient for sales is 0.00007359 which is a fairly small but positive value, implying that the Log Odds of belonging to wither Houston or Dallas goes up by a very small amount 0.007359 for every marginal increase in sales amount. Our p-value for sales is also extremely small, at 0.0000000000000047. This also indicates that the coefficient value for sales is significant.

Our coefficient for inventory is 0.0326 which is a relatively (to other coefficients) larger and negative value, implying that the Log Odds of belonging to wither Houston or Dallas goes down by 0.0326 for every marginal increase in inventory. Our p-value for inventory is 0.019, which is less than the 0.05 cuttoff level indicating that it is a statistically significant relationship between inventory and the city_binary variable.

Our coefficient for median is 0.00000478 which is a positive value but the smallest of the 3 coefficients. This implies that the Log odds of belonging to wither Houston or Dallas goes up by a meagre 0.00000478 for every increase in median price. The p-value for this value is also extremely small at 0.00000007, indicating a statistically significant relationship.

Our coefficient for month is -0.003526 which is negative but also relatively larger coefficient value. This implies that houses sold in months later in the year have a larger log-odds of belonging to either Houston or Dallas. Our p-value for this coefficient is 0.0528. Which is slightly above the standard cutoff value of 0.05. This might indicate that this relationship is barely statistically significant and may require further study before we can draw any conclusions from it.

Conculsion

In this exploration we observed the change of various variables in Texas during the period 2000 - 2015 as well as their behavior during the 2008 Financial Crisis and Housing Bubble. We were also able to observe the fact that the effect of the Housing Bubble was strongest in urban areas like Houston and Dallas. This relationship could also be observed in our Logistic Regression Model where we attempted predict whether a listing was Urban or Rural based on the sales value, inventory, median sales price, month. We were able to find 3 statistically significant relationships between these columns and a listing’s urban/rural location.

Final Project

Shresht Venkatraman

2024-04-19

Final Project

Loading our Data set

Observing Housing Market Dynamics:

1) Plotting changes in House Sales over time:

2) Plotting changes in Number of Listings over time:

2) Plotting changes in Dollar Value of Housing Sales over time:

4) Plotting changes in Median Home Prices over time:

Observing Price Dynamics during the Recession Period:

Deriving a “Pace” of sales column from our data:

How well do these relationships hold across cities?

Can we predict if a house sale during the recession is from Houston or Dallas based on the sales price?

Running a logistic regression:

Evaluating the Model:

Conculsion