This project is based on Zillow’s monthly home sales count data for Cincinnati, OH, spanning from February 2008 to October 2024. The dataset includes 202 observations, each representing the estimated number of unique properties that sold during the month. Published as part of Zillow’s open housing research initiative (zillow.com/research/data), this dataset provides a reliable measure of regional housing activity and demand over time.
Home sales volume could be influenced by a mix of seasonal, economic, and local factors. Based on the dataset, we can see that the sales increase during spring and summer and slow in winter. Other factors such as interest rates, employment levels, inflation, and buyer confidence also may drive cyclical trends. Local supply, inventory levels, affordability, and policy shifts also contribute to variability. We can see some clear seasonal pattern and there is a reasonable lenghth of time span, it maybe easier for us to analyze and forecast. However, there could also be some disruptions during Covid19 where economics were in the downstream, which could also affect housing sales.
I will begin with the summary statistics and take a look at some key features of the dataset, such as number of observations, mean, median, standard deviation and so on.
Code
# Load librarieslibrary(tidyverse)library(lubridate)library(scales)library(ggthemes)library(knitr)# Load datadf <-read_csv("../dataset/zillow_sales.csv")# Convert date column to Date typedf2 <- df %>%mutate(date =ymd(date))# Summary statistics tablesummary_stats <- df2 %>%summarise(`Number of Observations`=n(),Mean =mean(zillow_sales),Median =median(zillow_sales),Mode =as.numeric(names(sort(table(zillow_sales), decreasing =TRUE)[1])),`Standard Deviation`=sd(zillow_sales),Minimum =min(zillow_sales),Maximum =max(zillow_sales),Range =max(zillow_sales) -min(zillow_sales) )kable(round(summary_stats, 2), caption ="Summary Statistics of Zillow Home Sales Count")
Summary Statistics of Zillow Home Sales Count
Number of Observations
Mean
Median
Mode
Standard Deviation
Minimum
Maximum
Range
202
2467.24
2385.5
1558
686.55
1056
4046
2990
Next, I will present several visualizations to better understand the zillow house sale dataset.
Code
# Time series line plotggplot(df2, aes(x = date, y = zillow_sales)) +geom_line(color ="steelblue", size =1) +labs(title ="Monthly Home Sales in Cincinnati, OH (Feb 2008 – Oct 2024)",x ="Year", y ="Home Sales Count") +theme_minimal() +scale_x_date(date_labels ="%Y", breaks =pretty_breaks(n =10))
The line chart above illustrates the monthly home sales count in Cincinnati, OH from February 2008 to October 2024. Several key trends are visible:
There is a clear seasonal pattern, with recurring peaks typically during spring and summer months and troughs during the winter.
The period following the 2008 financial crisis shows a noticeable dip in sales, followed by a gradual recovery and growth trend.
Some irregular spikes and dips are visible, potentially linked to the COVID-19 pandemic or interest rate changes.
Code
# Histogramggplot(df2, aes(x = zillow_sales)) +geom_histogram(binwidth =150, fill ="skyblue", color ="black") +labs(title ="Histogram of Monthly Home Sales",x ="Home Sales Count", y ="Frequency") +theme_minimal()
The histogram of the sales count look slightly right-skewed. Most of the sale counts for most months fall between approximately 1500 - 2800 sales.
Code
# Density plotggplot(df2, aes(x = zillow_sales)) +geom_density(fill ="lightblue", alpha =0.7) +labs(title ="Density Plot of Monthly Home Sales",x ="Home Sales Count", y ="Density") +theme_minimal()
The density plot confirms the patter from the histogram plot, indicating that the concentration of the sales count is around the 2000-2500 range.
Code
# Boxplotggplot(df2, aes(y = zillow_sales)) +geom_boxplot(fill ="salmon", color ="black", outlier.color ="red") +labs(title ="Boxplot of Monthly Home Sales",y ="Home Sales Count") +theme_minimal()
The median is around 2200-2300. The boxplot indicates no extreme outliers, which implies values fall within the normal expected range.
# Load required packageslibrary(zoo)# Create 5th, 9th, and 13th order centered moving averagesdf3 <- df2 %>%mutate(ma_5 = zoo::rollmean(zillow_sales, k =5, fill =NA, align ="center"),ma_9 = zoo::rollmean(zillow_sales, k =9, fill =NA, align ="center"),ma_13 = zoo::rollmean(zillow_sales, k =13, fill =NA, align ="center") )# Plot all moving averages with original seriesggplot(df3, aes(x = date)) +geom_line(aes(y = zillow_sales), color ="gray70", size =0.8, alpha =0.8, linetype ="dotted") +geom_line(aes(y = ma_5), color ="darkgreen", size =1, na.rm =TRUE) +geom_line(aes(y = ma_9), color ="steelblue", size =1, na.rm =TRUE) +geom_line(aes(y = ma_13), color ="darkorange", size =1, na.rm =TRUE) +labs(title ="Monthly Home Sales with 5th, 9th, and 13th Order Moving Averages",subtitle ="Gray dotted = Original | Green = 5-point | Blue = 9-point | Orange = 13-point",x ="Date", y ="Home Sales Count") +theme_minimal()
To explore the trend dynamics, I used the moving average of 5th, 9th, and 13th order to display the trend. While the 5th and 9th order of MA is responsive and capture the short term fluctuation, the 13th order provides a much smoother curve that highlights the longer0term trends and structural changes. We can see that housing sale came back up from 2011, after the 2008 financial crisis, then during the post-Covid period, due to the interest rates spiking, we see a downward trend in sales.
The remainder represents the short-term irregular variation after removing the smoothed trend captured by the 13th order moving average. We can see there are peaks and troughs, roughly once per year, which may indicate some strong seasonality. The mean is basically around zero, which confirmed that the moving average removed the long term level.
Code
library(forecast)library(tsibble)library(feasts)library(fabletools)# Step 1: Convert to tsibble and fill gapsts_data <-ts(df$zillow_sales, start =c(2008, 2), frequency =12)# STL decomposition using seasonal window = "periodic" (assumes stable seasonality)stl_result <-stl(ts_data, s.window ="periodic")# Plot STL decompositionautoplot(stl_result) +labs(title ="STL Decomposition of Zillow Monthly Home Sales") +theme_minimal()
Based on the STL decomposition graph, we can see that the seasonal component is visible and strong. Repeating consistently year over year. Sales are high during late Spring and Summer months while low during the winter months.
Looking at the three forcasts, both naive and drift naive shows a flat value prediction for the next 6 months, where the seasonal_naive predicts a downward and the upward trend. In this case, the seasonal naive captures the value from the same month last year, and due to the data has strong seasonality, it seams to be more appropriate to use the seasonal naive model to conduct the forcast.