BANA 7050 Assignment 2

Author

Ruowei Fischer

Section 1

This project is based on Zillow’s monthly home sales count data for Cincinnati, OH, spanning from February 2008 to October 2024. The dataset includes 202 observations, each representing the estimated number of unique properties that sold during the month. Published as part of Zillow’s open housing research initiative (zillow.com/research/data), this dataset provides a reliable measure of regional housing activity and demand over time.

Home sales volume could be influenced by a mix of seasonal, economic, and local factors. Based on the dataset, we can see that the sales increase during spring and summer and slow in winter. Other factors such as interest rates, employment levels, inflation, and buyer confidence also may drive cyclical trends. Local supply, inventory levels, affordability, and policy shifts also contribute to variability. We can see some clear seasonal pattern and there is a reasonable lenghth of time span, it maybe easier for us to analyze and forecast. However, there could also be some disruptions during Covid19 where economics were in the downstream, which could also affect housing sales.

Section 2

Exploratory Data Analysis

I will begin with the summary statistics and take a look at some key features of the dataset, such as number of observations, mean, median, standard deviation and so on.

Code
# Load libraries
library(tidyverse)
library(lubridate)
library(scales)
library(ggthemes)
library(knitr)

# Load data
df <- read_csv("../dataset/zillow_sales.csv")

# Convert date column to Date type
df2 <- df %>%
  mutate(date = ymd(date))

# Summary statistics table
summary_stats <- df2 %>%
  summarise(
    `Number of Observations` = n(),
    Mean = mean(zillow_sales),
    Median = median(zillow_sales),
    Mode = as.numeric(names(sort(table(zillow_sales), decreasing = TRUE)[1])),
    `Standard Deviation` = sd(zillow_sales),
    Minimum = min(zillow_sales),
    Maximum = max(zillow_sales),
    Range = max(zillow_sales) - min(zillow_sales)
  )

kable(round(summary_stats, 2), caption = "Summary Statistics of Zillow Home Sales Count")
Summary Statistics of Zillow Home Sales Count
Number of Observations Mean Median Mode Standard Deviation Minimum Maximum Range
202 2467.24 2385.5 1558 686.55 1056 4046 2990

Next, I will present several visualizations to better understand the zillow house sale dataset.

Code
# Time series line plot
ggplot(df2, aes(x = date, y = zillow_sales)) +
  geom_line(color = "steelblue", size = 1) +
  labs(title = "Monthly Home Sales in Cincinnati, OH (Feb 2008 – Oct 2024)",
       x = "Year", y = "Home Sales Count") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y", breaks = pretty_breaks(n = 10))

The line chart above illustrates the monthly home sales count in Cincinnati, OH from February 2008 to October 2024. Several key trends are visible:

  • There is a clear seasonal pattern, with recurring peaks typically during spring and summer months and troughs during the winter.
  • The period following the 2008 financial crisis shows a noticeable dip in sales, followed by a gradual recovery and growth trend.
  • Some irregular spikes and dips are visible, potentially linked to the COVID-19 pandemic or interest rate changes.
Code
# Histogram
ggplot(df2, aes(x = zillow_sales)) +
  geom_histogram(binwidth = 150, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Monthly Home Sales",
       x = "Home Sales Count", y = "Frequency") +
  theme_minimal()

The histogram of the sales count look slightly right-skewed. Most of the sale counts for most months fall between approximately 1500 - 2800 sales.

Code
# Density plot
ggplot(df2, aes(x = zillow_sales)) +
  geom_density(fill = "lightblue", alpha = 0.7) +
  labs(title = "Density Plot of Monthly Home Sales",
       x = "Home Sales Count", y = "Density") +
  theme_minimal()

The density plot confirms the patter from the histogram plot, indicating that the concentration of the sales count is around the 2000-2500 range.

Code
# Boxplot
ggplot(df2, aes(y = zillow_sales)) +
  geom_boxplot(fill = "salmon", color = "black", outlier.color = "red") +
  labs(title = "Boxplot of Monthly Home Sales",
       y = "Home Sales Count") +
  theme_minimal()

The median is around 2200-2300. The boxplot indicates no extreme outliers, which implies values fall within the normal expected range.

Section 3

Trend, Remainder, and Seasonality Analysis

Code
# Load required packages
library(zoo)

# Create 5th, 9th, and 13th order centered moving averages
df3 <- df2 %>%
  mutate(
    ma_5 = zoo::rollmean(zillow_sales, k = 5, fill = NA, align = "center"),
    ma_9 = zoo::rollmean(zillow_sales, k = 9, fill = NA, align = "center"),
    ma_13 = zoo::rollmean(zillow_sales, k = 13, fill = NA, align = "center")
  )

# Plot all moving averages with original series
ggplot(df3, aes(x = date)) +
  geom_line(aes(y = zillow_sales), color = "gray70", size = 0.8, alpha = 0.8, linetype = "dotted") +
  geom_line(aes(y = ma_5), color = "darkgreen", size = 1, na.rm = TRUE) +
  geom_line(aes(y = ma_9), color = "steelblue", size = 1, na.rm = TRUE) +
  geom_line(aes(y = ma_13), color = "darkorange", size = 1, na.rm = TRUE) +
  labs(title = "Monthly Home Sales with 5th, 9th, and 13th Order Moving Averages",
       subtitle = "Gray dotted = Original | Green = 5-point | Blue = 9-point | Orange = 13-point",
       x = "Date", y = "Home Sales Count") +
  theme_minimal()

To explore the trend dynamics, I used the moving average of 5th, 9th, and 13th order to display the trend. While the 5th and 9th order of MA is responsive and capture the short term fluctuation, the 13th order provides a much smoother curve that highlights the longer0term trends and structural changes. We can see that housing sale came back up from 2011, after the 2008 financial crisis, then during the post-Covid period, due to the interest rates spiking, we see a downward trend in sales.

Code
df3 <- df3 %>%
  mutate(remainder_13 = zillow_sales - ma_13)

# Plot the remainder series
ggplot(df3, aes(x = date, y = remainder_13)) +
  geom_line(color = "firebrick", size = 1, na.rm = TRUE) +
  labs(title = "Remainder Series: Zillow Sales - 13th-Order Moving Average",
       x = "Date", y = "Residual (Sales - MA_13)") +
  theme_minimal()

The remainder represents the short-term irregular variation after removing the smoothed trend captured by the 13th order moving average. We can see there are peaks and troughs, roughly once per year, which may indicate some strong seasonality. The mean is basically around zero, which confirmed that the moving average removed the long term level.

Code
library(forecast)
library(tsibble)
library(feasts)
library(fabletools)

# Step 1: Convert to tsibble and fill gaps
ts_data <- ts(df$zillow_sales, start = c(2008, 2), frequency = 12)

# STL decomposition using seasonal window = "periodic" (assumes stable seasonality)
stl_result <- stl(ts_data, s.window = "periodic")

# Plot STL decomposition
autoplot(stl_result) +
  labs(title = "STL Decomposition of Zillow Monthly Home Sales") +
  theme_minimal()

Based on the STL decomposition graph, we can see that the seasonal component is visible and strong. Repeating consistently year over year. Sales are high during late Spring and Summer months while low during the winter months.

Section 4

Forcast using Naive

Code
library(tsibble)
library(fable)
library(feasts)
library(tidyverse)
library(lubridate)

library(tsibble)
library(fable)
library(feasts)
library(tidyverse)
library(lubridate)

# Step 1: Format to monthly tsibble
zillow_ts <- df %>%
  mutate(month = yearmonth(date)) %>%
  group_by(month) %>%
  summarise(zillow_sales = sum(zillow_sales, na.rm = TRUE), .groups = 'drop') %>%
  as_tsibble(index = month)

# Step 2: Fit naive models
naive_models <- zillow_ts %>%
  model(
    naive = NAIVE(zillow_sales),
    seasonal_naive = SNAIVE(zillow_sales),
    drift_naive = NAIVE(zillow_sales ~ drift())
  )

# Step 3: Forecast 6 months ahead
zillow_fc <- naive_models %>%
  forecast(h = "6 months")

# Step 4: Plot
autoplot(zillow_fc, zillow_ts) +
  labs(title = "6-Month Naive Forecasts for Zillow Monthly Home Sales",
       x = "Date", y = "Sales Count") +
  theme_minimal()

Looking at the three forcasts, both naive and drift naive shows a flat value prediction for the next 6 months, where the seasonal_naive predicts a downward and the upward trend. In this case, the seasonal naive captures the value from the same month last year, and due to the data has strong seasonality, it seams to be more appropriate to use the seasonal naive model to conduct the forcast.