San Fransico Zillow Weekly Sales Forecasting

Section 1

Introduction

There are huge finances that are involved while buying a property. In order to make that investment in the right time and right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property.One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.

Data Source

Zillow Weekly Median Sales Price

Why SanFransciso Data Set

The Zestimate is marketed as a tool designed to take the mystery out of real estate for consumers who would otherwise have to rely on brokers and guesswork. However, Zestimates often give a single time point value which might not be an accurate way to base the decision of purchase on.

The dataset provides detailed information on home sales in the area on a weekly basis, which can be used to create accurate short-term forecasts of the real estate market. By analyzing trends in median home prices, it is possible to make predictions about the future performance of the market.

Data Preparation and Visualization

We have median weekly sales data for the houses sold in San Francisco region from 1st week of Jan 2018 to 1st week of Jan 2023.

Code
sales_tbl_ts %>% ggplot() +
  geom_line(aes(date, value)) +
  theme_bw() +
  xlab("Year Week") +
  ylab("Median Sales Price")

Observation:-

  • Seasonality ( Up trend during summer months and downtrend during holiday period)

  • Mean value is increasing over the year

Section 2

Summary Statistics

Code
summary(sales_tbl_ts$value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 846500  927840  977750  971991 1004750 1099750 
Code
x<-sd(sales_tbl_ts$value)
x
[1] 58875.3

Mean of sales price is $971991 and median is $977750 with standard deviation of $58875.3

Exploratory Data Analysis

Code
hist <- sales_tbl_ts %>%
  ggplot() +
  geom_histogram(aes(value),colour = 4, fill = "white", bins = 20) +
  theme_bw()

dens <- sales_tbl_ts %>%
  ggplot() +
  geom_density(aes(value),colour = 4, fill = "white") +
  theme_bw()

violin <- sales_tbl_ts %>%
  ggplot() +
  geom_violin(aes("", value),colour = 4, fill = "white") +
  theme_bw()

boxplot <- sales_tbl_ts %>%
  ggplot() +
  geom_boxplot(aes("", value),colour = 4, fill = "white") +
  theme_bw()

ggarrange(hist, dens, violin,boxplot , 
          labels = c("A", "B", "C","D"),
          ncol = 2, nrow = 2)

  • We don’t see any outlines in the data

  • Mean and Median are very close.

  • Data is evenly distributed and its close to normal distribution

Linear Regression

Code
sales_tbl_ts %>%
  as.data.frame() %>%
  ggplot(aes(x=date, y=value)) +
    ylab("Weekly PriceMedian") +
    xlab("Weekly Date") +
    geom_line() +
    geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula = 'y ~ x'

Linear regression is not handling the seasonality, we should check the moving average method to check which rolling period we should use to calculate the MA.

Section 3

Estimation using Moving Average

Code
sales_ma_data <- sales_tbl_ts %>%
  arrange(date) %>%
  mutate(
    ma_13_left = rollapply(
      value,
      13,
      FUN = mean,
      align = "left", fill = NA
    ),
    ma_13_right = rollapply(
      value,
      13,
      FUN = mean,
      align = "right", fill = NA
    ),
    ma_13_center = rollapply(
      value,
      13,
      FUN = mean,
      align = "center", fill = NA
    )
  ) %>%
  mutate(
    value_ma_3 = rollapply(value, 3, FUN = mean, align = "center", fill = NA),
    value_ma_5 = rollapply(value, 5, FUN = mean, align = "center", fill = NA),
    value_ma_7 = rollapply(value, 7, FUN = mean, align = "center", fill = NA),
    value_ma_13 = rollapply(value, 13, FUN = mean, align = "center", fill = NA),
    value_ma_25 = rollapply(value, 25, FUN = mean, align = "center", fill = NA),
    value_ma_49 = rollapply(value, 49, FUN = mean, align = "center", fill = NA)
  )


sales_tbl_pivot <- sales_ma_data %>%
  pivot_longer(
    cols = ma_13_left:value_ma_49,
    values_to = "value_ma",
    names_to = "ma_order"
  ) %>%
  mutate(ma_order = factor(
    ma_order,
    levels = c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_3",
      "value_ma_5",
      "value_ma_7",
      "value_ma_13",
      "value_ma_25",
      "value_ma_49"
    ),
    labels = c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_3",
      "value_ma_5",
      "value_ma_7",
      "value_ma_13",
      "value_ma_25",
      "value_ma_49"
    )
  ))

x <- sales_tbl_pivot %>%
  filter(
    !ma_order %in% c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_7",
      "value_ma_49"
    )
  ) %>%
  mutate(ma_order = case_when(
    ma_order=='value_ma_3'~'3rd Order',
    ma_order=='value_ma_5'~'5th Order',
    ma_order=='value_ma_13'~'13th Order',
    ma_order=='value_ma_25'~'25th Order')
  ) %>%
  mutate(
    ma_order = factor(
      ma_order,
      labels = c('3rd Order',
      '5th Order',
      '13th Order',
      '25th Order'),
      levels = c('3rd Order',
      '5th Order',
      '13th Order',
      '25th Order')
    )
  ) %>%
  ggplot() +
  geom_line(aes(date, value), size = 1) +
  geom_line(aes(date, value_ma, color = ma_order), size = 1) +
    scale_color_discrete(name = 'MA Order')+
  theme_bw()+
  ylab('Median Sales Price')

x

We are trying to see the moving average based on the order. For the analysis we have taken 3rd,5th, 13th and 25th order.

From the graph 13th order moving avg. explains the seasonality. We

Classical Decomposition

Code
sales_tbl_ts %>%
  model(
    classical_decomposition(value)
  ) %>%
  components() %>%
  autoplot()

STL Decomposition

Code
sales_tbl_ts %>%
  model(
    STL(value)
  ) %>%
  components() %>%
  autoplot()

The classical and STL decomposition are able to explain the seasonality but still we see a cyclic trend in the residual as it moves up and down.We won’t be able to certify it as a white noise.

Section 4

Naive Forecast

Code
file_ts <- ts(sales_tbl_ts$value, start = c(2018, 1), frequency = 52)

naive_forecast <- naive(file_ts, h = 6)
autoplot(naive_forecast) + theme_bw() + ylab('Median Sales Price')

Using the Naive forecast we are forecasting for next 6 weeks but it doesn’t look like the correct way to forecast.

ACF and PACF

Code
sales_tbl_ts_1<-ts(sales_tbl_ts$value, sales_tbl_ts$date)
par(mfrow=c(1,2))
acf(sales_tbl_ts_1, lag.max = 30)
pacf(sales_tbl_ts_1,lag.max = 5)

Dampening effect of ACF suggests that our time series has a “long memory” since future values can be explained partially by values deep in past. But when we check the PACF lag period 2 so we need to use AR(2).