Predicting House Prices Monthly Using SARIMA And Holt Winter

Musthofa Syarifudin

2021-06-19

Introduction

Description

This notebook will trying to predict time series problem using SARIMA and Holt Winter Model. Dataset are house prices in USA from 2007 to 2019 here dataset link.

Stucture

Here are stucture of notebook:

Data Preparation
Exploratory Data Analysis
Modelling
Model Evaluation
Verdict

Data Preparation

Read the dataset

hs_data <- read.csv("data/raw_sales.csv")
rmarkdown::paged_table(head(hs_data))

raw data consist of 5 columns that some are corelated to house prices like bedrooms and property type, but in this notebook we will focus one using time variable to predict prices of house.

Checking dataset column types

str(hs_data)

## 'data.frame':    29580 obs. of  5 variables:
##  $ datesold    : chr  "2007-02-07 00:00:00" "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-09 00:00:00" ...
##  $ postcode    : int  2607 2906 2905 2905 2906 2905 2607 2606 2902 2906 ...
##  $ price       : int  525000 290000 328000 380000 310000 465000 399000 1530000 359000 320000 ...
##  $ propertyType: chr  "house" "house" "house" "house" ...
##  $ bedrooms    : int  4 3 3 4 3 4 3 4 3 3 ...

total of data is 29,580 sample and datesod column need to change into date datatype.

Check na values

colSums(is.na(hs_data))

##     datesold     postcode        price propertyType     bedrooms 
##            0            0            0            0            0

the data has zero na values. we can move on to next step

Changing Data Types

hs_data$datesold <- ymd_hms(hs_data$datesold)
hs_data <- hs_data %>% select(-postcode) %>% mutate(propertyType = as.factor(propertyType))
head(hs_data)

##     datesold  price propertyType bedrooms
## 1 2007-02-07 525000        house        4
## 2 2007-02-27 290000        house        3
## 3 2007-03-07 328000        house        3
## 4 2007-03-09 380000        house        4
## 5 2007-03-21 310000        house        3
## 6 2007-04-04 465000        house        4

Summary of data

summary(hs_data)

##     datesold                       price         propertyType     bedrooms   
##  Min.   :2007-02-07 00:00:00   Min.   :  56500   house:24552   Min.   :0.00  
##  1st Qu.:2013-02-05 00:00:00   1st Qu.: 440000   unit : 5028   1st Qu.:3.00  
##  Median :2015-09-30 00:00:00   Median : 550000                 Median :3.00  
##  Mean   :2015-02-21 07:35:42   Mean   : 609736                 Mean   :3.25  
##  3rd Qu.:2017-07-26 00:00:00   3rd Qu.: 705000                 3rd Qu.:4.00  
##  Max.   :2019-07-27 00:00:00   Max.   :8000000                 Max.   :5.00

from summary we know that start of data is 2007-02-07 untill 2019-07-27. we also know mean of price is 550.000 US DOllars.

before we move we need to check time series data assumtion

date must be sorted from old to newsest
date interval must be the same
no jumping data in the interval

Check Time Series Assumtions

all((hs_data$date) == (seq(ymd("2007-02-01"), ymd("2019-07-01"), by = "month")))

## Warning in `==.default`((hs_data$date), (seq(ymd("2007-02-01"),
## ymd("2019-07-01"), : longer object length is not a multiple of shorter object
## length

## [1] FALSE

because we dont fulfill criteria for time series data we need to doing some feature engginering

Feature Engginering

hs_agg <- hs_data %>% 
  mutate(month_new = month(hs_data$datesold),
         year_new = year(hs_data$datesold),
         date = ym(paste0(year_new,"-",month_new))) %>% 
  select(-propertyType, -bedrooms, -datesold) %>% 
  group_by(date) %>% 
  summarise(price_monthly_avg = mean(price))
hs_agg

## # A tibble: 150 x 2
##    date       price_monthly_avg
##    <date>                 <dbl>
##  1 2007-02-01           407500 
##  2 2007-03-01           339333.
##  3 2007-04-01           798000 
##  4 2007-05-01           339500 
##  5 2007-06-01           520333.
##  6 2007-07-01           592079.
##  7 2007-08-01           505609.
##  8 2007-09-01           556875 
##  9 2007-10-01           498600 
## 10 2007-11-01           505442.
## # ... with 140 more rows

We aggregate data for each month and average the total price of the date, using that we get price monthly average data that satisfies all time series assumptions. althought this is different than initial goals to predict house prices from this data we can predict monthly prices instead of 7 or 27 each month like initial data give which instead can give better prediction by taking average data each month for every years.

Check Assumptions

all((hs_agg$date) == (seq(ymd("2007-02-01"), ymd("2019-07-01"), by = "month")))

## [1] TRUE

Exploratory Data Analysis

Time Series Plot

plot(hs_agg$date,hs_agg$price_monthly_avg, type='l')

from the graph we know that our time series is addictive, and there is increasing trend in house prices.

Time Series Object

hp_ts <- ts(data = hs_agg$price_monthly_avg,
                  start = c(2007, 2),
                  frequency = 12)

we use 12 frequency because we use monthly data.

Decomposition

# decomposition
hp_dc <- hp_ts %>% 
  decompose(type = "additive")
  
# autoplot
hp_dc %>% 
  autoplot()

from the decomposition we know that data is monthly seasonal with upward trend.

Cross Validation

# 8 tahun terakhir untuk data test, sisanya jadikan data train
hp_test <- tail(hp_ts, 28)
hp_train <- head(hp_ts, -28)

we divide into train and test with test consist of 28 month as benchmark for model evaluation.

Check Stationary

# assign final ts object
hs_msts <- hp_train %>% 
  msts(seasonal.periods = c(7*4))

# check for stationary
adf.test(hs_msts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  hs_msts
## Dickey-Fuller = -3.1067, Lag order = 4, p-value = 0.1168
## alternative hypothesis: stationary

because p-values > 0.05 we deduce that our train is non stationary, we can convert using diff method.

# assign final ts object
hs_msts <- diff(hp_train) %>% 
  msts(seasonal.periods = c(7*4))

# check for stationary
adf.test(hs_msts)

## Warning in adf.test(hs_msts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  hs_msts
## Dickey-Fuller = -8.6053, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

because now our p-value < 0.05 we already make our data stationary.

Plot Stationary Vs Non Stationary

hp_ts_diff <- diff(hp_train)
hp_ts %>% 
  autoplot(series = "Actual") +
  autolayer(hp_ts_diff, series = "Diff 1")

Modelling

SARIMA MODEL

hp_auto <- auto.arima(hp_train, seasonal = T)
hp_auto

## Series: hp_train 
## ARIMA(3,1,1)(1,0,2)[12] 
## 
## Coefficients:
##           ar1     ar2     ar3      ma1    sar1     sma1    sma2
##       -0.2099  0.1438  0.5862  -0.9470  0.8685  -0.8385  0.2707
## s.e.   0.1065  0.1185  0.1305   0.0503  0.0709   0.1725  0.1614
## 
## sigma^2 estimated as 1910218089:  log likelihood=-1465.93
## AIC=2947.87   AICc=2949.15   BIC=2970.23

here we get SARIMA model with value ARIMA(3,1,1)(1,0,2)[12].

Holt Winter Model

hp_hw <- stlm(hp_train, method = "ets", lambda = 0) 
summary(hp_hw)

##               Length Class  Mode     
## stl           488    mstl   numeric  
## model          19    ets    list     
## modelfunction   1    -none- function 
## lambda          1    -none- numeric  
## x             122    ts     numeric  
## series          1    -none- character
## m               1    -none- numeric  
## fitted        122    ts     numeric  
## residuals     122    ts     numeric

Model Plotting

hp_train%>%
  autoplot(series = "Actual") +
  autolayer(hp_auto$fitted, series = "SARIMA") +
  autolayer(hp_hw$fitted, series = "HOLT WInter")

from the graph both model seem fitting well. but we can see model performance objectively in model evaluation.

Model Evaluation

Forecasting

# forecast
hp_sarima_fr <- forecast(hp_auto, h = 28)
hp_hw_fr <- forecast(hp_hw, h = 28)

here we create forecast data 28 month to future.

Model Comparison

forecast_list <- list(
  "SARIMA" = hp_sarima_fr,
  "HW" = hp_hw_fr)

compare_forecast <- function(x, test){
  lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
    lapply(., function(x) x %>% t() %>% as.data.frame) %>% 
    bind_rows() %>% 
    mutate(model = names(x)) %>%
    select(model, everything())
}

(compare_forecast(forecast_list, hp_test) %>% select(model, MAPE))

##    model     MAPE
## 1 SARIMA 4.524789
## 2     HW 3.189914

Model achieve great result in MAPE VAlues 4,5 for SARIMA and 3,1 for HW. we choose MAPE because our data is in range of thousand, so we need to choose metric that adjusted properly to data variance witch is MAPA metrics good because it adjusted with data percentage rather than Mean square error or root mean square error which will unstable of variance of data is high.

Model SARIMA Performance Plot

test_forecast(actual = hp_ts, 
              forecast.obj = hp_sarima_fr, 
              train = hp_train, 
              test = hp_test)

the graph for sarima seems better when influctuance in real data not reaching it’s peak, which mean model is not stable.

Model Holt Winter Performance Plot

test_forecast(actual = hp_ts, 
              forecast.obj = hp_hw_fr, 
              train = hp_train, 
              test = hp_test)

HW model also suffer the same SARIMA problem but better in overal performance.

Assumtion Check

because of better MAPE performance we pick HW model for assumptions check. ### No auto-correlation residual

Acf(hp_hw$residuals)

we can also check using Ljung Box methods

Box.test(hp_hw$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  hp_hw$residuals
## X-squared = 9.0206, df = 1, p-value = 0.00267

model has lower p-value that 0.05 (alpha) so we pass auto corelation assumtion.

Normality residual

hist(hp_hw$residuals, breaks = 15)

from the graph we can see that almost residual rest in 0 range.but we can use shapiro test to check assumtions.

shapiro.test(hp_hw$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  hp_hw$residuals
## W = 0.8245, p-value = 0.00000000009467

because our p-value is lower than alpha we pass residual assumtions.

Verdict

House prices has been changing in the past years, because of many things. Predicting house prices can help people plan to buy houses or estimating when need to sold the house to gain better values. By using SARIMA and HOLT WINTER model can achieve MAPE around 4.5 and 3.1 which is model predict capability was very good. we also check our data assumption and model assumptions.