INTRODUCTION

This is a time series analysis and forecast modelling using Holt-Winters and Autoregressive Moving Average (ARIMA) methods. The data was downloaded from :

https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv

The forecast results were finally evaluated with accuracy() and the Ljung Box test.

Invoke required libraries.

library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(timetk)
library(TSstudio)

Exploratory Data Analysis

Import the data.

sales <- read.csv('raw_sales.csv')

rmarkdown::paged_table(sales)

Check for missing values shows no missing values.

anyNA(sales)

## [1] FALSE

Check for the unique data fields of sales$propertyType.

unique(sales$propertyType)

## [1] "house" "unit"

Check for the unique data fields of sales$bedrooms.

unique(sales$bedrooms)

## [1] 4 3 5 1 2 0

Check the subset of propertyType==“unit”.

sales_unit<-sales%>% filter(propertyType=="unit")

Display subset propertyType==“unit”.

rmarkdown::paged_table(sales_unit)

Check the subset of propertyType==“house” and bedrooms==3.

sales_house_3<-sales%>% filter(propertyType=="house" & bedrooms==3)

Display subset sales_house_3( propertyType==“house” & bedrooms==3).

rmarkdown::paged_table(sales_house_3)

Check the data types of sales_house_3. This time series is now chosen to be modelled.

str(sales_house_3)

## 'data.frame':    11281 obs. of  5 variables:
##  $ datesold    : chr  "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-21 00:00:00" "2007-04-24 00:00:00" ...
##  $ postcode    : int  2906 2905 2906 2607 2902 2906 2902 2906 2906 2602 ...
##  $ price       : int  290000 328000 310000 399000 359000 320000 385000 305000 336000 427500 ...
##  $ propertyType: chr  "house" "house" "house" "house" ...
##  $ bedrooms    : int  3 3 3 3 3 3 3 3 3 3 ...

Check format of date field of sales_house_3$datesold.

head(unique(sales_house_3$datesold))

## [1] "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-21 00:00:00"
## [4] "2007-04-24 00:00:00" "2007-05-24 00:00:00" "2007-05-25 00:00:00"

Convert sales_house_3$datesold date field from “chr” to “datetime”.

sales_house_3$datesold <- ymd_hms(sales_house_3$datesold)

View the changes made.

glimpse(sales_house_3)

## Rows: 11,281
## Columns: 5
## $ datesold     <dttm> 2007-02-27, 2007-03-07, 2007-03-21, 2007-04-24, 2007-05-…
## $ postcode     <int> 2906, 2905, 2906, 2607, 2902, 2906, 2902, 2906, 2906, 260…
## $ price        <int> 290000, 328000, 310000, 399000, 359000, 320000, 385000, 3…
## $ propertyType <chr> "house", "house", "house", "house", "house", "house", "ho…
## $ bedrooms     <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …

Check a brief statistics summary of the final dataset.

summary(sales_house_3)

##     datesold                         postcode        price        
##  Min.   :2007-02-27 00:00:00.00   Min.   :2600   Min.   :  69000  
##  1st Qu.:2012-08-17 00:00:00.00   1st Qu.:2611   1st Qu.: 432500  
##  Median :2015-05-22 00:00:00.00   Median :2615   Median : 496000  
##  Mean   :2014-11-09 08:28:48.35   Mean   :2740   Mean   : 550264  
##  3rd Qu.:2017-05-01 00:00:00.00   3rd Qu.:2905   3rd Qu.: 609000  
##  Max.   :2019-07-26 00:00:00.00   Max.   :2914   Max.   :6480000  
##  propertyType          bedrooms
##  Length:11281       Min.   :3  
##  Class :character   1st Qu.:3  
##  Mode  :character   Median :3  
##                     Mean   :3  
##                     3rd Qu.:3  
##                     Max.   :3

Timeseries Forecasting

Use the ts() function to plot the the chosen dataset.

sales_ts <-ts(sales_house_3$price, start=c(2007, 2), end=c(2019, 7), frequency=365) 

forecast::autoplot(sales_ts)

Decompose the timeseries to view seasonal, trend and random patterns.

sales_decom <- sales_ts %>% decompose()

plot(sales_decom)

The decomposition shows a rising trend in years coming up to 2012 and a stationary trend onwards from 2015. Furthermore the decomposition shows a regular pattern of seasonality with price spikes at early and middle of year.

Modelling

Use ts_split to split data into test sets of 500 observations and the rest as training data.

split_sales_ts <- ts_split(ts.obj = sales_ts, sample.out = 500)


training <- split_sales_ts$train
testing <- split_sales_ts$test

Check population of training and test sets.

length(training)

## [1] 3886

length(testing)

## [1] 500

Check training dataset begin and end observations.

head(training)

## Time Series:
## Start = c(2007, 2) 
## End = c(2007, 7) 
## Frequency = 365 
## [1] 290000 328000 310000 399000 359000 320000

tail(training)

## Time Series:
## Start = c(2017, 232) 
## End = c(2017, 237) 
## Frequency = 365 
## [1] 462000 485000 513500 635000 353000 420000

Check test dataset begin and end observations.

head(testing)

## Time Series:
## Start = c(2017, 238) 
## End = c(2017, 243) 
## Frequency = 365 
## [1] 440000 456000 730000 673000 500000 342000

tail(testing)

## Time Series:
## Start = c(2019, 2) 
## End = c(2019, 7) 
## Frequency = 365 
## [1] 850000 490000 413000 415000 475000 537000

HoltWinters Model

Holt and Winters is a forecasting model suitable for time-series exhibiting trend and seasonality.

model_hw <- HoltWinters(training)

Plot the HoltWinters forecast using training data.

HW1 <- HoltWinters(training)# Custom HoltWinters fitting
HW2 <- HoltWinters(training, beta=F)#Visually evaluate the fits
plot(training, ylab="Value", xlim=c())
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")

Display a smaller section of previous plot.

plot(training, ylab="Value", xlim=c(2016.1, 2017.12))
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")

Generate a forecast using HoltWinters model for the next 365 days.

hw_forecast <- forecast(
model_hw,
h = 365)

ARIMA

Autoregressive Integrated Moving Average (ARIMA) Time series analysis assumes stationary data or just a time series that fluctuates about the mean. We can check for non stationary data by using ACF().

acf(training)

The ACF shows that there is only very minimal or low correlation (less than 0.2) between the time series and its own lagged values. Hence ARIMA is a suitable option as a forecasting model.

Generate the ARIMA model using stlm() plot the resulting forecast.

model_arima <- stlm(
  training,
  s.window = 365,
  method = c( "arima")
  
)

plot(training, ylab="amount", xlim=c())
lines(model_arima$fitted, lty=2, col="blue")

Generate a forecast using ARIMA model for the next 365 days.

arima_forecast <-  forecast(
model_arima,
h = 365)

Evaluation

Accuracy()

Use accuracy() to compare the two forecasting models on the “testing” time series dataset.

accuracy(hw_forecast, testing)

##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 3263.605 171157.0 104046.1 -4.661103 20.06696 0.7970241 0.06519543
## Test set     9520.788 158895.6 107015.2 -4.092905 19.76847 0.8197681 0.06017566
##              Theil's U
## Training set        NA
## Test set     0.9359353

accuracy(arima_forecast,testing)

##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set   104.1316 159672.4 105508.1 -5.40300 21.29978 0.8082237
## Test set     84176.4858 179154.6 113100.1 11.38168 18.71429 0.8663805
##                     ACF1 Theil's U
## Training set -0.05535067        NA
## Test set      0.05840184   1.04001

MAE is Computed as the average absolute difference between the values fitted by the model (one-step ahead in-sample forecast), and the observed historical data. MAE suggests that the predictions are off by approximately 100k.

MASE is the MAE divided by the MAE of the naive model. The naive model is one that predicts the value at time point t as the previous historical value. Scaling by this error means that you can evaluate how good the model is compared to the naive model. If the MASE is greater than 1, then the model is worse than the naive model. The lower the MASE, the better the model is compared to the naive model. This indicates that the HoltWinters is a better model compared to the ARIMA model.

MAPE (Mean absolute percentage error) considers actual values fed into model and fitted values from the model and calculates absolute difference between the two as a percentage of actual value and finally calculates the mean of that. MAPE suggests that the predictions are about 18-20% out.

Root Mean Squared Error (RMSE) is the square root of the mean squared error between the predicted and actual values. The RMSE is on the same scale as the observed data values. RMSE suggests that the predictions are about approximately off by 150k.

Ljung Box Test

The Ljung-Box test is a hypothesis test that checks if a time series contains an autocorrelation. The null Hypothesis H0 is that the residuals are independently distributed. The alternative hypothesis is that the residuals are not independently distributed and exhibit a serial correlation.

Perform a Ljung Box test on the HoltWinters forecast.

Box.test(hw_forecast, lag = 1, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  hw_forecast
## X-squared = 0.07052, df = 1, p-value = 0.7906

Here we see a p-value much greater than .01, thus we cannot reject the null hypothesis, indicating the time series does not contain an autocorrelation.

Perform a Ljung Box test on the ARIMA forecast.

Box.test(arima_forecast, lag = 1, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  arima_forecast
## X-squared = 0.097494, df = 1, p-value = 0.7549

Here we see a p-value much greater than .01, thus we cannot reject the null hypothesis, indicating the time series does not contain an autocorrelation.

Box.test(training, lag = 1, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  training
## X-squared = 39.382, df = 1, p-value = 3.486e-10

The original training time series indicates a presence of autocorrelation.

Box.test(testing, lag = 1, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  testing
## X-squared = 2.7865, df = 1, p-value = 0.09506

and the testing time series indicates no presence of autocorrelation p-value>0.01).

Conclusion

On the basis of mean errors (ME ~9k vs ME ~ 84k on test target data prediction) and mean absolute scaled error (MASE~0.79 vs 0.80) indicators on the “testing” time series set the Holt-Winters model gives more accurate results than ARIMA. We conclude that the Holt-Winter is the better forecasting model for the time series of price of a 3-bedroom house. Other factors to support Holt-Winters as the better model is the presence of a trend and seasonality which characterize the type of time series suitable for HoltWinters modelling.