This is a time series analysis and forecast modelling using Holt-Winters and Autoregressive Moving Average (ARIMA) methods. The data was downloaded from :
https://www.kaggle.com/datasets/htagholdings/property-sales?select=raw_sales.csv
The forecast results were finally evaluated with accuracy() and the Ljung Box test.
Invoke required libraries.
library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(timetk)
library(TSstudio)
Import the data.
sales <- read.csv('raw_sales.csv')
rmarkdown::paged_table(sales)
Check for missing values shows no missing values.
anyNA(sales)
## [1] FALSE
Check for the unique data fields of sales$propertyType.
unique(sales$propertyType)
## [1] "house" "unit"
Check for the unique data fields of sales$bedrooms.
unique(sales$bedrooms)
## [1] 4 3 5 1 2 0
Check the subset of propertyType==“unit”.
sales_unit<-sales%>% filter(propertyType=="unit")
Display subset propertyType==“unit”.
rmarkdown::paged_table(sales_unit)
Check the subset of propertyType==“house” and bedrooms==3.
sales_house_3<-sales%>% filter(propertyType=="house" & bedrooms==3)
Display subset sales_house_3( propertyType==“house” & bedrooms==3).
rmarkdown::paged_table(sales_house_3)
Check the data types of sales_house_3. This time series is now chosen to be modelled.
str(sales_house_3)
## 'data.frame': 11281 obs. of 5 variables:
## $ datesold : chr "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-21 00:00:00" "2007-04-24 00:00:00" ...
## $ postcode : int 2906 2905 2906 2607 2902 2906 2902 2906 2906 2602 ...
## $ price : int 290000 328000 310000 399000 359000 320000 385000 305000 336000 427500 ...
## $ propertyType: chr "house" "house" "house" "house" ...
## $ bedrooms : int 3 3 3 3 3 3 3 3 3 3 ...
Check format of date field of sales_house_3$datesold.
head(unique(sales_house_3$datesold))
## [1] "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-21 00:00:00"
## [4] "2007-04-24 00:00:00" "2007-05-24 00:00:00" "2007-05-25 00:00:00"
Convert sales_house_3$datesold date field from “chr” to “datetime”.
sales_house_3$datesold <- ymd_hms(sales_house_3$datesold)
View the changes made.
glimpse(sales_house_3)
## Rows: 11,281
## Columns: 5
## $ datesold <dttm> 2007-02-27, 2007-03-07, 2007-03-21, 2007-04-24, 2007-05-…
## $ postcode <int> 2906, 2905, 2906, 2607, 2902, 2906, 2902, 2906, 2906, 260…
## $ price <int> 290000, 328000, 310000, 399000, 359000, 320000, 385000, 3…
## $ propertyType <chr> "house", "house", "house", "house", "house", "house", "ho…
## $ bedrooms <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
Check a brief statistics summary of the final dataset.
summary(sales_house_3)
## datesold postcode price
## Min. :2007-02-27 00:00:00.00 Min. :2600 Min. : 69000
## 1st Qu.:2012-08-17 00:00:00.00 1st Qu.:2611 1st Qu.: 432500
## Median :2015-05-22 00:00:00.00 Median :2615 Median : 496000
## Mean :2014-11-09 08:28:48.35 Mean :2740 Mean : 550264
## 3rd Qu.:2017-05-01 00:00:00.00 3rd Qu.:2905 3rd Qu.: 609000
## Max. :2019-07-26 00:00:00.00 Max. :2914 Max. :6480000
## propertyType bedrooms
## Length:11281 Min. :3
## Class :character 1st Qu.:3
## Mode :character Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
Use the ts() function to plot the the chosen dataset.
sales_ts <-ts(sales_house_3$price, start=c(2007, 2), end=c(2019, 7), frequency=365)
forecast::autoplot(sales_ts)
Decompose the timeseries to view seasonal, trend and random patterns.
sales_decom <- sales_ts %>% decompose()
plot(sales_decom)
The decomposition shows a rising trend in years coming up to 2012 and a
stationary trend onwards from 2015. Furthermore the decomposition shows
a regular pattern of seasonality with price spikes at early and middle
of year.
Use ts_split to split data into test sets of 500 observations and the rest as training data.
split_sales_ts <- ts_split(ts.obj = sales_ts, sample.out = 500)
training <- split_sales_ts$train
testing <- split_sales_ts$test
Check population of training and test sets.
length(training)
## [1] 3886
length(testing)
## [1] 500
Check training dataset begin and end observations.
head(training)
## Time Series:
## Start = c(2007, 2)
## End = c(2007, 7)
## Frequency = 365
## [1] 290000 328000 310000 399000 359000 320000
tail(training)
## Time Series:
## Start = c(2017, 232)
## End = c(2017, 237)
## Frequency = 365
## [1] 462000 485000 513500 635000 353000 420000
Check test dataset begin and end observations.
head(testing)
## Time Series:
## Start = c(2017, 238)
## End = c(2017, 243)
## Frequency = 365
## [1] 440000 456000 730000 673000 500000 342000
tail(testing)
## Time Series:
## Start = c(2019, 2)
## End = c(2019, 7)
## Frequency = 365
## [1] 850000 490000 413000 415000 475000 537000
Holt and Winters is a forecasting model suitable for time-series exhibiting trend and seasonality.
model_hw <- HoltWinters(training)
Plot the HoltWinters forecast using training data.
HW1 <- HoltWinters(training)# Custom HoltWinters fitting
HW2 <- HoltWinters(training, beta=F)#Visually evaluate the fits
plot(training, ylab="Value", xlim=c())
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")
Display a smaller section of previous plot.
plot(training, ylab="Value", xlim=c(2016.1, 2017.12))
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")
Generate a forecast using HoltWinters model for the next 365 days.
hw_forecast <- forecast(
model_hw,
h = 365)
Autoregressive Integrated Moving Average (ARIMA) Time series analysis assumes stationary data or just a time series that fluctuates about the mean. We can check for non stationary data by using ACF().
acf(training)
The ACF shows that there is only very minimal or low correlation (less than 0.2) between the time series and its own lagged values. Hence ARIMA is a suitable option as a forecasting model.
Generate the ARIMA model using stlm() plot the resulting forecast.
model_arima <- stlm(
training,
s.window = 365,
method = c( "arima")
)
plot(training, ylab="amount", xlim=c())
lines(model_arima$fitted, lty=2, col="blue")
Generate a forecast using ARIMA model for the next 365 days.
arima_forecast <- forecast(
model_arima,
h = 365)
Use accuracy() to compare the two forecasting models on the “testing” time series dataset.
accuracy(hw_forecast, testing)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 3263.605 171157.0 104046.1 -4.661103 20.06696 0.7970241 0.06519543
## Test set 9520.788 158895.6 107015.2 -4.092905 19.76847 0.8197681 0.06017566
## Theil's U
## Training set NA
## Test set 0.9359353
accuracy(arima_forecast,testing)
## ME RMSE MAE MPE MAPE MASE
## Training set 104.1316 159672.4 105508.1 -5.40300 21.29978 0.8082237
## Test set 84176.4858 179154.6 113100.1 11.38168 18.71429 0.8663805
## ACF1 Theil's U
## Training set -0.05535067 NA
## Test set 0.05840184 1.04001
MAE is Computed as the average absolute difference between the values fitted by the model (one-step ahead in-sample forecast), and the observed historical data. MAE suggests that the predictions are off by approximately 100k.
MASE is the MAE divided by the MAE of the naive model. The naive model is one that predicts the value at time point t as the previous historical value. Scaling by this error means that you can evaluate how good the model is compared to the naive model. If the MASE is greater than 1, then the model is worse than the naive model. The lower the MASE, the better the model is compared to the naive model. This indicates that the HoltWinters is a better model compared to the ARIMA model.
MAPE (Mean absolute percentage error) considers actual values fed into model and fitted values from the model and calculates absolute difference between the two as a percentage of actual value and finally calculates the mean of that. MAPE suggests that the predictions are about 18-20% out.
Root Mean Squared Error (RMSE) is the square root of the mean squared error between the predicted and actual values. The RMSE is on the same scale as the observed data values. RMSE suggests that the predictions are about approximately off by 150k.
The Ljung-Box test is a hypothesis test that checks if a time series contains an autocorrelation. The null Hypothesis H0 is that the residuals are independently distributed. The alternative hypothesis is that the residuals are not independently distributed and exhibit a serial correlation.
Perform a Ljung Box test on the HoltWinters forecast.
Box.test(hw_forecast, lag = 1, type = "Ljung")
##
## Box-Ljung test
##
## data: hw_forecast
## X-squared = 0.07052, df = 1, p-value = 0.7906
Here we see a p-value much greater than .01, thus we cannot reject the null hypothesis, indicating the time series does not contain an autocorrelation.
Perform a Ljung Box test on the ARIMA forecast.
Box.test(arima_forecast, lag = 1, type = "Ljung")
##
## Box-Ljung test
##
## data: arima_forecast
## X-squared = 0.097494, df = 1, p-value = 0.7549
Here we see a p-value much greater than .01, thus we cannot reject the null hypothesis, indicating the time series does not contain an autocorrelation.
Box.test(training, lag = 1, type = "Ljung")
##
## Box-Ljung test
##
## data: training
## X-squared = 39.382, df = 1, p-value = 3.486e-10
The original training time series indicates a presence of autocorrelation.
Box.test(testing, lag = 1, type = "Ljung")
##
## Box-Ljung test
##
## data: testing
## X-squared = 2.7865, df = 1, p-value = 0.09506
and the testing time series indicates no presence of autocorrelation p-value>0.01).
On the basis of mean errors (ME ~9k vs ME ~ 84k on test target data prediction) and mean absolute scaled error (MASE~0.79 vs 0.80) indicators on the “testing” time series set the Holt-Winters model gives more accurate results than ARIMA. We conclude that the Holt-Winter is the better forecasting model for the time series of price of a 3-bedroom house. Other factors to support Holt-Winters as the better model is the presence of a trend and seasonality which characterize the type of time series suitable for HoltWinters modelling.
https://rdrr.io/cran/TSstudio/man/ts_split.html
https://koalatea.io/r-ljung-box-test/
https://stephenallwright.com/interpret-rmse/
https://www.andreaperlato.com/tspost/statistica-background-for-time-series/
https://www.ibm.com/docs/en/cognos-analytics/11.1.0?topic=forecasting-statistical-details
https://iopscience.iop.org/article/10.1088/1757-899X/166/1/012034/pdf