Introduction
Description
This notebook will trying to predict time series problem using SARIMA and Holt Winter Model. Dataset are house prices in USA from 2007 to 2019 here dataset link.
Stucture
Here are stucture of notebook:
- Data Preparation
- Exploratory Data Analysis
- Modelling
- Model Evaluation
- Verdict
Data Preparation
Read the dataset
hs_data <- read.csv("data/raw_sales.csv")
rmarkdown::paged_table(head(hs_data))raw data consist of 5 columns that some are corelated to house prices like bedrooms and property type, but in this notebook we will focus one using time variable to predict prices of house.
Checking dataset column types
str(hs_data)## 'data.frame': 29580 obs. of 5 variables:
## $ datesold : chr "2007-02-07 00:00:00" "2007-02-27 00:00:00" "2007-03-07 00:00:00" "2007-03-09 00:00:00" ...
## $ postcode : int 2607 2906 2905 2905 2906 2905 2607 2606 2902 2906 ...
## $ price : int 525000 290000 328000 380000 310000 465000 399000 1530000 359000 320000 ...
## $ propertyType: chr "house" "house" "house" "house" ...
## $ bedrooms : int 4 3 3 4 3 4 3 4 3 3 ...
total of data is 29,580 sample and datesod column need to change into date datatype.
Check na values
colSums(is.na(hs_data))## datesold postcode price propertyType bedrooms
## 0 0 0 0 0
the data has zero na values. we can move on to next step
Changing Data Types
hs_data$datesold <- ymd_hms(hs_data$datesold)
hs_data <- hs_data %>% select(-postcode) %>% mutate(propertyType = as.factor(propertyType))
head(hs_data)## datesold price propertyType bedrooms
## 1 2007-02-07 525000 house 4
## 2 2007-02-27 290000 house 3
## 3 2007-03-07 328000 house 3
## 4 2007-03-09 380000 house 4
## 5 2007-03-21 310000 house 3
## 6 2007-04-04 465000 house 4
Summary of data
summary(hs_data)## datesold price propertyType bedrooms
## Min. :2007-02-07 00:00:00 Min. : 56500 house:24552 Min. :0.00
## 1st Qu.:2013-02-05 00:00:00 1st Qu.: 440000 unit : 5028 1st Qu.:3.00
## Median :2015-09-30 00:00:00 Median : 550000 Median :3.00
## Mean :2015-02-21 07:35:42 Mean : 609736 Mean :3.25
## 3rd Qu.:2017-07-26 00:00:00 3rd Qu.: 705000 3rd Qu.:4.00
## Max. :2019-07-27 00:00:00 Max. :8000000 Max. :5.00
from summary we know that start of data is 2007-02-07 untill 2019-07-27. we also know mean of price is 550.000 US DOllars.
before we move we need to check time series data assumtion
- date must be sorted from old to newsest
- date interval must be the same
- no jumping data in the interval
Check Time Series Assumtions
all((hs_data$date) == (seq(ymd("2007-02-01"), ymd("2019-07-01"), by = "month")))## Warning in `==.default`((hs_data$date), (seq(ymd("2007-02-01"),
## ymd("2019-07-01"), : longer object length is not a multiple of shorter object
## length
## [1] FALSE
because we dont fulfill criteria for time series data we need to doing some feature engginering
Feature Engginering
hs_agg <- hs_data %>%
mutate(month_new = month(hs_data$datesold),
year_new = year(hs_data$datesold),
date = ym(paste0(year_new,"-",month_new))) %>%
select(-propertyType, -bedrooms, -datesold) %>%
group_by(date) %>%
summarise(price_monthly_avg = mean(price))
hs_agg## # A tibble: 150 x 2
## date price_monthly_avg
## <date> <dbl>
## 1 2007-02-01 407500
## 2 2007-03-01 339333.
## 3 2007-04-01 798000
## 4 2007-05-01 339500
## 5 2007-06-01 520333.
## 6 2007-07-01 592079.
## 7 2007-08-01 505609.
## 8 2007-09-01 556875
## 9 2007-10-01 498600
## 10 2007-11-01 505442.
## # ... with 140 more rows
We aggregate data for each month and average the total price of the date, using that we get price monthly average data that satisfies all time series assumptions. althought this is different than initial goals to predict house prices from this data we can predict monthly prices instead of 7 or 27 each month like initial data give which instead can give better prediction by taking average data each month for every years.
Check Assumptions
all((hs_agg$date) == (seq(ymd("2007-02-01"), ymd("2019-07-01"), by = "month")))## [1] TRUE
Exploratory Data Analysis
Time Series Plot
plot(hs_agg$date,hs_agg$price_monthly_avg, type='l') from the graph we know that our time series is addictive, and there is increasing trend in house prices.
Time Series Object
hp_ts <- ts(data = hs_agg$price_monthly_avg,
start = c(2007, 2),
frequency = 12) we use 12 frequency because we use monthly data.
Decomposition
# decomposition
hp_dc <- hp_ts %>%
decompose(type = "additive")
# autoplot
hp_dc %>%
autoplot() from the decomposition we know that data is monthly seasonal with upward trend.
Cross Validation
# 8 tahun terakhir untuk data test, sisanya jadikan data train
hp_test <- tail(hp_ts, 28)
hp_train <- head(hp_ts, -28)we divide into train and test with test consist of 28 month as benchmark for model evaluation.
Check Stationary
# assign final ts object
hs_msts <- hp_train %>%
msts(seasonal.periods = c(7*4))
# check for stationary
adf.test(hs_msts)##
## Augmented Dickey-Fuller Test
##
## data: hs_msts
## Dickey-Fuller = -3.1067, Lag order = 4, p-value = 0.1168
## alternative hypothesis: stationary
because p-values > 0.05 we deduce that our train is non stationary, we can convert using diff method.
# assign final ts object
hs_msts <- diff(hp_train) %>%
msts(seasonal.periods = c(7*4))
# check for stationary
adf.test(hs_msts)## Warning in adf.test(hs_msts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: hs_msts
## Dickey-Fuller = -8.6053, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
because now our p-value < 0.05 we already make our data stationary.
Plot Stationary Vs Non Stationary
hp_ts_diff <- diff(hp_train)
hp_ts %>%
autoplot(series = "Actual") +
autolayer(hp_ts_diff, series = "Diff 1")Modelling
SARIMA MODEL
hp_auto <- auto.arima(hp_train, seasonal = T)
hp_auto## Series: hp_train
## ARIMA(3,1,1)(1,0,2)[12]
##
## Coefficients:
## ar1 ar2 ar3 ma1 sar1 sma1 sma2
## -0.2099 0.1438 0.5862 -0.9470 0.8685 -0.8385 0.2707
## s.e. 0.1065 0.1185 0.1305 0.0503 0.0709 0.1725 0.1614
##
## sigma^2 estimated as 1910218089: log likelihood=-1465.93
## AIC=2947.87 AICc=2949.15 BIC=2970.23
here we get SARIMA model with value ARIMA(3,1,1)(1,0,2)[12].
Holt Winter Model
hp_hw <- stlm(hp_train, method = "ets", lambda = 0)
summary(hp_hw)## Length Class Mode
## stl 488 mstl numeric
## model 19 ets list
## modelfunction 1 -none- function
## lambda 1 -none- numeric
## x 122 ts numeric
## series 1 -none- character
## m 1 -none- numeric
## fitted 122 ts numeric
## residuals 122 ts numeric
Model Plotting
hp_train%>%
autoplot(series = "Actual") +
autolayer(hp_auto$fitted, series = "SARIMA") +
autolayer(hp_hw$fitted, series = "HOLT WInter") from the graph both model seem fitting well. but we can see model performance objectively in model evaluation.
Model Evaluation
Forecasting
# forecast
hp_sarima_fr <- forecast(hp_auto, h = 28)
hp_hw_fr <- forecast(hp_hw, h = 28)here we create forecast data 28 month to future.
Model Comparison
forecast_list <- list(
"SARIMA" = hp_sarima_fr,
"HW" = hp_hw_fr)
compare_forecast <- function(x, test){
lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
lapply(., function(x) x %>% t() %>% as.data.frame) %>%
bind_rows() %>%
mutate(model = names(x)) %>%
select(model, everything())
}
(compare_forecast(forecast_list, hp_test) %>% select(model, MAPE))## model MAPE
## 1 SARIMA 4.524789
## 2 HW 3.189914
Model achieve great result in MAPE VAlues 4,5 for SARIMA and 3,1 for HW. we choose MAPE because our data is in range of thousand, so we need to choose metric that adjusted properly to data variance witch is MAPA metrics good because it adjusted with data percentage rather than Mean square error or root mean square error which will unstable of variance of data is high.
Model SARIMA Performance Plot
test_forecast(actual = hp_ts,
forecast.obj = hp_sarima_fr,
train = hp_train,
test = hp_test)the graph for sarima seems better when influctuance in real data not reaching it’s peak, which mean model is not stable.
Model Holt Winter Performance Plot
test_forecast(actual = hp_ts,
forecast.obj = hp_hw_fr,
train = hp_train,
test = hp_test)HW model also suffer the same SARIMA problem but better in overal performance.
Assumtion Check
because of better MAPE performance we pick HW model for assumptions check. ### No auto-correlation residual
Acf(hp_hw$residuals)we can also check using Ljung Box methods
Box.test(hp_hw$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: hp_hw$residuals
## X-squared = 9.0206, df = 1, p-value = 0.00267
model has lower p-value that 0.05 (alpha) so we pass auto corelation assumtion.
Normality residual
hist(hp_hw$residuals, breaks = 15) from the graph we can see that almost residual rest in 0 range.but we can use shapiro test to check assumtions.
shapiro.test(hp_hw$residuals)##
## Shapiro-Wilk normality test
##
## data: hp_hw$residuals
## W = 0.8245, p-value = 0.00000000009467
because our p-value is lower than alpha we pass residual assumtions.
Verdict
House prices has been changing in the past years, because of many things. Predicting house prices can help people plan to buy houses or estimating when need to sold the house to gain better values. By using SARIMA and HOLT WINTER model can achieve MAPE around 4.5 and 3.1 which is model predict capability was very good. we also check our data assumption and model assumptions.