Introduction
Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.
Objective
It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!
Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!
Read Library
library(dplyr) # data manipulation
library(lubridate) # restructuring a certain type of datetime
library(tidyverse) # data wrangling
library(forecast) # time series modelling and forecasting
library(yardstick) # measuring forecasting performance
library(recipes) # data preprocessing
library(magrittr) # extension dplyr for piping syntax
library(timetk) # additional toolkit needed for time series models
library(purrr) # nested model fitting
library(tidyquant) # creating better ggplot aesthetics
library(padr) # adjusting time stamp
library(ggpubr) # data visualization
library(plotly) # interactive plotting
library(ggplot2) # data visalization
library(tidyr) # data manipulation
About Dataset
Import data
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
df <- read.csv("Capstone ML Data/5. scotty-ts/data/data-train.csv")
rmarkdown::paged_table(df)
Meta data
The dataset includes information about :
id: Transaction idtrip_id: Trip iddriver_id: Driver idrider_id: Rider idstart_time: Start time of requestsrc_lat: Request source latitudesrc_lon: Request source longitudesrc_area: Request source areasrc_sub_area: Request source sub-areadest_lat: Requested destination latitudedest_lon: Requested destination longitudedest_area: Requested destination areadest_sub_area: Requested destination sub-areadistance: Trip distance (in KM)status: Trip status (all status considered as a demand)confirmed_time_sec: Time different from request to confirmed (in seconds)
Data Wrangling
Datatype check
glimpse(df)
## Rows: 90,113
## Columns: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708ceb…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6189…
## $ driver_id <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5471…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2fadf…
## $ start_time <chr> "2017-10-01T00:00:17Z", "2017-10-01T00:02:34Z", "20…
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, 4…
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, 2…
## $ src_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ src_sub_area <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sxk9s…
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, 4…
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, 2…
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ dest_sub_area <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sxk9e…
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573, …
## $ status <chr> "confirmed", "nodrivers", "confirmed", "confirmed",…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106,…
There are some columns that have incorrect data types
# Count unique values in every column
sapply(df, n_distinct)
## id trip_id driver_id rider_id
## 90113 85438 1516 16010
## start_time src_lat src_lon src_area
## 88729 65749 57103 1
## src_sub_area dest_lat dest_lon dest_area
## 3 37496 30055 53
## dest_sub_area distance status confirmed_time_sec
## 164 76240 2 397
We will change some data types that are not correct, such as:
start_timeto date-timedriver_id,rider_id,src_sub_area,dest_area,dest_sub_area,statusto factor
Change datatypes
df <- df %>%
mutate_at(vars(driver_id,
rider_id,
src_sub_area,
dest_area,
dest_sub_area,
status),
as.factor) %>%
mutate(start_time = ymd_hms(start_time))
glimpse(df)
## Rows: 90,113
## Columns: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708ceb…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6189…
## $ driver_id <fct> 59a892c5568be44b2734f276, NA, 599dc0dfa5b4fd5471ad8…
## $ rider_id <fct> 59ad2d6efba75a581666b506, 59cd704bcf482f6ce2fadfdb,…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10-…
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760, 4…
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827, 2…
## $ src_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sx…
## $ src_sub_area <fct> sxk9s, sxk9e, sxk9s, sxk9e, sxk9e, sxk9s, sxk9s, sx…
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125, 4…
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316, 2…
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk…
## $ dest_sub_area <fct> sxk9u, sxk9e, sxk9e, sxk9e, sxk9q, sxk9e, sxk9e, sx…
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.693573, …
## $ status <fct> confirmed, nodrivers, confirmed, confirmed, nodrive…
## $ confirmed_time_sec <int> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 106,…
Data Preparation
Flor start_time
df <- df %>%
mutate(datetime = floor_date(start_time, unit = "hour"))
Calculate the demand for each area in every hour
# take 3 sub areas separately <data aggregation>
sxk97 <- df %>%
filter(src_sub_area == 'sxk97') %>%
group_by(src_sub_area, datetime) %>%
summarise(demand = n(), .groups = "drop")
sxk9e <- df %>%
filter(src_sub_area == 'sxk9e') %>%
group_by(src_sub_area, datetime) %>%
summarise(demand = n(), .groups = "drop")
sxk9s <- df %>%
filter(src_sub_area == 'sxk9s') %>%
group_by(src_sub_area, datetime) %>%
summarise(demand = n(), .groups = "drop")
Characteristics Time Series Data
Time series data: data related to time and have a fixed time interval.
💡 The main requirements for time series data:
- The data must be sorted according to the time period from the oldest data to the newest data
- The time interval must be the same
- No missing data for each interval
- There can be no empty data
Check range of datetime in every sub areas
range(sxk9e$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
range(sxk97$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
range(sxk9s$datetime)
## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
Each area has the same time range
Inspect period missed
all(seq(
from=as.POSIXct("2017-10-01 00:00:00", tz="UTC"),
to=as.POSIXct("2017-12-02 23:00:00", tz="UTC"),
by="hour"
) == sxk9e$datetime)
## [1] FALSE
The data above is not ready to be analyzed in time series because there are several missing periods in 1 hour intervals
Padding
Padding aims to ensure that the data intervals are the same, patch
the time period (hours) based on the column whose data type is date and
fill in the missing values (NA).
sxk97_use <- sxk97 %>%
arrange(datetime) %>% # sort time
pad(start_val = min(sxk97$datetime), # padding process
end_val = max(sxk97$datetime)) %>%
mutate(demand = ifelse(is.na(demand), 0, demand)) %>% # fill `NA` in demand with 0
replace(is.na(.), "sxk97") # fill `NA` in sub_area with sub_area name
## pad applied on the interval: hour
sxk9e_use <- sxk9e %>%
arrange(datetime) %>%
pad(start_val = min(sxk9e$datetime),
end_val = max(sxk9e$datetime)) %>%
mutate(demand = ifelse(is.na(demand), 0, demand)) %>%
replace(is.na(.), "sxk9e")
## pad applied on the interval: hour
sxk9s_use <- sxk9s %>%
arrange(datetime) %>%
pad(start_val = min(sxk9s$datetime),
end_val = max(sxk9s$datetime)) %>%
mutate(demand = ifelse(is.na(demand), 0, demand)) %>%
replace(is.na(.), "sxk9s")
## pad applied on the interval: hour
# missing value final check
anyNA(sxk9s_use)
## [1] FALSE
anyNA(sxk97_use)
## [1] FALSE
anyNA(sxk9e_use)
## [1] FALSE
When demand is equal to 0 it means that there is no incoming demand at that hour
Data is ready to be used for time series analysis
Gather each sub area into one dataframe
scotty <- rbind(sxk97_use,sxk9e_use,sxk9s_use)
rmarkdown::paged_table(scotty)
Exploratory Data Analysis
We would like to see the demand by sub-area throughout the day:
ggplotly(ggplot(scotty,aes(x = datetime,
y = demand)) +
geom_line(aes(col = src_sub_area)) +
labs(y = "Order Request",
x = "Date",
title = "General Order Demand") +
facet_wrap(~ src_sub_area,
scale = "free_y",
ncol = 1) +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)))
Based in the graph, it can be seen that on 3-11-2017 at 6 pm there was a very high demand in SXK97(207) and SXK9S (217) areas, it can be seen from the very high spike at that time. Meanwhile, for the SXK9E area, the highest demand occurred on 24-11-2017 at 6 pm, which was 200 demands
We could see that there are seasonal pattern, although the trend is not clear yet. To investigate more, we would make daily and weekly polar plot
Daily and Weekly Demand by Sub-Area
sxk97
sxk97_plot <- scotty %>%
filter(src_sub_area == "sxk97") %>%
.$demand
sxk97_daily <- ggseasonplot(ts(sxk97_plot,
frequency = 24),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK97 Daily",
x = "Hour") +
scale_y_sqrt()
sxk97_w <- scotty %>%
filter(src_sub_area == "sxk97") %>%
mutate(date = format(datetime,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk97_weekly <- ggseasonplot(ts(sxk97_w$demand, frequency = 7),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK97 Weekly",
x = "Day") +
scale_y_sqrt()
ggarrange(sxk97_daily, sxk97_weekly)
In SXK97 peak hours occurs at 6 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.
sxk9e
sxk9e_plot<- scotty %>%
filter(src_sub_area == "sxk9e") %>%
.$demand
sxk9e_daily <- ggseasonplot(ts(sxk9e_plot,frequency = 24),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9E Daily",
x = "Hour") +
scale_y_sqrt()
sxk9e_w <- scotty %>%
filter(src_sub_area == "sxk9e") %>%
mutate(date = format(datetime,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk9e_weekly <- ggseasonplot(ts(sxk9e_w$demand,frequency = 7),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9E Weekly",
x = "Day") +
scale_y_sqrt()
ggarrange(sxk9e_daily, sxk9e_weekly)
In SXK9E peak hours occurs at 6-7 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.
sxk9s
sxk9s_plot<- scotty %>%
filter(src_sub_area == "sxk9s") %>%
.$demand
sxk9s_daily <- ggseasonplot(ts(sxk9s_plot,frequency = 24),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9S Daily",
x = "Hour") +
scale_y_sqrt()
sxk9s_w <- scotty %>%
filter(src_sub_area == "sxk9s") %>%
mutate(date = format(datetime,"%Y/%m/%d")) %>%
group_by(date) %>%
mutate(demand = sum(demand)) %>%
select(c(date,demand)) %>%
distinct()
sxk9s_weekly <- ggseasonplot(ts(sxk9s_w$demand,frequency = 7),
polar = T) +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "SXK9S Weekly",
x = "Day") +
scale_y_sqrt()
ggarrange(sxk9s_daily, sxk9s_weekly)
In SXK9S peak hours occurs at 6 pm and peak days occurs in Friday. For off-peak hours occurs at 5-6 am and Sunday is the day with the least demand.
Decomposition
Decomposition is a stage in time series analysis that is used to describe several components in time series data.
💡 Components in the time series:
- Trend : data pattern in general, tends to increase or decrease. If there is a trend there is still a pattern, it means that there is a pattern that has not been decomposed properly.
- Seasonal : seasonal patterns that form repeating patterns over a fixed period of time
- Error/Reminder/Random : patterns that cannot be caught in trend and seasonal
Before doing forecasting modeling, we need to observe the time series
object from the decompose result. The main idea of
decompose is to describe the three components of the object ts (trend,
seasonal, residual).
Daily
area <- scotty %>%
filter(src_sub_area == "sxk97")
ts_day <- ts(data = area$demand,
frequency = 24)
ts_day %>%
decompose() %>%
autoplot() +
labs(title = "Decomposition on Daily Time Series") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.3))
Weekly
ts_week <- ts(data = area$demand,
frequency = 24*7)
ts_week %>%
decompose() %>%
autoplot() +
labs(title = "Decomposition on Weekly Time Series") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.3))
Unfortunately, the trend that resulted from the Daily and Weekly decomposition is not smooth enough. that might be caused by uncaptured extra seasonality, so it can be considered as multi-seasonal data.
So that, we need to try another option by creating the multiple time series object, msts with daily and weekly seasonality.
Daily-Weekly (Multi Seasonality)
daily_weekly <- msts(data = area$demand,
seasonal.periods = c(24, 24*7))
daily_weekly %>%
mstl() %>%
autoplot() +
labs(title = "Decomposition on Daily Weekly Time Series") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.3))
This time the trend become smoother, it means that our data is indicated to have multi seasonality
The conclusions obtained from the EDA process will help us to choose the best time series model
Data preprocessing
Cross Validation
Before modeling, we have to seperate our data into two: train and test dataset. Test dataset would be the last one week from the data, while the remain is our train dataset.
# size of train data and test data
test_size <- 24 * 7
train_size <- nrow(sxk97_use) - test_size
# initial range and final range for each data
test_end <- max(scotty$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- train_end - hours(train_size) + hours(1)
# interval for each data
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)
Plot train and test
train_test_plot <- scotty %>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
mutate(sample = factor(sample, levels = c("train", "test"))) %>%
ggplot(aes(x = datetime, y = demand, colour = sample)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light()
ggplotly(train_test_plot)
Date time series
Before we form Time Series data, we will first process the data using
the library recipes. Library recipes work in the form of columns, so we
will change our data to tabular based on src_sub_area using the
spread() function of the tidyr library
scotty <- scotty %>%
spread(src_sub_area, demand)
rmarkdown::paged_table(scotty)
We will do some data manipulation, which aims to make our model becomes more robust and less sensitive to outlier
Scaling will be performed using recipe() function from
library recipes:
rec <- recipe(~ ., filter(scotty, datetime %within% intrain)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
# preview bake result
scotty <- bake(rec, scotty)
rmarkdown::paged_table(scotty)
Because we have manipulated data, we must create a function to return
our data to the initial value. So we cread new function named
rec_revert
rec_revert <- function(vector, rec, varname) {
# store recipe values
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
# convert back based on the recipe
results <- (vector * rec_scale + rec_center) ^ 2
# add additional adjustment if necessary
results <- round(results)
# return the results
results
}
This revert back function will convert back the data to its original form.
Now we can convert our data into the long format again:
scotty <- scotty %>%
gather(src_sub_area, demand, -datetime)
rmarkdown::paged_table(scotty)
Modelling
In functional programming using purrr, we need to convert our tbl into a nested tbl. Nested data is like a table inside a table, which could be controlled using a key indicator; in other words, we can have tbl for each subarea and samples. Using this format, the fitting and forecasting process would be very versatile, yet we can still convert the results as long as we have a proper key like provider.
Let’s start by converting our tbl into a nested tbl. First, we need
to add sample indicator so it could be recognized as a key when we
nest() the data:
# label data to train and test
scotty <- scotty %>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na()
# nest train data
scotty <- scotty %>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
spread(sample, data)
## Warning: `.key` is deprecated
rmarkdown::paged_table(scotty)
Time series format
To make our data into a time series form we need the function ts () or msts (). So we will create a function data list named ts_function_list which contains both functions, which for time series will use the seasonality as follows:
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24*7))
)
data_funs
## $ts
## function(x) ts(x$demand, frequency = 24)
##
## $msts
## function(x) msts(x$demand, seasonal.periods = c(24, 24*7))
ts used for single seasonality, will use daily seasonality (frequency = 24) msts used for multiple seasonality, will use daily seasonality (frequency = 24) and weekly (frequency = 24 * 7)
Then we convert the list into a tbl using enframe().
Note that we should also give a key – which is the src_sub_area in our
case. Use rep() function:
# list to tbl
data_funs <- data_funs %>%
rep(length(unique(scotty$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(scotty$src_sub_area), length(unique(.$data_fun_name))))
)
data_funs
## # A tibble: 6 × 3
## data_fun_name data_fun src_sub_area
## <chr> <list> <chr>
## 1 ts <fn> sxk97
## 2 msts <fn> sxk97
## 3 ts <fn> sxk9e
## 4 msts <fn> sxk9e
## 5 ts <fn> sxk9s
## 6 msts <fn> sxk9s
Then we join nested function with our nested model ditting :
# join nested function
scotty <- scotty %>%
left_join(data_funs)
## Joining, by = "src_sub_area"
scotty
## # A tibble: 6 × 5
## # Groups: src_sub_area [3]
## src_sub_area test train data_fun_name data_fun
## <chr> <list> <list> <chr> <list>
## 1 sxk97 <tibble [168 × 2]> <tibble [1,344 × 2]> ts <fn>
## 2 sxk97 <tibble [168 × 2]> <tibble [1,344 × 2]> msts <fn>
## 3 sxk9e <tibble [168 × 2]> <tibble [1,344 × 2]> ts <fn>
## 4 sxk9e <tibble [168 × 2]> <tibble [1,344 × 2]> msts <fn>
## 5 sxk9s <tibble [168 × 2]> <tibble [1,344 × 2]> ts <fn>
## 6 sxk9s <tibble [168 × 2]> <tibble [1,344 × 2]> msts <fn>
Time series model list
We will use four different models
- Exponential smoothing state space model (ets)
- Autoregressive integrated moving average (Auto arima)
- Seasonal and Trend decomposition using Loess (stlm)
- Exponential smoothing state space model using (holtswinters)
- Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats)
# models list
models <- list(
auto.arima = function(x) auto.arima(x),
ets = function(x) ets(x),
holt.winter = function(x) HoltWinters(x, seasonal = "additive"),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x, use.box.cox = FALSE)
)
models
## $auto.arima
## function(x) auto.arima(x)
##
## $ets
## function(x) ets(x)
##
## $holt.winter
## function(x) HoltWinters(x, seasonal = "additive")
##
## $stlm
## function(x) stlm(x)
##
## $tbats
## function(x) tbats(x, use.box.cox = FALSE)
# combine into nested format: our previous nested model
models <- models %>%
rep(length(unique(scotty$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(scotty$src_sub_area), length(unique(.$model_name))))
)
models
## # A tibble: 15 × 3
## model_name model src_sub_area
## <chr> <list> <chr>
## 1 auto.arima <fn> sxk97
## 2 ets <fn> sxk97
## 3 holt.winter <fn> sxk97
## 4 stlm <fn> sxk97
## 5 tbats <fn> sxk97
## 6 auto.arima <fn> sxk9e
## 7 ets <fn> sxk9e
## 8 holt.winter <fn> sxk9e
## 9 stlm <fn> sxk9e
## 10 tbats <fn> sxk9e
## 11 auto.arima <fn> sxk9s
## 12 ets <fn> sxk9s
## 13 holt.winter <fn> sxk9s
## 14 stlm <fn> sxk9s
## 15 tbats <fn> sxk9s
The next step is to merge model_function_list to the scotty_data_nest data. Because the ets and auto.arima models cannot be used for multiseasonal time series types, we will not combine the two models with the msts type time series:
# drop model that we don't need
scotty <- scotty %>%
left_join(models) %>% # join by "src_sub_area"
filter(
!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts")
)
## Joining, by = "src_sub_area"
scotty
## # A tibble: 24 × 7
## # Groups: src_sub_area [3]
## src_sub_area test train data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <list>
## 1 sxk97 <tibble> <tibble> ts <fn> auto.arima <fn>
## 2 sxk97 <tibble> <tibble> ts <fn> ets <fn>
## 3 sxk97 <tibble> <tibble> ts <fn> holt.winter <fn>
## 4 sxk97 <tibble> <tibble> ts <fn> stlm <fn>
## 5 sxk97 <tibble> <tibble> ts <fn> tbats <fn>
## 6 sxk97 <tibble> <tibble> msts <fn> holt.winter <fn>
## 7 sxk97 <tibble> <tibble> msts <fn> stlm <fn>
## 8 sxk97 <tibble> <tibble> msts <fn> tbats <fn>
## 9 sxk9e <tibble> <tibble> ts <fn> auto.arima <fn>
## 10 sxk9e <tibble> <tibble> ts <fn> ets <fn>
## # … with 14 more rows
Creat the model
Run the function to form time series data based on the model and the results will be accommodated to the fitted variable.
Since it is time consuming, for further analysis, we will save the model in “scotty_model.RDS”.
# scotty <- scotty %<>%
# mutate(
# params = map(train, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
#
#scotty_model <- saveRDS(scotty, "scotty_model.RDS")
Calculate test and train errors
# read model
scotty <- readRDS("scotty_model.RDS")
# calculate errors
scotty_model_error <- scotty %>%
mutate(MAE_error_test = map(fitted, ~ forecast(.x, h = test_size)) %>%
map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area),
estimate = rec_revert(.x$mean,rec,src_sub_area)))) %>%
arrange(src_sub_area, MAE_error_test) %>%
mutate(MAE_error_train = map(fitted, ~ forecast(.x, h = test_size)) %>%
map2_dbl(train, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area),
estimate = rec_revert(.x$fitted,rec,src_sub_area)))) %>%
arrange(src_sub_area, MAE_error_train)
scotty_model_error
## # A tibble: 24 × 10
## # Groups: src_sub_area [3]
## src_sub_area test train data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <list>
## 1 sxk97 <tibble> <tibble> msts <fn> stlm <fn>
## 2 sxk97 <tibble> <tibble> ts <fn> stlm <fn>
## 3 sxk97 <tibble> <tibble> msts <fn> tbats <fn>
## 4 sxk97 <tibble> <tibble> ts <fn> tbats <fn>
## 5 sxk97 <tibble> <tibble> ts <fn> ets <fn>
## 6 sxk97 <tibble> <tibble> msts <fn> holt.winter <fn>
## 7 sxk97 <tibble> <tibble> ts <fn> holt.winter <fn>
## 8 sxk97 <tibble> <tibble> ts <fn> auto.arima <fn>
## 9 sxk9e <tibble> <tibble> msts <fn> stlm <fn>
## 10 sxk9e <tibble> <tibble> ts <fn> stlm <fn>
## # … with 14 more rows, and 3 more variables: fitted <list>,
## # MAE_error_test <dbl>, MAE_error_train <dbl>
Choose the best model: the model with smallest
MAE_error_test
scotty_best_model <- scotty_model_error %>%
group_by(src_sub_area) %>%
arrange(MAE_error_test) %>%
slice(1) %>% #choose the smallest error & best model for each area directly
ungroup() %>%
select(src_sub_area, ends_with("_name"),MAE_error_test)
scotty_best_model
## # A tibble: 3 × 4
## src_sub_area data_fun_name model_name MAE_error_test
## <chr> <chr> <chr> <dbl>
## 1 sxk97 msts tbats 7.60
## 2 sxk9e msts tbats 9.26
## 3 sxk9s msts tbats 7
scotty_best_model <- scotty_best_model %>%
select(-MAE_error_test) %>%
left_join(scotty_model_error)%>%
select(-MAE_error_test,-MAE_error_train)
## Joining, by = c("src_sub_area", "data_fun_name", "model_name")
scotty_best_model
## # A tibble: 3 × 8
## src_sub_area data_fun_name model_name test train data_fun model fitted
## <chr> <chr> <chr> <list> <list> <list> <lis> <list>
## 1 sxk97 msts tbats <tibble> <tibble> <fn> <fn> <tbats>
## 2 sxk9e msts tbats <tibble> <tibble> <fn> <fn> <tbats>
## 3 sxk9s msts tbats <tibble> <tibble> <fn> <fn> <tbats>
Forecasting
Forecasting is carried out on the test data using the best previously selected model. We will forecast for the next one week or 148 hours
Unnesting the result
We need to make a tbl containing our forecast to data test, then do some spread-gather tricks to make a set of keys that unique for each data representations, models, and also one for the forecast itself. After we get to that format, we could do unnest() the data into a proper long format. So let’s start with creating the forecast first
scotty_test <- scotty_model_error %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
scotty_test
## # A tibble: 24 × 12
## # Groups: src_sub_area [3]
## src_sub_area test train data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <list>
## 1 sxk97 <tibble> <tibble> msts <fn> stlm <fn>
## 2 sxk97 <tibble> <tibble> ts <fn> stlm <fn>
## 3 sxk97 <tibble> <tibble> msts <fn> tbats <fn>
## 4 sxk97 <tibble> <tibble> ts <fn> tbats <fn>
## 5 sxk97 <tibble> <tibble> ts <fn> ets <fn>
## 6 sxk97 <tibble> <tibble> msts <fn> holt.winter <fn>
## 7 sxk97 <tibble> <tibble> ts <fn> holt.winter <fn>
## 8 sxk97 <tibble> <tibble> ts <fn> auto.arima <fn>
## 9 sxk9e <tibble> <tibble> msts <fn> stlm <fn>
## 10 sxk9e <tibble> <tibble> ts <fn> stlm <fn>
## # … with 14 more rows, and 5 more variables: fitted <list>,
## # MAE_error_test <dbl>, MAE_error_train <dbl>, forecast <list>, key <chr>
# Create a proper key
scotty_test_key <- scotty_test %>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area)
scotty_test_key
## # A tibble: 27 × 3
## # Groups: src_sub_area [3]
## src_sub_area key value
## <chr> <chr> <list>
## 1 sxk97 actual <tibble [168 × 2]>
## 2 sxk9e actual <tibble [168 × 2]>
## 3 sxk9s actual <tibble [168 × 2]>
## 4 sxk97 msts-holt.winter <tibble [168 × 2]>
## 5 sxk9e msts-holt.winter <tibble [168 × 2]>
## 6 sxk9s msts-holt.winter <tibble [168 × 2]>
## 7 sxk97 msts-stlm <tibble [168 × 2]>
## 8 sxk9e msts-stlm <tibble [168 × 2]>
## 9 sxk9s msts-stlm <tibble [168 × 2]>
## 10 sxk97 msts-tbats <tibble [168 × 2]>
## # … with 17 more rows
Because our forecast data is still in the rec function,
we should return the original value using the rec_revert
function:
# unnest the result
scotty_test_unnest <- scotty_test_key %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
scotty_test_unnest
## # A tibble: 4,536 × 4
## # Groups: src_sub_area [3]
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-11-26 00:00:00 38
## 2 sxk97 actual 2017-11-26 01:00:00 21
## 3 sxk97 actual 2017-11-26 02:00:00 10
## 4 sxk97 actual 2017-11-26 03:00:00 10
## 5 sxk97 actual 2017-11-26 04:00:00 5
## 6 sxk97 actual 2017-11-26 05:00:00 2
## 7 sxk97 actual 2017-11-26 06:00:00 0
## 8 sxk97 actual 2017-11-26 07:00:00 11
## 9 sxk97 actual 2017-11-26 08:00:00 4
## 10 sxk97 actual 2017-11-26 09:00:00 4
## # … with 4,526 more rows
Plot the result
With this result, we can compare the forecast and actual data on data test by doing some interactive plotting
# interactive plotting forecast and original data test
fplot <- scotty_test_unnest %>%
ggplot(aes(x = datetime,
y = demand,
colour = key)) +
geom_line(data = scotty_test_unnest %>%
filter(key == "actual"),
aes(y = demand),
alpha = 0.3,size = 0.8)+
geom_line(data = scotty_test_unnest %>%
filter(key != "actual"),
aes(frame = key,
col = key)) +
labs(x = "", y = "Demand",
title = "Model Prediction Comparison", frame = "Model") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
tidyquant::theme_tq() +
theme(legend.position = "none",axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 7))
ggplotly(fplot)
Automated model selection
We can automatically select model with the smallest error by using
filter from dplyr
scotty_min_error <- scotty_model_error %>%
select(-fitted) %>% # remove unused
group_by(src_sub_area) %>%
filter(MAE_error_test == min(MAE_error_test)) %>%
ungroup()
scotty_min_error
## # A tibble: 3 × 9
## src_sub_area test train data_fun_name data_fun model_name model
## <chr> <list> <list> <chr> <list> <chr> <list>
## 1 sxk97 <tibble> <tibble> msts <fn> tbats <fn>
## 2 sxk9e <tibble> <tibble> msts <fn> tbats <fn>
## 3 sxk9s <tibble> <tibble> msts <fn> tbats <fn>
## # … with 2 more variables: MAE_error_test <dbl>, MAE_error_train <dbl>
From the EDA results, we already know that our dataset has multi-seasonality, so based on the automatic selection of the above model, it can also be seen that the model with the smallest MAE error follows the multi-seasonal function and
tbatsis the model with the lowest MAE.
Performing final forecast
Here we will do the same process as in model fitting, but this time we will use train and test data as our “full data”
Combining train and test data
scotty_combine <- scotty_min_error %>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
scotty_combine
## # A tibble: 3 × 8
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
## <chr> <list> <chr> <list> <chr> <list> <dbl>
## 1 sxk97 <tibble> msts <fn> tbats <fn> 7.60
## 2 sxk9e <tibble> msts <fn> tbats <fn> 9.26
## 3 sxk9s <tibble> msts <fn> tbats <fn> 7
## # … with 1 more variable: MAE_error_train <dbl>
Creat model
# scotty_combine_nest <- scotty_combine %>%
# mutate(
# params = map(fulldata, ~ list(x = .x)),
# data = invoke_map(data_fun, params),
# params = map(data, ~ list(x = .x)),
# fitted = invoke_map(model, params)
# ) %>%
# select(-data, -params)
#
# saveRDS(scotty_combine_nest, "scottycb1_nest.RDS")
Model result
scotty_combine_nest <- readRDS("scottycb1_nest.RDS")
scotty_combine_nest
## # A tibble: 3 × 9
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
## <chr> <list> <chr> <list> <chr> <list> <dbl>
## 1 sxk97 <tibble> msts <fn> tbats <fn> 7.60
## 2 sxk9e <tibble> msts <fn> tbats <fn> 9.26
## 3 sxk9s <tibble> msts <fn> tbats <fn> 7
## # … with 2 more variables: MAE_error_train <dbl>, fitted <list>
Forecasting
Then, let’s make a tbl containing the forecast results, and convert it to a long format
scotty_combine_forecast <- scotty_combine_nest %>%
select(-MAE_error_train) %>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7 * 4)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7 * 4),
demand = as.vector(.x$mean)
))
)
scotty_combine_forecast
## # A tibble: 3 × 9
## src_sub_area fulldata data_fun_name data_fun model_name model MAE_error_test
## <chr> <list> <chr> <list> <chr> <list> <dbl>
## 1 sxk97 <tibble> msts <fn> tbats <fn> 7.60
## 2 sxk9e <tibble> msts <fn> tbats <fn> 9.26
## 3 sxk9s <tibble> msts <fn> tbats <fn> 7
## # … with 2 more variables: fitted <list>, forecast <list>
Unesting forecast result
scotty_combine_forecast_unnest <- scotty_combine_forecast %>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
scotty_combine_forecast_unnest
## # A tibble: 6,552 × 4
## src_sub_area key datetime demand
## <chr> <chr> <dttm> <dbl>
## 1 sxk97 actual 2017-10-01 00:00:00 6
## 2 sxk97 actual 2017-10-01 01:00:00 4
## 3 sxk97 actual 2017-10-01 02:00:00 9
## 4 sxk97 actual 2017-10-01 03:00:00 2
## 5 sxk97 actual 2017-10-01 04:00:00 5
## 6 sxk97 actual 2017-10-01 05:00:00 4
## 7 sxk97 actual 2017-10-01 06:00:00 1
## 8 sxk97 actual 2017-10-01 07:00:00 0
## 9 sxk97 actual 2017-10-01 08:00:00 3
## 10 sxk97 actual 2017-10-01 09:00:00 3
## # … with 6,542 more rows
Plot forecast
fplot2 <- scotty_combine_forecast_unnest %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_light() +
scale_colour_tq()
ggplotly(fplot2)
Final data test result
data_test <- scotty_combine_forecast_unnest %>%
filter(datetime %within% intest) %>%
select(src_sub_area, datetime, demand)
data_test
## # A tibble: 504 × 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-11-26 00:00:00 38
## 2 sxk97 2017-11-26 01:00:00 21
## 3 sxk97 2017-11-26 02:00:00 10
## 4 sxk97 2017-11-26 03:00:00 10
## 5 sxk97 2017-11-26 04:00:00 5
## 6 sxk97 2017-11-26 05:00:00 2
## 7 sxk97 2017-11-26 06:00:00 0
## 8 sxk97 2017-11-26 07:00:00 11
## 9 sxk97 2017-11-26 08:00:00 4
## 10 sxk97 2017-11-26 09:00:00 4
## # … with 494 more rows
MAE Result
mae_test_result <- scotty_model_error %>%
group_by(src_sub_area) %>%
select(src_sub_area, MAE_error_test) %>%
arrange(MAE_error_test) %>%
slice(1) %>%
ungroup()
# Reached MAE < 10 for all sub-area in test dataset
mae_test_result %<>%
summarise(src_sub_area = "all sub-area", # also count MAE for all sub-area
MAE_error_test = mean(MAE_error_test)) %>%
bind_rows(mae_test_result, .)
mae_test_result
## # A tibble: 4 × 2
## src_sub_area MAE_error_test
## <chr> <dbl>
## 1 sxk97 7.60
## 2 sxk9e 9.26
## 3 sxk9s 7
## 4 all sub-area 7.95
Data submission
Read the data submission that has been given as a challenge in this project
submission <- read_csv("Capstone ML Data/5. scotty-ts/data/data-test.csv", show_col_types = F) %>%
mutate(datetime = ymd_hms(datetime))
submission
## # A tibble: 504 × 3
## src_sub_area datetime demand
## <chr> <dttm> <lgl>
## 1 sxk97 2017-12-03 00:00:00 NA
## 2 sxk97 2017-12-03 01:00:00 NA
## 3 sxk97 2017-12-03 02:00:00 NA
## 4 sxk97 2017-12-03 03:00:00 NA
## 5 sxk97 2017-12-03 04:00:00 NA
## 6 sxk97 2017-12-03 05:00:00 NA
## 7 sxk97 2017-12-03 06:00:00 NA
## 8 sxk97 2017-12-03 07:00:00 NA
## 9 sxk97 2017-12-03 08:00:00 NA
## 10 sxk97 2017-12-03 09:00:00 NA
## # … with 494 more rows
# Range data check
range(submission$datetime)
## [1] "2017-12-03 00:00:00 UTC" "2017-12-09 23:00:00 UTC"
Our data submission has datetime range from 2017-12-03 00:00:00 UTC to 2017-12-09 23:00:00 UTC equal to 1 week or 148 hours
Nest data submission
submit_nest <- submission %>% nest(datetime)
submit_nest
## # A tibble: 3 × 3
## src_sub_area demand data
## <chr> <lgl> <list>
## 1 sxk97 NA <tibble [168 × 1]>
## 2 sxk9e NA <tibble [168 × 1]>
## 3 sxk9s NA <tibble [168 × 1]>
Forecast data submission with the best model
submission <- scotty_best_model %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(submit_nest$data, ~ tibble( #map into our data submission nest
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, data_fun, sep = "-")
)
submission
## # A tibble: 3 × 10
## src_sub_area data_fun_name model_name test train data_fun model fitted
## <chr> <chr> <chr> <list> <list> <list> <lis> <list>
## 1 sxk97 msts tbats <tibble> <tibble> <fn> <fn> <tbats>
## 2 sxk9e msts tbats <tibble> <tibble> <fn> <fn> <tbats>
## 3 sxk9s msts tbats <tibble> <tibble> <fn> <fn> <tbats>
## # … with 2 more variables: forecast <list>, key <chr>
Look for the result
submission <- submission %>%
select(src_sub_area, forecast) %>%
unnest(forecast) %>%
mutate(demand = rec_revert(demand, rec, src_sub_area))
submission
## # A tibble: 504 × 3
## src_sub_area datetime demand
## <chr> <dttm> <dbl>
## 1 sxk97 2017-12-03 00:00:00 18
## 2 sxk97 2017-12-03 01:00:00 15
## 3 sxk97 2017-12-03 02:00:00 11
## 4 sxk97 2017-12-03 03:00:00 7
## 5 sxk97 2017-12-03 04:00:00 5
## 6 sxk97 2017-12-03 05:00:00 3
## 7 sxk97 2017-12-03 06:00:00 2
## 8 sxk97 2017-12-03 07:00:00 6
## 9 sxk97 2017-12-03 08:00:00 12
## 10 sxk97 2017-12-03 09:00:00 12
## # … with 494 more rows
Write CSV for data submission
write_csv(submission, "submit_scotty.csv")
Plot submission forecast
subforcast <- submission %>%
ggplot(aes(x = datetime, y = demand)) +
geom_line() +
labs(x = NULL, y = NULL, colour = NULL) +
facet_wrap(~ src_sub_area, scale = "free", ncol = 1) +
theme_linedraw() +
scale_colour_tq()
ggplotly(subforcast)
Assumption Checking
Assumptions in the time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data. Why use residual data? Because by using residual data, we can get information from the actual data as well as from the prediction results using the model. A good forecasting method produces the following residual values :
Uncorrelated residuals. If there are correlated residuals, it means that there is still information left that should be used to calculate forecast results.
Residuals have an average of 0. normally distributed
No-autocorrelation
To check the presence or absence of autocorrelation in the residual time series forecasting results, you can use acf residual plot to visualized the residual or use Ljung-box testing
\(H_0\): residual has no-autocorrelation
\(H_1\): residual has autocorrelation
sxk97
# Using acf residual plot visualization with function `acf(residual model)`
acf(residuals(scotty_best_model$fitted[[1]]))
# perform a Ljung-box test using the function `Box.test`
Box.test(residuals(scotty_best_model$fitted[[1]]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(scotty_best_model$fitted[[1]])
## X-squared = 0.0041596, df = 1, p-value = 0.9486
sxk9e
acf(residuals(scotty_best_model$fitted[[2]]))
Box.test(residuals(scotty_best_model$fitted[[2]]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(scotty_best_model$fitted[[2]])
## X-squared = 0.0012231, df = 1, p-value = 0.9721
sxk9s
acf(residuals(scotty_best_model$fitted[[3]]))
Box.test(residuals(scotty_best_model$fitted[[3]]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(scotty_best_model$fitted[[3]])
## X-squared = 0.0010346, df = 1, p-value = 0.9743
Based on the acf plot, very few residuals crossed the threshold and the p-value of the L-jung box test results was also above the aplha (0.05) for all sub area, so \(H_1\) is rejected
It can be concluded that the residuals in all sub area do not have autocorrelation
Normality of residual
To check the normality of the residuals in the forecasting time
series results, we can perform a normality test (shapiro test) using the
shapiro.test(residual model) function or just plot
histogram to see the distribution of the residuals
\(H_0\): residual spread normally
\(H_1\): residuals are not normally distributed
sxk97
# plot histogram
hist(residuals(scotty_best_model$fitted[[1]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk97", xlab = NULL)
shapiro.test(residuals(scotty_best_model$fitted[[1]]))
##
## Shapiro-Wilk normality test
##
## data: residuals(scotty_best_model$fitted[[1]])
## W = 0.9851, p-value = 1.6e-10
sxk9e
hist(residuals(scotty_best_model$fitted[[2]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk9e", xlab = NULL)
shapiro.test(residuals(scotty_best_model$fitted[[2]]))
##
## Shapiro-Wilk normality test
##
## data: residuals(scotty_best_model$fitted[[2]])
## W = 0.99493, p-value = 0.0001655
sxk9s
hist(residuals(scotty_best_model$fitted[[3]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk9s", xlab = NULL)
shapiro.test(residuals(scotty_best_model$fitted[[3]]))
##
## Shapiro-Wilk normality test
##
## data: residuals(scotty_best_model$fitted[[3]])
## W = 0.99045, p-value = 1.144e-07
Based on the histogram plot, it can be seen that the residuals have spread normally with the highest residual frequency around 0. The p-value of the normality test (Shapiro test) for all areas is greater than alpha (0.05), so \(H_1\) is rejected
It can be concluded that the residual value of the model is normally distributed for all sub area
Rubrics-Scotty: “Bring me the crystal ball!”
Data Preprocess
Demonstrate and explain how to properly do data aggregation.
- Should you floor the date to specific time level (minutes or hours or days)?
Saya melakukan pembulatan kebawah dengan tingkat waktu jam. Dikarenakan data set yang digunakan mengandung satuan menit dan detik maka untuk memudahkan analisis time series serta agar tidak terlalu bervariatif nilai timenya maka dilakukan pembulatan ke jam saja. Pembulatan kebawah dilakukan dengan alasan bahwa walaupun waktu terekamnya jam 05:59 tapi seharusnya masih dikategorikan kedalam pukul 05 bukan pukul 06.
- How do we group the data for aggregation/summarise?
Mengelompokkan data dilakukan dengan cara memilih kolom-kolom mana saja yang ingin kita kelompokkan dan yang akan kita lakukan perhitungan statistik selanjutnya menggunakan fungsi
group_by(). Pada kasus ini kita mengelompokkan berdasarkan sub area dan date time yang sudah dalam satuan tanggal-jam.Demonstrate how to properly do time series padding.
- Should you do time series padding?
Ya kita harus melakukan time series padding, karena data yang digunakan belum tentu memiliki interval waktu yang lengkap dan objek yang akan di forecast juga belum tentu ada di tiap interval waktunya
- Do you need to round the datetime into hour or minutes?
Pembulatan waktu saya lakukan kedalam jam, karena untuk mengurangi keberagaman nilai menit yang sangat banyak jika ingin melihat pola harian atau mingguannya, sehingga menurut saya jauh lebih tepat untuk dilakukan pembulatan ke satuan jam.
- When is the start and the end of the time interval for time series padding?
Start time = 2017-10-01 00:00:00 UTC
End time = 2017-12-02 23:00:00 UTC
Cross-Validation Scheme
Demonstrate and explain how to prepare cross-validation data for automated model selection.
- How to cross-validate data for time series?
Ada beberapa komponen yang harus disiapkan terlebih dahulu sebelum mealakukan cross validasi yaitu test dan train size, index waktu awal dan akhir untuk data test dan train, serta intervalnya. Dalam kasus ini karena kita ditugaskan untuk menforecast selama 1 minggu dengan satuan jam maka size data test adalah 24*7 dan data train adalah sisanya. Kemudian dicari rentang datetime dimulainya data train dan test serta berakhirnya dan yang terakhir dicari interval data train dan test dari rentang yang sudah didapat sebelumnya.
- Do you need to group the data by the source area?
Tentu saja kita perlu melakukan grouping ke sub areanya, agar kita bisa tahu setiap sub area memiliki data train dan testnya masing-masing. Karena nantinya forecasting dilakukan terhadap setiap sub area yang ditentukan
- Do you need to make nested dataframe?
Nedted dataframe secara sederhana adalah terdapat table didalam table, kita perlu membuat nested datafram agar tiap sub area dapat menyimpan data train dan testnya masing-masing
- How many observations you will use as the testing dataset?
Jumlah observasi yang digunakan adalah sebanyak 168 observasi, karena objective yang diberikan adalah untuk melakukan forecasting selama 1 minggu dalam satuan jam atau 24*7 jam
(2 Points) Demonstrate and explain how to prepare cross-validation data for “best” model evaluation.
- Do you need to further split the data train into training set and validation set?
Dalam kasus ini diberikan 2 buah data, yaitu data train yang berisi nilai demand dan data test yang tidak berisi nilai demand (objek yang akan di forecast) di markdown bernama
submission. Nah untuk melakukan pemodelan data train ini harus dipisah/split terlebih dahulu menjadi dua yaitu data train dan data validation yang dimana dalammarkdownyang saya tulis diatas bernama data test. Data validation ini digunakan untuk mendapatkan nilai evaluasi dari model yang sudah kita training menggunakan data training sehingga nantinya setelah didapat model yang terbaik lalu dipakailah model tersebut untuk memprediksi (forecast) data test yang sudah diberikan sebelumnya.- How do you split them? Should you use rolling origin method?
Ya, tentu menggunakan metode tersebut. Untuk membantu dalam membandingkan beberapa model, maka perlu dilakukan splitting data train menjadi data train dan data validasi. Strateginya adalah dengan mengambil 1 minggu terakhir sebagai data test dan porsi yang lebih besarnya sebagai data train. Strategi ini merupakan versi sederhana dari metode rolling origin yang dimana hanya diberikan 1 pasang data test dan data train saja.
- How much of the data will be used as the validation set?
Jumlah data yang digunakan untuk data validasi adalah sebanyak 168 data, karena kita ingin memprediksi demand untuk 1 minggu dalam satuan waktu
- Do we forecast with windowed data or expanding data?
Kita forcast dengan windowed data, karena dari awal (pada tahap cross validation) sudah di tetapkan size dari data yang akan di forcast yaitu sebanyak 24 jam * 7 hari dan tidak bertambah.
Automated Model Selection
Compare multiple preprocess specifications.
- Is different preprocess will have diffrerent results?
Tentu saja dengan metode preprocessing yang berbeda akan menghasilkan hasil yang berbeda juga, karena pada tahap preprocessing kita melakukan manipulasi terhadap data seperti discaling, diakar, atau bahkan di logaritmakan. Nilai tersebut akan masuk kedalam pemodelan dan pastinya akan menghasilkan hasil yang berbeda tiap metode preprocessingnya.
- How many kind of preprocess spesification you will prepare?
Pada kasus ini saya melakukan 3 proses preprocessing yaitu :
- Diakarkan menggunakan fungsi
step_sqrt - Lalu di kurangi dengan nilai tengah menggunakan fungsi
step_center - Terakhir di scaling menggunakan fungsi
step_scale
- Will you choose 2 different speficiation: log transformation and square root transformation specification? Will you create another preprocess approach?
Saat melakukan preprocessing saya menggunakan pendekatan preprocessing yang berbeda yaitu dengan cara menormalisasi datanya dengan mengakarkan, kemudian mengurangi dengan nilai tengahnya lalu kemudian di scaling.
Compare multiple seasonality specifications.
- How many seasonality specification you will create?
Di pemodelan saya membuat 2 seasonality yaitu single seasonality (daily) dan multiple seasonality (daily dan weekly)
- Will you create model with daily sesasonality only?
Saya tidak hanya membuat model dengan single seasonality saja, melainkan juga dengan menggunakan multiple seasonality harian dan mingguan. Hal ini dikarenakan saat saya meakukan EDA saya melihat bahwa data yang digunakan ternyata terindikasi memilki lebih dari satu seasonality, sehingga untuk pemodelannya saya juga mengikutkan multiple seasonalitynya.
- Will you create multiple seasonality (daily and weekly)?
Iya saya membuat multiple seasonality daily dan weekly, karena saat proses EDA ketika dilakukan decompose menggunakan multi seasonal trend yang dihasilkan lebih smooth dibandingkan hanya menggunakan single seasonality daily ataupun weekly
Compare multiple model specifications.
- How many forecasting model will you use?
Saya menggunakan model forcasting sebanyal 5 model yaitu
Exponential smoothing state space model (ets)
Autoregressive integrated moving average (Auto arima)
Seasonal and Trend decomposition using Loess (stlm)
Exponential smoothing state space model using (holtswinters)
Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal
Will you use exponential smoothing? Will you use ARIMA?
Saya menggunakan model exponensial smoothing seperti ets dan holtswinter
Saya juga menggunakan model ARIMA tetapi lebih tepatnya AUTO SARIMA karena data yang digunakan memiliki seasonal dan model dengan otomatis mencari kombinasi nilai (p,d,q)(P,D,Q) nya
Automate best specifications selection.
- Since we use multiple preprocess, seasonality, and models, can you make an automated script to summarise the result?
Jawaban dari pertanyaan ini sudah terdapat di bagian Modelling lebih tepatnya pada bagian calculate test and train error, sehingga saya akan menampilkan gambar sebagai tampilan hasil scripnya
- How do you measure the model performance?
Pengukuran model terbaik dilakukan dengan menggunakan nilai Mean Absolute Error (MAE) sebagai acuannya. Mean Absolute Error (MAE) menunjukkan rata-rata dari nilai absolut error. MAE bisa diinterpretasikan sebagai seberapa besar penyimpangan hasil prediksi terhadap nilai aktualnya. Semakin rendah nilai MAE maka semakin bagus hasil forecasting dari modelnya.
- Which model and specifications has the best performance?
Model yang digunakan adalah tbats dengan spesifikasi :
- TBATS(1, {2,2}, -, {<24,7>, <168,6>})
Interpretasi : TBATS(omega, p,q, phi, <m1,k1>,…,<mJ,kJ>) omega adalah parameter Box-Cox, phi adalah parameter redaman, kesalahan/error dimodelkan sebagai proses ARMA(p,q) dan m1,…,mJ mencantumkan periode musiman yang digunakan dalam model dan k1,…,kJ adalah jumlah suku Fourier terms yang sesuai yang digunakan untuk setiap seasonality. Sehingga :
- omega = 1, artinya tidak ada parameter Box-Cox
- p,q = (2,2), artinya banyak data yang dipakai ketika Auto Regressive = 2 dan Moving Average = 2
- Memiliki 2 tipe seasonal yang berbeda, dengan panjang 24 = harian dan 168 = mingguan
- Model tbats menyesuaikan untuk seasonality harian menggunakan 7 suku Fourier, dan mingguan menggunakan 6 suku Fourier
- Parameters
- Alpha: 0.02163328
- Gamma-1 Values: -0.0007510731 -0.001416854
- Gamma-2 Values: 0.001190823 0.0002403854
- AR coefficients: -0.245483 0.528058
- MA coefficients: 0.595991 -0.241668
- Sigma: 0.4635748
- AIC: 7694.876
- Note: Angka pada setiap parameter diatas hanya representatif untuk sub area sxk97 saja yang digunakan sebagai contoh. Kita juga dapat mengakses parameter model untuk sub area lainnya
Prediction Performance
Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.
Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.
Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.
Reached MAE < 11 for all sub-area in (your own) evaluation dataset.
Reached MAE < 12 for sub-area sxk97 in test dataset.
Reached MAE < 11 for sub-area sxk9e in test dataset.
Reached MAE < 10 for sub-area sxk9s in test dataset.
Reached MAE < 11 for all sub-area in test dataset.
Conclusion
Pada kasus ini, forecast menggunakan multiple seasonality menghasilkan prediksi yang lebih baik.
Pada ketiga lokasi, forecast menggunakan permodelan TBATS menghasilkan prediksi yang paling baik.
Error dari model menghasilkan error yang memenuhi kedua asumsi time series.
Selain kasus scotty, metode automated multiple time series forecasting ini juga dapat digunakan di bidang sales, stock market, atau bahkan juga fenomena alam seperti curah hujan dan kenaikan muka air laut.
Assumption Checking
- Does the model meet the autocorrelation assumption? What about the normality of residuals?
Model memenuhi kedua asumsi time series yang ada.
Melalui uji Ljung-box dihasilkan p-value yang lebih kecil dari alpha 0.05 di setiap daerahnya yang berarti residual hasil forcastnya tidak berkolerasi.
Melaui uji Shapiro dihasilkan p-value yang lebih kecil dari alpha 0.05 di setiap daerahnya yang berarti nilai-nilai residual memiliki rata-rata 0 dan terdistribusi secara normal.
Kedua asumsi tersebut juga didukung dengan plot yang sudah disediakan.
- If the assumptions are not met, what is the cause? how to handle that? Based on seasonality when the highest demand?
Asumsi pada time series diujikan untuk mengukur apakah residual yang peroleh dari hasil modeling sudah cukup baik untuk menggambarkan dan menangkap informasi pada data. Mengapa menggunakan residual data? Karena dengan menggunakan residual data, kita dapat mendapatkan informasi dari data aktual maupun dari hasil prediksi menggunakan model.
Sehingga jika asumsi tidak terpenuhi maka dikarenakan model tidak dapat menangkap informasi atau pola yang ada didalam data yang kita gunakan. Cara mengatasinya adalah dengan melakukan tunning terhadap model yang dipakai, contohnya : 1. Memilih jenis model yang tepat sesuai dengan karakteristik datanya (Additive atau Multipicative) 2. Memilih seasonality yang tepat 3. Melakukan differencing 4. Melakukan transformasi menggunakan log() 5. Memilih konfigurasi nilai pembentuk model yang tepat (p,d,q dan P,D,Q)
Daily -> Jam 6 sore
Weekly -> Hari Jumat