The goal of this analysis is to explore how classical machine learning models can be used for forecasting time series with double periodicity. Also to explore how tidymodels framework works for such analysis.
Below is the list of main packages that I have used:
library(tidyverse)
library(lubridate)
library(timetk)
library(tidymodels)
library(modeltime)
options(tibble.print_min = 20)
Data set can be found here. Let’s load the data set directly from its web location:
filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/00482/dataset.zip"
temp <- tempfile(tmpdir = here::here())
download.file(filepath,temp)
dfs <- read_csv(file = unz(temp, "dataset.csv"))
## Rows: 35717 Columns: 4
## -- Column specification ----------------------------------------------------------------------------
## Delimiter: ","
## chr (1): SystemCodeNumber
## dbl (2): Capacity, Occupancy
## dttm (1): LastUpdated
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
unlink(temp)
I added here::here() to download temporary file in my project directory, but its not necessary. Let’s inspect the data:
dfs
## # A tibble: 35,717 x 4
## SystemCodeNumber Capacity Occupancy LastUpdated
## <chr> <dbl> <dbl> <dttm>
## 1 BHMBCCMKT01 577 61 2016-10-04 07:59:42
## 2 BHMBCCMKT01 577 64 2016-10-04 08:25:42
## 3 BHMBCCMKT01 577 80 2016-10-04 08:59:42
## 4 BHMBCCMKT01 577 107 2016-10-04 09:32:46
## 5 BHMBCCMKT01 577 150 2016-10-04 09:59:48
## 6 BHMBCCMKT01 577 177 2016-10-04 10:26:49
## 7 BHMBCCMKT01 577 219 2016-10-04 10:59:48
## 8 BHMBCCMKT01 577 247 2016-10-04 11:25:47
## 9 BHMBCCMKT01 577 259 2016-10-04 11:59:44
## 10 BHMBCCMKT01 577 266 2016-10-04 12:29:45
## 11 BHMBCCMKT01 577 269 2016-10-04 13:02:48
## 12 BHMBCCMKT01 577 263 2016-10-04 13:29:45
## 13 BHMBCCMKT01 577 238 2016-10-04 14:02:47
## 14 BHMBCCMKT01 577 215 2016-10-04 14:29:49
## 15 BHMBCCMKT01 577 192 2016-10-04 14:57:13
## 16 BHMBCCMKT01 577 165 2016-10-04 15:30:14
## 17 BHMBCCMKT01 577 162 2016-10-04 16:04:12
## 18 BHMBCCMKT01 577 143 2016-10-04 16:31:14
## 19 BHMBCCMKT01 577 54 2016-10-05 07:57:17
## 20 BHMBCCMKT01 577 59 2016-10-05 08:30:15
## # ... with 35,697 more rows
summary(dfs)
## SystemCodeNumber Capacity Occupancy LastUpdated
## Length:35717 Min. : 220 Min. : -8.0 Min. :2016-10-04 07:46:28
## Class :character 1st Qu.: 500 1st Qu.: 210.0 1st Qu.:2016-10-24 16:33:03
## Mode :character Median : 849 Median : 446.0 Median :2016-11-11 14:00:21
## Mean :1398 Mean : 642.2 Mean :2016-11-11 18:11:47
## 3rd Qu.:2009 3rd Qu.: 798.0 3rd Qu.:2016-11-29 13:28:30
## Max. :4675 Max. :4327.0 Max. :2016-12-19 16:30:35
unique(dfs$SystemCodeNumber)
## [1] "BHMBCCMKT01" "BHMBCCPST01" "BHMBCCSNH01" "BHMBCCTHL01" "BHMBRCBRG01"
## [6] "BHMBRCBRG02" "BHMBRCBRG03" "BHMBRTARC01" "BHMEURBRD01" "BHMEURBRD02"
## [11] "BHMMBMMBX01" "BHMNCPHST01" "BHMNCPLDH01" "BHMNCPNHS01" "BHMNCPNST01"
## [16] "BHMNCPPLS01" "BHMNCPRAN01" "Broad Street" "Bull Ring" "NIA Car Parks"
## [21] "NIA North" "NIA South" "Others-CCCPS105a" "Others-CCCPS119a" "Others-CCCPS133"
## [26] "Others-CCCPS135a" "Others-CCCPS202" "Others-CCCPS8" "Others-CCCPS98" "Shopping"
Fields descriptions from Machine Learning Repository:
SystemCodeNumber: Car park ID
Capacity: Car park capacity
Occupancy: Car park occupancy rate
LastUpdated: Date and Time of the measure
I will be working with the SystemCodeNumber = “BHMBCCMKT01” , but overall all parking locations have pretty similar properties. So let’s explore those properties
dfs %>% filter(SystemCodeNumber == "BHMBCCMKT01") -> dfs
Handy plotting function plot_time_series() from timetk package will help with the investigation
plot_time_series(dfs, .date_var = LastUpdated, .value = Occupancy, .smooth = F, .interactive = T, .y_lab = "Occupancy",.title = "Car Park Occupancy")
First thing that is pretty evident on the graph, is that there are missing data points, namely between 2016-10-19 and 2016-10-22, also between 2016-12-02 and 2016-12-05.
Next, the structure periodicity - as per tibble preview of the data, we can see its date time data with approximate half hour intervals, data is also separated by days. The date time part starts at around 08:00 and finishes at around 16:30 - in total 18 observations per day, 7 days a week (excluding those blank spots for now).
Lets look how Car Park Occupancy is distributed per day and per half-hour
boxplot(formula = Occupancy ~ factor(weekdays(LastUpdated), levels = c("Monday","Tuesday","Wednesday","Thursday", "Friday", "Saturday", "Sunday"), labels = c("Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday", "Sunday"),ordered = T), data = dfs, xlab = NULL, main = "Occupancy distribution on each week day")
# Stripping date information from the LastUpdated date time field and then plotting it against target variable
plot(x = as.numeric(as.POSIXct(dfs$LastUpdated)) %% 86400 / 3600, dfs$Occupancy, xlab = "Hour", ylab = "Occupancy", main = "Occupancy distribution in half hour intervals")
Out of weekdays, Saturdays have clearly largest values, other days are pretty similar with Sundays being a bit lower and Tuesdays a bit higher than the remaining ones. The half hour plot shows parabolic share that is not exactly symmetric - the end of the day values for Occupancy are larger than the values at the start of the day.
Let’s take a look which lags are important in this time series. Based on lag importantce we can possibly construct additional predictor variables. I will use plot_acf_diagnostics() from timetk package:
plot_acf_diagnostics(.data = dfs,
.date_var = LastUpdated,
.value = Occupancy,
.show_white_noise_bars = T,
.x_intercept = c(18,126),
.white_noise_line_color = "green")
Besides the first few lags in ACF plot that show the time series points a highly correlated with their nearest neighbors, we see there are a few lags that correspond to periodicity of the time series, specifically, at lag 18(ACF) & 19(PACF) for daily periodicity and at lag 125 (ACF) & 126 (PACF) for the weekly periodicity.
So the plan to prepare the data is as following:
1. Round date time to nearest half hour to be able to spot missing entries/duplicates
2. Check if there are any duplicate values, missing single entries
3. Clean missing days/single entries by imputing values
Will use handy lubridate function round_date() for this, also let’s create Date variable out of our date time variable:
dfs %>% mutate(DateTime = round_date(x = LastUpdated,unit = "30 minutes"), Date = as.Date(LastUpdated)) -> dfs
Discarding fields that are either redundant or don’t hold any prediction value:
dfs %>% select(-LastUpdated, -Capacity, -SystemCodeNumber) -> dfs
dfs
## # A tibble: 1,312 x 3
## Occupancy DateTime Date
## <dbl> <dttm> <date>
## 1 61 2016-10-04 08:00:00 2016-10-04
## 2 64 2016-10-04 08:30:00 2016-10-04
## 3 80 2016-10-04 09:00:00 2016-10-04
## 4 107 2016-10-04 09:30:00 2016-10-04
## 5 150 2016-10-04 10:00:00 2016-10-04
## 6 177 2016-10-04 10:30:00 2016-10-04
## 7 219 2016-10-04 11:00:00 2016-10-04
## 8 247 2016-10-04 11:30:00 2016-10-04
## 9 259 2016-10-04 12:00:00 2016-10-04
## 10 266 2016-10-04 12:30:00 2016-10-04
## 11 269 2016-10-04 13:00:00 2016-10-04
## 12 263 2016-10-04 13:30:00 2016-10-04
## 13 238 2016-10-04 14:00:00 2016-10-04
## 14 215 2016-10-04 14:30:00 2016-10-04
## 15 192 2016-10-04 15:00:00 2016-10-04
## 16 165 2016-10-04 15:30:00 2016-10-04
## 17 162 2016-10-04 16:00:00 2016-10-04
## 18 143 2016-10-04 16:30:00 2016-10-04
## 19 54 2016-10-05 08:00:00 2016-10-05
## 20 59 2016-10-05 08:30:00 2016-10-05
## # ... with 1,292 more rows
Let’s start by looking if there are duplicate values for DateTime:
dfs %>% count(DateTime) %>% filter(n > 1)
## # A tibble: 4 x 2
## DateTime n
## <dttm> <int>
## 1 2016-10-28 08:30:00 2
## 2 2016-10-30 08:00:00 3
## 3 2016-11-18 08:00:00 2
## 4 2016-11-25 08:00:00 2
dfs %>% filter(as.character(DateTime) %in% c("2016-10-28 08:30:00", "2016-10-30 08:00:00", "2016-11-18 08:00:00", "2016-11-25 08:00:00"))
## # A tibble: 9 x 3
## Occupancy DateTime Date
## <dbl> <dttm> <date>
## 1 57 2016-10-28 08:30:00 2016-10-28
## 2 57 2016-10-28 08:30:00 2016-10-28
## 3 43 2016-10-30 08:00:00 2016-10-30
## 4 43 2016-10-30 08:00:00 2016-10-30
## 5 43 2016-10-30 08:00:00 2016-10-30
## 6 35 2016-11-18 08:00:00 2016-11-18
## 7 35 2016-11-18 08:00:00 2016-11-18
## 8 44 2016-11-25 08:00:00 2016-11-25
## 9 44 2016-11-25 08:00:00 2016-11-25
It seems there are some duplicates, but not too many, will remove these by aggregating:
dfs %>% group_by(DateTime, Date) %>% summarise(Occupancy = mean(Occupancy, na.rm = T)) %>% ungroup()-> dfs
dfs %>% count(DateTime) %>% filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: DateTime <dttm>, n <int>
As mentioned previously, we spotted two intervals where entires days were missing, so the plan is to create sort of “frame” which would hold all possible continuous time stamps based on MIN and MAX values of our existing data set.
expand.grid(Date = seq.Date(from = min(as.Date(dfs$DateTime)), to = max(as.Date(dfs$DateTime)), by = 1),
time = seq.int(from = 8*60*60, to = 16*60*60 + 1800, by = 1800 )) %>%
mutate(DateTime = as.Date(Date) + as.duration(time)) %>% select(DateTime) %>% arrange(DateTime) %>%
as_tibble()->frame_dt
frame_dt
## # A tibble: 1,386 x 1
## DateTime
## <dttm>
## 1 2016-10-04 08:00:00
## 2 2016-10-04 08:30:00
## 3 2016-10-04 09:00:00
## 4 2016-10-04 09:30:00
## 5 2016-10-04 10:00:00
## 6 2016-10-04 10:30:00
## 7 2016-10-04 11:00:00
## 8 2016-10-04 11:30:00
## 9 2016-10-04 12:00:00
## 10 2016-10-04 12:30:00
## 11 2016-10-04 13:00:00
## 12 2016-10-04 13:30:00
## 13 2016-10-04 14:00:00
## 14 2016-10-04 14:30:00
## 15 2016-10-04 15:00:00
## 16 2016-10-04 15:30:00
## 17 2016-10-04 16:00:00
## 18 2016-10-04 16:30:00
## 19 2016-10-05 08:00:00
## 20 2016-10-05 08:30:00
## # ... with 1,366 more rows
What I have done is a cross join between all dates and all time stamps (in seconds) in a single day with the help of expand.grid(). Next, is joining the frame to our existing data:
left_join(x = frame_dt, y = dfs) -> dfs
## Joining, by = "DateTime"
sum(is.na(dfs$Occupancy))
## [1] 79
So there 79 empty values, which I will need to deal with. I chose to impute using a handy function timetk::ts_impute_vec(). For the impute period I chose 1 week:
dfs %>% mutate(Occupancy = ts_impute_vec(x = Occupancy, period = 18*7)) -> dfs
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
sum(is.na(dfs$Occupancy))
## [1] 0
plot_time_series(dfs,.date_var = DateTime, .value = Occupancy, .title = "Car Park Occupancy")
No empty values remaining in the time series.
I will also leverage lagged values, as well as some rolling average numbers to create additional predictors. From ACF and PACF plots, it looked like lags at 18 and 126 are of significant importance. I will also add rolling daily averages for one day, three days and one week:
dfs %>%
tk_augment_lags(.value = Occupancy,.lags = c(18,126)) %>%
tk_augment_slidify(.value = starts_with("Occupancy_"),
.period = c(18,3*18, 7*18),
.f = ~mean(., na.rm = T),
.align = "center",
.partial = T) -> dfs
As our new variables introduced empty numbers, I will subset rows, where any of those numbers are missing:
remove_missing(dfs) -> dfs
## Warning: Removed 205 rows containing missing values.
First of, I will create validation data set and then Splits variable that will hold training/testing splits:
validation_set = dfs %>% slice_tail(n = 3*18)
dfs %>% anti_join(validation_set) -> dfs
## Joining, by = c("DateTime", "Date", "Occupancy", "Occupancy_lag18", "Occupancy_lag126", "Occupancy_lag18_roll_18", "Occupancy_lag126_roll_18", "Occupancy_lag18_roll_54", "Occupancy_lag126_roll_54", "Occupancy_lag18_roll_126", "Occupancy_lag126_roll_126")
Splits = dfs %>% initial_time_split( prop = (nrow(dfs) - 4*18) / nrow(dfs))
Next, for the recipe, I am planning to use these steps:
1. Convert date to numeric format
2. Retrieve all possible date time features out of our predictor DateTime
3. Remove those features that do not hold any value to the prediction
4. Remove Date and DateTime predictors as I have extracted all their features
5. Remove those predictors that are highly sparse and unbalanced (in case there are any)
recipe(Occupancy ~ ., data = training(Splits)) %>%
step_mutate(date_num = as.numeric(Date)) %>%
step_timeseries_signature(DateTime) %>%
step_rm(matches("(.xts$)|(.iso$)|(second)|(am.pm)|(lbl)|(_qday)|(_yday)")) %>%
step_rm(DateTime,Date) %>%
step_nzv(all_numeric_predictors()) -> my_recipe
I selected a random forest model, a decision tree model and a boosted tree model to use as examples. I will not do any tuning for these models and run everything on defaults to see how they stack up out of the box:
mod_ranger = rand_forest(mode = "regression") %>% set_engine("ranger", num.threads = parallel::detectCores(),importance = "impurity")
mod_dtree = decision_tree(mode = "regression") %>% set_engine("rpart")
mod_xgboost = boost_tree(mode = "regression") %>% set_engine("xgboost", nthread = parallel::detectCores())
Next, will use workflow_set() to create a workflow for each model and assign my custom recipe for each of the models by setting cross = T:
workflow_set(preproc = list(prep = my_recipe),
models = list(ranger = mod_ranger,dtree = mod_dtree, xgboost = mod_xgboost),
cross = T) -> wf_set
After I have my workflow set, I can fit the models with training data, and I will use modeltime_fit_workflowset() for this purpose. Then, I can do a prediction on the test data set with modeltime_calibrate()
modeltime_fit_workflowset(wf_set, data = training(Splits)) -> wf_set_fit
modeltime_calibrate(object = wf_set_fit, new_data = testing(Splits)) -> wf_calibrated
.calibration_data holds the actuals and predictions, which can be used with modeltime_accuracy():
modeltime_accuracy(wf_calibrated)
## # A tibble: 3 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 RANGER Test 14.8 18.0 0.650 15.3 18.5 0.949
## 2 2 RPART Test 19.4 36.6 0.854 24.8 22.1 0.890
## 3 3 XGBOOST Test 13.3 11.6 0.587 11.4 18.1 0.941
And we can explore the results visually, just need to unnest the calibration table:
# Function for plotting
plotting = function(df){
ggplot(df, aes(x = DateTime, y = .prediction, color = .model_desc)) + geom_line(cex = 0.75) +
facet_wrap(~.model_desc, ncol = 1, scales = "free_y") +
geom_line(aes(x = DateTime, y = .actual, color = "actual"), cex = 0.75, alpha = 0.75) +
scale_color_discrete(type = c("#000000", "#AB5F96", "#64AB5F", "#5FABA5")) +
labs(title = "Forecast", subtitle = "When running on defaults") +
theme(legend.title=element_blank(),
panel.background = element_rect(fill = "#F9F7F4"),
axis.title.y.left = element_blank()
)
}
wf_calibrated %>% select(.model_desc, .calibration_data) %>% mutate(.model_desc = tolower(.model_desc)) %>%
unnest(.calibration_data) %>% plotting()
Let’s take a look at variable importance plots for each model, to see what are the common predictors between the models with highest importance. I will combine plots for each separate model and display them combined using grid.arrange()
library(vip)
library(gridExtra)
library(grid)
p_title = grid.text(label = "Variable importance plots",gp=gpar(fontsize=20), draw = F)
P_ranger = vip(extract_fit_parsnip(pluck(wf_calibrated$.model,1))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,1))
p_rpart = vip(extract_fit_parsnip(pluck(wf_calibrated$.model,2))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,2))
p_xgboost = vip(extract_fit_parsnip(pluck(wf_calibrated$.model,3))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,3))
grid.arrange(p_title, P_ranger, p_rpart, p_xgboost, ncol=2)
All three models have week lagged values as the most important one, which is accurate as the time series exhibited weekly periodicity, however it is interesting that the weekly periodicity is much more important than the daily one (Occupancy_lag18). xgboost model seems to be heavily relying on weekly periodicity and probably would not be a good model for sudden daily changes in occupancy. Ranger and rpart models are more balanced in this regard.
Will start by refitting the training + testing data set, which, in my case was dfs. Then, follows the same steps, using modeltime_calibrate() to get actuals and predictions, which can be investigated forther:
modeltime_refit(object = wf_calibrated, data = dfs) -> dfs_refit
modeltime_calibrate(object = dfs_refit, new_data = validation_set) -> dfs_calibrated
modeltime_accuracy(dfs_calibrated)
## # A tibble: 3 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 RANGER Test 27.2 55.6 0.766 26.3 37.0 0.926
## 2 2 RPART Test 67.2 102. 1.89 43.3 92.8 0.791
## 3 3 XGBOOST Test 38.8 38.4 1.09 24.8 53.8 0.904
dfs_calibrated %>% select(.model_desc, .calibration_data) %>% mutate(.model_desc = tolower(.model_desc)) %>%
unnest(.calibration_data) %>% plotting()
Out of the box performance for ranger and xgboost models was at an acceptable level. Both of these models captured periodicity pretty well. Additional feature engineering, for example taking into account public holidays/event could improve model performance. Adding rolling measures for lagged values and lagged values themselves improved performance of all three models. The most important predictor was the week lagged value, which shows that the models have capability of forecasting of at least one week in advance, although some other predictors would need to be tweaked to accommodate for that.