Using machine learning to create forecasts for Parking Birmingham Data Set

Introduction

The goal of this analysis is to explore how classical machine learning models can be used for forecasting time series with double periodicity. Also to explore how tidymodels framework works for such analysis.

Packages used

Below is the list of main packages that I have used:

library(tidyverse)
library(lubridate)
library(timetk)
library(tidymodels)
library(modeltime)
options(tibble.print_min = 20)

Importing data

Data set can be found here. Let’s load the data set directly from its web location:

filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/00482/dataset.zip"
temp <- tempfile(tmpdir = here::here())
download.file(filepath,temp)
dfs <- read_csv(file = unz(temp, "dataset.csv"))

## Rows: 35717 Columns: 4

## -- Column specification ----------------------------------------------------------------------------
## Delimiter: ","
## chr  (1): SystemCodeNumber
## dbl  (2): Capacity, Occupancy
## dttm (1): LastUpdated

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

unlink(temp)

I added here::here() to download temporary file in my project directory, but its not necessary. Let’s inspect the data:

dfs

## # A tibble: 35,717 x 4
##    SystemCodeNumber Capacity Occupancy LastUpdated        
##    <chr>               <dbl>     <dbl> <dttm>             
##  1 BHMBCCMKT01           577        61 2016-10-04 07:59:42
##  2 BHMBCCMKT01           577        64 2016-10-04 08:25:42
##  3 BHMBCCMKT01           577        80 2016-10-04 08:59:42
##  4 BHMBCCMKT01           577       107 2016-10-04 09:32:46
##  5 BHMBCCMKT01           577       150 2016-10-04 09:59:48
##  6 BHMBCCMKT01           577       177 2016-10-04 10:26:49
##  7 BHMBCCMKT01           577       219 2016-10-04 10:59:48
##  8 BHMBCCMKT01           577       247 2016-10-04 11:25:47
##  9 BHMBCCMKT01           577       259 2016-10-04 11:59:44
## 10 BHMBCCMKT01           577       266 2016-10-04 12:29:45
## 11 BHMBCCMKT01           577       269 2016-10-04 13:02:48
## 12 BHMBCCMKT01           577       263 2016-10-04 13:29:45
## 13 BHMBCCMKT01           577       238 2016-10-04 14:02:47
## 14 BHMBCCMKT01           577       215 2016-10-04 14:29:49
## 15 BHMBCCMKT01           577       192 2016-10-04 14:57:13
## 16 BHMBCCMKT01           577       165 2016-10-04 15:30:14
## 17 BHMBCCMKT01           577       162 2016-10-04 16:04:12
## 18 BHMBCCMKT01           577       143 2016-10-04 16:31:14
## 19 BHMBCCMKT01           577        54 2016-10-05 07:57:17
## 20 BHMBCCMKT01           577        59 2016-10-05 08:30:15
## # ... with 35,697 more rows

summary(dfs)

##  SystemCodeNumber      Capacity      Occupancy       LastUpdated                 
##  Length:35717       Min.   : 220   Min.   :  -8.0   Min.   :2016-10-04 07:46:28  
##  Class :character   1st Qu.: 500   1st Qu.: 210.0   1st Qu.:2016-10-24 16:33:03  
##  Mode  :character   Median : 849   Median : 446.0   Median :2016-11-11 14:00:21  
##                     Mean   :1398   Mean   : 642.2   Mean   :2016-11-11 18:11:47  
##                     3rd Qu.:2009   3rd Qu.: 798.0   3rd Qu.:2016-11-29 13:28:30  
##                     Max.   :4675   Max.   :4327.0   Max.   :2016-12-19 16:30:35

unique(dfs$SystemCodeNumber)

##  [1] "BHMBCCMKT01"      "BHMBCCPST01"      "BHMBCCSNH01"      "BHMBCCTHL01"      "BHMBRCBRG01"     
##  [6] "BHMBRCBRG02"      "BHMBRCBRG03"      "BHMBRTARC01"      "BHMEURBRD01"      "BHMEURBRD02"     
## [11] "BHMMBMMBX01"      "BHMNCPHST01"      "BHMNCPLDH01"      "BHMNCPNHS01"      "BHMNCPNST01"     
## [16] "BHMNCPPLS01"      "BHMNCPRAN01"      "Broad Street"     "Bull Ring"        "NIA Car Parks"   
## [21] "NIA North"        "NIA South"        "Others-CCCPS105a" "Others-CCCPS119a" "Others-CCCPS133" 
## [26] "Others-CCCPS135a" "Others-CCCPS202"  "Others-CCCPS8"    "Others-CCCPS98"   "Shopping"

Fields descriptions from Machine Learning Repository:
SystemCodeNumber: Car park ID
Capacity: Car park capacity
Occupancy: Car park occupancy rate
LastUpdated: Date and Time of the measure

Data investigation

I will be working with the SystemCodeNumber = “BHMBCCMKT01” , but overall all parking locations have pretty similar properties. So let’s explore those properties

dfs %>% filter(SystemCodeNumber == "BHMBCCMKT01") -> dfs

Handy plotting function plot_time_series() from timetk package will help with the investigation

plot_time_series(dfs, .date_var = LastUpdated, .value = Occupancy, .smooth = F, .interactive = T, .y_lab = "Occupancy",.title = "Car Park Occupancy")

First thing that is pretty evident on the graph, is that there are missing data points, namely between 2016-10-19 and 2016-10-22, also between 2016-12-02 and 2016-12-05.
Next, the structure periodicity - as per tibble preview of the data, we can see its date time data with approximate half hour intervals, data is also separated by days. The date time part starts at around 08:00 and finishes at around 16:30 - in total 18 observations per day, 7 days a week (excluding those blank spots for now).

Lets look how Car Park Occupancy is distributed per day and per half-hour

boxplot(formula = Occupancy ~ factor(weekdays(LastUpdated), levels = c("Monday","Tuesday","Wednesday","Thursday", "Friday", "Saturday", "Sunday"), labels = c("Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday", "Sunday"),ordered = T), data = dfs, xlab = NULL, main = "Occupancy distribution on each week day")

# Stripping date information from the LastUpdated date time field and then plotting it against target variable
plot(x = as.numeric(as.POSIXct(dfs$LastUpdated)) %% 86400 / 3600,  dfs$Occupancy, xlab = "Hour", ylab = "Occupancy", main = "Occupancy distribution in half hour intervals")

Out of weekdays, Saturdays have clearly largest values, other days are pretty similar with Sundays being a bit lower and Tuesdays a bit higher than the remaining ones. The half hour plot shows parabolic share that is not exactly symmetric - the end of the day values for Occupancy are larger than the values at the start of the day.

ACF and PACF plots

Let’s take a look which lags are important in this time series. Based on lag importantce we can possibly construct additional predictor variables. I will use plot_acf_diagnostics() from timetk package:

plot_acf_diagnostics(.data = dfs,
                      .date_var = LastUpdated,
                      .value = Occupancy, 
                      .show_white_noise_bars = T,
                     .x_intercept = c(18,126),
                     .white_noise_line_color = "green")

Besides the first few lags in ACF plot that show the time series points a highly correlated with their nearest neighbors, we see there are a few lags that correspond to periodicity of the time series, specifically, at lag 18(ACF) & 19(PACF) for daily periodicity and at lag 125 (ACF) & 126 (PACF) for the weekly periodicity.

Data cleaning

So the plan to prepare the data is as following:
1. Round date time to nearest half hour to be able to spot missing entries/duplicates
2. Check if there are any duplicate values, missing single entries
3. Clean missing days/single entries by imputing values

Rounding date time values

Will use handy lubridate function round_date() for this, also let’s create Date variable out of our date time variable:

dfs %>% mutate(DateTime = round_date(x = LastUpdated,unit = "30 minutes"), Date = as.Date(LastUpdated)) -> dfs

Discarding fields that are either redundant or don’t hold any prediction value:

dfs %>% select(-LastUpdated, -Capacity, -SystemCodeNumber) -> dfs
dfs

## # A tibble: 1,312 x 3
##    Occupancy DateTime            Date      
##        <dbl> <dttm>              <date>    
##  1        61 2016-10-04 08:00:00 2016-10-04
##  2        64 2016-10-04 08:30:00 2016-10-04
##  3        80 2016-10-04 09:00:00 2016-10-04
##  4       107 2016-10-04 09:30:00 2016-10-04
##  5       150 2016-10-04 10:00:00 2016-10-04
##  6       177 2016-10-04 10:30:00 2016-10-04
##  7       219 2016-10-04 11:00:00 2016-10-04
##  8       247 2016-10-04 11:30:00 2016-10-04
##  9       259 2016-10-04 12:00:00 2016-10-04
## 10       266 2016-10-04 12:30:00 2016-10-04
## 11       269 2016-10-04 13:00:00 2016-10-04
## 12       263 2016-10-04 13:30:00 2016-10-04
## 13       238 2016-10-04 14:00:00 2016-10-04
## 14       215 2016-10-04 14:30:00 2016-10-04
## 15       192 2016-10-04 15:00:00 2016-10-04
## 16       165 2016-10-04 15:30:00 2016-10-04
## 17       162 2016-10-04 16:00:00 2016-10-04
## 18       143 2016-10-04 16:30:00 2016-10-04
## 19        54 2016-10-05 08:00:00 2016-10-05
## 20        59 2016-10-05 08:30:00 2016-10-05
## # ... with 1,292 more rows

Checking for duplicates

Let’s start by looking if there are duplicate values for DateTime:

dfs %>% count(DateTime) %>% filter(n > 1)

## # A tibble: 4 x 2
##   DateTime                n
##   <dttm>              <int>
## 1 2016-10-28 08:30:00     2
## 2 2016-10-30 08:00:00     3
## 3 2016-11-18 08:00:00     2
## 4 2016-11-25 08:00:00     2

dfs %>% filter(as.character(DateTime) %in% c("2016-10-28 08:30:00", "2016-10-30 08:00:00", "2016-11-18 08:00:00", "2016-11-25 08:00:00"))

## # A tibble: 9 x 3
##   Occupancy DateTime            Date      
##       <dbl> <dttm>              <date>    
## 1        57 2016-10-28 08:30:00 2016-10-28
## 2        57 2016-10-28 08:30:00 2016-10-28
## 3        43 2016-10-30 08:00:00 2016-10-30
## 4        43 2016-10-30 08:00:00 2016-10-30
## 5        43 2016-10-30 08:00:00 2016-10-30
## 6        35 2016-11-18 08:00:00 2016-11-18
## 7        35 2016-11-18 08:00:00 2016-11-18
## 8        44 2016-11-25 08:00:00 2016-11-25
## 9        44 2016-11-25 08:00:00 2016-11-25

It seems there are some duplicates, but not too many, will remove these by aggregating:

dfs %>% group_by(DateTime, Date) %>% summarise(Occupancy = mean(Occupancy, na.rm = T)) %>% ungroup()-> dfs
dfs %>% count(DateTime) %>% filter(n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: DateTime <dttm>, n <int>

Checking and cleaning for any missing values

As mentioned previously, we spotted two intervals where entires days were missing, so the plan is to create sort of “frame” which would hold all possible continuous time stamps based on MIN and MAX values of our existing data set.

expand.grid(Date = seq.Date(from = min(as.Date(dfs$DateTime)), to =  max(as.Date(dfs$DateTime)), by = 1), 
            time = seq.int(from = 8*60*60, to = 16*60*60 + 1800, by = 1800 )) %>%
  mutate(DateTime = as.Date(Date) + as.duration(time)) %>% select(DateTime) %>% arrange(DateTime) %>% 
  as_tibble()->frame_dt
frame_dt

## # A tibble: 1,386 x 1
##    DateTime           
##    <dttm>             
##  1 2016-10-04 08:00:00
##  2 2016-10-04 08:30:00
##  3 2016-10-04 09:00:00
##  4 2016-10-04 09:30:00
##  5 2016-10-04 10:00:00
##  6 2016-10-04 10:30:00
##  7 2016-10-04 11:00:00
##  8 2016-10-04 11:30:00
##  9 2016-10-04 12:00:00
## 10 2016-10-04 12:30:00
## 11 2016-10-04 13:00:00
## 12 2016-10-04 13:30:00
## 13 2016-10-04 14:00:00
## 14 2016-10-04 14:30:00
## 15 2016-10-04 15:00:00
## 16 2016-10-04 15:30:00
## 17 2016-10-04 16:00:00
## 18 2016-10-04 16:30:00
## 19 2016-10-05 08:00:00
## 20 2016-10-05 08:30:00
## # ... with 1,366 more rows

What I have done is a cross join between all dates and all time stamps (in seconds) in a single day with the help of expand.grid(). Next, is joining the frame to our existing data:

left_join(x = frame_dt, y = dfs) -> dfs

## Joining, by = "DateTime"

sum(is.na(dfs$Occupancy))

## [1] 79

So there 79 empty values, which I will need to deal with. I chose to impute using a handy function timetk::ts_impute_vec(). For the impute period I chose 1 week:

dfs %>% mutate(Occupancy = ts_impute_vec(x = Occupancy, period = 18*7)) -> dfs

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

sum(is.na(dfs$Occupancy))

## [1] 0

plot_time_series(dfs,.date_var = DateTime, .value = Occupancy, .title = "Car Park Occupancy")

No empty values remaining in the time series.

Adding additional predictors

I will also leverage lagged values, as well as some rolling average numbers to create additional predictors. From ACF and PACF plots, it looked like lags at 18 and 126 are of significant importance. I will also add rolling daily averages for one day, three days and one week:

dfs %>% 
  tk_augment_lags(.value = Occupancy,.lags = c(18,126)) %>% 
  tk_augment_slidify(.value = starts_with("Occupancy_"), 
                           .period = c(18,3*18, 7*18), 
                           .f = ~mean(., na.rm = T),
                           .align = "center",
                           .partial = T) -> dfs

As our new variables introduced empty numbers, I will subset rows, where any of those numbers are missing:

remove_missing(dfs) -> dfs

## Warning: Removed 205 rows containing missing values.

Preprocessing

First of, I will create validation data set and then Splits variable that will hold training/testing splits:

validation_set = dfs %>% slice_tail(n = 3*18)
 dfs %>% anti_join(validation_set) -> dfs

## Joining, by = c("DateTime", "Date", "Occupancy", "Occupancy_lag18", "Occupancy_lag126", "Occupancy_lag18_roll_18", "Occupancy_lag126_roll_18", "Occupancy_lag18_roll_54", "Occupancy_lag126_roll_54", "Occupancy_lag18_roll_126", "Occupancy_lag126_roll_126")

Splits = dfs %>% initial_time_split( prop = (nrow(dfs) - 4*18) / nrow(dfs))

Next, for the recipe, I am planning to use these steps:
1. Convert date to numeric format
2. Retrieve all possible date time features out of our predictor DateTime
3. Remove those features that do not hold any value to the prediction
4. Remove Date and DateTime predictors as I have extracted all their features
5. Remove those predictors that are highly sparse and unbalanced (in case there are any)

recipe(Occupancy ~ ., data = training(Splits)) %>% 
  step_mutate(date_num = as.numeric(Date)) %>% 
  step_timeseries_signature(DateTime) %>% 
  step_rm(matches("(.xts$)|(.iso$)|(second)|(am.pm)|(lbl)|(_qday)|(_yday)")) %>% 
  step_rm(DateTime,Date) %>% 
  step_nzv(all_numeric_predictors())  -> my_recipe

Models

I selected a random forest model, a decision tree model and a boosted tree model to use as examples. I will not do any tuning for these models and run everything on defaults to see how they stack up out of the box:

mod_ranger = rand_forest(mode = "regression") %>% set_engine("ranger", num.threads = parallel::detectCores(),importance = "impurity")
mod_dtree = decision_tree(mode = "regression") %>% set_engine("rpart")
mod_xgboost = boost_tree(mode = "regression") %>% set_engine("xgboost", nthread = parallel::detectCores())

Next, will use workflow_set() to create a workflow for each model and assign my custom recipe for each of the models by setting cross = T:

workflow_set(preproc = list(prep = my_recipe), 
             models = list(ranger = mod_ranger,dtree =  mod_dtree, xgboost = mod_xgboost), 
             cross = T) -> wf_set

After I have my workflow set, I can fit the models with training data, and I will use modeltime_fit_workflowset() for this purpose. Then, I can do a prediction on the test data set with modeltime_calibrate()

modeltime_fit_workflowset(wf_set, data =  training(Splits)) -> wf_set_fit
modeltime_calibrate(object = wf_set_fit, new_data = testing(Splits)) -> wf_calibrated

Examining results

Accuracy

.calibration_data holds the actuals and predictions, which can be used with modeltime_accuracy():

modeltime_accuracy(wf_calibrated)

## # A tibble: 3 x 9
##   .model_id .model_desc .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 RANGER      Test   14.8  18.0 0.650  15.3  18.5 0.949
## 2         2 RPART       Test   19.4  36.6 0.854  24.8  22.1 0.890
## 3         3 XGBOOST     Test   13.3  11.6 0.587  11.4  18.1 0.941

And we can explore the results visually, just need to unnest the calibration table:

# Function for plotting
plotting = function(df){
    ggplot(df, aes(x = DateTime, y = .prediction, color = .model_desc)) + geom_line(cex = 0.75) + 
  facet_wrap(~.model_desc, ncol = 1, scales = "free_y") + 
  geom_line(aes(x = DateTime, y = .actual, color = "actual"), cex = 0.75, alpha = 0.75) +
  scale_color_discrete(type = c("#000000", "#AB5F96", "#64AB5F", "#5FABA5")) +
  labs(title = "Forecast", subtitle = "When running on defaults") + 
  theme(legend.title=element_blank(),
        panel.background = element_rect(fill = "#F9F7F4"),
        axis.title.y.left = element_blank()
        )
}

wf_calibrated %>% select(.model_desc, .calibration_data) %>% mutate(.model_desc = tolower(.model_desc))  %>% 
  unnest(.calibration_data) %>% plotting()

Variable importance plots

Let’s take a look at variable importance plots for each model, to see what are the common predictors between the models with highest importance. I will combine plots for each separate model and display them combined using grid.arrange()

library(vip)
library(gridExtra)
library(grid)
p_title = grid.text(label = "Variable importance plots",gp=gpar(fontsize=20), draw = F)
P_ranger =  vip(extract_fit_parsnip(pluck(wf_calibrated$.model,1))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,1))
p_rpart = vip(extract_fit_parsnip(pluck(wf_calibrated$.model,2))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,2))
p_xgboost = vip(extract_fit_parsnip(pluck(wf_calibrated$.model,3))$fit) + ggtitle(pluck(wf_calibrated$.model_desc,3))
grid.arrange(p_title, P_ranger, p_rpart, p_xgboost, ncol=2)

All three models have week lagged values as the most important one, which is accurate as the time series exhibited weekly periodicity, however it is interesting that the weekly periodicity is much more important than the daily one (Occupancy_lag18). xgboost model seems to be heavily relying on weekly periodicity and probably would not be a good model for sudden daily changes in occupancy. Ranger and rpart models are more balanced in this regard.

Performance in validation data set

Will start by refitting the training + testing data set, which, in my case was dfs. Then, follows the same steps, using modeltime_calibrate() to get actuals and predictions, which can be investigated forther:

modeltime_refit(object = wf_calibrated, data = dfs) -> dfs_refit
modeltime_calibrate(object = dfs_refit, new_data = validation_set) -> dfs_calibrated
modeltime_accuracy(dfs_calibrated)

## # A tibble: 3 x 9
##   .model_id .model_desc .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 RANGER      Test   27.2  55.6 0.766  26.3  37.0 0.926
## 2         2 RPART       Test   67.2 102.  1.89   43.3  92.8 0.791
## 3         3 XGBOOST     Test   38.8  38.4 1.09   24.8  53.8 0.904

dfs_calibrated %>% select(.model_desc, .calibration_data) %>% mutate(.model_desc = tolower(.model_desc))  %>% 
  unnest(.calibration_data) %>% plotting()

Conclusions

Out of the box performance for ranger and xgboost models was at an acceptable level. Both of these models captured periodicity pretty well. Additional feature engineering, for example taking into account public holidays/event could improve model performance. Adding rolling measures for lagged values and lagged values themselves improved performance of all three models. The most important predictor was the week lagged value, which shows that the models have capability of forecasting of at least one week in advance, although some other predictors would need to be tweaked to accommodate for that.

References

Birmingham City Council. n.d. https://data.birmingham.gov.uk/dataset.

Stolfi, Daniel H. and Alba, Enrique and Yao, Xin. n.d. “Predicting Car Park Occupancy Rates in Smart Cities.” https://doi.org/10.1007/978-3-319-59513-9_11.