Bayesian Hyperparameter Optimization for Time Series

Introduction

Hyperparameter tuning is a computationally intensive process, even when you have domain knowledge of a particular algorithm’s parameters. Bayesian Optimization is a method that uses a small amount of random grid search evaluation data to choose parameters that often far outperform those selected by intensive grid search. Below, I will outline how to use Bayesian Optimization to tune the hyperparameters in Facebook’s Prophet time-series package. We will be predicting property crime in a specific Chicago police districts at a weekly level.

Pulling Open Data

First, we pull all burglaries in the city since 2001. We access Chicago’s Open Data Portal through the RSocrata API interface. Then we will aggregate to counts by week and district.

library(RSocrata)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

fbi_code = "'05'"
url = sprintf("https://data.cityofchicago.org/resource/6zsd-86xi.json?$select=*&$where=fbi_code=%s", fbi_code)
burglaries = read.socrata(url)

burglaries$date_clean = as.Date(as.POSIXct(burglaries$date, format = '%Y-%m-%d %H:%M:%S'))
burglaries$week       = as.Date(cut(burglaries$date_clean, 'week'))

burglary_counts       = burglaries %>%
                          group_by(district, week) %>%
                          summarise(COUNTCRIMES = length(district))

## Warning: package 'bindrcpp' was built under R version 3.3.3

head(burglary_counts)

## # A tibble: 6 x 3
## # Groups:   district [1]
##   district       week COUNTCRIMES
##      <chr>     <date>       <int>
## 1      001 2001-01-01           8
## 2      001 2001-01-08           6
## 3      001 2001-01-15           1
## 4      001 2001-01-22           5
## 5      001 2001-01-29           4
## 6      001 2001-02-05           6

Cross Validation Framework

Below, I highlight the cross validation framework for choosing the best parameters. We will use Hyndman’s Evaluation on a rolling forecasting origin method, illustrated by the graph below. The blue points represent training periods, and the red point represents validation periods. We roll over several validation periods, adding a new training period each time, and averaging our evaluation metric across all of the validation periods. Hyndman’s cv framework

Initial Grid Search

Below we set up the initial random grid for cross validation. We select random samples of parameter values that we believe to be in the correct range. These are then tested using our cross-validation format and the error metric – in this case Mean Average Percent Error (MAPE) – is calculated across the past six weeks.

library(prophet)

## Loading required package: Rcpp

## Warning: package 'Rcpp' was built under R version 3.3.3

### rolling cv period
set.seed(12345)
num_weeks   = 6
dist        = '004'
origin_week = as.Date(cut(Sys.Date(), 'week')) - days(7*num_weeks) 
end_week    = as.Date(cut(Sys.Date(), 'week')) - days(7) 
date_seq    = seq(origin_week, end_week, by = '7 days')

cv_set  = burglary_counts %>%
          ungroup() %>%
          filter(week <= end_week & district == dist) %>%
          select(week, COUNTCRIMES)

names(cv_set) = c('ds', 'y')

rand_search_grid =  data.frame( 
  changepoint_prior_scale = sort(runif(10, 0.01, 0.1)),
  seasonality_prior_scale = c(sort(sample(c(runif(5, 0.01, 0.05), runif(5, 1, 10)), 5, replace = F)),
                              sort(sample(c(runif(5, 0.01, 0.05), runif(5, 1, 10)), 5, replace = F))),
  n_changepoints          = sample(5:25, 10, replace = F),
  Value                   = rep(0, 10)
)


  # Search best parameters
for (i in 1:nrow(rand_search_grid)) {
  parameters = rand_search_grid[i, ]
  error = c()
  for (d in date_seq) {
    train = subset(cv_set, ds < d)
    test  = subset(cv_set, ds == d)
    
    m = prophet(train, growth = 'linear',
                 seasonality.prior.scale = parameters$seasonality_prior_scale, 
                 changepoint.prior.scale = parameters$changepoint_prior_scale,
                 n.changepoints = parameters$n_changepoints,
                 weekly.seasonality = F,
                 daily.seasonality = F)
    
    future = make_future_dataframe(m, periods = 1, freq = 'week')
    
    # NOTE: There's a problem in function names with library(caret)
    forecast = predict(m, future)
    forecast$ds = as.Date(forecast$ds)
    
    error_d = forecast::accuracy(forecast[forecast$ds %in% test$ds, 'yhat'], test$y)[ , 'MAPE']
    error = c(error, error_d)
  }
  mean_error = mean(error)
  print(mean_error)
  rand_search_grid$Value[i] = -mean_error
}

## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## [1] 22.6018
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.22433
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.20506
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.32177
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.36697
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.71723
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.67149
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.3325
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.30668
## Initial log joint probability = -20.2114
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.32095

best_cv_value = -1*rand_search_grid[which.max(rand_search_grid$Value),'Value']
rand_search_grid = arrange(rand_search_grid, -Value)
head(rand_search_grid)

##   changepoint_prior_scale seasonality_prior_scale n_changepoints     Value
## 1              0.05108329              0.03942740             22 -21.20506
## 2              0.03925858              0.02564813             13 -21.22433
## 3              0.05583019              4.62236628             18 -21.32177
## 4              0.02497346              0.01004546             21 -22.60180
## 5              0.08975121              2.68941201             23 -24.30668
## 6              0.08881959              0.02938231             19 -24.33250

Bayesian Hyperparamter Optimization

Now we implement Bayesian Hyperparameter Optimization through the rBayesianOptimization package. This will take as input: a function that returns a cross-validation metric TO BE MAXIMIZED (thus, the negative of MAPE) for the specific model; the initial tuning grid with the last column being named ‘Value’; and the ‘search bounds’ - or the entire feasible range for the parameters being tuned. To choose these search bounds, we add 20% to both ends of the values we sampled in grid search.

library(rBayesianOptimization)

## Warning: package 'rBayesianOptimization' was built under R version 3.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

prophet_fit_bayes = function(changepoint_prior_scale, seasonality_prior_scale, n_changepoints) {
    
    error = c()
    for (d in date_seq) {
      train = subset(cv_set, ds < d)
      test  = subset(cv_set, ds == d)
    
      m = prophet(train, growth = 'linear',
                 seasonality.prior.scale = seasonality_prior_scale, 
                 changepoint.prior.scale = changepoint_prior_scale,
                 n.changepoints = n_changepoints,
                 weekly.seasonality = F,
                 daily.seasonality = F)
    
      future = make_future_dataframe(m, periods = 1, freq = 'week')
    
    # NOTE: There's a problem in function names with library(caret)
      forecast = predict(m, future)
      forecast$ds = as.Date(forecast$ds)
    
      error_d = forecast::accuracy(forecast[forecast$ds %in% test$ds, 'yhat'], test$y)[ , 'MAPE']
      error = c(error, error_d)
  }
  
    ## The function wants to _maximize_ the outcome so we return 
    ## the negative of the resampled MAPE value. `Pred` can be used
    ## to return predicted values but we'll avoid that and use zero
    list(Score = -mean(error), Pred = 0)
}
  
changepoint_bounds    = range(rand_search_grid$changepoint_prior_scale)
n_changepoint_bounds  = as.integer(range(rand_search_grid$n_changepoints))
seasonality_bounds    = range(rand_search_grid$seasonality_prior_scale)

bayesian_search_bounds = list(changepoint_prior_scale = changepoint_bounds,
                              seasonality_prior_scale = seasonality_bounds,
                              n_changepoints = as.integer(n_changepoint_bounds))



ba_search = BayesianOptimization(prophet_fit_bayes,
                                    bounds = bayesian_search_bounds,
                                    init_grid_dt = rand_search_grid, 
                                    init_points = 0, 
                                    n_iter = 15,
                                    acq = 'ucb', 
                                    kappa = 1, 
                                    eps = 0,
                                    verbose = TRUE)

## 10 points in hyperparameter space were pre-sampled
## elapsed = 18.68  Round = 11  changepoint_prior_scale = 0.0468    seasonality_prior_scale = 9.5649    n_changepoints = 12.0000    Value = -20.1275 
## elapsed = 21.00  Round = 12  changepoint_prior_scale = 0.0478    seasonality_prior_scale = 9.5649    n_changepoints = 23.0000    Value = -19.1804 
## elapsed = 22.77  Round = 13  changepoint_prior_scale = 0.0442    seasonality_prior_scale = 9.5649    n_changepoints = 23.0000    Value = -20.2467 
## elapsed = 21.76  Round = 14  changepoint_prior_scale = 0.0510    seasonality_prior_scale = 9.5649    n_changepoints = 23.0000    Value = -19.3550 
## elapsed = 21.46  Round = 15  changepoint_prior_scale = 0.0486    seasonality_prior_scale = 7.7951    n_changepoints = 23.0000    Value = -19.0404 
## elapsed = 21.93  Round = 16  changepoint_prior_scale = 0.0490    seasonality_prior_scale = 8.1011    n_changepoints = 23.0000    Value = -19.6954 
## elapsed = 21.17  Round = 17  changepoint_prior_scale = 0.0451    seasonality_prior_scale = 7.5197    n_changepoints = 22.0000    Value = -20.4120 
## elapsed = 22.41  Round = 18  changepoint_prior_scale = 0.0529    seasonality_prior_scale = 7.7326    n_changepoints = 23.0000    Value = -20.8209 
## elapsed = 19.61  Round = 19  changepoint_prior_scale = 0.0453    seasonality_prior_scale = 7.9003    n_changepoints = 15.0000    Value = -19.5333 
## elapsed = 22.95  Round = 20  changepoint_prior_scale = 0.0489    seasonality_prior_scale = 9.3493    n_changepoints = 23.0000    Value = -19.9182 
## elapsed = 21.44  Round = 21  changepoint_prior_scale = 0.0469    seasonality_prior_scale = 7.8733    n_changepoints = 22.0000    Value = -19.5145 
## elapsed = 17.51  Round = 22  changepoint_prior_scale = 0.0484    seasonality_prior_scale = 7.6942    n_changepoints = 6.0000 Value = -20.9759 
## elapsed = 20.42  Round = 23  changepoint_prior_scale = 0.0485    seasonality_prior_scale = 7.6972    n_changepoints = 18.0000    Value = -20.0951 
## elapsed = 21.42  Round = 24  changepoint_prior_scale = 0.0431    seasonality_prior_scale = 7.7797    n_changepoints = 20.0000    Value = -20.5655 
## elapsed = 17.70  Round = 25  changepoint_prior_scale = 0.0424    seasonality_prior_scale = 8.0613    n_changepoints = 9.0000 Value = -22.0842 
## 
##  Best Parameters Found: 
## Round = 15   changepoint_prior_scale = 0.0486    seasonality_prior_scale = 7.7951    n_changepoints = 23.0000    Value = -19.0404

best_params_ba  = c(ba_search$Best_Par, Value = -1*ba_search$Best_Value)



m = prophet(cv_set, growth = 'linear',
                 seasonality.prior.scale = best_params_ba[['seasonality_prior_scale']], 
                 changepoint.prior.scale = best_params_ba[['changepoint_prior_scale']],
                 n.changepoints = best_params_ba[['n_changepoints']])

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

## Initial log joint probability = -27.621
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

future = make_future_dataframe(m, periods = 1, freq = 'week')
    
    # NOTE: There's a problem in function names with library(caret)
forecast = predict(m, future)
forecast$ds = as.Date(forecast$ds)

p = ggplot() + 
    geom_point(data = subset(cv_set, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), y = y), size = 1) +
    geom_line(data = subset(forecast, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), y = yhat), color = "#0072B2", size = 1) +
    geom_ribbon(data = subset(forecast, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), ymin = yhat_lower, ymax = yhat_upper), fill = "#0072B2", alpha = 0.3) +
    geom_point(data = test, aes(x = as.Date(ds), y = y), size = 1, color = '#4daf4a') 

p

print(paste('Did district', dist, 'experience a spike in burglaries in the past week?'))

## [1] "Did district 004 experience a spike in burglaries in the past week?"

subset(cv_set, ds == end_week)$y - subset(forecast, ds == end_week)$yhat_upper > 0

## [1] FALSE

Conclusions

Look at the improvement over random grid search cross validation below.

print('Improvement in MAPE over cross-validation using Bayesian Optimization')

## [1] "Improvement in MAPE over cross-validation using Bayesian Optimization"

print(paste(round(best_cv_value - best_params_ba[['Value']], 4), 'percentage points'))

## [1] "2.1647 percentage points"