Hyperparameter tuning is a computationally intensive process, even when you have domain knowledge of a particular algorithm’s parameters. Bayesian Optimization is a method that uses a small amount of random grid search evaluation data to choose parameters that often far outperform those selected by intensive grid search. Below, I will outline how to use Bayesian Optimization to tune the hyperparameters in Facebook’s Prophet time-series package. We will be predicting property crime in a specific Chicago police districts at a weekly level.
First, we pull all burglaries in the city since 2001. We access Chicago’s Open Data Portal through the RSocrata API interface. Then we will aggregate to counts by week and district.
library(RSocrata)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
fbi_code = "'05'"
url = sprintf("https://data.cityofchicago.org/resource/6zsd-86xi.json?$select=*&$where=fbi_code=%s", fbi_code)
burglaries = read.socrata(url)
burglaries$date_clean = as.Date(as.POSIXct(burglaries$date, format = '%Y-%m-%d %H:%M:%S'))
burglaries$week = as.Date(cut(burglaries$date_clean, 'week'))
burglary_counts = burglaries %>%
group_by(district, week) %>%
summarise(COUNTCRIMES = length(district))
## Warning: package 'bindrcpp' was built under R version 3.3.3
head(burglary_counts)
## # A tibble: 6 x 3
## # Groups: district [1]
## district week COUNTCRIMES
## <chr> <date> <int>
## 1 001 2001-01-01 8
## 2 001 2001-01-08 6
## 3 001 2001-01-15 1
## 4 001 2001-01-22 5
## 5 001 2001-01-29 4
## 6 001 2001-02-05 6
Below, I highlight the cross validation framework for choosing the best parameters. We will use Hyndman’s Evaluation on a rolling forecasting origin method, illustrated by the graph below. The blue points represent training periods, and the red point represents validation periods. We roll over several validation periods, adding a new training period each time, and averaging our evaluation metric across all of the validation periods.
Below we set up the initial random grid for cross validation. We select random samples of parameter values that we believe to be in the correct range. These are then tested using our cross-validation format and the error metric – in this case Mean Average Percent Error (MAPE) – is calculated across the past six weeks.
library(prophet)
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
### rolling cv period
set.seed(12345)
num_weeks = 6
dist = '004'
origin_week = as.Date(cut(Sys.Date(), 'week')) - days(7*num_weeks)
end_week = as.Date(cut(Sys.Date(), 'week')) - days(7)
date_seq = seq(origin_week, end_week, by = '7 days')
cv_set = burglary_counts %>%
ungroup() %>%
filter(week <= end_week & district == dist) %>%
select(week, COUNTCRIMES)
names(cv_set) = c('ds', 'y')
rand_search_grid = data.frame(
changepoint_prior_scale = sort(runif(10, 0.01, 0.1)),
seasonality_prior_scale = c(sort(sample(c(runif(5, 0.01, 0.05), runif(5, 1, 10)), 5, replace = F)),
sort(sample(c(runif(5, 0.01, 0.05), runif(5, 1, 10)), 5, replace = F))),
n_changepoints = sample(5:25, 10, replace = F),
Value = rep(0, 10)
)
# Search best parameters
for (i in 1:nrow(rand_search_grid)) {
parameters = rand_search_grid[i, ]
error = c()
for (d in date_seq) {
train = subset(cv_set, ds < d)
test = subset(cv_set, ds == d)
m = prophet(train, growth = 'linear',
seasonality.prior.scale = parameters$seasonality_prior_scale,
changepoint.prior.scale = parameters$changepoint_prior_scale,
n.changepoints = parameters$n_changepoints,
weekly.seasonality = F,
daily.seasonality = F)
future = make_future_dataframe(m, periods = 1, freq = 'week')
# NOTE: There's a problem in function names with library(caret)
forecast = predict(m, future)
forecast$ds = as.Date(forecast$ds)
error_d = forecast::accuracy(forecast[forecast$ds %in% test$ds, 'yhat'], test$y)[ , 'MAPE']
error = c(error, error_d)
}
mean_error = mean(error)
print(mean_error)
rand_search_grid$Value[i] = -mean_error
}
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
## [1] 22.6018
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.22433
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.20506
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 21.32177
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.36697
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.71723
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.67149
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.3325
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 24.30668
## Initial log joint probability = -20.2114
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.2022
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -17.1986
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -16.5929
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4219
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## Initial log joint probability = -26.4014
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## [1] 25.32095
best_cv_value = -1*rand_search_grid[which.max(rand_search_grid$Value),'Value']
rand_search_grid = arrange(rand_search_grid, -Value)
head(rand_search_grid)
## changepoint_prior_scale seasonality_prior_scale n_changepoints Value
## 1 0.05108329 0.03942740 22 -21.20506
## 2 0.03925858 0.02564813 13 -21.22433
## 3 0.05583019 4.62236628 18 -21.32177
## 4 0.02497346 0.01004546 21 -22.60180
## 5 0.08975121 2.68941201 23 -24.30668
## 6 0.08881959 0.02938231 19 -24.33250
Now we implement Bayesian Hyperparameter Optimization through the rBayesianOptimization package. This will take as input: a function that returns a cross-validation metric TO BE MAXIMIZED (thus, the negative of MAPE) for the specific model; the initial tuning grid with the last column being named ‘Value’; and the ‘search bounds’ - or the entire feasible range for the parameters being tuned. To choose these search bounds, we add 20% to both ends of the values we sampled in grid search.
library(rBayesianOptimization)
## Warning: package 'rBayesianOptimization' was built under R version 3.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
prophet_fit_bayes = function(changepoint_prior_scale, seasonality_prior_scale, n_changepoints) {
error = c()
for (d in date_seq) {
train = subset(cv_set, ds < d)
test = subset(cv_set, ds == d)
m = prophet(train, growth = 'linear',
seasonality.prior.scale = seasonality_prior_scale,
changepoint.prior.scale = changepoint_prior_scale,
n.changepoints = n_changepoints,
weekly.seasonality = F,
daily.seasonality = F)
future = make_future_dataframe(m, periods = 1, freq = 'week')
# NOTE: There's a problem in function names with library(caret)
forecast = predict(m, future)
forecast$ds = as.Date(forecast$ds)
error_d = forecast::accuracy(forecast[forecast$ds %in% test$ds, 'yhat'], test$y)[ , 'MAPE']
error = c(error, error_d)
}
## The function wants to _maximize_ the outcome so we return
## the negative of the resampled MAPE value. `Pred` can be used
## to return predicted values but we'll avoid that and use zero
list(Score = -mean(error), Pred = 0)
}
changepoint_bounds = range(rand_search_grid$changepoint_prior_scale)
n_changepoint_bounds = as.integer(range(rand_search_grid$n_changepoints))
seasonality_bounds = range(rand_search_grid$seasonality_prior_scale)
bayesian_search_bounds = list(changepoint_prior_scale = changepoint_bounds,
seasonality_prior_scale = seasonality_bounds,
n_changepoints = as.integer(n_changepoint_bounds))
ba_search = BayesianOptimization(prophet_fit_bayes,
bounds = bayesian_search_bounds,
init_grid_dt = rand_search_grid,
init_points = 0,
n_iter = 15,
acq = 'ucb',
kappa = 1,
eps = 0,
verbose = TRUE)
## 10 points in hyperparameter space were pre-sampled
## elapsed = 18.68 Round = 11 changepoint_prior_scale = 0.0468 seasonality_prior_scale = 9.5649 n_changepoints = 12.0000 Value = -20.1275
## elapsed = 21.00 Round = 12 changepoint_prior_scale = 0.0478 seasonality_prior_scale = 9.5649 n_changepoints = 23.0000 Value = -19.1804
## elapsed = 22.77 Round = 13 changepoint_prior_scale = 0.0442 seasonality_prior_scale = 9.5649 n_changepoints = 23.0000 Value = -20.2467
## elapsed = 21.76 Round = 14 changepoint_prior_scale = 0.0510 seasonality_prior_scale = 9.5649 n_changepoints = 23.0000 Value = -19.3550
## elapsed = 21.46 Round = 15 changepoint_prior_scale = 0.0486 seasonality_prior_scale = 7.7951 n_changepoints = 23.0000 Value = -19.0404
## elapsed = 21.93 Round = 16 changepoint_prior_scale = 0.0490 seasonality_prior_scale = 8.1011 n_changepoints = 23.0000 Value = -19.6954
## elapsed = 21.17 Round = 17 changepoint_prior_scale = 0.0451 seasonality_prior_scale = 7.5197 n_changepoints = 22.0000 Value = -20.4120
## elapsed = 22.41 Round = 18 changepoint_prior_scale = 0.0529 seasonality_prior_scale = 7.7326 n_changepoints = 23.0000 Value = -20.8209
## elapsed = 19.61 Round = 19 changepoint_prior_scale = 0.0453 seasonality_prior_scale = 7.9003 n_changepoints = 15.0000 Value = -19.5333
## elapsed = 22.95 Round = 20 changepoint_prior_scale = 0.0489 seasonality_prior_scale = 9.3493 n_changepoints = 23.0000 Value = -19.9182
## elapsed = 21.44 Round = 21 changepoint_prior_scale = 0.0469 seasonality_prior_scale = 7.8733 n_changepoints = 22.0000 Value = -19.5145
## elapsed = 17.51 Round = 22 changepoint_prior_scale = 0.0484 seasonality_prior_scale = 7.6942 n_changepoints = 6.0000 Value = -20.9759
## elapsed = 20.42 Round = 23 changepoint_prior_scale = 0.0485 seasonality_prior_scale = 7.6972 n_changepoints = 18.0000 Value = -20.0951
## elapsed = 21.42 Round = 24 changepoint_prior_scale = 0.0431 seasonality_prior_scale = 7.7797 n_changepoints = 20.0000 Value = -20.5655
## elapsed = 17.70 Round = 25 changepoint_prior_scale = 0.0424 seasonality_prior_scale = 8.0613 n_changepoints = 9.0000 Value = -22.0842
##
## Best Parameters Found:
## Round = 15 changepoint_prior_scale = 0.0486 seasonality_prior_scale = 7.7951 n_changepoints = 23.0000 Value = -19.0404
best_params_ba = c(ba_search$Best_Par, Value = -1*ba_search$Best_Value)
m = prophet(cv_set, growth = 'linear',
seasonality.prior.scale = best_params_ba[['seasonality_prior_scale']],
changepoint.prior.scale = best_params_ba[['changepoint_prior_scale']],
n.changepoints = best_params_ba[['n_changepoints']])
## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -27.621
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
future = make_future_dataframe(m, periods = 1, freq = 'week')
# NOTE: There's a problem in function names with library(caret)
forecast = predict(m, future)
forecast$ds = as.Date(forecast$ds)
p = ggplot() +
geom_point(data = subset(cv_set, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), y = y), size = 1) +
geom_line(data = subset(forecast, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), y = yhat), color = "#0072B2", size = 1) +
geom_ribbon(data = subset(forecast, ds >= origin_week - days(7*52)), aes(x = as.Date(ds), ymin = yhat_lower, ymax = yhat_upper), fill = "#0072B2", alpha = 0.3) +
geom_point(data = test, aes(x = as.Date(ds), y = y), size = 1, color = '#4daf4a')
p
print(paste('Did district', dist, 'experience a spike in burglaries in the past week?'))
## [1] "Did district 004 experience a spike in burglaries in the past week?"
subset(cv_set, ds == end_week)$y - subset(forecast, ds == end_week)$yhat_upper > 0
## [1] FALSE
Look at the improvement over random grid search cross validation below.
print('Improvement in MAPE over cross-validation using Bayesian Optimization')
## [1] "Improvement in MAPE over cross-validation using Bayesian Optimization"
print(paste(round(best_cv_value - best_params_ba[['Value']], 4), 'percentage points'))
## [1] "2.1647 percentage points"