Background
This is my Machine Learning Capstone Project for Algoritma Data Science Academy Batch 4 (Cohort Deragon).
For more information about Algoritma Data Science Academy, please see https://algorit.ma/
This is a project about a ride-sharing business that operating in several big cities in Turkey. The company provide motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” to their order buttons. In this project, we are going to help them and forecast the driver demand for the next 5 working days.
Data Preprocessing
We input the Beam_Me dataset into R :
These are the library we are going to use :
library(dplyr)
library(tidyr)
library(stringr)
library(forecast)
library(lubridate)
library(TTR)
library(fpp)
library(fpp2)
library(xts)
library(zoo)
library(ggplot2)
library(ggthemes)
library(ggthemr) # devtools::install_github("cttobin/ggthemr")
library(plotly)Let’s peek into the original Beam_Me dataset :
## Observations: 366,784
## Variables: 10
## $ timeStamp <chr> "2017-11-03 19:02:31", "2017-10-01 17:45:56", "…
## $ driverID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ riderID <chr> "5941083dc01c9f3eeeaac726", "59d0d4363d32b86176…
## $ orderStatus <chr> "nodrivers", "nodrivers", "nodrivers", "nodrive…
## $ confirmedTimeSec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ srcGeohash <chr> "6", "7", "7", "7", "7", "7", "7", "7", "7", "7…
## $ srcLong <dbl> -60.87779, 0.00000, 0.00000, 0.00000, 0.00000, …
## $ srcLat <dbl> -23.04455, 0.00000, 0.00000, 0.00000, 0.00000, …
## $ destLong <dbl> 40.98718, 40.99750, 40.99750, 41.03570, 41.0449…
## $ destLat <dbl> 29.02819, 28.85056, 28.85056, 29.06256, 29.0070…
As timeStamp is the date and hour of order, we change the class into Date or POSIXct with ymd_hms() funtion from lubridate library.
## [1] "POSIXct" "POSIXt"
The dataset includes some “cancelled” and duplicated “nodrivers” transactions, which is not representing a real demand, and should not be included in the time series analysis and forecast model. Based on this condition, mfirst we preprocess the data into a proper time series data that represent the true Beam_Me’s demands.
Based on customer behaviour, we can say that there are 3 scenarios of the orderStatus that will be counted as 1 legitimate order : 1. Costumer ordering and get “confirmed” status. 2. Customer ordering, get “nodrivers” (and it can be more than 1 time) and keep ordering until he/she get “confirmed” status. 3. Customer ordering, get “nodrivers” and “cancelled” his/her order.
We’d like to throw the “cancelled” data, as the “cancelled” order is previously presented as “nodrivers”. We add Date and Hour columns, as we want to predict the Hourly Demand and group our dataset per riderID, Date, and Hour.
Then we’d like to throw away the “nodrivers” duplicates. We add prevOrder to change the first “nodrivers” status into “NA” with lag() function, dupOrder to see if prevOrder is the same as orderStatus, and orderStatus to see if the orderStatus == “nodrivers” and dupOrder == “Duplicates” we will change it to “NA”, then we can filter our dataset, throwing the all “NA” data from orderStatus.
Beam_Me_ordered <- Beam_Me_ordered %>%
# prepare date and hour variable
mutate(
Date = timeStamp %>% format("%Y-%m-%d"),
Hour = timeStamp %>% hour()
) %>%
# arrange and group by : riders, date, and hour
arrange(riderID, Date, Hour) %>%
group_by(riderID, Date, Hour) %>%
mutate(prevOrder = lag(orderStatus, default = "NA"),
dupOrder = ifelse(orderStatus == prevOrder, "Duplicates", "Not Duplicates"),
orderStatus = ifelse(orderStatus == "nodrivers" & dupOrder == "Duplicates", NA, orderStatus)) %>%
# drop NA in orderStatus
filter(!is.na(orderStatus)) %>%
# We're left with "nodrivers" and "confirmed", we still have to throw the "nodrivers" to make it 1 legitimate order
mutate(nextOrder = lead(orderStatus, default = "NA"),
orderStatus = ifelse(orderStatus == "nodrivers" & nextOrder == "confirmed", NA, orderStatus)) %>%
filter(!is.na(orderStatus)) %>%
# stop the grouping
ungroup()## # A tibble: 6 x 15
## timeStamp driverID riderID orderStatus confirmedTimeSec
## <dttm> <chr> <chr> <chr> <int>
## 1 2017-11-18 13:18:06 599eff5… 58c420… confirmed 31
## 2 2017-11-25 21:48:23 596bb16… 58c52e… confirmed 87
## 3 2017-10-01 14:19:09 58c4320… 58c6e4… confirmed 6
## 4 2017-10-08 14:51:13 null 58c6e4… nodrivers 0
## 5 2017-10-08 18:28:59 5998156… 58c6e4… confirmed 9
## 6 2017-10-11 13:16:24 59c9a6f… 58c6e4… confirmed 21
## # … with 10 more variables: srcGeohash <chr>, srcLong <dbl>, srcLat <dbl>,
## # destLong <dbl>, destLat <dbl>, Date <chr>, Hour <int>,
## # prevOrder <chr>, dupOrder <chr>, nextOrder <chr>
We subset the dataset into Beam_Me_data.
## # A tibble: 6 x 6
## timeStamp riderID orderStatus srcGeohash Date Hour
## <dttm> <chr> <chr> <chr> <chr> <int>
## 1 2017-11-18 13:18:06 58c420322eb67e3… confirmed s 2017-1… 13
## 2 2017-11-25 21:48:23 58c52e464a1df94… confirmed s 2017-1… 21
## 3 2017-10-01 14:19:09 58c6e4502776161… confirmed s 2017-1… 14
## 4 2017-10-08 14:51:13 58c6e4502776161… nodrivers s 2017-1… 14
## 5 2017-10-08 18:28:59 58c6e4502776161… confirmed s 2017-1… 18
## 6 2017-10-11 13:16:24 58c6e4502776161… confirmed s 2017-1… 13
These are the sum of “confirmed” and “nodrivers” orders which represent the real demand from 2017-10-01 to 2017-12-01.
##
## confirmed nodrivers
## 212224 14956
We calculate the demand of Beam_Me drivers, using summarise() and n(), then complete the hour category ( there are some hour that are missing in the data, means there is no order at that Hour, so we change the Demand into 0).
Beam_Me_demand <- Beam_Me_data %>%
arrange(Date, Hour) %>%
group_by(Date, Hour) %>%
# count the number of order by the group
summarise(Demand = n()) %>%
# stop the grouping
ungroup() %>%
# complete the Hour category
complete(Date, Hour) %>%
# replace the NA in Demand with 0, means no order at that hour
mutate(Demand = ifelse(is.na(Demand), 0, Demand))
# check
str(Beam_Me_demand)## Classes 'tbl_df', 'tbl' and 'data.frame': 1488 obs. of 3 variables:
## $ Date : chr "2017-10-01" "2017-10-01" "2017-10-01" "2017-10-01" ...
## $ Hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Demand: num 0 0 0 67 31 18 8 10 7 5 ...
Beam_Me_demand <- Beam_Me_demand %>%
mutate(timeStamp = paste(Date, Hour) %>% as.POSIXct(format = "%Y-%m-%d %H", tz = "Turkey"))## # A tibble: 6 x 4
## Date Hour Demand timeStamp
## <chr> <int> <dbl> <dttm>
## 1 2017-10-01 0 0 2017-10-01 00:00:00
## 2 2017-10-01 1 0 2017-10-01 01:00:00
## 3 2017-10-01 2 0 2017-10-01 02:00:00
## 4 2017-10-01 3 67 2017-10-01 03:00:00
## 5 2017-10-01 4 31 2017-10-01 04:00:00
## 6 2017-10-01 5 18 2017-10-01 05:00:00
## # A tibble: 6 x 4
## Date Hour Demand timeStamp
## <chr> <int> <dbl> <dttm>
## 1 2017-12-01 18 498 2017-12-01 18:00:00
## 2 2017-12-01 19 728 2017-12-01 19:00:00
## 3 2017-12-01 20 462 2017-12-01 20:00:00
## 4 2017-12-01 21 654 2017-12-01 21:00:00
## 5 2017-12-01 22 568 2017-12-01 22:00:00
## 6 2017-12-01 23 376 2017-12-01 23:00:00
Cleaned Data Used for Time Series Analysis and Forecasting Future Demands
## Observations: 1,488
## Variables: 4
## $ Date <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2017-…
## $ Hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Demand <dbl> 0, 0, 0, 67, 31, 18, 8, 10, 7, 5, 8, 23, 31, 19, 56, 4…
## $ timeStamp <dttm> 2017-10-01 00:00:00, 2017-10-01 01:00:00, 2017-10-01 …
## [1] "2017-10-01 00:00:00 +03" "2017-12-01 23:00:00 +03"
## [1] "Sunday"
## [1] "Friday"
## Date Hour Demand timeStamp
## 0 0 0 0
Time Series Model of Beam_Me Drivers Demand
Based on the service offered, the Beam_Me’s demands should have a seasonality. Let’s try to see the seasonality:
# convert to ts, with 24 hours natural period
Beam_Me_ts <- ts(Beam_Me_demand$Demand, start = min(Beam_Me_demand$Date), frequency = 24)# seasonality check
Beam_Me_ts %>%
# sample data for clearer view - 4 week
head(24 * 7 * 4) %>%
# classical decomposition
decompose() %>%
autoplot() +
theme_calc()The seasonality is showing a strong additive pattern, but we also have to consider the trend pattern. If you look up closely, you could see that sometimes the trend showing a pattern too. This pattern in trend might be sourced from uncaptured extra seasonality from higher natural period–in this case, weekly or even monthly seasonality. To solve this complex seasonality, we need to convert the data into msts() object which accept multiple frequency setting. Let’s start by trying another higher natural period: a weekly period. Using msts() object, we can use mstl() to decompose multiple seasonalities:
# seasonality check
Beam_Me_msts %>%
# sample data for clearer view - 4 week
head(24 * 7 * 4) %>%
# classical decomposition
mstl() %>%
autoplot() +
theme_calc()Now we can see a clearer trend and could confirm the daily and weekly seasonality for the data.
Another way to see this in clear view is to plot the Demand distribution between every hours and weekdays:
# number of order distribution between hours and weekdays
Beam_Me_demand %>%
mutate(
Weekdays = timeStamp %>% as.character() %>% as.Date() %>% weekdays()
) %>%
ggplot(aes(x = Hour, y = Demand, fill = factor(Weekdays))) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(
title = "Beam_Me's Order by Hours and Weekdays",
y = "Number of Demand",
x = "Hour of the Day",
fill = ""
) +
theme(legend.position = "bottom") +
theme_calc() +
scale_colour_calc()From the distribution, we can confirm that not every weekdays have the same hour seasonality pattern.
Forecasting Beam_Me’s Demands per Hour
Based on previous trend and seasonality analysis, we must consider forecast model that could accept multiple seasonality. There are some options for this kind of model: 1. Use mstl() for seasonality adjustment, then use ets() to forecast the trend–this can be done simultaneously using stlm() 2. Use tbats() for complex combination of ets() and modelling the error with auto.arima() with complex seasonality setting
Another thing that need to be considered is the zero values. With so many zero value, we can not do log transformation for the data. One technique to solve this problem is adding a constant value to our data, and readjust the following forecast by substracting the same value of constant.
I will use a function to do the substraction for a forecast class objects:
# adding a constant to our data
Beam_Me_ts <- Beam_Me_ts + 1
Beam_Me_msts <- Beam_Me_msts + 1
# forecast value constant readjustment
forecastConsReadj <- function(forecastObj, constant) {
forecastObj[["mean"]] <- forecastObj[["mean"]] - constant
forecastObj[["upper"]] <- forecastObj[["upper"]] - constant
forecastObj[["lower"]] <- forecastObj[["lower"]] - constant
return(forecastObj)
}Let’s split the dataset into train and test set. As one of the seasonal periods is 7 days and started at Sunday, we want to end the train set in a full loop and cut it to the last Saturday, which we have the last 6 days in the test set.
# keep the original ts & msts dataset
tsOri <- Beam_Me_ts
mstsOri <- Beam_Me_msts
# test dataset
tsTest <- Beam_Me_ts %>% tail(24 * 6)
# don't forget to readjust
tsTest <- tsTest - 1
# train dataset
Beam_Me_ts %>% head(length(.) - 24 * 6)## Time Series:
## Start = c(17440, 1)
## End = c(17495, 24)
## Frequency = 24
## [1] 1 1 1 68 32 19 9 11 8 6 9 24 32
## [14] 20 57 44 75 76 66 106 59 72 34 38 48 47
## [27] 43 38 27 13 4 1 4 27 82 189 130 92 77
## [40] 89 115 112 159 168 208 309 222 138 80 90 87 77
## [53] 18 19 3 3 3 11 95 232 197 136 93 107 124
## [66] 120 139 190 241 282 231 146 128 67 84 80 41 16
## [79] 9 8 6 13 104 235 177 113 120 126 108 136 201
## [92] 187 255 319 273 179 115 115 90 80 44 21 12 10
## [105] 5 15 123 221 214 109 119 117 145 136 181 220 241
## [118] 348 295 185 178 134 98 76 41 26 9 6 6 15
## [131] 118 267 188 124 134 135 128 165 227 243 331 377 367
## [144] 198 177 175 160 125 83 47 30 17 8 13 46 135
## [157] 113 103 128 147 205 187 219 224 232 248 289 207 157
## [170] 166 201 148 114 64 27 28 9 2 9 24 34 54
## [183] 58 81 106 101 105 121 110 137 141 119 96 94 63
## [196] 59 40 20 10 3 7 20 129 258 188 125 118 110
## [209] 122 157 173 207 264 303 264 152 132 120 73 74 42
## [222] 22 18 5 5 17 131 287 197 132 128 156 169 161
## [235] 185 221 266 298 309 191 140 146 111 77 39 21 10
## [248] 7 9 17 126 298 241 191 166 153 173 193 184 212
## [261] 286 344 294 206 180 159 98 101 51 35 16 12 7
## [274] 26 150 298 228 147 161 177 148 192 163 230 298 375
## [287] 283 184 131 125 97 61 41 13 21 11 3 11 117
## [300] 221 194 118 121 139 141 188 229 310 380 454 436 260
## [313] 195 187 170 110 68 52 34 18 8 20 67 157 129
## [326] 96 109 143 149 204 207 197 234 296 272 188 148 185
## [339] 160 144 112 51 37 22 12 11 28 53 74 85 83
## [352] 113 142 194 189 214 205 229 198 164 172 188 122 98
## [365] 45 23 20 15 8 20 155 335 247 145 140 147 136
## [378] 166 172 211 286 362 263 176 168 143 86 68 29 20
## [391] 20 6 2 12 127 281 216 117 135 137 178 165 205
## [404] 243 310 395 345 223 166 108 119 88 49 22 5 8
## [417] 5 17 146 311 234 156 134 143 199 187 216 246 302
## [430] 384 308 207 151 128 113 107 61 27 17 4 5 16
## [443] 119 307 236 163 141 165 175 146 193 236 322 454 302
## [456] 196 176 150 121 100 39 28 14 11 9 15 151 289
## [469] 193 146 137 172 168 200 286 332 378 447 453 257 198
## [482] 155 167 138 84 55 18 11 9 13 64 138 125 85
## [495] 150 166 214 264 286 254 271 296 304 229 194 206 174
## [508] 143 84 68 44 24 18 10 17 45 80 91 111 113
## [521] 167 192 203 193 186 256 173 111 170 178 125 67 59
## [534] 27 14 5 8 20 144 274 192 134 124 125 137 219
## [547] 188 240 306 389 305 198 144 181 125 84 49 22 12
## [560] 9 5 16 124 176 133 69 55 77 69 89 101 178
## [573] 225 279 236 181 113 110 104 58 55 15 12 6 2
## [586] 19 123 232 134 111 88 118 126 131 157 144 142 146
## [599] 71 47 32 25 16 27 20 13 7 5 2 9 81
## [612] 146 106 66 69 74 101 114 128 172 235 253 208 124
## [625] 119 81 91 83 37 24 14 8 5 13 115 231 196
## [638] 116 132 129 162 185 219 237 398 449 414 213 170 146
## [651] 147 111 70 36 23 16 8 16 59 98 132 116 99
## [664] 143 170 201 208 196 209 214 199 156 139 121 66 69
## [677] 31 19 18 10 6 4 9 45 41 58 76 79 116
## [690] 111 140 137 141 160 147 117 93 95 72 66 39 17
## [703] 10 3 4 10 82 220 172 91 115 108 107 128 144
## [716] 185 234 347 234 158 110 79 81 33 25 11 9 1
## [729] 2 11 85 197 176 95 122 113 125 140 157 173 242
## [742] 255 209 144 103 88 65 57 29 24 7 3 5 13
## [755] 97 214 194 99 125 102 136 139 186 239 289 400 333
## [768] 142 122 160 99 77 41 17 11 5 3 10 84 289
## [781] 190 144 134 143 123 127 180 217 265 370 256 136 128
## [794] 107 82 66 45 27 14 5 4 9 104 287 177 165
## [807] 137 154 143 352 572 623 728 1160 809 468 331 319 260
## [820] 169 64 27 31 6 4 10 22 99 131 132 90 133
## [833] 169 201 181 134 128 175 228 156 124 117 120 106 76
## [846] 69 33 10 10 3 12 60 80 69 82 108 155 146
## [859] 172 147 128 146 149 142 148 122 130 93 40 27 7
## [872] 5 7 12 141 283 207 155 116 133 163 174 219 257
## [885] 288 354 274 136 176 118 104 75 42 23 24 14 5
## [898] 12 139 301 204 142 132 289 243 269 292 347 440 687
## [911] 449 229 212 186 163 104 71 24 8 18 3 13 111
## [924] 301 236 178 171 183 258 276 284 385 436 608 579 289
## [937] 193 182 170 90 78 7 1 1 1 1 33 395 274
## [950] 177 169 175 224 236 319 329 416 617 566 326 257 215
## [963] 190 96 72 25 19 7 9 21 142 449 337 202 155
## [976] 269 274 306 382 472 579 835 724 397 270 244 251 161
## [989] 103 60 60 42 20 19 51 161 154 149 184 220 270
## [1002] 345 283 333 326 392 299 248 263 246 194 154 123 82
## [1015] 36 40 27 23 25 81 99 92 154 175 292 333 274
## [1028] 202 198 220 227 190 170 172 139 95 53 43 11 4
## [1041] 10 6 103 439 255 163 151 170 171 215 288 303 356
## [1054] 472 355 236 156 131 147 125 110 23 10 17 11 18
## [1067] 140 454 254 198 153 227 260 280 291 353 414 656 491
## [1080] 222 234 169 195 77 87 26 18 6 5 11 180 352
## [1093] 296 196 169 249 248 261 319 328 545 593 473 321 248
## [1106] 207 139 129 70 57 17 7 6 9 128 378 280 258
## [1119] 175 187 211 211 252 323 389 483 363 234 144 182 115
## [1132] 113 57 58 30 22 13 12 140 384 260 201 169 217
## [1145] 317 321 397 517 622 834 709 336 249 224 238 195 107
## [1158] 78 66 42 12 13 45 180 204 155 185 202 267 310
## [1171] 293 300 320 385 367 205 191 203 210 180 162 111 68
## [1184] 33 22 11 23 63 111 104 116 135 192 171 193 148
## [1197] 220 218 140 79 86 65 33 37 15 26 12 6 3
## [1210] 5 101 262 160 107 72 58 80 77 115 110 77 157
## [1223] 119 105 63 58 29 19 26 13 2 2 2 6 73
## [1236] 221 175 105 102 163 167 155 197 262 400 568 379 155
## [1249] 147 174 85 72 54 35 17 10 8 11 96 254 194
## [1262] 129 89 121 125 143 202 210 252 339 242 147 96 84
## [1275] 78 85 33 21 19 9 3 6 80 246 170 111 136
## [1288] 142 191 187 285 247 283 470 306 174 165 144 147 134
## [1301] 56 50 23 5 6 18 127 307 251 212 222 336 339
## [1314] 442 483 622 840 1115 1039 576 416 381 372 194 101 56
## [1327] 38 26 8 12 27 168 170 140 194 226 385 359 407
## [1340] 332 369 357 370 231
## Multi-Seasonal Time Series:
## Start: 1 1
## Seasonal Periods: 24 168
## Data:
## [1] 1 1 1 68 32 19 9 11 8 6 9 24 32
## [14] 20 57 44 75 76 66 106 59 72 34 38 48 47
## [27] 43 38 27 13 4 1 4 27 82 189 130 92 77
## [40] 89 115 112 159 168 208 309 222 138 80 90 87 77
## [53] 18 19 3 3 3 11 95 232 197 136 93 107 124
## [66] 120 139 190 241 282 231 146 128 67 84 80 41 16
## [79] 9 8 6 13 104 235 177 113 120 126 108 136 201
## [92] 187 255 319 273 179 115 115 90 80 44 21 12 10
## [105] 5 15 123 221 214 109 119 117 145 136 181 220 241
## [118] 348 295 185 178 134 98 76 41 26 9 6 6 15
## [131] 118 267 188 124 134 135 128 165 227 243 331 377 367
## [144] 198 177 175 160 125 83 47 30 17 8 13 46 135
## [157] 113 103 128 147 205 187 219 224 232 248 289 207 157
## [170] 166 201 148 114 64 27 28 9 2 9 24 34 54
## [183] 58 81 106 101 105 121 110 137 141 119 96 94 63
## [196] 59 40 20 10 3 7 20 129 258 188 125 118 110
## [209] 122 157 173 207 264 303 264 152 132 120 73 74 42
## [222] 22 18 5 5 17 131 287 197 132 128 156 169 161
## [235] 185 221 266 298 309 191 140 146 111 77 39 21 10
## [248] 7 9 17 126 298 241 191 166 153 173 193 184 212
## [261] 286 344 294 206 180 159 98 101 51 35 16 12 7
## [274] 26 150 298 228 147 161 177 148 192 163 230 298 375
## [287] 283 184 131 125 97 61 41 13 21 11 3 11 117
## [300] 221 194 118 121 139 141 188 229 310 380 454 436 260
## [313] 195 187 170 110 68 52 34 18 8 20 67 157 129
## [326] 96 109 143 149 204 207 197 234 296 272 188 148 185
## [339] 160 144 112 51 37 22 12 11 28 53 74 85 83
## [352] 113 142 194 189 214 205 229 198 164 172 188 122 98
## [365] 45 23 20 15 8 20 155 335 247 145 140 147 136
## [378] 166 172 211 286 362 263 176 168 143 86 68 29 20
## [391] 20 6 2 12 127 281 216 117 135 137 178 165 205
## [404] 243 310 395 345 223 166 108 119 88 49 22 5 8
## [417] 5 17 146 311 234 156 134 143 199 187 216 246 302
## [430] 384 308 207 151 128 113 107 61 27 17 4 5 16
## [443] 119 307 236 163 141 165 175 146 193 236 322 454 302
## [456] 196 176 150 121 100 39 28 14 11 9 15 151 289
## [469] 193 146 137 172 168 200 286 332 378 447 453 257 198
## [482] 155 167 138 84 55 18 11 9 13 64 138 125 85
## [495] 150 166 214 264 286 254 271 296 304 229 194 206 174
## [508] 143 84 68 44 24 18 10 17 45 80 91 111 113
## [521] 167 192 203 193 186 256 173 111 170 178 125 67 59
## [534] 27 14 5 8 20 144 274 192 134 124 125 137 219
## [547] 188 240 306 389 305 198 144 181 125 84 49 22 12
## [560] 9 5 16 124 176 133 69 55 77 69 89 101 178
## [573] 225 279 236 181 113 110 104 58 55 15 12 6 2
## [586] 19 123 232 134 111 88 118 126 131 157 144 142 146
## [599] 71 47 32 25 16 27 20 13 7 5 2 9 81
## [612] 146 106 66 69 74 101 114 128 172 235 253 208 124
## [625] 119 81 91 83 37 24 14 8 5 13 115 231 196
## [638] 116 132 129 162 185 219 237 398 449 414 213 170 146
## [651] 147 111 70 36 23 16 8 16 59 98 132 116 99
## [664] 143 170 201 208 196 209 214 199 156 139 121 66 69
## [677] 31 19 18 10 6 4 9 45 41 58 76 79 116
## [690] 111 140 137 141 160 147 117 93 95 72 66 39 17
## [703] 10 3 4 10 82 220 172 91 115 108 107 128 144
## [716] 185 234 347 234 158 110 79 81 33 25 11 9 1
## [729] 2 11 85 197 176 95 122 113 125 140 157 173 242
## [742] 255 209 144 103 88 65 57 29 24 7 3 5 13
## [755] 97 214 194 99 125 102 136 139 186 239 289 400 333
## [768] 142 122 160 99 77 41 17 11 5 3 10 84 289
## [781] 190 144 134 143 123 127 180 217 265 370 256 136 128
## [794] 107 82 66 45 27 14 5 4 9 104 287 177 165
## [807] 137 154 143 352 572 623 728 1160 809 468 331 319 260
## [820] 169 64 27 31 6 4 10 22 99 131 132 90 133
## [833] 169 201 181 134 128 175 228 156 124 117 120 106 76
## [846] 69 33 10 10 3 12 60 80 69 82 108 155 146
## [859] 172 147 128 146 149 142 148 122 130 93 40 27 7
## [872] 5 7 12 141 283 207 155 116 133 163 174 219 257
## [885] 288 354 274 136 176 118 104 75 42 23 24 14 5
## [898] 12 139 301 204 142 132 289 243 269 292 347 440 687
## [911] 449 229 212 186 163 104 71 24 8 18 3 13 111
## [924] 301 236 178 171 183 258 276 284 385 436 608 579 289
## [937] 193 182 170 90 78 7 1 1 1 1 33 395 274
## [950] 177 169 175 224 236 319 329 416 617 566 326 257 215
## [963] 190 96 72 25 19 7 9 21 142 449 337 202 155
## [976] 269 274 306 382 472 579 835 724 397 270 244 251 161
## [989] 103 60 60 42 20 19 51 161 154 149 184 220 270
## [1002] 345 283 333 326 392 299 248 263 246 194 154 123 82
## [1015] 36 40 27 23 25 81 99 92 154 175 292 333 274
## [1028] 202 198 220 227 190 170 172 139 95 53 43 11 4
## [1041] 10 6 103 439 255 163 151 170 171 215 288 303 356
## [1054] 472 355 236 156 131 147 125 110 23 10 17 11 18
## [1067] 140 454 254 198 153 227 260 280 291 353 414 656 491
## [1080] 222 234 169 195 77 87 26 18 6 5 11 180 352
## [1093] 296 196 169 249 248 261 319 328 545 593 473 321 248
## [1106] 207 139 129 70 57 17 7 6 9 128 378 280 258
## [1119] 175 187 211 211 252 323 389 483 363 234 144 182 115
## [1132] 113 57 58 30 22 13 12 140 384 260 201 169 217
## [1145] 317 321 397 517 622 834 709 336 249 224 238 195 107
## [1158] 78 66 42 12 13 45 180 204 155 185 202 267 310
## [1171] 293 300 320 385 367 205 191 203 210 180 162 111 68
## [1184] 33 22 11 23 63 111 104 116 135 192 171 193 148
## [1197] 220 218 140 79 86 65 33 37 15 26 12 6 3
## [1210] 5 101 262 160 107 72 58 80 77 115 110 77 157
## [1223] 119 105 63 58 29 19 26 13 2 2 2 6 73
## [1236] 221 175 105 102 163 167 155 197 262 400 568 379 155
## [1249] 147 174 85 72 54 35 17 10 8 11 96 254 194
## [1262] 129 89 121 125 143 202 210 252 339 242 147 96 84
## [1275] 78 85 33 21 19 9 3 6 80 246 170 111 136
## [1288] 142 191 187 285 247 283 470 306 174 165 144 147 134
## [1301] 56 50 23 5 6 18 127 307 251 212 222 336 339
## [1314] 442 483 622 840 1115 1039 576 416 381 372 194 101 56
## [1327] 38 26 8 12 27 168 170 140 194 226 385 359 407
## [1340] 332 369 357 370 231
We will also include a basic ets() with log transformation as a benchmark model. In modelling the log transformation, we can use lambda = 0 in stlm() setting. But tbats() function can not accept initial lambda = 0 value, so we need to convert to log before passing into tbats(), and reset the lambda value manually:
# basic ets
etsMod <- Beam_Me_ts %>%
ets(lambda = 0) %>%
forecast(h = 24 * 6) %>%
forecastConsReadj(constant = 1)
# stlm
stlmMod <- Beam_Me_msts %>%
stlm(lambda = 0) %>%
forecast(h = 24 * 6) %>%
forecastConsReadj(constant = 1)
# tbats
tbatsMod <- Beam_Me_msts %>%
log() %>% # log transformation
tbats(use.box.cox = FALSE)
tbatsMod$lambda <- 0
tbatsMod %<>%
forecast(h = 24 * 6) %>% # forecast
forecastConsReadj(constant = 1)## Warning in InvBoxCox(y.forecast, object$lambda, biasadj, list(level =
## level, : biasadj information not found, defaulting to FALSE.
Let’s check the accuracy for each forecast models on our test dataset :
# some forecast accuracy metrics
rbind(
# ets' accuracy
accuracy(etsMod$mean %>% as.vector() %>% round(), tsTest),
# stlm's accuracy
accuracy(stlmMod$mean %>% as.vector() %>% round(), tsTest),
# tbats' accuracy
accuracy(tbatsMod$mean %>% as.vector() %>% round(), tsTest)
) %>%
# tidy-up to a dataframe
broom::tidy() %>%
mutate(.rownames = c("ets", "stlm", "tbats")) %>%
rename(models = .rownames)## Warning: 'tidy.matrix' is deprecated.
## See help("Deprecated")
## # A tibble: 3 x 8
## models ME RMSE MAE MPE MAPE ACF1 Theil.s.U
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets -117. 183. 124. -Inf Inf 0.872 0
## 2 stlm 23.6 80.0 43.2 -Inf Inf 0.751 0
## 3 tbats 3.36 76.0 44.7 -Inf Inf 0.728 0
It seems the accuracy of stlm() is showing the best MAE results, and tbats() being the second. But if we analyze the forecast using graph:
# create a data frame of test data and all forecast data
accuracyData <- data.frame(
timeStamp = Beam_Me_demand$timeStamp %>% tail(24 * 6),
actual = tsTest %>% as.vector(),
etsForecast = etsMod$mean %>% as.vector(),
stlmForecast = stlmMod$mean %>% as.vector(),
tbatsForecast = tbatsMod$mean %>% as.vector()
)
# plot the data
accuracyData %>%
gather(key, value, actual, matches("Forecast")) %>%
mutate(key = key %>% str_replace_all("Forecast", "")) %>%
ggplot(aes(x = timeStamp, y = value, colour = factor(key))) +
geom_line() +
labs(
title = "Beam_Me's Order by Hours on Simulated Test Dataset",
y = "Number of Order",
x = "Date",
colour = ""
) +
theme_calc() +
scale_color_calc()The forecast from tbats() is good because it follow the two seasonality very well. Meanwhile, the forecast from ets() is giving the highest error, too overfitting compared to the actual data. The last but not least, stlm(), showing a better seasonality modelling, eventhough the stlm() function following the same notion of ets(), the model choose to take a constant mean, but instead of one, the model give a constant mean for two seasonality–daily and weekly seasonality:
# the form of stlm forecast
Beam_Me_msts %>%
stlm(lambda = 0) %>%
forecast(h = 24 * 6) %>%
forecastConsReadj(constant = 1) %$%
mean %>%
plot()Notes from our teaching assistant : This is why stlm() results is slightly differ from tbats(), which modelling multiple seasonality using arima with fourier terms–we could say it like smoothing the seasonalities pattern. In other words, a tbats() model would trying to fit multiple seasonality by trying the best value to replicating the patterns, instead of using mean value produced from seasonality decomposition:
# tbats
tbatsMod <- Beam_Me_msts %>%
log() %>%
tbats(use.box.cox = FALSE)
tbatsMod$lambda <- 0
tbatsMod %>%
forecast(h = 24 * 6) %>%
forecastConsReadj(constant = 1) %$%
mean %>%
plot()## Warning in InvBoxCox(y.forecast, object$lambda, biasadj, list(level =
## level, : biasadj information not found, defaulting to FALSE.
Notes from our teaching assistant : You can notice that the seasonalities are smoother than the forecast from stlm(). But don’t be deceived by the smoothing property of tbats(), because if look back at the principle of “believing the mean”, then tbats() forecast would be more fluctuative in term of seasonalities mean; recall that it does the smoothing process using arima, which is by nature seek a better fit to the fluctuation, instead of a better fit to the mean.
So we are faced with a common trade-off in forecast problems: trusting a “safe” constant mean model, the ets(), or take a “risk” on a more fluctuative model, tbats(). Turns out, the answer is “in between” of the two: stlm().
Key-advice from our teaching assistant : First, refit the models on full train dataset. Recall that the actual evaluation dataset is on the next 5 business days, so we need to forecast to 7 days ahead, then subset to only the last 5 days:
# basic ets
etsMod <- tsOri %>%
ets(lambda = 0) %>%
forecast(h = 24 * 7) %>%
forecastConsReadj(constant = 1)
# stlm
stlmMod <- mstsOri %>%
stlm(lambda = 0) %>%
forecast(h = 24 * 7) %>%
forecastConsReadj(constant = 1)
# tbats
tbatsMod <- mstsOri %>%
log() %>%
tbats(use.box.cox = FALSE)
tbatsMod$lambda <- 0
tbatsMod %<>%
forecast(h = 24 * 7) %>%
forecastConsReadj(constant = 1)## Warning in InvBoxCox(y.forecast, object$lambda, biasadj, list(level =
## level, : biasadj information not found, defaulting to FALSE.
# gather all forecast data
Beam_Me_pred <- read.csv("submissionForecast.csv")
Beam_Me_pred <- cbind.data.frame(Beam_Me_pred,
etsForecast = etsMod$mean %>% as.vector() %>% tail(24 * 5) %>% round(),
stlmForecast = stlmMod$mean %>% as.vector() %>% head(24 * 7) %>% tail(24 * 5) %>% round(),
tbatsForecast = tbatsMod$mean %>% as.vector() %>% tail(24 * 5) %>% round()
)
Beam_Me_pred$nOrder <- NULL## timeStamp etsForecast stlmForecast tbatsForecast
## 1 2017-12-04 00:00:00 271 126 126
## 2 2017-12-04 01:00:00 237 113 116
## 3 2017-12-04 02:00:00 189 98 109
## 4 2017-12-04 03:00:00 152 82 92
## 5 2017-12-04 04:00:00 88 34 59
## 6 2017-12-04 05:00:00 49 25 33
Based on previous modelings, we will choose the model with the lowest MAE, the stlmForecast, as the result of the demand forecasting for the next 5 business days (Monday, 2017/12/04 to Friday Friday, 2017/12/08).
Beam_Me_submit <- Beam_Me_pred[ , c("timeStamp", "stlmForecast")]
names(Beam_Me_submit) <- c("timeStamp", "nOrder")
Beam_Me_submit$timeStamp <- ymd_hms(Beam_Me_submit$timeStamp)## timeStamp nOrder
## 1 2017-12-04 00:00:00 126
## 2 2017-12-04 01:00:00 113
## 3 2017-12-04 02:00:00 98
## 4 2017-12-04 03:00:00 82
## 5 2017-12-04 04:00:00 34
## 6 2017-12-04 05:00:00 25
## timeStamp nOrder
## 115 2017-12-08 18:00:00 396
## 116 2017-12-08 19:00:00 497
## 117 2017-12-08 20:00:00 538
## 118 2017-12-08 21:00:00 733
## 119 2017-12-08 22:00:00 626
## 120 2017-12-08 23:00:00 356
ggplot(data = Beam_Me_demand, aes(x = timeStamp, y = Demand, color)) +
geom_line(aes(color = "First line")) +
geom_line(data = Beam_Me_submit, aes(x = timeStamp, y = nOrder, color = "Second line")) +
xlab("Time") +
ylab("Demand") +
ggtitle("Actual and Forecast Beam_Me Drivers Demand") +
theme_calc() +
scale_color_manual(values = c("blue", "red"),
labels = c("Actual", "Forecast"))The blue line : the actual data of Beam_Me driver demand from Oct 1, 2017 to Dec 1, 2017. The red line : the forecast for the next 5 working days, from Dec 4, 2017 to Dec 8, 2017. There’s a gap in the data, it means the weekends after Dec 1, 2017 so they don’t show in the data visualization.