First, let’s set the environment and load pacman package for loading libraries easily.
In this project, we are trying to forecast the visitor of a business for the next 7 days. The data is gathered from a data science bootcamp. From the data we’ll get the receipts with many branches. Let’s take a glimpse of the data we have.
p_load(data.table) # for faster and easily processing of large data
p_load(dplyr) # for data manipulation
cs <- fread("data/cs_train.csv")
cs %>% glimpse()## Observations: 7,063,969
## Variables: 11
## $ datetime <chr> "2017-12-01 00:47:29", "2017-12-01 00:47:29",...
## $ outlet <chr> "E_46", "E_46", "E_46", "E_46", "E_46", "E_46...
## $ receipt <chr> "A0017765", "A0017765", "A0017765", "A0017765...
## $ item <int> 10100101, 10200029, 10400016, 10500028, 10500...
## $ item_group <chr> "NOODLE_DISH", "RICE_DISH", "SIDE_DISH", "DRI...
## $ item_major_group <chr> "FOOD", "FOOD", "FOOD", "BEVERAGES", "BEVERAG...
## $ qty <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, ...
## $ price <dbl> 3.50, 6.96, 9.92, 4.74, 3.01, 5.48, 3.01, 4.7...
## $ total <dbl> 3.50, 6.96, 9.92, 4.74, 6.01, 5.48, 3.01, 4.7...
## $ payment <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH...
## $ sales_type <chr> "DINE_IN", "DINE_IN", "DINE_IN", "DINE_IN", "...
Here, we’ve known that we have 7,063,969 observations and 11 variables. Each observation contains an item which has been ordered.
## [1] "E_46" "E_11" "E_34" "E_30" "E_48" "E_44" "E_26" "E_63" "E_22" "C_05"
## [11] "E_42" "E_37" "A_05" "E_43" "A_21" "C_03" "A_10" "C_02" "A_12" "A_19"
## [21] "E_56" "A_08" "E_55" "A_22" "E_02" "D_01" "E_35" "A_02" "E_49" "A_07"
## [31] "E_16" "D_02" "E_08" "A_11" "B_01" "A_18" "E_60" "A_13" "A_20" "E_50"
## [41] "A_01" "E_21" "E_06" "E_45" "E_18" "E_09" "C_01" "A_09" "A_23" "E_27"
## [51] "E_62" "E_20" "E_04" "A_14" "E_32" "E_36" "A_06" "D_03" "E_51" "E_10"
## [61] "A_17" "E_40" "E_33" "E_47" "E_54" "E_14" "E_19" "E_24" "E_61" "E_05"
## [71] "E_57" "E_38" "E_28" "E_13" "E_41" "E_58" "E_07" "E_25" "E_12" "E_53"
## [81] "A_15" "E_17" "E_01" "E_52" "E_31" "E_23" "C_04" "E_29" "E_15" "E_39"
## [91] "A_16" "A_04" "E_03" "A_03" "E_59"
There are 95 unique outlets in observation, but here we only want to analyze the trends by the region of the outlets (the first character) which is A, B, C, D and E. There is also no variable of amount of visitor which we want to predict, but we can do some feature engineering to get visitor variable from the unique variable available, which is from the unique of the receipt code.
To predict the visitor in each outlet region and each hour, let’s just proceed the datetime, visitor, and receipt variables, and ignore the others. The datetime is still in character class, and contain the minutes, and seconds. The outlet also contains specific number details. Here, we would convert the class of datetime, round the datetime into the floor time, and get first character of the outlet.
p_load(lubridate) # for tranforming datetime
cs_dor <- cs %>%
distinct(datetime, outlet, receipt) %>%
mutate(datetime = ymd_hms(datetime), floor_date = floor_date(datetime, "hour"), outlet_group = substr(outlet, 1, 1)) %>%
select(floor_date, outlet_group, receipt)
cs_dor %>% head()## floor_date outlet_group receipt
## 1 2017-12-01 E A0017765
## 2 2017-12-01 E A0017766
## 3 2017-12-01 E A0017767
## 4 2017-12-01 E A0017768
## 5 2017-12-01 E A0017769
## 6 2017-12-01 E A0017770
Before we proceed futher, let’s check if we have any NA in the data.
## floor_date outlet_group receipt
## 0 0 0
And then, we use count function to get the visitor variable for each hour of time and outlet.
cs_vis <- cs_dor %>%
count(floor_date, outlet_group) %>%
select(datetime = floor_date, outlet = outlet_group, visitor = n)
cs_vis %>% head()## # A tibble: 6 x 3
## datetime outlet visitor
## <dttm> <chr> <int>
## 1 2017-12-01 00:00:00 E 7
## 2 2017-12-01 01:00:00 E 20
## 3 2017-12-01 02:00:00 E 13
## 4 2017-12-01 03:00:00 E 13
## 5 2017-12-01 04:00:00 E 16
## 6 2017-12-01 05:00:00 E 5
Now, we have a far less amount of rows. Let’s visualize our data to get the good view of how the time series look like.
p_load(ggplot2)
p_load(viridis) # for better visualization
cs_vis %>%
arrange(datetime) %>%
group_by(outlet) %>%
ggplot(aes(datetime, visitor, col = outlet)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_classic(base_size = 12, base_family = "mono")It’s clearly showed that outlet E has the most visitors among all outlets, followed by outlet A. In outlet A and E, we also see there are trends and seasonalities, which is probably high on weekend, that happen especially after Christmas & New Year Season. For the reason, we will only use the data from second week of January. After seeing the trend and seasonality, we can confirm that there are corellations among successive observations, so we can proceed futher the time series analysis and forecasting.
Then, after seeing the plot of overall data, let’s zoom out to smaller range.
cs_vis[4000:4250, ] %>%
arrange(datetime) %>%
group_by(outlet) %>%
ggplot(aes(datetime, visitor, col = outlet)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_classic(base_size = 12, base_family = "mono")In zoom-out view as the plot above, we see there are also seasonalities at each hour, but looks differently in different outlets. Because we find there are more than one seasonality in our data (weekly and hourly), it’s better to use stlm or tbats for our methods for forecasting. We will see the better understanding of the trends and seasonalities after creating the time series objects of each outlet.
Before creating time series objects, from the last table above, it is showed not every hour outlets have visitors. Here, we have to create a data frame which contains every hour for each outlet, merge with the data frame we already have, and fill with 0, if there are no visitors recorded.
# filter by outlet
cs_vis_a <- cs_vis %>%
filter(outlet == "A")
cs_vis_b <- cs_vis %>%
filter(outlet == "B")
cs_vis_c <- cs_vis %>%
filter(outlet == "C")
cs_vis_d <- cs_vis %>%
filter(outlet == "D")
cs_vis_e <- cs_vis %>%
filter(outlet == "E")
# create a data frame contains every hour from the interval of the data
time_int <- data.frame(datetime = seq(
from = range(cs_vis$datetime)[1],
to = range(cs_vis$datetime)[2],
by = "hour"
))
# merge it with each outlets
cs_vis_a_m <- merge(x = time_int, y = cs_vis_a, by = "datetime", all = TRUE) %>%
select(datetime, visitor) %>%
replace(is.na(.), 0)
cs_vis_b_m <- merge(x = time_int, y = cs_vis_b, by = "datetime", all = TRUE) %>%
select(datetime, visitor) %>%
replace(is.na(.), 0)
cs_vis_c_m <- merge(x = time_int, y = cs_vis_c, by = "datetime", all = TRUE) %>%
select(datetime, visitor) %>%
replace(is.na(.), 0)
cs_vis_d_m <- merge(x = time_int, y = cs_vis_d, by = "datetime", all = TRUE) %>%
select(datetime, visitor) %>%
replace(is.na(.), 0)
cs_vis_e_m <- merge(x = time_int, y = cs_vis_e, by = "datetime", all = TRUE) %>%
select(datetime, visitor) %>%
replace(is.na(.), 0)Then, after binding every outlet together, in order to only use the data with date after second week of January (8 January 2018), we need to find the starting index.
# create column of outlet names
cs_vis_a_m <- cs_vis_a_m %>% mutate(outlet = "A")
cs_vis_b_m <- cs_vis_b_m %>% mutate(outlet = "B")
cs_vis_c_m <- cs_vis_c_m %>% mutate(outlet = "C")
cs_vis_d_m <- cs_vis_d_m %>% mutate(outlet = "D")
cs_vis_e_m <- cs_vis_e_m %>% mutate(outlet = "E")
# bind all outlets
cs_bind <-
rbind(cs_vis_a_m, cs_vis_b_m, cs_vis_c_m, cs_vis_d_m, cs_vis_e_m) %>%
arrange(datetime, outlet)
# find index with datetime of 2018-01-08 00:00:00
which((cs_bind$datetime == "2018-01-08 00:00:00 UTC")) + 7 * 5 # need to adjust the time for different time zone## [1] 4561 4562 4563 4564 4565
## datetime visitor outlet
## 4561 2018-01-08 5 A
## 4562 2018-01-08 0 B
## 4563 2018-01-08 0 C
## 4564 2018-01-08 0 D
## 4565 2018-01-08 193 E
The plots above look like a multiplicative time series, so let’s normalize the data. Here, I use two times root of square of the visitor column due to two seasonalities.
# filter by the datetime (after 8 January 2018)
cs_bind_slc <- cs_bind %>%
slice(4561:n())
# normalize data
cs_bind_scale <- cs_bind_slc %>%
mutate(visitor = sqrt(sqrt(visitor)))
cs_bind_scale %>% head(., 10)## datetime visitor outlet
## 1 2018-01-08 00:00:00 1.495349 A
## 2 2018-01-08 00:00:00 0.000000 B
## 3 2018-01-08 00:00:00 0.000000 C
## 4 2018-01-08 00:00:00 0.000000 D
## 5 2018-01-08 00:00:00 3.727257 E
## 6 2018-01-08 01:00:00 0.000000 A
## 7 2018-01-08 01:00:00 0.000000 B
## 8 2018-01-08 01:00:00 0.000000 C
## 9 2018-01-08 01:00:00 0.000000 D
## 10 2018-01-08 01:00:00 2.932972 E
For testing the model we will make, we split the data by last 7 days.
train_idx <- c(1:(nrow(cs_bind_slc) - (5 * 24 * 7)))
cs_train <- cs_bind_scale[train_idx, ]
cs_test <- cs_bind_scale[-train_idx, ]For multiple output, we can use sweep package to help us perform a forecast on each outlet altogether, without running one by one.
First, we have to group the data by outlet and bundle them into list-column using nest function from tidyrpackage .
p_load(tidyr) # for nesting
cs_nest <- cs_train %>%
group_by(outlet) %>%
nest(.key = "data.tbl")
cs_nest## # A tibble: 5 x 2
## outlet data.tbl
## <chr> <list>
## 1 A <tibble [912 x 2]>
## 2 B <tibble [912 x 2]>
## 3 C <tibble [912 x 2]>
## 4 D <tibble [912 x 2]>
## 5 E <tibble [912 x 2]>
After that, we create time series with aid of map function. Due to the data is in hourly period, we use frequency of 24.
# create ts data
p_load(purrr) # for mapping
p_load(timetk) # for accessing tk_ts fuction
cs_ts <- cs_nest %>%
mutate(data.ts = map(
.x = data.tbl,
.f = tk_ts,
select = -datetime,
start = 1,
freq = 24
))
cs_ts## # A tibble: 5 x 3
## outlet data.tbl data.ts
## <chr> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts>
## 2 B <tibble [912 x 2]> <S3: ts>
## 3 C <tibble [912 x 2]> <S3: ts>
## 4 D <tibble [912 x 2]> <S3: ts>
## 5 E <tibble [912 x 2]> <S3: ts>
Then, for multiseasonal time series, we have to convert the ts with msts function with two periods 24 (hourly) and 24*7=168 (weekly).
# create multiseasonal ts data
p_load(forecast)
cs_msts <- cs_ts %>%
mutate(data.msts = map(data.ts,
msts,
seasonal.periods = c(24, 24 * 7)
))
cs_msts## # A tibble: 5 x 4
## outlet data.tbl data.ts data.msts
## <chr> <list> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts> <S3: msts>
## 2 B <tibble [912 x 2]> <S3: ts> <S3: msts>
## 3 C <tibble [912 x 2]> <S3: ts> <S3: msts>
## 4 D <tibble [912 x 2]> <S3: ts> <S3: msts>
## 5 E <tibble [912 x 2]> <S3: ts> <S3: msts>
Here, I want to compare two models using stlm and tbats.
First, let’s create stlm model.
## # A tibble: 5 x 5
## outlet data.tbl data.ts data.msts fit.stlm
## <chr> <list> <list> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 2 B <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 3 C <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 4 D <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
## 5 E <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm>
Then, forecast the next week as the periods that we need to predict.
# forecast the next 7 x 24 hours
cs_fcast_stlm <- cs_stlm %>%
mutate(fcast.stlm = map(fit.stlm, forecast, h = 24 * 7))
cs_fcast_stlm## # A tibble: 5 x 6
## outlet data.tbl data.ts data.msts fit.stlm fcast.stlm
## <chr> <list> <list> <list> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 2 B <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 3 C <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 4 D <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
## 5 E <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: stlm> <S3: forecast>
We can apply sw_sweep to get the forecast in a nice “tidy” data frame. Here, we need to convert back the value of visitor that we have normalized.
# get the forecast result of the visitor for each outlet and datetime
p_load(sweep)
cs_fcast_tidy <- cs_fcast_stlm %>%
mutate(sweep = map(fcast.stlm, sw_sweep, fitted = FALSE, timetk_idx = TRUE)) %>%
unnest(sweep) %>%
mutate(
visitor = round(value^2^2),
lo.80 = round(lo.80^2^2),
lo.95 = round(lo.95^2^2),
hi.80 = round(hi.80^2^2),
hi.95 = round(lo.95^2^2)
) %>%
arrange(index, outlet)## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## Warning in sw_sweep.forecast(.x[[i]], ...): Object has no timetk index.
## Using default index.
## # A tibble: 6 x 9
## outlet index key value lo.80 lo.95 hi.80 hi.95 visitor
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E 7.42 forecast 5.59 808 729 1165 282429536481 974
## 2 A 7.42 forecast 2.39 23 19 45 130321 33
## 3 B 7.42 forecast 0.000731 0 0 0 0 0
## 4 C 7.42 forecast 0.0626 0 0 0 0 0
## 5 D 7.42 forecast 1.43 3 2 6 16 4
## 6 E 7.42 forecast 4.74 404 358 623 16426010896 505
Then, let’s compare the actual and the forecast values.
# create table to compare actual and forecast visitor
cs_fcast_tidy$visitor_actual <- cs_bind_slc$visitor
cs_fcast_vs_actual <- cs_fcast_tidy %>%
filter(key == "forecast") %>%
mutate(visitor_forecast = visitor) %>%
select(c(outlet, index, visitor_actual, visitor_forecast))
cs_fcast_vs_actual## # A tibble: 840 x 4
## outlet index visitor_actual visitor_forecast
## <chr> <dbl> <dbl> <dbl>
## 1 A 6.43 26 6
## 2 B 6.43 0 0
## 3 C 6.43 0 0
## 4 D 6.43 0 0
## 5 E 6.43 228 211
## 6 A 6.43 4 0
## 7 B 6.43 0 0
## 8 C 6.43 0 0
## 9 D 6.43 0 0
## 10 E 6.43 80 86
## # ... with 830 more rows
Here, we get the mean summary of the model.
# summarise the actual and forecast visitor for each outlet
cs_fcast_vs_actual %>%
group_by(outlet) %>%
summarise(mean(visitor_actual), mean(visitor_forecast))## # A tibble: 5 x 3
## outlet `mean(visitor_actual)` `mean(visitor_forecast)`
## <chr> <dbl> <dbl>
## 1 A 194. 188.
## 2 B 14.7 14.9
## 3 C 31.2 32.3
## 4 D 21.6 22.5
## 5 E 649 652.
Let’s calculate the rmse of the model.
## For binary classification, the first factor level is assumed to be the event.
## Set the global option `yardstick.event_first` to `FALSE` to change this.
##
## Attaching package: 'yardstick'
## The following object is masked from 'package:forecast':
##
## accuracy
cs_fcast_rmse <- cs_fcast_vs_actual %>%
# group_by(outlet) %>%
rmse(truth = visitor_actual, estimate = visitor_forecast)
cs_fcast_rmse## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 43.3
With the same process as lstm model, we proceed the forecast with tbats.
# forecast with tbats
cs_tbats <- cs_msts %>%
mutate(fit.tbats = map(data.msts,
tbats,
use.box.cox = FALSE
))
cs_tbats## # A tibble: 5 x 5
## outlet data.tbl data.ts data.msts fit.tbats
## <chr> <list> <list> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 2 B <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 3 C <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 4 D <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
## 5 E <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats>
# forecast the next 7 x 24 hours
cs_fcast_tbats <- cs_tbats %>%
mutate(fcast.tbats = map(fit.tbats, forecast, h = 24 * 7))
cs_fcast_tbats## # A tibble: 5 x 6
## outlet data.tbl data.ts data.msts fit.tbats fcast.tbats
## <chr> <list> <list> <list> <list> <list>
## 1 A <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 2 B <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 3 C <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 4 D <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
## 5 E <tibble [912 x 2]> <S3: ts> <S3: msts> <S3: tbats> <S3: forecast>
# get the forecast result of the visitor for each outlet and datetime
p_load(sweep)
cs_fcast_tidy <- cs_fcast_tbats %>%
mutate(sweep = map(fcast.tbats, sw_sweep, fitted = FALSE)) %>%
unnest(sweep) %>%
arrange(index, outlet) %>%
mutate(
visitor = round(visitor^2^2),
lo.80 = round(lo.80^2^2),
lo.95 = round(lo.95^2^2),
hi.80 = round(hi.80^2^2),
hi.95 = round(lo.95^2^2)
)
cs_fcast_tidy %>% tail()## # A tibble: 6 x 8
## outlet index key visitor lo.80 lo.95 hi.80 hi.95
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E 7.42 forecast 978 789 701 1200 241474942801
## 2 A 7.42 forecast 40 25 19 60 130321
## 3 B 7.42 forecast 0 0 0 0 0
## 4 C 7.42 forecast 0 0 0 0 0
## 5 D 7.42 forecast 3 1 1 5 1
## 6 E 7.42 forecast 499 386 335 634 12594450625
# create table to compare actual and forecast visitor
cs_fcast_tidy$visitor_actual <- cs_bind_slc$visitor
cs_fcast_vs_actual <- cs_fcast_tidy %>%
filter(key == "forecast") %>%
mutate(visitor_forecast = visitor) %>%
select(c(outlet, index, visitor_actual, visitor_forecast))
cs_fcast_vs_actual## # A tibble: 840 x 4
## outlet index visitor_actual visitor_forecast
## <chr> <dbl> <dbl> <dbl>
## 1 A 6.43 26 8
## 2 B 6.43 0 0
## 3 C 6.43 0 0
## 4 D 6.43 0 0
## 5 E 6.43 228 244
## 6 A 6.43 4 0
## 7 B 6.43 0 0
## 8 C 6.43 0 0
## 9 D 6.43 0 0
## 10 E 6.43 80 87
## # ... with 830 more rows
# summarise the actual and forecast visitor for each outlet
cs_fcast_vs_actual %>%
group_by(outlet) %>%
summarise(mean(visitor_actual), mean(visitor_forecast))## # A tibble: 5 x 3
## outlet `mean(visitor_actual)` `mean(visitor_forecast)`
## <chr> <dbl> <dbl>
## 1 A 194. 184.
## 2 B 14.7 14.5
## 3 C 31.2 32.3
## 4 D 21.6 22.8
## 5 E 649 635.
# calculate the rmse of the model for each outlet
library(yardstick)
cs_fcast_rmse <- cs_fcast_vs_actual %>%
# group_by(outlet) %>%
rmse(truth = visitor_actual, estimate = visitor_forecast)
cs_fcast_rmse## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 57.5
From rmse result, we know that lstm (43.3) is better than tbats model (57.6). So, let’s use lstm for forecasting the visitors 7 days after 22 February 2018. Here, we use the same process to get the forecast with combining train and test data we split previously.
# to forecast the visitor after 22 February 2018
cs_nest2 <- cs_bind_scale %>%
group_by(outlet) %>%
nest(.key = "data.tbl") %>%
mutate(data.ts = map(
.x = data.tbl,
.f = tk_ts,
select = -datetime,
start = 1,
freq = 24
)) %>%
mutate(data.msts = map(data.ts,
msts,
seasonal.periods = c(24, 24 * 7)
)) %>%
mutate(fit.stlm = map(
data.msts,
stlm
)) %>%
mutate(fcast.stlm = map(fit.stlm, forecast, h = 24 * 7)) %>%
mutate(sweep = map(fcast.stlm, sw_sweep, fitted = FALSE, timetk_idx = F)) %>%
unnest(sweep) %>%
arrange(index, outlet) %>%
mutate(
visitor = round(value^2^2),
lo.80 = round(lo.80^2^2),
lo.95 = round(lo.95^2^2),
hi.80 = round(hi.80^2^2),
hi.95 = round(lo.95^2^2)
) %>%
filter(key == "forecast") %>%
mutate(datetime = index) %>%
select(datetime, outlet, visitor)
cs_nest2## # A tibble: 840 x 3
## datetime outlet visitor
## <dbl> <chr> <dbl>
## 1 7.43 A 10
## 2 7.43 B 0
## 3 7.43 C 0
## 4 7.43 D 0
## 5 7.43 E 222
## 6 7.43 A 0
## 7 7.43 B 0
## 8 7.43 C 0
## 9 7.43 D 0
## 10 7.43 E 77
## # ... with 830 more rows
Let’s input the visitor values to the submission files.
# load the data for submission
cs_sms <- fread("data/cs_submission.csv")
# input the result of visitor from nest2
cs_sms$visitor <- cs_nest2$visitor
cs_sms## datetime outlet visitor
## 1: 2018-02-22 00:00:00 A 10
## 2: 2018-02-22 00:00:00 B 0
## 3: 2018-02-22 00:00:00 C 0
## 4: 2018-02-22 00:00:00 D 0
## 5: 2018-02-22 00:00:00 E 222
## ---
## 836: 2018-02-28 23:00:00 A 35
## 837: 2018-02-28 23:00:00 B 0
## 838: 2018-02-28 23:00:00 C 0
## 839: 2018-02-28 23:00:00 D 1
## 840: 2018-02-28 23:00:00 E 475
Finally, let’s write the results for the submission.