1 Introduction

Most of us would have heard about the new buzz in the market i.e. Cryptocurrency. Many of us would have invested in their coins too. But, is investing money in such a volatile currency safe? How can we make sure that investing in these coins now would surely generate a healthy profit in the future? We can’t be sure but we can surely generate an approximate value based on the previous prices. Time series models is one way to predict them.

In this article, I will talk about a new time series technique that was developed by Facebook named Prophet. Dataset is the credit card spending from January 2016 to April 2018 and I will try to forecast spending amount of credit card in May 2018

2 Prophet for time series

I use a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:

\(y(t) = g(t) + s(t) + h(t) + Error(t)\)

g(t): piecewise linear or logistic growth curve for modelling non-periodic changes in time series
s(t): periodic changes (e.g. weekly/yearly seasonality)
h(t): effects of holidays (user provided) with irregular schedules
Error(t): error term accounts for any unusual changes not accommodated by the model

library(dplyr)
library(lubridate)
library(magrittr)
library(prophet)
library(ggplot2)
library(ggthemes)
rm(list = ls())
load(file = "D:/4. BA - R/1. TOPIC/49. CREDIT CARD SPENDING ANALYSIS/CREDIT CARD SPENDING ANALYSIS/Data/raw_data.Rda")
raw_data %>% head

##          id yearmonth target_id trans_date trans_curr trans_amount_qd
## 1 239912899    201802  10656943 2018-02-10        VND        675000.0
## 2 239881890    201802   7312783 2018-02-10        VND        347694.6
## 3 239881897    201802   8291980 2018-02-10        VND         60943.2
## 4 239881914    201802  10073216 2018-02-10        VND         62080.2
## 5 239886848    201802   5348936 2018-02-09        VND       4590000.0
## 6 239886858    201802   6237498 2018-02-09        VND       2300000.0
##           trans_details     merchant_id trans_city trans_country
## 1 LUCAS 96 BUI THI XUAN        12401770     HA NOI      Viet Nam
## 2               Uber BV 518898000200042     Vorden   NETHERLANDS
## 3               Uber BV 518898000200042     Vorden   NETHERLANDS
## 4               Uber BV 518898000200042     Vorden   NETHERLANDS
## 5      MAT TROI HA LONG        44800014 QUANG NINH      Viet Nam
## 6       NHA KHOA PHUONG        12001269     HA NOI      Viet Nam
##   bi_trans_type sic_code     cif             product_detail
## 1           POS     5137 1592062   02. TITANIUM LADY CREDIT
## 2 OnlinePayment     4121 1395827   02. TITANIUM LADY CREDIT
## 3 OnlinePayment     4121 1844092        04. PLATINUM CREDIT
## 4 OnlinePayment     4121  806327 03. TITANIUM STEPUP CREDIT
## 5           POS     7996  974699        04. PLATINUM CREDIT
## 6           POS     8021  765716   02. TITANIUM LADY CREDIT
##                                               sic_id    mcc_category
## 1 5137 Men/Women/Child Uniforms/Commercial Clothings FashionShopping
## 2                           4121 Taxicabs/limousines      Travelline
## 3                           4121 Taxicabs/limousines      Travelline
## 4                           4121 Taxicabs/limousines      Travelline
## 5                      7996 Amusement Parks,circuses   Entertainment
## 6                        8021 Dentists,orthodontists HospitalMedical
##   mcc_category_aia  open_date yearmonth_new       date month_transaction
## 1             <NA> 2017-11-07        201802 2018-02-10                 4
## 2             <NA> 2016-09-28        201802 2018-02-10                17
## 3             <NA> 2017-04-10        201802 2018-02-10                11
## 4             <NA> 2017-09-19        201802 2018-02-10                 5
## 5             <NA> 2015-08-13        201802 2018-02-09                31
## 6          Medical 2016-01-12        201802 2018-02-09                26

The dataset contains transactions of credit card users in many areas (called mcc categories), we need to forecast spending of every single mcc category before calculating the performance of the whole month

raw_data %>% 
  group_by(yearmonth_new) %>% 
  summarise(sum = round(sum(trans_amount_qd)/1e9,2)) %>% 
  ggplot(aes(x = as.factor(yearmonth_new), y = sum, group = 1)) +
  geom_line(col = "darkgreen") +
  geom_point(col = "darkred") + 
  geom_text(aes(label = sum), check_overlap = TRUE, vjust = -0.5) +
  theme_economist() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +  
  xlab("Timeline") +
  ylab("Credit card spending amount (bil)") +
  ggtitle("Credit card spending")

2.1 Step 1

Create credit card spending amount in each mcc category

load(file = "D:/4. BA - R/1. TOPIC/49. CREDIT CARD SPENDING ANALYSIS/CREDIT CARD SPENDING ANALYSIS/Data/Prophet_data.Rda")
for (i in unique(raw_data$mcc_category)) {
  df <- raw_data %>% 
    filter(mcc_category == i) %>% 
    group_by(date) %>% 
    summarise(spending = round(sum(trans_amount_qd)/1e9,5)) %>% 
    as.data.frame()
  names(df)[1] <- "ds"
  names(df)[2] <- "y"
  df$y <- log(df$y)
  df$mcc_category <- paste(i)
  name <- paste(i)
  df <- df %>% select(3,1,2)
  assign(name, df)
}

2.2 Step 2

Calculate performce of credit card in May of each mcc category using regression for logarithm value of credit card spending

list_of_dataframe <- list(FashionShopping, Travelline, Entertainment, HospitalMedical, 
                          Departmentstores, Utilities, Dailyitems, Others,
                          Cash, DinningDrinking, Education, Automotivemotorandbikesaleandservice,
                          Electronic, HouseholdFurniture, Jewellery, Advertisingservices, Insurance,
                          Fuel, TelephoneServiceEquipment, BeautySpa, Dutyfreetouristattraction, 
                          Bookstoresnewsstationeryprinting, Realestate)

final_result <- data.frame()

for ( i in c(1:23)) {
  df <- list_of_dataframe[[i]]
  m <- prophet(df)
  future <- make_future_dataframe(m, periods = 32)
  forecast <- predict(m, future)
  result <- forecast %>% 
    mutate(spending = exp(1)^yhat) %>% 
    tail(31) %>% 
    summarise(predict = sum(spending)) %>% 
    as.data.frame()
  result$mcc_category <- unique(list_of_dataframe[[i]]$mcc_category)
  final_result <- rbind(final_result, result)
}

## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 928045447
## initial log joint probability = -50.544
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 2061027555
## initial log joint probability = -28.8111
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 995665898
## initial log joint probability = -159.451
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1392713796
## initial log joint probability = -131.491
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 2017638617
## initial log joint probability = -44.4012
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1276467127
## initial log joint probability = -60.8832
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 776550200
## initial log joint probability = -12.9036
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 570197813
## initial log joint probability = -206.558
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 814904973
## initial log joint probability = -45.6499
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 211907482
## initial log joint probability = -19.6751
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1304807879
## initial log joint probability = -586.116
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 755878985
## initial log joint probability = -291.915
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 957690062
## initial log joint probability = -57.0713
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1539297794
## initial log joint probability = -171.261
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1115245831
## initial log joint probability = -46.3715
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1742770748
## initial log joint probability = -134.498
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1453274203
## initial log joint probability = -2466.52
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 35550647
## initial log joint probability = -404.799
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1163350206
## initial log joint probability = -141.945
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1190466966
## initial log joint probability = -198.14
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 783860143
## initial log joint probability = -1691.26
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1941332230
## initial log joint probability = -63945.3
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
## STAN OPTIMIZATION COMMAND (LBFGS)
## init = user
## save_iterations = 1
## init_alpha = 0.001
## tol_obj = 1e-012
## tol_grad = 1e-008
## tol_param = 1e-008
## tol_rel_obj = 10000
## tol_rel_grad = 1e+007
## history_size = 5
## seed = 1254228559
## initial log joint probability = -4255.61
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance

final_result %<>% select(2,1)

predict <- final_result %>% 
  select(1,2) %>% 
  mutate(year = 2018,
         month = "05" %>% as.factor()) %>% 
  select(1,3,4,2) %>% 
  rename(mcc_category = mcc_category, 
         year = year,
         month = month,
         spending = predict) %>% 
  as.data.frame()

mcc_spending <- raw_data %>% 
  mutate(year = year(date),
         month = month(date)) %>% 
  mutate(month = case_when(
    month == 1 ~ "01",
    month == 2 ~ "02",
    month == 3 ~ "03",
    month == 4 ~ "04",
    month == 5 ~ "05",
    month == 6 ~ "06",
    month == 7 ~ "07",
    month == 8 ~ "08",
    month == 9 ~ "09",
    month == 10 ~ "10",
    month == 11 ~ "11",
    month == 12 ~ "12"
  ) %>% as.factor) %>% 
  group_by(mcc_category, year, month) %>% 
  summarise(spending = round(sum(trans_amount_qd)/1e9,5)) %>% 
  as.data.frame()

mcc_spending <- rbind(mcc_spending, predict)
mcc_spending %>% 
  mutate(yearmonth = paste(year,month, sep = "") %>% as.factor()) %>% 
  group_by(yearmonth) %>% 
  summarise(spending = sum(spending)) %>% 
  ggplot(aes(x = as.factor(yearmonth), y = spending, group = 1)) +
  geom_line(col = "darkgreen") +
  geom_point(col = "darkred") + 
  geom_text(aes(label = round(spending,2)), check_overlap = TRUE, vjust = -0.5) +
  theme_economist() +
  xlab("Timeline") +
  ylab("Credit card spending amount (bil)") +
  ggtitle("Spending per transaction since 2017")

After using prophet for time series forecast, we can see that credit card spending in May 2018 reach more than 3000 billion VND

TIME SERIES WITH PROPHET

Nguyen Minh Thanh - thanhleo92@gmail.com

17 May 2018

1 Introduction

2 Prophet for time series

2.1 Step 1

2.2 Step 2