Exploratory Data Analysis IV - Further Revision on Redemptions and Revenue

library("rjson")
library("DT")
library("data.table")
library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()

library("usmap")
library("ggplot2")
library("maptools")

## Loading required package: sp
## Checking rgeos availability: FALSE
## Please note that 'maptools' will be retired during 2023,
## plan transition at your earliest convenience;
## some functionality will be moved to 'sp'.
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()

library("mapview")
library("devtools")

## Loading required package: usethis

library("vcd")

## Loading required package: grid

library("gridExtra")

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library("shiny")

## 
## Attaching package: 'shiny'
## 
## The following objects are masked from 'package:DT':
## 
##     dataTableOutput, renderDataTable

library("MASS")

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library("faraway")
library("rgdal")

## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-6, (SVN revision 1201)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.4.2, released 2022/03/08
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

library("tigris")

## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.

library("sf")

## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE

library("car")

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following objects are masked from 'package:faraway':
## 
##     logit, vif
## 
## The following object is masked from 'package:maptools':
## 
##     pointLabel
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library("GGally")

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:faraway':
## 
##     happy

library("patchwork")

## 
## Attaching package: 'patchwork'
## 
## The following object is masked from 'package:MASS':
## 
##     area

library("tibble")
library("corrplot")

## corrplot 0.92 loaded

library("Hmisc")

## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:maptools':
## 
##     label
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

library("ggcorrplot")

Codes from Previous EDAs

redemptions_2021 <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/redemptions_2021.json")

redemptions_2022 <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/redemptions_2022.json")

redemptions_2023_Jan <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/redemptions_2023_Jan.json")

revenue_view_2021 <-
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2021.json")

revenue_view_2022 <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2022.json")

revenue_view_2023_Jan <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2023_Jan.json")

redemptions_2021 <- lapply(redemptions_2021, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_redemptions_2021 <- as.data.frame(do.call("cbind", redemptions_2021)) |> t()

redemptions_2022 <- lapply(redemptions_2022, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_redemptions_2022 <- as.data.frame(do.call("cbind", redemptions_2022)) |> t()

redemptions_2023_Jan <- lapply(redemptions_2023_Jan, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_redemptions_2023_Jan <- as.data.frame(do.call("cbind", redemptions_2023_Jan)) |> t()

revenue_view_2021 <- lapply(revenue_view_2021, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2021 <- as.data.frame(do.call("cbind", revenue_view_2021)) |> t()

revenue_view_2022 <- lapply(revenue_view_2022, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2022 <- as.data.frame(do.call("cbind", revenue_view_2022)) |> t()

revenue_view_2023_Jan <- lapply(revenue_view_2023_Jan, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2023_Jan <- as.data.frame(do.call("cbind", revenue_view_2023_Jan)) |> t()

df_redemptions <- 
  rbind(df_redemptions_2021, df_redemptions_2022, df_redemptions_2023_Jan)

df_revenue_view <- 
  rbind(df_revenue_view_2021, df_revenue_view_2022, df_revenue_view_2023_Jan)

df_redemptions <- 
  rbind(df_redemptions_2021, df_redemptions_2022, df_redemptions_2023_Jan)

df_revenue_view <- 
  rbind(df_revenue_view_2021, df_revenue_view_2022, df_revenue_view_2023_Jan)

df_redemptions <- data.frame(df_redemptions)

df_redemptions <- transform(df_redemptions, 
                  total_redemption_amount = as.numeric(total_redemption_amount),
                  tip = as.numeric(tip),
                  Venue.Type...Detail = as.factor(Venue.Type...Detail), 
                  Check.Average = as.factor(Check.Average),
                  Service.Type = as.factor(Service.Type))

df_revenue_view <- data.frame(df_revenue_view)

df_revenue_view <- transform(df_revenue_view, 
                  amount_charged_in_usd = as.numeric(amount_charged_in_usd),
                  credit_given_in_usd = as.numeric(credit_given_in_usd), 
                  is_app_purchase = as.factor(is_app_purchase),
                  transaction_type = as.factor(transaction_type),
                  credit_type = as.factor(credit_type),
                  is_excess = as.factor(is_excess),
                  stripe_brand = as.factor(stripe_brand))

df_redemptions_by_user <- group_by(df_redemptions, user_id)

df_revenue_view_by_user <- group_by(df_revenue_view, user_id)

df_rev_redem <- full_join(df_revenue_view_by_user, df_redemptions_by_user, 
                          by = "user_id")

## Warning in full_join(df_revenue_view_by_user, df_redemptions_by_user, by = "user_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 7 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Before reading: This EDA IV report is specifically aimed to improve EDA III on the following aspects:

1. Design a new histogram with Y-axis displaying the number of customers and X-axis displaying the frequency of purchases (i.e. 1, 2, …, 9+) for those 1 project purchasers.

df_revenue_view_by_user_by_proj_id <- df_revenue_view_by_user |> 
  group_by(user_id, project_id, name) |> count() 

df_revenue_view_by_user_by_proj_id <- 
  mutate(df_revenue_view_by_user_by_proj_id, unique = 1)

tbl1 <- aggregate(df_revenue_view_by_user_by_proj_id$unique, 
                  list(df_revenue_view_by_user_by_proj_id$user_id), FUN = sum)

tbl2 <- aggregate(df_revenue_view_by_user_by_proj_id$n, 
                  list(df_revenue_view_by_user_by_proj_id$user_id), FUN = sum)

tbl1 <- rename_at(tbl1, 1, ~ "user_id")

tbl1 <- rename_at(tbl1, 2, ~ "number_diff_projects")

tbl2 <- rename_at(tbl2, 1, ~ "user_id")

tbl2 <- rename_at(tbl2, 2, ~ "number_total_projects")

df_revenue_view_by_user_diff_total <- full_join(tbl1, tbl2, by = "user_id")

df_revenue_view_by_user_diff_total_1tp <- 
  filter(df_revenue_view_by_user_diff_total, number_diff_projects == 1)

df_revenue_view_by_user_diff_dist_1tp <- 
  df_revenue_view_by_user_diff_total_1tp |> group_by(number_total_projects) |> 
  count()

df_revenue_view_by_user_diff_dist_1tp <- 
  rename_at(df_revenue_view_by_user_diff_dist_1tp, 1, 
            ~ "num_times_transc_this_proj")

df_revenue_view_by_user_diff_dist_1tp <- 
  rename_at(df_revenue_view_by_user_diff_dist_1tp, 2, ~ "count_users")

val <- 
  filter(df_revenue_view_by_user_diff_dist_1tp, num_times_transc_this_proj >= 9)$
  count_users |> sum()

df_revenue_view_by_user_diff_dist_1tp <- 
  df_revenue_view_by_user_diff_dist_1tp[-(9:36), ]

df_revenue_view_by_user_diff_dist_1tp <- transform(df_revenue_view_by_user_diff_dist_1tp, 
                    num_times_transc_this_proj = as.character(num_times_transc_this_proj))

df_revenue_view_by_user_diff_dist_1tp[nrow(df_revenue_view_by_user_diff_dist_1tp) + 1, ] <-
  c("9+", val)

df_revenue_view_by_user_diff_dist_1tp <- transform(df_revenue_view_by_user_diff_dist_1tp, 
                                                   count_users = as.integer(count_users))

df_revenue_view_by_user_diff_dist_1tp_no1 <- 
  df_revenue_view_by_user_diff_dist_1tp[-1, ]

ggplot(df_revenue_view_by_user_diff_dist_1tp, 
       aes(x = num_times_transc_this_proj, y = count_users)) + 
  geom_col(fill = "green3", color = "black") + 
  xlab("Number of times for transacting this one project") +
  ylab("Count of customers") + 
  ggtitle("Bar Plot of Customers Who Only Purchase 1 Type of Project")

ggplot(df_revenue_view_by_user_diff_dist_1tp_no1, 
       aes(x = num_times_transc_this_proj, y = count_users)) + 
  geom_col(fill = "green3", color = "black") + 
  xlab("Number of times for transacting this one project") +
  ylab("Count of customers") + 
  ggtitle("Bar Plot of Customers Who Only Purchase 1 Type of Project")

The bar-plots above show the distribution of given the transaction of only one project, the count of customers who transact once, twice, …, up to 9 and more times. The second bar-plot excludes those one-and-gone customers, and we can still observe quite a lot of customers involve in exactly two transactions on one project. This sequence of purchasing behavior is important in guiding us towards the forecast of customers’ purchasing habits. Still, customers with three, four, or even five transactions on one single project are not rare, and these data can also be important in studying other customers if we utilize some causal experimental methods like synthetic control (SCM), difference-in-difference (DID), or propensity score matching (PSM).

val2 <- 
  filter(df_revenue_view_by_user_diff_dist_1tp, num_times_transc_this_proj >= 4)$
  count_users |> sum()

df_revenue_view_by_user_diff_dist_1tp_3 <- 
  df_revenue_view_by_user_diff_dist_1tp[-(4:9), ]

df_revenue_view_by_user_diff_dist_1tp_3 <- transform(df_revenue_view_by_user_diff_dist_1tp_3, 
                    num_times_transc_this_proj = as.character(num_times_transc_this_proj))

df_revenue_view_by_user_diff_dist_1tp_3[nrow(df_revenue_view_by_user_diff_dist_1tp_3) + 1, ] <-
  c("4+", val2)

df_revenue_view_by_user_diff_dist_1tp_3 <- transform(df_revenue_view_by_user_diff_dist_1tp_3, 
                                                   count_users = as.integer(count_users))

ggplot(df_revenue_view_by_user_diff_dist_1tp_3, aes(x="", 
      y= df_revenue_view_by_user_diff_dist_1tp_3$count_users, 
      fill = df_revenue_view_by_user_diff_dist_1tp_3$num_times_transc_this_proj)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  geom_text(aes(label = 
  paste0(round(100*df_revenue_view_by_user_diff_dist_1tp_3$count_users/
              sum(df_revenue_view_by_user_diff_dist_1tp_3$count_users),2), "%")), 
  position = position_stack(vjust=0.5)) +
  labs(x = NULL, y = NULL) +
  theme_classic() +
  theme(axis.line = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank()) +
  scale_fill_brewer(palette="Reds") +
  ggtitle("Pie Plot of Customers Who Only Purchase 1 Type of Project")

## Warning: Use of `df_revenue_view_by_user_diff_dist_1tp_3$count_users` is discouraged.
## ℹ Use `count_users` instead.

## Warning: Use of `df_revenue_view_by_user_diff_dist_1tp_3$num_times_transc_this_proj` is
## discouraged.
## ℹ Use `num_times_transc_this_proj` instead.

## Warning: Use of `df_revenue_view_by_user_diff_dist_1tp_3$count_users` is discouraged.
## ℹ Use `count_users` instead.
## Use of `df_revenue_view_by_user_diff_dist_1tp_3$count_users` is discouraged.
## ℹ Use `count_users` instead.
## Use of `df_revenue_view_by_user_diff_dist_1tp_3$count_users` is discouraged.
## ℹ Use `count_users` instead.

## Warning: Use of `df_revenue_view_by_user_diff_dist_1tp_3$num_times_transc_this_proj` is
## discouraged.
## ℹ Use `num_times_transc_this_proj` instead.

The pie-plot above shows that among all customers who only purchase one type of project, there are above 3/4 of customers who only purchase once, which is not very helpful in building a predictive model. However, we still have around 17% of customers who purchase twice and around 4% of customers who purchase three and more than three times. Given the total sample of this pie-chart, the rest of 1/5 customers is sizable enough to perform predictive purposes.

2. Redo the whole process by excluding the universal UID projects. I will comment on any features that have been changed due to removing UID equal to -999. You may briefly skip some details since the the logistics here is almost the same as EDA III. Some new findings are emphasized, and others that are not very informative are removed.

df_revenue_view_by_user <- 
  mutate(df_revenue_view_by_user, last_char = substr(UID, nchar(UID), nchar(UID)))

df_revenue_view_by_user_g <- filter(df_revenue_view_by_user, last_char == "g")

df_rev_redem_g <- 
  full_join(df_revenue_view_by_user_g, df_redemptions_by_user, by = "user_id")

## Warning in full_join(df_revenue_view_by_user_g, df_redemptions_by_user, : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 7 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

dim(df_rev_redem_g)

## [1] 585710     33

qqnorm(df_rev_redem_g$total_redemption_amount, pch = 1, frame = FALSE)
qqline(df_rev_redem_g$total_redemption_amount, col = "steelblue", lwd = 2)

library("patchwork")
ggplot(df_rev_redem_g, 
       aes(sample = df_rev_redem_g$total_redemption_amount, 
           color = df_rev_redem_g$Service.Type)) +
  stat_qq() +
  stat_qq_line()

## Warning: Removed 20768 rows containing non-finite values (`stat_qq()`).

## Warning: Removed 20768 rows containing non-finite values (`stat_qq_line()`).

ggplot(df_rev_redem_g, 
       aes(sample = df_rev_redem_g$total_redemption_amount, 
           color = df_rev_redem_g$Check.Average)) +
  stat_qq() +
  stat_qq_line()

## Warning: Removed 20768 rows containing non-finite values (`stat_qq()`).
## Removed 20768 rows containing non-finite values (`stat_qq_line()`).

ggplot(df_rev_redem_g, 
       aes(sample = df_rev_redem_g$total_redemption_amount, 
           color = df_rev_redem_g$Venue.Type...Detail)) +
  stat_qq() +
  stat_qq_line()

## Warning: Removed 20768 rows containing non-finite values (`stat_qq()`).
## Removed 20768 rows containing non-finite values (`stat_qq_line()`).

The exclusion of UID starting with i, the universal type of transactions, in fact makes the analysis for QQ-plot even harder to interpret. The distribution on the right tail is not roughly following an exponential function as before - it now looks more like a step-function. We can also see that the fitted line for those expensive projects are now positive, which means that by excluding the universal projects, our pattern seen before is now more obvious. That is, the normality assumption for those high check average or fine dining projects appears to fail.

redem_A <- hist(df_rev_redem_g$total_redemption_amount, breaks = 50, 
              col = 'skyblue2', ylim = c(0, 550000),
          xlab = "Amount of total redemption ($)", 
          main = "Histogram of amount of total redemption")
text(redem_A$mids,redem_A$counts,labels = redem_A$counts, adj=c(0.5, -0.5))

redem_sub <- hist(df_rev_redem_g$total_redemption_amount, breaks = 1800, 
              col = 'skyblue2', ylim = c(0, 135000), xlim = c(0, 500),
          xlab = "Amount of total redemption ($)", 
          main = "Histogram of amount of total redemption (up to $500)")
text(redem_sub$mids,redem_sub$counts,labels = redem_sub$counts, adj=c(0.5, -0.5))

The histogram of amount of total redemption has a heavy right tail. However, the frequency of the higher amount of total redemption is indeed quite low compared to the first few columns. Hence, we restrict our X-axis with amount of total redemption less than 500 dollars. We divide each bin length to only 20 dollars, and the new distribution above is pretty nice to read. There are 123,303 customers who redeem less than 20 dollars. One can easily interpret other bins. The trend follows a gradually decay function.

For each customer, compute for average value per transaction / redemption in dollar amounts

unique(df_rev_redem_g$user_id) |> length()

## [1] 91389

We have a total of 74,888 customers that only involved in the purchases of projects g, though some of them are not present in both tables of df_revenue_view and df_redemption. Even though df_rev_redem_g is already grouped by user_id, we will perform it once again and apply aggregating method to find each customer’s mean amount on both transaction and redemption.

df_rev_redem_g_avg <- df_rev_redem_g |> group_by(user_id) |> 
  summarize_at(vars(amount_charged_in_usd, total_redemption_amount), list(avg = mean))

df_rev_redem_g_avg

cor(df_rev_redem_g_avg$amount_charged_in_usd_avg, 
    df_rev_redem_g_avg$total_redemption_amount_avg, 
    use="pairwise.complete.obs")

## [1] 0.4505558

ggpairs(df_rev_redem_g_avg[, 2:3])

## Warning: Removed 16501 rows containing non-finite values (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 34801 rows containing missing values

## Warning: Removed 34801 rows containing missing values (`geom_point()`).

## Warning: Removed 18300 rows containing non-finite values (`stat_density()`).

We have formed the tabular average data for both transaction and redemption amount. We are interested in learning their correlation after we have excluded all NA entries. The correlation between two mean values is approximately 0.451, neither too high nor too low, slightly larger than 0.44, which is the correlation coefficient before we remove those universal projects. From the pairs plot, we can observe that the density plots on the diagonal are almost identical in shape. The lower left corner is the scatter-plot between the two average values. The upper right corner display the Pearson correlation coefficient between each average variable.

For each customer, find the recency of the most recent transaction, relative to the end date of the dataset

lastD1_g <- max(as.Date(df_rev_redem_g$created_at.x), na.rm = TRUE)
lastD2_g <- max(as.Date(df_rev_redem_g$created_at.y), na.rm = TRUE)
lastD_g <- max(lastD1_g, lastD2_g)
lastD_g

## [1] "2023-01-31"

The end date of data recording is 2023-01-31. Now we will access each customer’s recency date for both transaction and redemption. The approach is similar to the above.

df_rev_redem_recency_g <- df_rev_redem_g |> group_by(user_id) |> 
  summarize_at(vars(created_at.x, created_at.y), list(recency = max))

df_rev_redem_recency_g <- rename_at(df_rev_redem_recency_g, 
                        2, ~ "transaction_recency")

df_rev_redem_recency_g <- rename_at(df_rev_redem_recency_g, 
                        3, ~ "redemption_recency")

df_rev_redem_recency_g <- mutate(df_rev_redem_recency_g, 
transaction_recency_diff = lastD_g - as.Date(df_rev_redem_recency_g$transaction_recency), 
redemption_recency_diff = lastD_g - as.Date(df_rev_redem_recency_g$redemption_recency))

df_rev_redem_recency_g <- transform(df_rev_redem_recency_g, 
              transaction_recency_diff = as.numeric(transaction_recency_diff),
              redemption_recency_diff = as.numeric(redemption_recency_diff))

df_rev_redem_recency_g |> 
  arrange(desc(as.numeric(transaction_recency_diff)))

filter(df_rev_redem_recency_g, transaction_recency_diff == 760) |> nrow()

## [1] 40

df_rev_redem_recency_g |> 
  arrange(as.numeric(transaction_recency_diff)) |> head()

filter(df_rev_redem_recency_g, transaction_recency_diff == 0) |> nrow()

## [1] 129

df_rev_redem_recency_g |> 
  arrange(desc(as.numeric(redemption_recency_diff))) |> head()

filter(df_rev_redem_recency_g, redemption_recency_diff == 760) |> nrow()

## [1] 22

df_rev_redem_recency_g |> 
  arrange(as.numeric(redemption_recency_diff)) |> head()

filter(df_rev_redem_recency_g, redemption_recency_diff == 0) |> nrow()

## [1] 446

as.Date("2023-01-31") - 760

## [1] "2021-01-01"

From the table output above, we surprisingly find out that the recency of customers in revenue and redemption dataset all has a maximum value of 760 days (~ 2.08 years away from the last date entered in the dataset), and the recency of customers in revenue and redemption dataset all has a minimum value of 0 days. This implies that about 2 years ago, which is actually Jan 1, 2021, the first date of record in the dataset, 40 customers finished their last purchase via our application, whereas 22 customers finished their last redemption via our application. On Jan 31, 2023, which is the last date that our dataset can record, it displays that 129 customers just finished a transaction on this last date, whereas 446 customers just finished a redemption on this last date. One can observe the decrease in these numbers compared to all customers with UID starting with i. It may be interested in learning the distribution of recency frequency for both transaction and redemption, since this purchasing or redeeming behavior is a potentially significant factor that determines the CLV model.

ggpairs(df_rev_redem_recency_g[, 4:5])

## Warning: Removed 16501 rows containing non-finite values (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 34801 rows containing missing values

## Warning: Removed 34801 rows containing missing values (`geom_point()`).

## Warning: Removed 18300 rows containing non-finite values (`stat_density()`).

Again, on the diagonal, we observe the frequency plot (shown as density). The shapes for both transaction and redemption are very similar to each other. There are two relatively obvious peaks for both actions - for transaction, it has a global peak at around 50 days, following by another small peak at around 250 days; for redemption, it has a global peak at around 40 days, following by another relatively obvious peak at around 250 days. In fact, the time-series peak-trough pattern is also quite similar! We can see from the upper right corner that their Pearson correlation coefficient is quite high, reaching about 0.853 (which decreases a bit if we include universal projects). The scatter-plot is hard to interpret because of the overfitting issue, but we are able to conclude their high associativity. However, we would also raise a question here - why do we observe such a pattern in times? Is there any particular event happening on those peak dates that influence customers to no longer purchase and redeem OR influence them to become more inclined to purchasing and redeeming via our platform? We are unable to see this from the dataset itself. Instead, we would be interested to look for the firm to share some historical events recorded and see if they match the dates of our peaks or troughs. This is a great insight that can be deliberately enforced to increase CLV. Though a retention rate is not defined here, our plots can offer an insight for quasi-retention rate. We don’t know if some customers in the dataset stop purchasing or redeeming due to whatever reasons, but once we find some predictive variables that can accurately judge if a customer is ultimately terminating their usage of our platform or a customer resumes his or her purchasing or redeeming behavior in the near or far future, then we almost determine their “retention rate”, which is an indispensible factor in determining the CLV model under subscriptive environment.

trans_p_g <- ggplot(df_rev_redem_recency_g, 
                  aes(x = transaction_recency_diff, y = ..density..)) + 
  geom_histogram(color = "grey60", fill = "#FFCC99") + 
  geom_density(color = "red2") + 
  xlab("Recency difference between customer's last day of transaction and Jan 31, 2023")

redem_p_g <- ggplot(df_rev_redem_recency_g, 
                  aes(x = redemption_recency_diff, y = ..density..)) + 
  geom_histogram(color = "grey", fill = "steelblue") + 
  geom_density(color = "purple3") + 
  xlab("Recency difference between customer's last day of redemption and Jan 31, 2023")

trans_p_g + redem_p_g

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 16501 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 16501 rows containing non-finite values (`stat_density()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 18300 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 18300 rows containing non-finite values (`stat_density()`).

After removing customers only with universal projects, the peaks for recency difference between customer’s last day of transaction and Jan 31, 2023 are much more obvious than before - around 5 peak detected by the histogram. The histograms along with density plots are displayed above next to each other. The pattern is almost the same - a lot of customers stop purchasing or redeeming about 50 days ago. This is a bad signal since we should expect the firm to attract more and more customers purchasing and redeeming through our platform. Why do customers face a sharp decline about 50 days ago from Jan 31, 2023 (which is around Dec 10, 2022)? During the Christmas holiday and New Year, they even decrease their using frequency, why is this happening? Also, why is the peak staying for a short period instead of a long, prosperous stage? All of these questions need to be investigated out of our curiosity.

For each customer, compute the average weekly frequency of transaction and redemption from the date of a customer’s own first transaction to the end date of the dataset

Well, this question is very similar to the first task we just performed - each customer’s total average over the total period. Now the prompt asks for each customer’s weekly average value. The task sounds hard to manipulate since each customer has different usage lifetime in using our platform (my proposed definition, a.k.a. \(UL := \Delta T = T_{end} - T_{start}\), where \(\Delta T\) implies the date difference between the starting date and ending date in the unit of day, \(UL\) is the short abbreviation for usage lifetime). However, a further thinking directs us to work out this problem from another perspective. That is, by taking an idea from our recency table, we can create a new column by calculating \(UL\) for each single customer. Then the average weekly transaction and redemption amount can be founded very easily. Also, the frequency of both is easy to be found.

df_rev_redem_UL_g <- df_rev_redem_g |> group_by(user_id) |> 
  summarize_at(vars(created_at.x, created_at.y), 
               list(end = max, start = min))

df_rev_redem_UL_g <- rename_at(df_rev_redem_UL_g, 
                        2, ~ "transaction_end")

df_rev_redem_UL_g <- rename_at(df_rev_redem_UL_g, 
                        3, ~ "redemption_end")

df_rev_redem_UL_g <- rename_at(df_rev_redem_UL_g, 
                        4, ~ "transaction_start")

df_rev_redem_UL_g <- rename_at(df_rev_redem_UL_g, 
                        5, ~ "redemption_start")

df_rev_redem_UL_g <- mutate(df_rev_redem_UL_g, 
UL_trans = as.Date(df_rev_redem_UL_g$transaction_end) - 
  as.Date(df_rev_redem_UL_g$transaction_start), 
UL_redem = as.Date(df_rev_redem_UL_g$redemption_end) - 
  as.Date(df_rev_redem_UL_g$redemption_start))

df_rev_redem_UL_g <- transform(df_rev_redem_UL_g, 
              UL_trans = as.numeric(UL_trans),
              UL_redem = as.numeric(UL_redem))

df_rev_redem_UL_g <- 
  mutate(df_rev_redem_UL_g, UL_trans = UL_trans/7, UL_redem = UL_redem/7)

head(df_rev_redem_UL_g)

The codes above are just to recreate 4 new columns on transaction_end, redemption_end, transaction_start, and redemption_start. Then based on our definition for usage lifetime (abbr. UL), we calculate for UL_trans and UL_redem based on the time difference between transaction_end and transaction_start AND the difference between redemption_end and redemption_start. Notice that the last line is to convert the difference in day units into week units.

df_rev_freq_g <- df_revenue_view_by_user_g |> group_by(user_id) |> count()

df_rev_freq_g <- rename_at(df_rev_freq_g, 2, ~ "total_transaction")

df_redem_freq_g <- df_redemptions_by_user |> group_by(user_id) |> count()

df_redem_freq_g <- rename_at(df_redem_freq_g, 2, ~ "total_redemption")

df_rev_redem_freq_g <- full_join(df_rev_freq_g, df_redem_freq_g, by = "user_id")

head(df_rev_redem_freq_g)

We have just created another data frame for total_transaction and total_redemption for each user_id. Since we have common variables user_id in both data frames (df_rev_redem_UL_g and df_rev_redem_freq_g), we intend to full-join them.

df_rev_redem_weekly_g <- full_join(df_rev_redem_UL_g, df_rev_redem_freq_g, by = "user_id")

df_rev_redem_weekly_eff_g <- df_rev_redem_weekly_g[, c(1, 6:9)]

df_rev_redem_weekly_eff_g <- 
  mutate(df_rev_redem_weekly_eff_g, weekly_trans = total_transaction/UL_trans, 
                                weekly_redem = total_redemption/UL_redem)

df_rev_redem_weekly_eff_g[sapply(df_rev_redem_weekly_eff_g, is.infinite)] <- NA

df_rev_redem_weekly_eff_sub_g <- df_rev_redem_weekly_eff_g[, c(1, 6:7)] |> 
  filter(weekly_trans != "NA" | weekly_redem != "NA")

head(df_rev_redem_weekly_eff_sub_g)

After we join the two data frames, we perform the calculation for weekly_trans and weekly_redem. A tricky part here is that when UL_trans == 0 or UL_redem == 0, the division will make weekly_trans and weekly_redem an infinity value (Inf)! To avoid this mathematical error, we issue all entries of Inf with NA. Also, notice that we exclude the cases of both NA in weekly_trans and weekly_redem - this makes our analysis very trivial. Hence, we decide to toss them out. Still, some customers only have non-NA entries for just one of weekly_trans and weekly_redem. Hence, we will split the data frame into two parts, one with columns user_id and weekly_trans, the other with columns user_id and weekly_redem. Some large entries may be due to the reason of small number of transaction or redemption. One example here is user_id == 100003 - we only see this customer purchased twice on Apr 22, 2022 and Apr 23, 2022. This gives us the weekly transaction as 14 transactions per week, as calculated by \(\frac{2 \,\, \text{units}}{1/7 \,\, \text{weeks}} = 14 \,\, \text{units/week}\). We should be especially careful when a customer makes a lot of transaction or redemption during a very short amount of time, as this will significantly raise the value for weekly_trans and weekly_redem.

df_rev_redem_weekly_trans_g <- df_rev_redem_weekly_eff_sub_g[, c(1, 2)]

df_rev_redem_weekly_redem_g <- df_rev_redem_weekly_eff_sub_g[, c(1, 3)]

df_rev_redem_weekly_trans_g <-
  df_rev_redem_weekly_trans_g |> 
  filter(weekly_trans != "NA") |> 
  arrange(desc(weekly_trans))

df_rev_redem_weekly_redem_g <-
  df_rev_redem_weekly_redem_g |>
  filter(weekly_redem != "NA") |> 
  arrange(desc(weekly_redem))

head(df_rev_redem_weekly_trans_g)

head(df_rev_redem_weekly_redem_g)

The splitted data frames are df_rev_redem_weekly_trans_g and df_rev_redem_weekly_redem_g - we again toss out those NA entries and sort the weekly transaction times and weekly redemption times in decreasing order. We find out that a customer with user_id == 52749 made 84 times of transaction per week and a customer with user_id == 206397 made 70 times of redemption per week! Let us visualize them.

ggpairs(df_rev_redem_weekly_eff_sub_g[, 2:3])

## Warning: Removed 13465 rows containing non-finite values (`stat_density()`).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 19872 rows containing missing values

## Warning: Removed 19872 rows containing missing values (`geom_point()`).

## Warning: Removed 6407 rows containing non-finite values (`stat_density()`).

It is interesting to see that the weekly transaction times and weekly redemption times are similar in density distributions on the diagonal. Also, recall our diagonal plots for average transaction amount and average redemption amount, actually the four shapes are quite similar in shape! This is a great finding as it may indicate a strong correlation in between. The Pearson correlation coefficient increases a bit (0.451 to 0.506) after we remove customers only with universal projects. Let us take a closer look at the correlation plot below.

df_corr_avg_weekly_times_g <- 
  full_join(df_rev_redem_g_avg[, 1:3], 
            df_rev_redem_weekly_eff_g[, c(1, 6:7)], 
            by = "user_id")

df_corr_avg_weekly_times_nNA_g <- na.omit(df_corr_avg_weekly_times_g)

cor(df_corr_avg_weekly_times_nNA_g[, 2:5]) |> ggcorrplot()

The correlation plot displays that there is a strong correlation between weekly redemption times and weekly transaction times AND a strong correlation between average redemption amount and average transaction amount, which is a finding that agrees to our previous analysis. The weekly purchasing / redeeming times are weakly correlated with the average transaction amount and average redemption amount - thus, their distributions in similar shape do not provide any more informative knowledge to us.

Exploratory Data Analysis IV - Further Revision on Redemptions and Revenue

Zhongming Jiang

2023-05-17