Abstract

This project presents a framework for modeling user engagement states as a Markov chain, using it for prediction of future engagement states and calculating a predicted Discounted Lifetime Value for each engagement group. This notebook demonstrates how I applied this analysis to ActBlue donations data, step by step, to help the reader understand how exactly one might apply this analysis to their own data. I apply this to the political/non-profit realm - analyzing the patterns of donations for democratic elections campaigns through ActBlue (an online donations platform), based on data of donations to Bernie Sanders’ campaign during April 2015 - February 2016.

Data retrieved from ActBlue Archives. Initial data retrieval and preprocessing manipulation before imported into this notebook were done in this seperate notebook on GitHub. View and download this notebook and project on my GitHub: Github.com/tomereldor/Capstone.


Introduction

This analysis will show you how you can understand your users’ activity better and know which types are worth more to your business. This tutorial will teach you how to take data on user engagement or contributions over time (for repeating customers) and model it using Markov states to calculate the “Discounted Lifetime Value” (DLV; or CLV, Customer Lifetime Value) of your users. The primary outcome of this framework is predicting your users Discounted Customer Lifetime Value, in a highly computationally efficient and straightforward manner, based on their level of engagement. The DLV tells you what the “Net Present Value” of the contribution of each user’s state of engagement is, meaning how much money are they “worth” to your organization in present value dollars. That “worth” or “DLV” should inform you how much you should be able to spend on users at each level of activity or be willing to pay to nudge them to change their level of activity. Apart from that, modeling your users with a Markov Model can teach you how users engage with your product/service, what are their chances to change their level of activity based on their current level of activity and more. It can help with predicting churn and understanding how much new users are “worth” to you.

I will walk you step-by-step through carrying out this analysis by seeing how I applied it to data on donations to Bernie Sander’s elections campaign, to get at actionable insights. This analysis can yield interesting conclusions such as this key takeaway: the highest DLV contributions came from people who donated monthly or the least frequently, according to the contrary belief or goal of ActBlue campaigns which often nudge users to subscribe to a small daily donation.

Below are some key takeaways I got from this analysis in this case study of donations to Bernie Sanders’ campaign during 2015-2016.

  1. GO FOR HIGH R DONATION AMOUNTS, NOT FOR DONATING A LITTLE FREQUENTLY. The groups with the highest average DLV came from the highest average donation, not from higher frequency. Meaning, even if people donated every day, it still didn’t sum up to the same amounts in a whole month as the monthly donors.
  2. NUDGE FOR LOW FREQUENCY (LIKE MONTHLY) TO GET PEOPLE TO DONATE MORE. Within the high amount donors of various frequencies, there was an INVERSE relationship with the frequency; monthly donors typically donated more than weekly, and they donated more than daily donors of the same average amount tier!
  3. CONSIDER THE SIZE OF EACH GROUP AND CONVERSION RATE: IF YOU CAN GET MANY MORE PEOPLE TO DONATE MONTHLY THAN GETTING PEOPLE TO DONATE DAILY, THEN THE SHEER SIZE OF THE GROUP WILL RESULT IN MORE DONATIONS. In aggregate discounted lifetime donation values (times the number of drivers), most of the lifetime value comes from the monthly donors, even if they donate a small amount because they were a much larger group. Therefore, for the Sanders’ campaigners - or democratic campaigners - you might be better off convincing more people to donate a monthly donation of whatever magnitude they can afford, than trying to get a higher frequency of contributions that users will later stop, or than trying to push hard for a higher amount!
  4. PEOPLE MIND LESS ABOUT HOW OFTEN THEY DONATE THAN HOW MUCH THEY DONATE EACH TIME. While some of the states are a bit more stable, most states tended to drop down a tier of frequency (as in, from daily to weekly) but stayed at a similar amount of average donation!
  5. PEOPLE JOIN MORE AND STAY LONGER AT “LOWER ASKS”: WHEN THEY DONATE LESS OFTEN. THIS RESULTS WITH LARGER CONTRIBUTIONS OVER TIME. With time, the numbers of Monthly donors increased and the number of donors who donate a small amount decreased. Intuitively, it was easy to get people to join or stay in the lower commitment donation patterns.

Disclaimer: take each conclusion with a grain of salt: the first sentence of each key takeaway assumes that the following point creates a casual relationship, which it might not, but it helps deliver the message and helps to understand the value in each takeaway.


Motivation Behind This Framework: Discounted Lifetime Value and Markov Models

This analysis is part of my capstone project, in which I explain all about the motivation behind using this framework. You can view it at bit.ly/tomercapstonepaper. If you are not intending on reading the full motivation in my paper, here is a summary of it.

Selecting Markov Models for Modeling Discounted Predictive Lifetime Value

There are multiple approaches we can take towards the general goal of quantifying and predicting Discounted Lifetime Value / Customer Lifetime Value. You might want to quantify Discounted Lifetime Value for purely predictive purposes or interpretation and understanding of your users.

If your goal is purely predictive and you do not care about the interpretation, you might be able to find better predictive models that can predict with greater accuracy a DLV value for each user. For example, you might use Random Forests or Neural Networks over all of the relevant available features. However, there are two main drawbacks: you will not get much understanding or interpretation of your user base, and therefore it might be hard to act upon the conclusions from the model.

If you’re interested in interpretation and ability to get actionable insights, there are very few models which would be relevant for that. A regression model might work, but there are a few problems. First, you would have to engineer your data, features, and model form wisely and carefully. Second, you will not get an understanding of transitions and patterns of behaviors. Third, and most important – for all other models, you would probably end up predicting the next action/state for each time step and for each user.

To solve all these issues, you could model your users using a Markov model. You could sketch a diagram of states of engagement your users might transition between, which would help you understand your users lifetime flow better. Then you can conveniently convert that into a Markov model of engagement states.
This Markov model would enjoy the unique property of discounting the probabilities of each future timestep by (t=timestep), which is a shared property with the formula of Discounted Lifetime Value. Thus you will only need to derive these discount factors and transition matrices once, and for each desired next timestep you raise them to the power of your desired timestep.
The DLV is then calculated for each state using the formula below, where \(S_1\) is State_1, \(C_1\) is the constant amount donated at this current week, P is the Markovian (1-Step) Transition Matrix, \(\delta\) is our selected discount factor, and \(V\) is the expected value for that future week:

\(V_{S_1} = C_1 +\delta∑PV_1 +\delta^2∑P^2V2 +\delta^3∑P^3V3 + .... +\delta^t∑P^tVt\)
Since the expected value for each week is the weighted sum of the expected value to receive from being at each state, weighted by the probabilities of being at that state (from the transition matrix), then V is the constant value derived from each state.
By using Matrix notation we can express this more concisely, Where \(S_1...S_k\) stands for all states.
Then, \(V_{S_1...S_k}\) is a 1-dimensional vertical vector where each entry is the DLV for the particular given state, \(C_{S_1...S_k}\) is the same for the current amount values (“flows”) of the latest time period, and \(P\) is the full transition matrix from each state S1…Sk to each other state \(S_1...S_k\).
\(V_{S_1...S_k} = C_{S_1...S_k} + \delta PV_{S_1...S_k}\)
We can and arrive at this shortand specification (see details below or at the paper)
\((I-P)V_{S_1...S_k} = C_{S_1...S_k}\)
This enables us to solve this as a system of linear equations, and get at the desired DLV Values using a simple computation!


Technical Setup

First, let’s install the required libraries (if needed) and load them into the working space. Throughout this project, I will be working with the “data.table” library. What is it? it provides a new structure for dataframes that are indexed and faster to work with. It also has a more convenient and intuitive set of conventions to use for indexing and data transformations. It can replace most of the cases you’d use “dplyr” or similar packages and does so incredibly quickly. Below I will show how one simple procedure took over 3 minutes in dplyr but less than 1 second in data.table. Therefore, I highly recommend learning that library and speeding up your future R work.

# Loading Libraries
#install.packages(c('data.table','RODBC','tidyr','plyr', 'dplyr', 'ggplot2', 'ggthemes', 'directlabels', 'lubridate', 'car', 'markovchain', 'cowplot', 'plotly;))
libraries_needed <- c('data.table','tidyr','plyr', 'dplyr', 'ggplot2', 'plotly', 'ggthemes', 'directlabels', 'lubridate', 'markovchain')
lapply(libraries_needed, require, character.only = TRUE)
# disable scientific notation:
options(scipen = 999)
# Setting PATH variable as current working directly
PATH = getwd()
### remove old objects from memory if needed:
# rm(ci, conf.intervals, dfs, dt.tenth, dt_tenth, lm_lin, lm1, lm2, new.dat) # remove by name
# rm(list=setdiff(ls(), "object_to_keep")) # remove all objects except these objects to keep
# Loading the data using fast data.table reader function: fread
df_original <- fread("./data/dt_bernie_repeaters.csv") # (complete or relative) path of dataset
setDT(df_original) #set as data.table object
summary(df_original)
##        V1           donor_uid             date               amount       
##  Min.   :      1   Length:3332208     Length:3332208     Min.   :   1.00  
##  1st Qu.: 833053   Class :character   Class :character   1st Qu.:  10.00  
##  Median :1666104   Mode  :character   Mode  :character   Median :  10.00  
##  Mean   :1666104                                         Mean   :  25.22  
##  3rd Qu.:2499156                                         3rd Qu.:  27.00  
##  Max.   :3332208                                         Max.   :5400.00  
##                                                                           
##  aggregate_amount   n_donations       prev_date         days_since_last 
##  Min.   :    1.0   Min.   :  2.000   Length:3332208     Min.   :  0.0   
##  1st Qu.:   15.0   1st Qu.:  3.000   Class :character   1st Qu.:  7.0   
##  Median :   40.0   Median :  6.000   Mode  :character   Median : 20.0   
##  Mean   :  102.6   Mean   :  8.808                      Mean   : 30.5   
##  3rd Qu.:  100.0   3rd Qu.: 11.000                      3rd Qu.: 31.0   
##  Max.   :26500.0   Max.   :171.000                      Max.   :305.0   
##                                                         NA's   :674884

Define The Time-Period of Each Step

Let’s assume we are working with a dataset of actions: such as purchases, requests, donations, visits, clicks, etc. We’d want to decide on a time-period to model as one time-step that contains all the most important patterns and cycles of the data. Then, we would transform the data into a dataset where each row is that time-period and each column (i.e., amount) summarizes the total values of that variable during that time (i.e., the sum of dollars spent that week, or minutes spent on the website per day).

First, we need to think and define the time period that we’ll call a “state” so that we can later chunk the data into rows that summarize these time-periods and call them steps. For this, we’ll need to use our own reasoning about the nature of the area and problem. On the one hand, we don’t want too broad timeframes, since then we would lose valuable information hidden inside by summarizing it with just a few metrics. On the other hand, we don’t want too small time-frames, since they don’t reflect true cycles, lack sufficient information, and we usually won’t have enough data to justify those. We want to characterize what are the natural “cycle” period of the space. For example, let’s consider modeling the requests of a user of a ride-hailing platform such as Uber/Lyft/Via. Starting from smaller timeframes - naturally, hours would not be enough to capture ride-hailing patterns; a day might catch some (i.e., commute patterns or evening outings), but each day would have too much random variability such that it’s not useful for inference. On the other hand, we would also miss grander scheme information from skipping over the next time-frame scale: weekly patterns. We might model each week as a step, since people have their weekly routines and have different likelihoods of wanting to take Ubers both at different times of day but also at different times of the week (i.e., peaking at Friday and Saturday nights). This is reasonably where we should stop, since we don’t have strong monthly patterns; and while we might have yearly seasonal effects, it is too broad for us to use for inference. If we have that data, we should verify all those assumptions in the data first. Once we decided on a data time-step, we will later reshape the data to a “wide” dataset where each row contains summary statistic (such as sum or the average amount of each column in the original dataset)

ActBlue: Monthly States (time-periods)

In this case, we are investigating patterns of donations to election campaigns through the donation platform “ActBlue”. Through analyzing the data (that I’ve done previously in python in this notebook), I observed that the majority of repeating donors donate every month; with a few different tiers like weekly and daily donations, as well as lower frequencies or just occasional donations without a reoccurring automatic cycle. For that reason, it seems that we should model donations per month, since if it would be shorter, we’d miss the common monthly cycles; but any longer would be too few “steps” in our data that spans for about a year (between 2015-2016).


Data Preprocessing

I did much of the data preprocessing already in the above mentioned Python notebook. You will need to inspect and clean the data, remove the irrelevant or messy variables, limit the time scope of the data if it is too long, maybe the geographical and other scopes, and make sure you have only the population you truly care about studying left. For example, I for now only care about “repeating” donors - donors who have donated more than once. For that, As you could see in the python notebook, I integrated many datasets of different periods, and extracted only the relevant repeating donors.

Now I’ll start cleaning and transforming the dataset to the desired format.

Dataset Formatting

First, let’s make sure each column has the right format:

summary(df_original)
##        V1           donor_uid             date               amount       
##  Min.   :      1   Length:3332208     Length:3332208     Min.   :   1.00  
##  1st Qu.: 833053   Class :character   Class :character   1st Qu.:  10.00  
##  Median :1666104   Mode  :character   Mode  :character   Median :  10.00  
##  Mean   :1666104                                         Mean   :  25.22  
##  3rd Qu.:2499156                                         3rd Qu.:  27.00  
##  Max.   :3332208                                         Max.   :5400.00  
##                                                                           
##  aggregate_amount   n_donations       prev_date         days_since_last 
##  Min.   :    1.0   Min.   :  2.000   Length:3332208     Min.   :  0.0   
##  1st Qu.:   15.0   1st Qu.:  3.000   Class :character   1st Qu.:  7.0   
##  Median :   40.0   Median :  6.000   Mode  :character   Median : 20.0   
##  Mean   :  102.6   Mean   :  8.808                      Mean   : 30.5   
##  3rd Qu.:  100.0   3rd Qu.: 11.000                      3rd Qu.: 31.0   
##  Max.   :26500.0   Max.   :171.000                      Max.   :305.0   
##                                                         NA's   :674884

There are three main types of variable formats that we care about for this case: 1. Numeric - any numeric value (either int or float) that we will make numeric calculations with or should treat as a scale. 2. Date - date information has its own format, which we should use to do calculations with and extract elements from. 3. Factor - means categorical variable. A related format can be “character.” That would be good to capture a column of unique strings of text, such as comments or notes. However, if we want to use those for grouping or subsetting (like username), we should convert those to factors since it is a really a category. We see that V1, which should be just a numeric index, is currently only “V1”, but it’s ignorable. Donor_uid is currently a “character” type of column, while it should be either “numeric” or a “factor.” In our case, it’s a text-based ID and thus can’t be numeric and must be a factor. An alternative would be to convert this ID to a numerical id and do operations on it; but in our case, it holds information (name and zip code) which is more helpful to observe. We should convert the “Date” column into a date format, as well as the “Previous Date” column. Amount, aggregate amount, number of donations, and days_since last are already in numerical format.

# convert donor ID to a factor (categorical) rather than character:
df_original$donor_uid <- factor(df_original$donor_uid) 
# convert date columns to dates rather than text
df_original$date <- as.POSIXct(df_original$date)
df_original$prev_date <- as.POSIXct(df_original$prev_date)
# view df
head(df_original,3)
##    V1            donor_uid       date amount aggregate_amount n_donations
## 1:  1 ;RICHARD|FAHEY|38548 2015-05-06     10               20          10
## 2:  2 ;RICHARD|FAHEY|38548 2015-06-06     10               20          10
## 3:  3 ;RICHARD|FAHEY|38548 2015-07-06     10               80          10
##     prev_date days_since_last
## 1:       <NA>              NA
## 2: 2015-05-06              31
## 3: 2015-06-06              30
summary(df_original)
##        V1                              donor_uid      
##  Min.   :      1   CHRISTOPHER|FLESNER|72120:    217  
##  1st Qu.: 833053   XITIJ|SHAH|75204         :    201  
##  Median :1666104   NICKOLAS|ZACHARIAS|66046 :    182  
##  Mean   :1666104   ANGELA|BAILEY|32771      :    174  
##  3rd Qu.:2499156   CHI|SEXTON|77450         :    150  
##  Max.   :3332208   MUSTAFA|FARIS|78752      :    149  
##                    (Other)                  :3331135  
##       date                         amount        aggregate_amount 
##  Min.   :2015-04-30 00:00:00   Min.   :   1.00   Min.   :    1.0  
##  1st Qu.:2015-09-30 00:00:00   1st Qu.:  10.00   1st Qu.:   15.0  
##  Median :2015-12-30 00:00:00   Median :  10.00   Median :   40.0  
##  Mean   :2015-12-01 01:54:39   Mean   :  25.22   Mean   :  102.6  
##  3rd Qu.:2016-02-08 00:00:00   3rd Qu.:  27.00   3rd Qu.:  100.0  
##  Max.   :2016-02-29 00:00:00   Max.   :5400.00   Max.   :26500.0  
##                                                                   
##   n_donations        prev_date                   days_since_last 
##  Min.   :  2.000   Min.   :2015-04-30 00:00:00   Min.   :  0.0   
##  1st Qu.:  3.000   1st Qu.:2015-09-21 00:00:00   1st Qu.:  7.0   
##  Median :  6.000   Median :2015-12-16 00:00:00   Median : 20.0   
##  Mean   :  8.808   Mean   :2015-11-16 08:24:24   Mean   : 30.5   
##  3rd Qu.: 11.000   3rd Qu.:2016-01-28 00:00:00   3rd Qu.: 31.0   
##  Max.   :171.000   Max.   :2016-02-29 00:00:00   Max.   :305.0   
##                    NA's   :674884                NA's   :674884

Trasnforming Dataset to Wide: Summaries of Activities Per User Per Time-period

Now we will need to start summarizing the dataframe per user per state. I’ll use data.table from now on for most such purposes. Our State timeframe is a MONTH. Thus, first step is to add a MONTH column, so that we can group by it. We could either choose to do monthly on a variable start day per user according to their first donation, but since we are summarizing all the month’s information into one row, it doesn’t really matter and we can also do the simpler way of grouping per calendar month. If someone makes a donation every 15 of the month, it will still be summarized to the same aggregate amount a month as it would if she would donate every 1st of the month.

I’m using the “data.table” package to serve as my data structure and main data manipulation package, since it is highly more efficient and convenient than base R. I highly recommend working with it! It is actually easy to learn and will make your R coding more efficient. Check it out here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

The basic concept to understand is that data.table queries and performs operations by this syntax: DT[i, j, by]

R: i (rows) , j (columns) , by SQL: where | order by , select | update , group by

First position (i): empty, meaning selecting all rows Second position (j): here we specify what happens to each column Third position (by): by=list(donor_uid, monthyear) : here we group by donor ID and MONTH.

  • Remember that if you are about to make dummy / categorical variable operations (like regression) with data.table, be careful with the size of the dataset, vector and memory of your machine, since such operations effectively add another column per category in the background (one-hot encoding).

Below, I’ll take a sample of data and do the same transformation to it using data.table and using ddply, the common alternative, and show the efficiency of each.

# let's take just the year + month. Since the data all lists in the same format YYYY-MM-DD, it's easiest just to take the first 6 charachters.
df_original$monthyear <- substr(df_original$date, start = 1, stop = 7)


# I always prefer to experiment first with a smaller sample of the data before doing a heavy or risky operation on the whole dataset. Let's make that sample:
dt_sample <- sample_n(df_original, 100000)
dt_sample <- setDT(dt_sample) # ensure it's a data.table
nrow(dt_sample) # checking: it does have 100000 rows. 
## [1] 100000
setkey(dt_sample, donor_uid, monthyear)

Why Use Data.table? Demonstation of Processing Time in Data.table Versus Ddply

# using data.table:
system.time( 
  dt <- dt_sample[ , list(total_amount = sum(amount), 
                               avg_amount = mean(amount), 
                               n_donations_month = .N,
                               n_donations_life = mean(n_donations)
                                ), 
                        by=list(donor_uid, monthyear)]
)
##    user  system elapsed 
##   0.023   0.003   0.009
# using ddply:
system.time( 
  dt_sum_ddply <- ddply(dt_sample, 
                        ~ donor_uid + monthyear,
                        summarize, 
                         total_amount = sum(amount), 
                         avg_amount = mean(amount), 
                         n_donations_life = mean(n_donations)
                        )
  )   
##    user  system elapsed 
##  68.495   1.033  70.168

Using ddply it took 3+ minutes, versus a fraction of a second in data.table, meaning data.table was 12,000 times more efficient!
Let’s do this using data.table on the entire dataset

setDT(df_original)
dt <- df_original[ , list(total_amount = sum(amount), 
                               avg_amount = mean(amount), 
                               n_donations_month = .N,
                               n_donations_life = mean(n_donations)
                                ), 
                        by=list(donor_uid, monthyear)]
head(dt,2)
##               donor_uid monthyear total_amount avg_amount
## 1: ;RICHARD|FAHEY|38548   2015-05           10         10
## 2: ;RICHARD|FAHEY|38548   2015-06           10         10
##    n_donations_month n_donations_life
## 1:                 1               10
## 2:                 1               10

We will continue to use data.table whenever possible and I advise you to utilize it yourselves.

Time-Series States Dataset

Now that we summarized the dataset to a month-level, we will use it to create month-long states. We want to create a continuum of states per each donor. Meaning, we wish for each donor to have a list of what happened during EACH month of the ten months; and if they didn’t make any contribution during that month, we want that row saying that they had 0 donations!
This is to create a continuous representation of states per donor/user.

Let’s go and fill the data for missing lines:

setkey(dt, donor_uid, monthyear)
dt <- dt[order(donor_uid,monthyear)]

Now, we have many people who only donated one month. That will not add much valuable information.
How many people donated for more than one month?

# let's add a column: "month_donated" - in how many different months this donor donated?
# meaning, how many listings does this dataset hold of the same donor?
dt[, months_donated := .N, by=donor_uid]

# let's plot this as a histogram
qplot(data = dt, x=months_donated, main = "How many months did people donate in?", xlab = "Number of Months Users Donated In")

Since we are analyzing continuous engagement, and also since our states will assume repeated donations – let’s only analyze data of “repeating” donors who donated in more than one month. This will mean that we’ll not have a good overall picture of the probabilities of donating more than one month, but we will limit our analysis FOR THOSE WHO DONATED IN MORE THAN ONE MONTH, to inspect the hypothesis of continuous/repeated donations versus one-time donations.

# SUBSET THE DATA TO DONORS WHO DONATED MORE THAN ONCE
dt_repeaters <- dt[months_donated>1, ,]
# let's work with the data subset of repeating donors:
# get a unique list of all IDs and unique list of all possible monthyears
# unique_ids <- levels(dt_repeaters$donor_uid) # the levels are based on the original full dataset. Let's do just for this one:
print(paste("Number of repeated donations:", nrow(dt_repeaters)))
## [1] "Number of repeated donations: 2097445"
unique_ids <- unique(dt_repeaters$donor_uid)
print(paste("Unique repeated donors:", length(unique_ids)))
## [1] "Unique repeated donors: 579825"
unique_monthyears <-  sort(unique(dt_repeaters$monthyear))
print(paste("Months:", length(unique_monthyears)))
## [1] "Months: 11"
print(sort(unique_monthyears))
##  [1] "2015-04" "2015-05" "2015-06" "2015-07" "2015-08" "2015-09" "2015-10"
##  [8] "2015-11" "2015-12" "2016-01" "2016-02"
# Now we need to make a new dataset of all permutations
# there are a number of ways to do this.
# 1) from base R, you can use expand.grid(column_1, column_2).
# 2) from tidyR package, you can use the slighlty improved "crossing": crossing(column_1, column_2) and get a tibble.
# 3) in data.table, you can use "CJ" function, which creates a join data.table, with the "UNIQUE" option on.

# Now, let's use data.table join data table to get a data table with all permutations. 
library(data.table)
dt_combos <-  CJ(donor_uid = unique_ids, monthyear = unique_monthyears, unique = TRUE)
head(dt_combos,2)
##               donor_uid monthyear
## 1: ;RICHARD|FAHEY|38548   2015-04
## 2: ;RICHARD|FAHEY|38548   2015-05

Now that we have our base table with all required IDs and timesteps, we can fill in data where we have recorded values. The rest will be donations of 0 since they didn’t donate.
Since we need ALL rows from the first (LEFT) table of permutations and keeping even the ones without matches from the RIGHT table, while inserting only the matched rows from the second (RIGHT) table, this is called a LEFT-OUTER JOIN - a concept taken mainly from SQL language. It is implemented in multiple ways in R: as merge functions in base, in dplyr, and in our case in data.table. For a more complete guide on joins in data.table, see this tutorial found here.

# now, combine this permutations dataset with the actual info from our dt_repeaters dataset.
# first, we need to set the columns we want to match on as keys:
setkey(dt_combos, donor_uid, monthyear)
setkey(dt_repeaters, donor_uid, monthyear)

# Now, we perform the match!
# The "all.x" part is the "Left Join", meaning: 
# take everything from the LEFT table and match what you can from the RIGHT (X=LEFT, versus RIGHT=Y)
dt.all <- merge(dt_combos, dt_repeaters,  all.x=TRUE)
dt.all
##                     donor_uid monthyear total_amount avg_amount
##       1: ;RICHARD|FAHEY|38548   2015-04           NA         NA
##       2: ;RICHARD|FAHEY|38548   2015-05           10         10
##       3: ;RICHARD|FAHEY|38548   2015-06           10         10
##       4: ;RICHARD|FAHEY|38548   2015-07           10         10
##       5: ;RICHARD|FAHEY|38548   2015-08           10         10
##      ---                                                       
## 6378071:   ZZYZX|CURTIS|47408   2015-10           NA         NA
## 6378072:   ZZYZX|CURTIS|47408   2015-11           NA         NA
## 6378073:   ZZYZX|CURTIS|47408   2015-12           10         10
## 6378074:   ZZYZX|CURTIS|47408   2016-01           10         10
## 6378075:   ZZYZX|CURTIS|47408   2016-02           NA         NA
##          n_donations_month n_donations_life months_donated
##       1:                NA               NA             NA
##       2:                 1               10             10
##       3:                 1               10             10
##       4:                 1               10             10
##       5:                 1               10             10
##      ---                                                  
## 6378071:                NA               NA             NA
## 6378072:                NA               NA             NA
## 6378073:                 1                2              2
## 6378074:                 1                2              2
## 6378075:                NA               NA             NA

GREAT! Now we have the merged data.table. BUT WAIT - there’s NAs! Why is that? of course, we just matched the rows with donations, while leaving the rows without donations as NA’s. Now all we need to complete this stage is to fill out the dataset NAs with the relevant values: 0’s for the amounts and n_donations, and the unique user value for n_donations_life and months_donated.

# let's fill out the NA lines with relevant values:
dt.all[is.na(total_amount)]$total_amount <- 0
dt.all[is.na(avg_amount)]$avg_amount <- 0
dt.all[is.na(n_donations_month)]$n_donations_month <- 0
dt.all$monthyear <- factor(dt.all$monthyear)
# now fill n_donations_life and months_donated with the constant corresponding statistic per user
# first, get a condensed table of unique matches: total n_donations_life and months_donated of each user (regardless of month). This is like a dictionary we'll later multiply
dt.totalstats = unique(dt.all[!is.na(n_donations_life), .(donor_uid, n_donations_life, months_donated)])
setkey(dt.totalstats, donor_uid)
setkey(dt.all, donor_uid)
dt.all <- merge(dt.all[,.(donor_uid, monthyear, total_amount, avg_amount, n_donations_month)], dt.totalstats, all.x = TRUE)

#save file "fwrite" is a quicker and efficient writer of data.tables
fwrite(dt.all, "./data/dt_all.csv")
## loading checkpoint if needded:
# dt.all <- fread(paste0(PATH,"/data/"))

How to Define Your States? Start With Data Exploration and Analysis

We must know our problem space well, the analyzed users relatively well, and what is the purpose and usage of this analysis (who is going to use it and for what), in order to define our states.

Possible approaches

There are a few main directions we can take this in:

1. Manual Definition from Human Knowledge

1. User: Who is using this and for what? Do they already have groupings of users that we should consider? Do they treat customers differently based on some criteria or thresholds? 2. Product: Does our product or service offer distinct tiers? If so, we should consider those differences in states. If not explicitly, does it do so implicitly? 3. User Journey: The best state maps would map the life journey of the user with our product/service. For example: new -> first_time_user -> weekly_high_amount_user -> monthly_high_amount_user -> churned. By knowing the natural journey flows of our customers, we’ll be able to construct users stories and thus build some hypotheses for possible stages of a user’s life. If we don’t know that already, we can learn that by looking at the data.

2. Computational Algorithms to help define states

We can use some algorithms to help us define the states here! Two main ways we can do that are as follows:

1. Clustering into states: decide which few features (columns) of the data are most important at defining a state? For example, frequency and amount. We can choose those and cluster the data according to those. If we are lucky enough to be able to cluster according to only 2 (or sometimes 3) features, we can also plot all the data’s point values on a scatter-plot with these two features, and plot the resulted state clusters or boundaries to help discern the quality of clustering.

2. Predicting next states - if you use regression or better yet a decision tree or random forest on all the features to predict an outcome such as the next state, or the next amount donated next month, you could find what the most valuable data-pieces are!

3. Hidden Markov Models - Hidden Markov Models assume that there are some hidden processes which drive our observed processes above the ground. We can use some visible variable such as the “amount” as the measured proxy (visible state), while assuming that there are hidden states which drive it invisibly in the background. This approach should also maximize the stability of those states, although potentially at the expense of the interpretability of the states themselves.

3. Combine computational data analysis with domain knowledge to create states

if you don’t have your states of engagement already, you should think about what are useful and reasonable “states” of engagement for your case. This process shouldn’t be entirely automatic, because eventually, this sort of analysis needs to have interpretable and useful states that would hopefully be relevant for two purposes:
(1) understand and talk about your customer base more broadly in other uses and (2) treat these “states” as a cohort and make decisions accordingly, such as treating or incentivizing them differently, or such as inferring customer behavior patterns from the transitions between them. However, we do want the states to be somewhat stable, meaning - the probability of staying in each state should be relatively high (except for states defined differently, such as if you add a “New” state or another inherently temporary state). With that, devising the states should be informed by understanding data and customer behavior. For inspiration, see below.

Suggested Procedure for discerning states: 1. Decide what the metric for “value” that you want to measure is. In many cases that would be money, but not necessarily, such as time spent on your platform, clicks, or whatever drives your revenue or key business outcomes 2. Decide what the most important variables to differentiate between states are. In many cases, those will be: (1) frequency per time-period (averaged) and (2) value delivered. 3. Think organizationally and conceptually: do you have organizational relevance to particular types of divisions across these variables?
For example, the frequency of every order of magnitude (hourly, daily, weekly, monthly, yearly), or in value levels? (are there different tiers of customers according to their value delivered? such as “gold” status in a frequent flyer club; are there currently any promotions or segmentation of the customer base that is likely to persist and could inform these states here?) 4. Explore the distribution of these two variables. Does the data lend itself to some distinct groupings? Do these make sense? 5. Try to combine the logical and data-driven hypotheses about states and inform your state formation. 6. Alternatively, or in combination to inform the above, you could let the data speak for itself - and use a clustering algorithm to choose optimal clusters along the measured variables automatically. I’ll demonstrate such a technique using a simple and famous clustering algorithm called K-means. 7. Another possible direction, which would usually be mainly useful later on in the process, is to use Hidden Markov Models (HMMs) - which assumes that there is a visible layer of states that is driven by an unknown, hidden layer of states. However, we need to start with some “state” definitions as the visible states.

Combining Clustering Insights With Human Insights

If you have precise domain knowledge and business use-case describing your states already, that is great, and that is something I can’t help to generalize. Therefore, I’m going to to take a more domain agnostic and generalizable approach here and create the first definition of states informed by clustering.

In our case, there are two key variables which would define a state: 1. Amount donated, and 2. Frequency.
Amount donated - we have that information already, as the amounts given in aggregate over the month. We currently have two amount metrics: the average amount donated that month, or total amount donated that month. Which one will we choose depends on what is the other variable of choice, and which would maximize information given combined with the other variable. Frequency - we currently have a number of donations this month. We might want to include more past information; for example - how many consecutive months have you donated? That might be more predictive.

Why couldn’t the states be just cuts of the distribution of total amount given per month? Why would we want to divide them by frequency and amount donated? There are a few main reasons. We should remember that our goal in this analysis is for interpretation and actionable insights. Having someone donate 30 USD once a month versus 1 USD every day is a very different behavioral pattern, and also a very different ask. Therefore we want to separate them to know what people did and what were the outcomes of each activity. For example, we want to be able to separate exactly between those types of scenarios: between donors donating similar amounts but in different chunks; so that we could see which one was more sustainable or led to a longer lifetime and higher overall lifetime value. Second, from my experience running this analysis, you could see patterns of people changing one of these variables but not the other: for example, still donating $10 but instead of every week, now once a month. I hypothesize that this might happen with data as well. By separating to those two dimensions, we can allow the Markov Transition Matrix to show us just that, and adjust the DLV prediction accordingly.

Next, we should remember that the Markovian assumption is that the states are “memory-less”: the current state is supposed to be sufficiently predictive about the next ones. Therefore, we will start with a 1-step Markov decision process. Later we can try incorporating more memory by wrangling the data to have a more memory-based metric, or by implementing a multi-step Markov model.

For frequency, the existing 1-step variable is N_Donations per month. For amount, we can either have the average amount per donation, or the total amount donated that month. Considering that we already have the number of contributions as another variable, it makes more sense to use the Average amount donated, which gives us a more intuitive metric of the actual average amount value (i.e., $1 per day), and we can easily multiply it by the N_donations per month to get the monthly total if we like.

First, we will explore the data distributions around these variables of interest:

Distribution of N_Donations Per Month

qplot(x = n_donations_month, binwidth=1, data=dt[n_donations_month>0], xlab = "Number of repeated donations", main = "Histogram: Number of repeated donations per donor") 

So the number of repeated donations per person turns out to be logarithmic with exponential decay.

Distribution of Average Amount Donated

qplot(x = avg_amount, binwidth=5, data=dt[n_donations_month>0 & avg_amount<=500], xlab = "Average Amount (for months with donation)", main = "Histogram: Average Amount Donated") 

Although on first glance, this distribution seems roughly like a regular Poisson distribution, it is more likely a multimodal distribution comprised of the most common (discrete) amounts users donate and some continuous distribution around those values. This is probably because the Bernie Sanders ActBlue website campaign encouraged candidates to buttons of suggested donation values of these peaked values: $1 ,$2 ,$5 ,$100 ,$25 ,$500 ,$1,000, Or: Other (User chooses their own custom amount).
Let’s see how that was actually done on the website: ActBlue page for supporting Bernie Sanders

*Note: this screenshot is from the current website, and not from 2015-2016, such that amounts might have varied.
This clear connection between the user interface and the amounts donated in practice is an explicit example of how nudging and user architecture influences users to follow nudges; the given donation amounts were almost all of the donations.

Scatterplot of the relationship between the main two variables

if you want to be agnostic to the sizing of each feature, you should normalize/standardize all of your features before clustering. In this case, I wish to know the proportion of weighting, so I’m keeping the sizes as is for now.

Let’s see if there are any visible divisions of the data: #### Scatterplot of avg_donation_amount vs n_donations_month

# let's see the relationship between donation_amount vs days_since_last_donation
# we'll sample 500,000 out of the 2.1 Million rows with actual donations we have, since R couldn't produce that image with all those datapoints!
qplot(x=n_donations_month, y=avg_amount, data = sample_n(dt.all[n_donations_month>0],500000), geom="point", main="Monthly donations vs Donation Amount", xlab = "Number of donations that Month", ylab = "Donation Amount", alpha=0.1)

TThese are the two key axes upon which we will create our states since we are interested in groups of varying (1) frequency and (2) amount. Therefore, plotting these variables against each other is essential. In some cases, the data might have had more natural divisions by itself. In other cases, you might want to transform the data in some ways to have more granularity or a different measure of a similar metric so that it would be more easily distinguishable (for example number of donations in the past two months; days since the last donation, etc.).

Get States Approximations Using Clustering

If you want to be agnostic to the sizing of each feature, you should normalize/standardize all of your features before clustering. In this case, I want that proportion of weighting, so I’m keeping it like this.
The two most important variables for us for clustering are the average donation amount vs. the number of donations per month. We might want to check later other transformations of variables which might be more accurate, but it depends on our later application, like the number of consecutive months with donations, or that weighted by the number of donations per month.

Let’s see what clustering might suggest for us. We will try to cluster the data according to our 2 (or more) most important variables - usually related to frequency (here: days since the last donation) and amount (could also be the number of amounts/time spent/clicks/engagement etc.).

# Make sure to clean your values from NAs and INFs, and if you want - trasnform them into making more relevant groupings (for example, raising to the second power if you want more distinct differences at higher amounts).
summary(dt.all)
##                         donor_uid         monthyear      
##  ;RICHARD|FAHEY|38548        :     11   2015-04: 579825  
##  .SUSAN|HAJIANI|90025        :     11   2015-05: 579825  
##  'STANLEY|LEHTO|98597        :     11   2015-06: 579825  
##  (ANJANEYA)MURTY|KANURY|97330:     11   2015-07: 579825  
##  (JAMES) DAVID|RICH|85308    :     11   2015-08: 579825  
##  (LETITIA) DIANE|TAYEBY|05301:     11   2015-09: 579825  
##  (Other)                     :6378009   (Other):2899125  
##   total_amount        avg_amount       n_donations_month 
##  Min.   :    0.00   Min.   :   0.000   Min.   :  0.0000  
##  1st Qu.:    0.00   1st Qu.:   0.000   1st Qu.:  0.0000  
##  Median :    0.00   Median :   0.000   Median :  0.0000  
##  Mean   :   12.06   Mean   :   8.422   Mean   :  0.4833  
##  3rd Qu.:   10.00   3rd Qu.:  10.000   3rd Qu.:  1.0000  
##  Max.   :12000.00   Max.   :5400.000   Max.   :113.0000  
##                                                          
##  n_donations_life  months_donated  
##  Min.   :  2.000   Min.   : 2.000  
##  1st Qu.:  2.000   1st Qu.: 2.000  
##  Median :  3.000   Median : 3.000  
##  Mean   :  5.083   Mean   : 3.617  
##  3rd Qu.:  6.000   3rd Qu.: 5.000  
##  Max.   :171.000   Max.   :11.000  
## 

There are many clustering algorithms available. For this tutorial purpose, let’s use the k-means, since it is the most common one, most familiar to readers, easily understandable, and its algorithm answers our needs. We’ll use one of the common algorithm to each k-means clustering called “Hartigan-Wong”. The Hartigan-Wong updates its centroids every time when a point moves. Additionally, it also saves time by making smarter choices in how it checks for the closest cluster. Other available algorithms in R, like Lloyd’s k-means algorithm, are not as efficient, but rather remained as historical legacy since Lloyd’s was the first (and thus least developed) clustering algorithm.

# now let's see what clustering would bring us
cluster <- kmeans(x=dt.all[, .(avg_amount, n_donations_month)], centers=6, iter.max = 500, nstart = 25, algorithm = "Hartigan-Wong")
cluster$centers
##      avg_amount n_donations_month
## 1    0.07532229        0.03513539
## 2  306.55146601        1.33075390
## 3   27.77653581        1.56763243
## 4    9.86239870        1.48533986
## 5 1233.09930365        1.17112998
## 6   67.99218534        1.29466409


Now after we algorithmically clustered, let’s see where did the algorithm divide the data:

# See results

## Now we got states and interpretations for them. let's insert them into our data.
dt.all$cluster <- factor(cluster$cluster) # adds a cluster column to inform states 

# now we can plot this and see how it looks. I'm sampling 0.5 Million samples from the data so that the software can handle it. For plotting, I'm plotting average donations below 1000 since above that there are only some outliers (that belong to the same group).
clusterplot <- ggplot(sample_n(dt.all[avg_amount<1000], 500000), aes(x=n_donations_month, y=avg_amount, color=cluster, alpha=0.2)) +
  geom_point() +
  labs(title =  "Avg Amounts vs Number of Donations This Month", 
        x = "Number of Donations This Month", 
        y = "Avg Donation Amount")
# this time I didn't plot it with plotly because there are too many datapoints for interactivity (freezing Rstudio), but later you'll see some interactive plotly plots. 
clusterplot

The clustering algorithm clustered the data into 1- different clusters, where we see the following trends: 1. They are mostly clustered by amount and not frequency 2. There are many more groups in the lower range; almost as if it was a more linearly spaced division in the log-scale. We shall consider that but also other alternatives in choosing states.

Choice of the clusters themselves

As I mentioned before, the choice of states should be informed by data but more importantly - for our needs of later interpretation. The clustering algorithm chose to cluster mostly according to the amount donated, but we care maybe even more about the number of donations per month. Looking at the clusters, the data, previous data explorations, and distributions, and the possible choice set for donations, we should come up with some distinct boundaries.

Choice of the number of clusters

How do you know how many clusters to have? This is a case-by-case question. If you have some definitions and cut-offs that would make sense to cluster by, definitely consider those. You probably don’t want too many clusters because you want them to be interpretable and understandable.
My advice is: you should go with a few clusters which are clear and interpretable in your situation, and mix insights from clustering methods, possibly Hidden Markov Models (see the end of tutorial), and real-world relevance of groups. Computationally, I’d look at the distribution of clusters and see what looks most sensible (the most concise distinct clustering framework), while keeping an eye for the reported Cluster Sum of Squares by cluster (between_SS / total_SS = 87.8 %). After many trials and various amounts for the centers, this latest one included above seems to have good performance and decent interpretability. We’ll continue with that for now.

Check clusters boundaries to inform states boundaries, together with external information (of choice set)

Let’s check clusters means on both axes:

cluster$centers
##      avg_amount n_donations_month
## 1    0.07532229        0.03513539
## 2  306.55146601        1.33075390
## 3   27.77653581        1.56763243
## 4    9.86239870        1.48533986
## 5 1233.09930365        1.17112998
## 6   67.99218534        1.29466409

Check cluster boundaries:

## borders in amount
list_borders <- c()
for (i in seq(1,10)) {
  list_borders <- c(list_borders, c(min(dt.all[cluster == i, avg_amount]), 
                                  max(dt.all[cluster == i, avg_amount])))
}
list_borders
##  [1]    0.000000    4.920000  187.500000  769.000000   18.727273
##  [6]   47.906667    1.107143   18.815000  770.000000 5400.000000
## [11]   47.944444  187.093333         Inf        -Inf         Inf
## [16]        -Inf         Inf        -Inf         Inf        -Inf
## borders in n_donations_month
list_borders_nd <- c()
for (i in seq(1,10)) {
  list_borders_nd <- c(list_borders_nd, c(min(dt.all[cluster == i, n_donations_month]), 
                                  max(dt.all[cluster == i, n_donations_month])))
}

View the list of borders:

list_borders_nd
##  [1]    0   26    1   14    1   84    1  113    1    9    1   52  Inf -Inf
## [15]  Inf -Inf  Inf -Inf  Inf -Inf

For determining boundaries, we should take again a close look on the histogram:

qplot(x = n_donations_month, binwidth=1, data=dt.all, xlab = "Number of repeated donations", main = "Histogram: Number of repeated donations per donor") 

qplot(x = avg_amount, binwidth=1, data=dt[n_donations_month>0 & avg_amount<=100], xlab = "Average Amount (for months with donation)", main = "Histogram: Average Amount Donated") 

In addition to the histogram, we should consider the choices given to donors were: $10, $27, $50, $100, $250, $500, $1000, [custom_value]. But we mainly see the peaks (outside of 0) at: $10, $25, $50, and $100, with very few donations above a $100.

According to the above means and boundaries, previous data explorations and distributions, and the possible choice set for donations, I will be clustering according to the following: Amounts donated 0: non-donated 0< x_between ≤$10, - “1” 10< x_between ≤$25, - “10” 25< x_between ≤$50 - “25” 50 + and higher - “50”

Frequency: Number of donations per month: - 0 donations: non-donated - 1-4 donations - “monthly” - 4-10 donations - “weekly” - 10+ donations - “daily”

Creating states according to boundaries

#f_divide_states <- function(dt) {
dt.all$state <- ""

# No donations:
dt.all[n_donations_month == 0]$state <- "Non" #0
dt.all[avg_amount == 0]$state <- "Non"

# low freq (Monthly): 1-4 donations (monthly to weekly donations)
# for each tier, we have 4 pricing tiers: 
# $0.1 - $10, $10 - $100, $100 - $500, $500+
dt.all[n_donations_month>0 & n_donations_month<=4 &
        avg_amount > 0 & avg_amount<=10]$state <- "monthly1"    #31
dt.all[n_donations_month>0 & n_donations_month<=4 &
        avg_amount > 10 & avg_amount<=25]$state <- "monthly10"  #32
dt.all[n_donations_month>0 & n_donations_month<=4 &
        avg_amount > 25 & avg_amount<=50]$state <- "monthly25"  #33
dt.all[n_donations_month>0 & n_donations_month<=4 &
        avg_amount > 50]$state <- "monthly50+"                  #34

# weekly: 4-10 donations
dt.all[n_donations_month>4 & n_donations_month<=10 & 
        avg_amount > 0 & avg_amount<=10]$state <- "weekly1"  #21
dt.all[n_donations_month>4 & n_donations_month<=10 &
        avg_amount > 10 & avg_amount<=25]$state <- "weekly10" #22
dt.all[n_donations_month>4 & n_donations_month<=10 &
        avg_amount > 25 & avg_amount<=50]$state <- "weekly25" #23
dt.all[n_donations_month>4 & n_donations_month<=10 &
        avg_amount > 50]$state <- "weekly50+"               #24


# daily: 10+ donations
dt.all[n_donations_month>10 &
        avg_amount > 0 & avg_amount<=10]$state <- "daily1"  # 11
dt.all[n_donations_month>10 &
        avg_amount > 10 & avg_amount<=25]$state <- "daily10" #12
dt.all[n_donations_month>10 &
        avg_amount > 25 & avg_amount<=50]$state <- "daily25" #13
dt.all[n_donations_month>10 &
        avg_amount > 50]$state <- "daily50+"                 #14

dt.all$state <- factor(dt.all$state)  
fwrite(dt.all, paste0(PATH,"/data/dt_state.csv"))
summary(dt.all$state)
##     daily1    daily10    daily25   daily50+   monthly1  monthly10 
##       1420       2308        849        326    1046867     450043 
##  monthly25 monthly50+        Non    weekly1   weekly10   weekly25 
##     409774     136893    4280630      14526      21681       9259 
##  weekly50+ 
##       3499

According to the above insights and wanting intuitive interpretation of states, I am dividing the states to the following categories per month Frequency:
- 1. Monthly: 1-4 Donations a Month
- 2. Weekly: 5-10 donations per month
- 3. Daily: 11 or more donations per month
Amount:
- A. “_1“: Average Donation Between $0.01-10
- B.”_10“: Average Donation Between $10.01-25
- C.”_25“: Average Donation Between $25.01-50
- D.”_50“: Average Donation Above $50.
“Non” - means the user did not make any donation that month.

checkpoint to reload data from disk:
### load data from disk
fwrite(dt.all, "./data/dt.all.state.csv")

dt.all <- fread(paste0(PATH,"/data/dt_state.csv"))
dt.all$state <- factor(dt.all$state)
summary(dt.all$state)
##     daily1    daily10    daily25   daily50+   monthly1  monthly10 
##       1420       2308        849        326    1046867     450043 
##  monthly25 monthly50+        Non    weekly1   weekly10   weekly25 
##     409774     136893    4280630      14526      21681       9259 
##  weekly50+ 
##       3499

Here we finished setting up the states to a logical division, informed by the distribution of the data, somewhat by clustering analysis (although that was deemed less relevant in this particular case), and created states of donors which are graspable, easy to discriminate and to understand, and appropriate for this analysis.

Visualize the new states

Let’s see what does the division to states look like exactly. From now on I want to demonstrate plotting in a slightly more visualization thoughtful way - more comfortable and intuitive to read, fewer distractions from the actual data - like the white background and better sizes of visualizations, more appropriate for paper-ready graphs and make data more easily detected by the human eye. This is based on visualization principles, and you can learn more about them by the field expert Claus O. Wilke, in his website. Wilke created the “cowplot” package which automatically transforms your ggplot visualizations to be of that style. It’s not necessarily more visually pleasing, but it is better for our human detection and understanding of the data in the plot more effectively.

# install if necessary and load the COWPLOT package, then plot normally!
# install.packages("cowplot")
require("cowplot")
# if you want to remove it, use this line below:
# unloadNamespace("cowplot")
ggplot(sample_n(dt.all[avg_amount<1000], 500000), aes(x=n_donations_month, y=avg_amount, color=state)) +
  geom_point(alpha=0.3) +
  labs(title =  "Average Donation Amount vs Number of Donations That Month", 
        x = "Number of Donations This Month", 
        y = "Avg Donation Amount") +
  theme_minimal()

Create “Previous State” shifted column for transition

Now that we have created the “state” column, we need to create another column for each row of the “previous state” for each agent. From this, we’ll be able to build the transition matrix. For this operation, I’ll again use data.table. In order to create a shifted state column, we first need to order the dataset in the order to be shifted, and group by user (so that states don’t spillover between users). So the primary sort column is user, and then we want to order that by date. Since we have 6 Million rows, this takes a long time. I commented this out I’ve commented the operation row out for not accidentally running it when not needed, but you can uncomment it and use it on your machine.

# now for each division of the db perform the shift
# First step: "Set keys"" in data.table by the order we want to sort by the columns. make sure the first column is the primary group (the user: donor_uid) and the second is the secondary (dates: monthyear)
# 
setkey(dt, donor_uid, monthyear) # important for ordering 
dt <- dt[order(donor_uid, monthyear)] 
# verify it's ordered correctly by viewing it:
head(dt,13)
##                 donor_uid monthyear total_amount avg_amount
##  1:  ;RICHARD|FAHEY|38548   2015-05           10         10
##  2:  ;RICHARD|FAHEY|38548   2015-06           10         10
##  3:  ;RICHARD|FAHEY|38548   2015-07           10         10
##  4:  ;RICHARD|FAHEY|38548   2015-08           10         10
##  5:  ;RICHARD|FAHEY|38548   2015-09           10         10
##  6:  ;RICHARD|FAHEY|38548   2015-10           10         10
##  7:  ;RICHARD|FAHEY|38548   2015-11           10         10
##  8:  ;RICHARD|FAHEY|38548   2015-12           10         10
##  9:  ;RICHARD|FAHEY|38548   2016-01           10         10
## 10:  ;RICHARD|FAHEY|38548   2016-02           10         10
## 11: :LAURA|GARRISON|30601   2016-02           10          5
## 12:  .SUSAN|HAJIANI|90025   2015-11          100        100
## 13:  .SUSAN|HAJIANI|90025   2016-01           50         50
##     n_donations_month n_donations_life months_donated
##  1:                 1               10             10
##  2:                 1               10             10
##  3:                 1               10             10
##  4:                 1               10             10
##  5:                 1               10             10
##  6:                 1               10             10
##  7:                 1               10             10
##  8:                 1               10             10
##  9:                 1               10             10
## 10:                 1               10             10
## 11:                 2                2              1
## 12:                 1                2              2
## 13:                 1                2              2
# # CREATE "Previous State" COLUMN, SHIFTED ONCE BEHIND, per group of user id. 
# ***UNCOMMENT THE LINE BELOW TO EXECUTE!!! 
# dt[, prev_state:= (shift(state, n=1L, fill = 0 , type = "lag")), by=donor_uid]
 
# save result: UNCOMMENT
# fwrite(dt, "./data/df_chunks/dt_states_1.csv")
# 
# # make sure it worked!! see below for further description of what to notice:
# dt[1:100, .(donor_uid, monthyear, state, prev_state)]

# create "New" state for new donors who don't have a previous state, and are not in the first month where everyone doesn't have a previous step.
# dt[isnull(prev_state) & monthyear != '2015-04', state:="New"]

Checkpoint to save or load the saved result:

# fwrite(dt.all, "./data/df_states_full.csv")
dt <- fread("./data/df_states_full.csv")
dt$state <- factor(dt$state) # make state from string to factor (category)
#print(summary(dt$state))
head(dt,3)
##    V1            donor_uid monthyear total_amount avg_amount
## 1: 22 'STANLEY|LEHTO|98597   2015-04            0          0
## 2: 23 'STANLEY|LEHTO|98597   2015-05            0          0
## 3: 24 'STANLEY|LEHTO|98597   2015-06            0          0
##    n_donations_month n_donations_life months_donated cluster state
## 1:                 0                3              3       5   New
## 2:                 0                3              3       5   Non
## 3:                 0                3              3       5   Non
##    state_prev state_num state_prev_num
## 1:          -         1             -1
## 2:        New         0              0
## 3:        Non         0              0

Divide into “Train” and “Test” datasets

In Machine Learning, when we want to develop predictive models, we have to balance fitting our model best but not overfitting. We test this by dividing our dataset into a “training” and “testing” datasets: we fit our model based on the training dataset, and then finally we test it and report conclusions on the test dataset. That shows the result of generalizing this model to out-of-sample new data, which is what we wanted when we developed a predictive model (otherwise, we would have known the answers already, no?!). Normally, we’d want our training and testing datasets to be similar in terms of being representative random samples from the population. However, in this case, such as in any other time-series or chronological dataset, in reality we will always have a history of data and will try to predict for the next unseen future data based on the past data. Therefore, we will actually need to divide our training and testing datasets by chronological date to early and late data, treating it as if we are now at the midpoint in time, and based on all that past data we will try to predict that “future” data.

# make monthyear into a factor
dt$monthyear <- factor(dt$monthyear)
# let's convert these to dates, since we will need that also later on for plotting.
# since we left out the day of month, we now need to manually insert a random day of month to have R accept that it's a date.
dt$date <-  as.Date(paste0(dt$monthyear, "-01") )

# Now split the dataset to training (2015) and testing datasets (2016)
dt_2015 <- dt[date < as.Date("2016-01-01"), ]
dt_2016 <- dt[date >= as.Date("2016-01-01"),]
print(paste("2015 Dataset N =", nrow(dt_2015)))
## [1] "2015 Dataset N = 5218425"
summary(dt_2015[, .(monthyear)])
##    monthyear      
##  2015-04: 579825  
##  2015-05: 579825  
##  2015-06: 579825  
##  2015-07: 579825  
##  2015-08: 579825  
##  2015-09: 579825  
##  (Other):1739475
print(paste("2016 Dataset N =", nrow(dt_2016)))
## [1] "2016 Dataset N = 1159650"
summary(dt_2016[, .(date)])
##       date           
##  Min.   :2016-01-01  
##  1st Qu.:2016-01-01  
##  Median :2016-01-16  
##  Mean   :2016-01-16  
##  3rd Qu.:2016-02-01  
##  Max.   :2016-02-01
print(paste0("Allocated into training set: ",100*nrow(dt_2015)/nrow(dt),"%."))
## [1] "Allocated into training set: 81.8181818181818%."

Filter First Month (without transitions)

For the upcoming calculations of states, we want to ignore the first month because no one has a previous state there. So we filter those out:

# filter anything in April
dt <- dt[dt$date > as.Date("2015-04-30"),]
df <- setDF(dt) # make a data.frame out of dt 
summary(dt$date)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2015-05-01" "2015-07-01" "2015-09-16" "2015-09-15" "2015-12-01" 
##         Max. 
## "2016-02-01"

Build Transition Matrix


Now we can use our dataset and build a Markovian Transition Matrix based on the frequency of current transitions! ## Build Overall Transition Matrix: Based On All Data

# First, make sure your states are ordered in your desired order.
# in R, cateogircal columns would be "factors", and you need to do that by *specfying the order of the "levels"* in your factor. 

state_levels <- c("New", 'Non'  , 'monthly1' , 'monthly10', 'monthly25', 'monthly50+', 'weekly1' , 'weekly10' , 'weekly25' , 'weekly50+' , 'daily1', 'daily10', 'daily25', 'daily50+')
dt$state <- factor(dt$state, levels = state_levels)
dt$state_prev <- factor(dt$state_prev, levels = state_levels)

# first, let's create a dataframe with all possible combinations of states, as a template of all possible transition, to ensure that any transition matrix will include the entire combinations set even if ther is no data. 
# create a table with all possible combinations of states (for transitions) to ensure we have a listing for each one. This is espeiclaly needed for when we later do this *per month*, since in a smaller sample there is a higher chance of some combinations to not occur (and we do see that happens). 
states_combos <- expand.grid(state_prev=state_levels, curr_state=state_levels, KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE)


# This is a function which computes Transition Matrix from data and converts to vector 
f_compute_transitions_as_matrix <- function(df=setDF(dt), 
                                          curr_state_column = "state", 
                                           prev_state_column = "state_prev",
                                          return_ns=F, #the return Ns will return also the N amount of transitions
                                          factorize=F,
                                          states_combos_df = states_combos #need df of all states combinations
                                          ) { 
    # sometimes we want to make sure the columns are factored correctly, that could be handles here:
   if (factorize) {
      # transform the data into a contingency table of counts of joint occurences (previous & current states)
      dt$state <- factor(dt$state, levels = state_levels)
      dt$state_prev <- factor(dt$state_prev, levels = state_levels) }
  
    # melt the pairs of transition from wide format into long format
    transitions_long <- melt(table(df[[prev_state_column]], df[[curr_state_column]]))
    colnames(transitions_long) = c("state_prev", "curr_state","n") # change column names 
    
    # now, in case that there are missing combinations, add them:
    # first, merge the transitions counts into the df with full set of possible combinations, 
    # so that if there were any combinations missing they will be there from the combos df without an N value
    transitions_long_complete <- merge(states_combos, transitions_long, all.x=T)
    # now convert any missing NAs (without transitions) to 0 count.
    transitions_long_complete[is.na(transitions_long_complete)] <- 0
    
    # here we create the transition matrix by summarizing the long format, where the key is the current state
    trans_matrix <- spread(as.data.frame(transitions_long_complete), 
                           key =  curr_state, value = n, fill = 0,  drop = FALSE) 
    trans_matrix[["state_prev"]] <- NULL # delete the named row
    
    # Tabulate to get numbers of each row (from each state), 
    # and divide that row by this total number to get percentage of transitions
    n_per_row <- rowSums(trans_matrix) 
    trans_matrix <- trans_matrix / n_per_row # make into transition probabilities (every row sums to 1)
    trans_matrix <- replace(trans_matrix, is.na(trans_matrix), 0) # replace NA with 0 (never occured)
    trans_matrix <- data.matrix(trans_matrix) # convert into data (numerical) matrix
    #trans_matrix <- replace(trans_matrix, is.na(trans_matrix), 0) # replace NA with 0 (never occured)
    if (return_ns == T) {return(list(trans_matrix, n_per_row))} else {return(trans_matrix)}
}

Above this is the function that we can use to create a transition matrix out of any two columns, or subset of the dataframe. Let’s use it over the whole dataframe:

## TRANSITION MATRIX: convert vector to transition matrix
trans_matrix <- f_compute_transitions_as_matrix(dt, curr_state_column = "state", 
                                           prev_state_column = "state_prev", return_ns = F)
# Save matrix
fwrite(as.data.frame(trans_matrix, row.names = unlist(state_levels)), "./data/trans_matrix_all.csv")

# PRINT MATRIX
round(as.data.frame(trans_matrix, row.names = unlist(state_levels)),2)
##            New  Non monthly1 monthly10 monthly25 monthly50+ weekly1
## New          0 0.85     0.07      0.02      0.04       0.02    0.00
## Non          0 0.74     0.12      0.06      0.06       0.02    0.00
## monthly1     0 0.31     0.58      0.07      0.02       0.00    0.01
## monthly10    0 0.36     0.12      0.38      0.10       0.01    0.00
## monthly25    0 0.45     0.06      0.09      0.36       0.03    0.00
## monthly50+   0 0.51     0.03      0.04      0.09       0.32    0.00
## weekly1      0 0.04     0.33      0.11      0.01       0.00    0.25
## weekly10     0 0.04     0.11      0.23      0.08       0.01    0.05
## weekly25     0 0.04     0.04      0.10      0.22       0.08    0.01
## weekly50+    0 0.07     0.03      0.04      0.09       0.29    0.01
## daily1       0 0.01     0.09      0.02      0.00       0.00    0.19
## daily10      0 0.01     0.01      0.04      0.01       0.01    0.04
## daily25      0 0.02     0.01      0.02      0.03       0.02    0.00
## daily50+     0 0.06     0.03      0.03      0.04       0.07    0.01
##            weekly10 weekly25 weekly50+ daily1 daily10 daily25 daily50+
## New            0.00     0.00      0.00   0.00    0.00    0.00     0.00
## Non            0.00     0.00      0.00   0.00    0.00    0.00     0.00
## monthly1       0.01     0.00      0.00   0.00    0.00    0.00     0.00
## monthly10      0.02     0.01      0.00   0.00    0.00    0.00     0.00
## monthly25      0.01     0.01      0.00   0.00    0.00    0.00     0.00
## monthly50+     0.00     0.01      0.01   0.00    0.00    0.00     0.00
## weekly1        0.13     0.01      0.00   0.06    0.03    0.00     0.00
## weekly10       0.27     0.08      0.01   0.01    0.07    0.02     0.00
## weekly25       0.10     0.23      0.08   0.00    0.04    0.05     0.02
## weekly50+      0.02     0.08      0.26   0.00    0.01    0.02     0.06
## daily1         0.09     0.01      0.00   0.43    0.15    0.01     0.00
## daily10        0.20     0.09      0.01   0.06    0.40    0.09     0.03
## daily25        0.07     0.17      0.08   0.01    0.15    0.31     0.13
## daily50+       0.04     0.08      0.22   0.00    0.04    0.15     0.21
Below is a more helpful colorful visualization of the table with color-coded colors and heatmap style cell-values (green for higher probabilities).
Transition Matrix

Transition Matrix

Great! The above transition matrix shows that the diagonal values, meaning the stable transitions of staying in the same state, are the highest ones. That means that our categorization per states was reasonable, and we may continue. Naturally, it is not ideal - they are not as stable as being around 80% stable staying at the same state, but at least they are more likely to repeat than to transition to any other state (except for the state of “New” which is how I defined it).

Build “Train” Transition Matrix: Based Only on Early/Training Dataset

Now we would also want to test the predictive capability of this transition matrix by splitting the dataset to training (until Dec 2015) and test (Jan 2016 onwards). We will build the transition matrix on the training set and predict on the test set later. In this particular case, we will probably encounter a seasonality issue, since we don’t have enough years to model to account for seasonality, and I don’t presume to know enough how to predict to insert modifications to predictions to adjust for seasonality manually. The major change would be significant increases in donations as we get closer to the elections seasons. However, I will demonstrate this principle for the reader’s potential use.

### 2015 Only TRANSITION MATRIX + N's of rows
list_trans_matrix_2015 <- f_compute_transitions_as_matrix(dt_2015, curr_state_column = "state",
                                                          prev_state_column = "state_prev", return_ns = T)

trans_matrix_2015 <- list_trans_matrix_2015[[1]] 
trans_matrix_2015_Ns <- list_trans_matrix_2015[[2]] 

# PRINT MATRIX
round(as.data.frame(trans_matrix_2015, row.names = unlist(state_levels)),2)
##            New  Non monthly1 monthly10 monthly25 monthly50+ weekly1
## New          0 0.85     0.07      0.02      0.04       0.02    0.00
## Non          0 0.79     0.10      0.04      0.05       0.02    0.00
## monthly1     0 0.34     0.60      0.04      0.01       0.00    0.00
## monthly10    0 0.43     0.13      0.37      0.05       0.01    0.00
## monthly25    0 0.52     0.06      0.07      0.32       0.02    0.00
## monthly50+   0 0.61     0.03      0.03      0.07       0.25    0.00
## weekly1      0 0.09     0.51      0.11      0.01       0.00    0.19
## weekly10     0 0.07     0.20      0.30      0.09       0.01    0.06
## weekly25     0 0.08     0.07      0.16      0.33       0.07    0.01
## weekly50+    0 0.08     0.08      0.04      0.17       0.33    0.00
## daily1       0 0.05     0.21      0.05      0.00       0.00    0.33
## daily10      0 0.03     0.03      0.13      0.03       0.00    0.08
## daily25      0 0.08     0.08      0.08      0.08       0.00    0.00
## daily50+     0 0.00     0.00      0.12      0.12       0.25    0.00
##            weekly10 weekly25 weekly50+ daily1 daily10 daily25 daily50+
## New            0.00     0.00      0.00   0.00    0.00    0.00     0.00
## Non            0.00     0.00      0.00   0.00    0.00    0.00     0.00
## monthly1       0.00     0.00      0.00   0.00    0.00    0.00     0.00
## monthly10      0.01     0.00      0.00   0.00    0.00    0.00     0.00
## monthly25      0.00     0.00      0.00   0.00    0.00    0.00     0.00
## monthly50+     0.00     0.00      0.00   0.00    0.00    0.00     0.00
## weekly1        0.06     0.00      0.00   0.02    0.01    0.00     0.00
## weekly10       0.19     0.03      0.00   0.01    0.03    0.01     0.00
## weekly25       0.08     0.14      0.03   0.00    0.02    0.02     0.01
## weekly50+      0.02     0.06      0.16   0.00    0.00    0.01     0.03
## daily1         0.09     0.00      0.03   0.24    0.00    0.00     0.00
## daily10        0.34     0.10      0.02   0.05    0.16    0.02     0.02
## daily25        0.08     0.38      0.00   0.00    0.08    0.15     0.00
## daily50+       0.00     0.12      0.38   0.00    0.00    0.00     0.00

Now we have the transition matrix based on the data of 2015! It seems reasonable as it is fairly close to the full one.

Build Historical Monthly Transition Matrices

In order to substantiate the stability and over-time relevance of our model, we’d like to create the same transition matrix model based only on each month data, and later compare the values across months, so that we have the history and stability of transitions according to each month. If values change wildly (either wild oscillations or worse - a clear trend), we’d know that either (1) the monthly data is too sparse or that (2) transition patterns inherently change throughout time, and thus we might not be able to rely on the model to predict far into the future.

I’ll do that by:
1. Creating a new column “month” of year-month 2. Group the dataset by “Month”, and run the same transition_matrix creation function on each group (using ddply that’s simple)

Building Per Month Datasets:
We’ll use ddply and dplyr, which work with Data.Frames and not with Data.Tables, so we’ll convert the data.table into a dataframe simply by using “setDF()”.

Let’s start with our monthly transition matrices and values.
First, we’ll build the same concept as before - our Markovian transition matrix, but based on each Month’s data separately.

Create Monthly Transition Matrices

# ALL monthly TRANSITION MATRICES
f_generate_monthly_trans_matrices <- function(df, states_colname="state",
                                               prev_states_colname="state_prev") {
  monthly_trans_matrices <- dlply(df, .(monthyear), .fun = f_compute_transitions_as_matrix, curr_state_column = states_colname, prev_state_column = prev_states_colname)
  #monthly_trans_matrices <- lapply(monthly_trans_matrices[2:length(monthly_trans_matrices)], na.exclude) 
  #monthly_trans_matrices <- monthly_trans_matrices[2:(length(monthly_trans_matrices)-1)]
  monthly_trans_matrices
  }

# Run Function!
monthly_trans_matrices <- f_generate_monthly_trans_matrices(df)

# make sure that each listing is of the same size:

complete_states <- function(avg_amounts, states_levels) {
  statesdf <- data.frame(states_levels)
  colnames(statesdf) = "states"
  states_amounts <- merge(statesdf, avg_amounts, all.x=T)
  states_amounts[is.na(states_amounts)] <- 0
  states_amounts 
  }

# Save Result
fwrite(monthly_trans_matrices, "./data/monthly_trans_matrices.csv")
# View Result for an example
monthly_trans_matrices[5]
## $`2015-09`
##       New        Non   monthly1  monthly10   monthly25  monthly50+
##  [1,]   0 0.00000000 0.00000000 0.00000000 0.000000000 0.000000000
##  [2,]   0 0.71045566 0.14073717 0.05833470 0.067973810 0.021711880
##  [3,]   0 0.20524794 0.71345410 0.05568422 0.013845942 0.002290965
##  [4,]   0 0.24956091 0.12922455 0.53563255 0.060620576 0.008196285
##  [5,]   0 0.37250066 0.09812384 0.08777416 0.412490091 0.018937726
##  [6,]   0 0.47942064 0.06071213 0.04248642 0.098129149 0.309354255
##  [7,]   0 0.00000000 0.36607143 0.05357143 0.008928571 0.000000000
##  [8,]   0 0.02739726 0.13698630 0.23287671 0.054794521 0.000000000
##  [9,]   0 0.07142857 0.07142857 0.07142857 0.214285714 0.035714286
## [10,]   0 0.00000000 0.11111111 0.00000000 0.111111111 0.000000000
## [11,]   0 0.00000000 0.33333333 0.00000000 0.000000000 0.000000000
## [12,]   0 0.00000000 0.00000000 0.00000000 0.000000000 0.000000000
## [13,]   0 0.00000000 0.00000000 0.00000000 0.000000000 0.000000000
## [14,]   0 0.00000000 0.00000000 0.00000000 0.000000000 1.000000000
##            weekly1     weekly10     weekly25     weekly50+         daily1
##  [1,] 0.0000000000 0.0000000000 0.0000000000 0.00000000000 0.000000000000
##  [2,] 0.0002911512 0.0003289342 0.0001200165 0.00002667034 0.000008890114
##  [3,] 0.0055333692 0.0033926314 0.0003254923 0.00002503787 0.000150227219
##  [4,] 0.0026611315 0.0108041939 0.0028740220 0.00010644526 0.000053222630
##  [5,] 0.0006606183 0.0044041223 0.0040958337 0.00070465956 0.000000000000
##  [6,] 0.0001207001 0.0022933011 0.0039831020 0.00301750151 0.000000000000
##  [7,] 0.3571428571 0.1250000000 0.0000000000 0.00000000000 0.044642857143
##  [8,] 0.1095890411 0.2876712329 0.0547945205 0.01369863014 0.013698630137
##  [9,] 0.0000000000 0.1428571429 0.2500000000 0.00000000000 0.000000000000
## [10,] 0.0000000000 0.0000000000 0.0000000000 0.66666666667 0.000000000000
## [11,] 0.0000000000 0.0000000000 0.0000000000 0.00000000000 0.666666666667
## [12,] 0.0000000000 0.0000000000 1.0000000000 0.00000000000 0.000000000000
## [13,] 0.0000000000 0.0000000000 0.0000000000 0.00000000000 0.000000000000
## [14,] 0.0000000000 0.0000000000 0.0000000000 0.00000000000 0.000000000000
##              daily10        daily25      daily50+
##  [1,] 0.000000000000 0.000000000000 0.00000000000
##  [2,] 0.000006667585 0.000004445057 0.00000000000
##  [3,] 0.000050075740 0.000000000000 0.00000000000
##  [4,] 0.000212890521 0.000000000000 0.00005322263
##  [5,] 0.000264247336 0.000044041223 0.00000000000
##  [6,] 0.000362100181 0.000000000000 0.00012070006
##  [7,] 0.044642857143 0.000000000000 0.00000000000
##  [8,] 0.054794520548 0.013698630137 0.00000000000
##  [9,] 0.035714285714 0.107142857143 0.00000000000
## [10,] 0.000000000000 0.000000000000 0.11111111111
## [11,] 0.000000000000 0.000000000000 0.00000000000
## [12,] 0.000000000000 0.000000000000 0.00000000000
## [13,] 0.000000000000 1.000000000000 0.00000000000
## [14,] 0.000000000000 0.000000000000 0.00000000000

Above is an example of the first two transition matrices based on each month. We see, for example, that the “rare” columns aren’t always filled out, since it is rare to find those rare donations (and 1 month data isn’t always sufficient). That’s a weakness inherent to this check, so we shouldn’t count on the accuracy of the more “rare” values.

Average Monthly Amount Per State

Since we’d want to estimate what is the ‘value’ of being in each state, the key information for that is what was the actual amount donated in each state per donation-timestep.

# ALL MONTHLY AMOUNT FLOWS (Donor Monthly Amount Averages), PER DONOR STATE, PER MONTH
f_generate_monthly_amount_avgs <- function(df = dt, 
                                          states_column = "state", timeframe = "monthyear") {
    monthly_amount_avgs <- ddply(df[c(timeframe, states_column ,"avg_amount")], 
                              .(monthyear, eval(parse(text=states_column))), 
                              summarise, avg_amount=mean(avg_amount))
    colnames(monthly_amount_avgs) = c(timeframe,"state","avg_amount")
    return(monthly_amount_avgs)
}

# Run Function: Generate Average Monthly Amounts
monthly_amount_avgs <- f_generate_monthly_amount_avgs(df)
# Save and View Result
# fwrite(monthly_amount_avgs, "./data/monthly_amount_avgs.csv")
monthly_amount_avgs[1:15,]
##    monthyear      state avg_amount
## 1    2015-05     daily1   9.666667
## 2    2015-05    daily10  15.871212
## 3    2015-05   monthly1   9.128604
## 4    2015-05  monthly10  21.362677
## 5    2015-05  monthly25  44.118810
## 6    2015-05 monthly50+ 138.822419
## 7    2015-05        Non   0.000000
## 8    2015-05    weekly1   7.123500
## 9    2015-05   weekly10  16.981622
## 10   2015-05   weekly25  33.372008
## 11   2015-05  weekly50+ 109.907123
## 12   2015-06     daily1   7.077381
## 13   2015-06    daily10  12.428657
## 14   2015-06   monthly1   8.610756
## 15   2015-06  monthly10  18.938001

Above is the average “value” (amount donated) per state per time period.

Grab Diagonal Transition Probabilities (Probabilities of repeating the same state)

To get a concise idea of the differences between transition probabilities through history, I’ll use the diagonals of the transition matrix, which means the probability for each state to repeat itself. Then, we could visualize their progress over time.

# Monthly diagnoals df
f_grab_monthly_diagnoals <- function(monthly_trans_matrices, state_levels, timeframe="monthyear") {
  monthly_diagonals <- ldply(monthly_trans_matrices, .fun = diag)
  colnames(monthly_diagonals) <- c(timeframe, unlist(state_levels))
  return(monthly_diagonals)
}
# Run function
monthly_diagonals <- f_grab_monthly_diagnoals(monthly_trans_matrices, state_levels, timeframe="monthyear")
# View result
monthly_diagonals
##    monthyear New       Non  monthly1 monthly10 monthly25 monthly50+
## 1    2015-05   0 0.0000000 0.0000000 0.0000000 0.0000000  0.0000000
## 2    2015-06   0 0.8601765 0.4940459 0.3697429 0.2724654  0.2056265
## 3    2015-07   0 0.8684338 0.4475266 0.2392017 0.2427718  0.1725108
## 4    2015-08   0 0.8460426 0.5727861 0.3458286 0.3020536  0.2016585
## 5    2015-09   0 0.7104557 0.7134541 0.5356326 0.4124901  0.3093543
## 6    2015-10   0 0.8159545 0.5151641 0.2918454 0.2729428  0.2070071
## 7    2015-11   0 0.8479445 0.6672665 0.4193409 0.3472743  0.2536180
## 8    2015-12   0 0.5587033 0.6848462 0.4545049 0.3793568  0.4470461
## 9    2016-01   0 0.4697698 0.5800798 0.3722922 0.3749312  0.3659146
## 10   2016-02   0 0.4332018 0.5074471 0.3796755 0.5109247  0.4808587
##      weekly1   weekly10   weekly25  weekly50+    daily1   daily10
## 1  0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
## 2  0.2926829 0.21212121 0.13333333 0.08333333 0.0000000 0.0000000
## 3  0.1157895 0.08629442 0.03797468 0.11538462 0.3333333 0.0000000
## 4  0.2068966 0.12790698 0.04000000 0.23076923 0.3333333 0.0000000
## 5  0.3571429 0.28767123 0.25000000 0.66666667 0.6666667 0.0000000
## 6  0.1120815 0.14487179 0.12500000 0.10937500 0.1600000 0.1333333
## 7  0.1818182 0.21052632 0.13142857 0.12280702 0.2142857 0.1111111
## 8  0.4516129 0.32035928 0.24468085 0.27272727 0.3333333 0.8000000
## 9  0.2888696 0.26736995 0.19289941 0.24045802 0.4202899 0.3968254
## 10 0.2607187 0.31481086 0.29571984 0.29538905 0.4622356 0.4285714
##      daily25  daily50+
## 1  0.0000000 0.0000000
## 2  0.0000000 0.0000000
## 3  0.0000000 0.0000000
## 4  0.0000000 0.0000000
## 5  1.0000000 0.0000000
## 6  0.1250000 0.0000000
## 7  0.0000000 0.0000000
## 8  0.0000000 0.0000000
## 9  0.2156863 0.1666667
## 10 0.3700787 0.2400000

Monthly Transitions Of Specific States

Additionally to the diagonals, I’d like to check some transitions To or From specific states more closely.
For example, if our the state that donates the most throughout time is “Daily50+”, we’d like to know who / how people transition into that state? And where do they go afterward? And how has that changed throughout history?

# monthly Row or Column: To and From Daily
f_grab_trans_fromto <- function(monthly_trans_matrices, rowcolnum = 3, from_or_to = "from", states_list = state_levels, rows_id="monthyear") {
  ifelse(test = (from_or_to == "from"),
         yes =  grab_trans <- function(df) {df[rowcolnum,]},
         no =  grab_trans <- function(df) {df[,rowcolnum]}
  )
  monthly_trans_rows = ldply(monthly_trans_matrices, .fun = grab_trans,  .id = rows_id)
  colnames(monthly_trans_rows) <- c(rows_id,unlist(states_list))
  return(monthly_trans_rows)
}
# Run Function, for example from state daily low, since that's interesting
monthly_trans_from_dailylow <- f_grab_trans_fromto(monthly_trans_matrices, rowcolnum = 11, from_or_to = "from") 
# View result
monthly_trans_from_dailylow
##    monthyear New         Non   monthly1  monthly10 monthly25  monthly50+
## 1    2015-05   0 0.000000000 0.00000000 0.00000000         0 0.000000000
## 2    2015-06   0 0.000000000 0.00000000 0.00000000         0 0.000000000
## 3    2015-07   0 0.333333333 0.00000000 0.00000000         0 0.000000000
## 4    2015-08   0 0.000000000 0.33333333 0.00000000         0 0.000000000
## 5    2015-09   0 0.000000000 0.33333333 0.00000000         0 0.000000000
## 6    2015-10   0 0.040000000 0.24000000 0.08000000         0 0.000000000
## 7    2015-11   0 0.000000000 0.21428571 0.07142857         0 0.000000000
## 8    2015-12   0 0.166666667 0.00000000 0.00000000         0 0.000000000
## 9    2016-01   0 0.021739130 0.07971014 0.02173913         0 0.000000000
## 10   2016-02   0 0.003021148 0.06948640 0.02114804         0 0.006042296
##      weekly1   weekly10    weekly25 weekly50+    daily1   daily10
## 1  0.0000000 0.00000000 0.000000000 0.0000000 0.0000000 0.0000000
## 2  0.0000000 1.00000000 0.000000000 0.0000000 0.0000000 0.0000000
## 3  0.0000000 0.00000000 0.000000000 0.3333333 0.3333333 0.0000000
## 4  0.3333333 0.00000000 0.000000000 0.0000000 0.3333333 0.0000000
## 5  0.0000000 0.00000000 0.000000000 0.0000000 0.6666667 0.0000000
## 6  0.3200000 0.12000000 0.000000000 0.0400000 0.1600000 0.0000000
## 7  0.4285714 0.07142857 0.000000000 0.0000000 0.2142857 0.0000000
## 8  0.5000000 0.00000000 0.000000000 0.0000000 0.3333333 0.0000000
## 9  0.2318841 0.05797101 0.000000000 0.0000000 0.4202899 0.1594203
## 10 0.1419940 0.09667674 0.009063444 0.0000000 0.4622356 0.1722054
##        daily25 daily50+
## 1  0.000000000        0
## 2  0.000000000        0
## 3  0.000000000        0
## 4  0.000000000        0
## 5  0.000000000        0
## 6  0.000000000        0
## 7  0.000000000        0
## 8  0.000000000        0
## 9  0.007246377        0
## 10 0.018126888        0

Markov Modeling

Now we can model this as a Markov chain to get what would the steady state look like with the following transitions.

#install.packages("markovchain")
require("markovchain")

### Markov Chain
markov <- new("markovchain", states = unlist(state_levels), byrow = T, transitionMatrix = trans_matrix, name = "Driver States Markov Chain")
# print(markov)

#### Steady State
# The steady state is the EigenVector of the transition Matrix; it's the proportions of drivers in each state which will remain the same with the transition matrix.
steady <-  steadyStates(markov)

# reorder data.table
order_by_state <- function(vector = steady, new_order=state_levels) {
  vector_dt <- as.data.frame(t(vector))
  vector_dt$state<- rownames(vector_dt)
  vector_dt$state <- factor(vector_dt$state, levels = new_order)
  vector_dt[order(vector_dt$state),]
  vec <- as.double(vector_dt$V1)
  names(vec) <- new_order
  return(vec)
  }

steady <- order_by_state(steady, state_levels)
print(t(t(steady)))
##                    [,1]
## New        0.0000000000
## Non        0.5711958580
## monthly1   0.2042060065
## monthly10  0.0950029136
## monthly25  0.0802358268
## monthly50+ 0.0250789482
## weekly1    0.0045698099
## weekly10   0.0082197737
## weekly25   0.0041616392
## weekly50+  0.0018542719
## daily1     0.0011507385
## daily10    0.0025117365
## daily25    0.0012269323
## daily50+   0.0005855448

Discounted Lifetime Value Formula and Model


Our User Value Equation is:
\(X = monthly_amounts + discount_factor*ExpectedValue\)

where EXPECTED_VALUE is:
\(P(TransitionMatrix)*X\) (times expected value next week assuming it’s the same).

So if P is the Transition Matrix, and C is current monthly donation amount, and t is a given timestep:
\(X = C + discount^t*(P^t * X)\)

We need to get this in the form of MATRIX*X=constant. So:
\(X = C + discount^t*(P^t*X)\)
rename them into:
\(X = C + dP*X\)
\(X - dP*X = C\)
\((I-dP)X = C\)

We can here solve equation of the form
\(MATRIX*X_{vector} = Constant (Ax=b)\) So we’ll create the “A” matrix as:
\(A = (I-dp)\)

Discount Factor: Determine the Discount Factor Manually

discount = 0.8 # manually determined discount factor! You will manually determine the discount factor, and there is no exact science about this. One useful measure is to check: how many steps in the future do you need to know about? If you say only 4 steps, you can set the discount factor so that after more than 4 steps the discounted value is negligible, according to your “negligible” threshold. Let’s say the “negligible” value is 0.05, or 5%, since this is commonly accepted as a significance value, so you could say that at that level the value is not significantly different than zero. So you can find a discount value that after 4 steps, meaning raised to the power of 4, would be around 5% or 0.05. This happens at values lower than 0.47. (0.47^4 = 0.049). Let’s say that we want to predict no more than 6 months into the future. For that, a value of 0.6 discount factor will be appropriate since 0.6^6 = 0.046656.

# GET DLV DATASET

state_levels <- c("New", 'Non'  , 'monthly1' , 'monthly10', 'monthly25', 'monthly50+', 'weekly1' , 'weekly10' , 'weekly25' , 'weekly50+' , 'daily1', 'daily10', 'daily25', 'daily50+')

# Here is a little function to make sure that all states are present in a summary dataframe, and if not - add the missing states and 0 as the amount. 

complete_states <- function(avg_amounts, states_levels) {
  statesdf <- data.frame(states_levels)
  colnames(statesdf) = "states"
  states_amounts <- merge(statesdf, avg_amounts, all.x=T)
  states_amounts[is.na(states_amounts)] <- 0
  states_amounts 
  }

### General Function with arbitrary (1) discount factor and state columns
generate_DLV_dataset_beta <- function(dataframe = df, 
                                         discount_factor = 0.3, 
                                         curr_states_column = "state", 
                                         prev_state_column = "state_prev", 
                                         stateslist = state_levels,
                                         amount_column = "avg_amount",
                                         factorize_df = F) {
    
   # Flow amounts: Avg monthly Amounts driven per state for all drivers in each state in all times
    avg_amounts = aggregate(dataframe[[amount_column]], by=list("states" = dataframe[[curr_states_column]]), FUN=mean)
    avg_amounts[["states"]] <- factor(avg_amounts[["states"]], levels=stateslist, ordered=TRUE)
    
    # if there are missing states that time period, add it and assign 0 with custom function above
    avg_amounts <- complete_states(avg_amounts, stateslist) 

    # Calculate transition Matrix
    trans_matrix <- f_compute_transitions_as_matrix(dataframe, curr_states_column, prev_state_column, factorize = factorize_df )
    
    # Calculate Customer Lifetime Value Solution! 
    ## Getting "A" Matrix: Re-factoring Transition Matrix P to A = (I-dp) = I - discount_factor*trans_matrix
    # get Identity Matrix with the number of dimensions as states:
    I = diag(nrow(trans_matrix)) 
    ## SOLVE Ax=B : EQUATION IS OF THE FORM: Ax = b where A:(I-dP), b = avg_amounts
    A = I - discount_factor * trans_matrix
    # take our constant as average amounts over desired period , as numeric vector
    b = as.vector(avg_amounts[['x']]) 
    # X_Values_named = matrix(data = Xs, ncol = 1, dimnames = list(stateslist))
    DLV_solutions = solve(A, b) 
    DLV_df <-  data.frame(data = cbind(avg_amounts, DLV_solutions))
    DLV_df['DLV_future_remainder'] <- DLV_df$data.DLV_solutions - DLV_df$data.x
    colnames(DLV_df) = c("state","Avg_Amounts_monthly", 
                         "DLV_Values","DLV_future_remainder")
    DLV_df$state <- factor(DLV_df$state, levels = stateslist)
    DLV_df[order(DLV_df[["state"]]),]
    DLV_df
}

### Using the General Function
#df <- setDF(df)
DLV_0.6 <- generate_DLV_dataset_beta(dataframe = df, discount_factor = 0.6)
DLV_0.6_2015 <- generate_DLV_dataset_beta(dataframe = setDF(dt_2015), discount_factor = 0.6)
DLV_0.4_2015 <- generate_DLV_dataset_beta(dataframe = setDF(dt_2015), discount_factor = 0.4)
DLV_0.4 <- generate_DLV_dataset_beta(dataframe = setDF(dt), discount_factor = 0.4)

#DLV_0.3_early <- generate_DLV_dataset_beta(dataframe = setDF(dt[dt$date<as.Date("2015-07-02"),]), discount_factor = 0.3)
#DLV_0.3_early
DLV_0.6
##                 state Avg_Amounts_monthly DLV_Values DLV_future_remainder
## New            daily1            6.277852   39.00389             32.72604
## Non           daily10           16.150704   51.49946             35.34875
## monthly1      daily25           33.738643   78.22754             44.48889
## monthly10    daily50+           78.615717  130.70535             52.08964
## monthly25    monthly1            8.230614   42.68653             34.45592
## monthly50+  monthly10           19.976802   53.01237             33.03557
## weekly1     monthly25           40.407341  102.62997             62.22262
## weekly10   monthly50+          133.208988  204.79248             71.58349
## weekly25          New            0.000000   45.20373             45.20373
## weekly50+         Non            0.000000   37.18583             37.18583
## daily1        weekly1            7.239903   58.90015             51.66025
## daily10      weekly10           16.527671   79.24088             62.71321
## daily25      weekly25           34.243213   87.83084             53.58762
## daily50+    weekly50+           97.233739  148.07637             50.84263

For data manipulation (for raising the discount factor to the power of the number of steps), we need to create a column with the ordinal number of the chronological steps:

# since our steps our defined by monthyear, we'll convert it to its ordinal order.
# that is easy in our case since it's already a numerical equivalent which appears in ascending order in our dataset, we can easily convert it to a factor which will be set automatically with ascending levels, and then use the numeric representation of that factor.
dt$monthyear <- factor(dt$monthyear, ordered = TRUE)
dt$timestep <- as.numeric(dt$monthyear)
df <- setDF(dt)

Predict DLV’s Using The Markov Model

Now finally we will use our Markov model to predict DLVs per state, and later compare them to the “true” values on the test set to test the performance or calibrate our model predictions.

generate_donor_DLVs_predictions <- function(df, discount, DLV_dataset, states_column = "state", 
                                             state_levels=state_levels, startdate=as.Date("2015-05-01")) {
   # First, since we are only predicting for the donors/users who started at that started, 
   # let's filter out data about other donors/users that were not present then. 
    # first, get a list of donors who were present at the start date
    setDT(df)
    donors_startdate <- unique(df$donor_uid[df$date == startdate])
    df_startdatedonors <- df[date >= startdate & donor_uid %in% as.vector(donors_startdate),]
    # map each donor / user to its starting state
    original_states_startdate <- df_startdatedonors[df_startdatedonors$date == 
                                                      startdate, .(donor_uid,state)]
    colnames(original_states_startdate) <- c("donor_uid","starting_state")
    
    # Merge to create a new dataset of donors with original states column
    df_startdatedonors <- merge(x = df_startdatedonors,
                                       y = original_states_startdate, 
                                by = "donor_uid", all.x = TRUE)
    
    # create a week # column (for later raising discount factor by that power)
    ### THIS ONLY WORKS IF INSPECTED IS 01/01 
    df_startdatedonors$discount_factor <- discount ^ df_startdatedonors$timestep # Discount factor per week 
    # discounted amounts
    df_startdatedonors$discounted_week_amounts <- (df_startdatedonors$avg_amount 
                                                        *  df_startdatedonors$discount_factor)
    # aggregate amounts per donor
    true_dlvs <- ddply(df_startdatedonors, .(donor_uid), 
                       summarise, true_dlv_amounts=sum(discounted_week_amounts))
    # merge into dataset of donor_uid, donor_original_state, sum_amounts
    donors_DLVs_bystate <- merge(x = original_states_startdate,
                                  y = true_dlvs, by = "donor_uid", all.x = TRUE)
    # add column of mean true value per state
    donors_DLVs_bystate <- ddply(donors_DLVs_bystate, "starting_state", 
                                  transform, true_dlv_mean  = mean(true_dlv_amounts))
    
    ## COMBINE WITH MARKOV ESTIMATED VALUES: ACCORDING TO PRIOR UNSEEN DATA 2015
    # add predicted values per state
    donors_DLVs_all <-  merge(x = donors_DLVs_bystate, 
                               y=DLV_dataset[,c("state",unlist(grep('DLV_Value',
                                                                    colnames(DLV_dataset),
                                                                    value=TRUE)))], 
                               by.x = 'starting_state', by.y = 'state', all.x = TRUE)
    donors_DLVs_all$starting_state <- factor(donors_DLVs_all$starting_state, 
                                                     levels = state_levels, ordered = T)
    return(donors_DLVs_all)
}


# run function: generate these datasets for needed disocunt factors 
# combine true values with predictions for 0.6 discount
donors_DLVs_combined_0.6 <- generate_donor_DLVs_predictions(df=df, discount=0.6, DLV_dataset=DLV_0.6,
                                                            states_column = "state",
                                                            state_levels=state_levels,
                                                            startdate=as.Date("2015-05-01") )
fwrite(donors_DLVs_combined_0.6, "./data/donors_DLVs_combined_0.6.csv") # save datasets

# combine true values with predictions for 0.4, 2015
donors_DLVs_combined_0.4_2015 <- generate_donor_DLVs_predictions(df, discount=0.4, DLV_dataset=DLV_0.4_2015) 
fwrite(donors_DLVs_combined_0.4_2015, "./data/donors_DLVs_combined_0.4_2015.csv")

# create the DLV 0.3 object for early data 2015
DLV_0.3_2015 <- generate_DLV_dataset_beta(dataframe = setDF(dt_2015), discount_factor = 0.4)
# combine true values with predictions
donors_DLVs_combined_0.3_2015 <- generate_donor_DLVs_predictions(df, discount=0.3, DLV_dataset=DLV_0.3_2015) 
fwrite(donors_DLVs_combined_0.3_2015, "./data/donors_DLVs_combined_0.3_2015")

# show result
head(donors_DLVs_combined_0.6,3)
##   starting_state             donor_uid true_dlv_amounts true_dlv_mean
## 1         daily1      MARK|BOWEN|91367         12.77084      12.77084
## 2        daily10 BARBARA|BAHNSEN|96734         47.53204      30.30735
## 3        daily10   WAYNE|STAPLES|05491         13.08265      30.30735
##   DLV_Values
## 1   39.00389
## 2   51.49946
## 3   51.49946

Later we will compare these actual discounted total amounts with the predicted ones.

Now we got all our necessary datasets in place for both communicating the values of which directly (like the transition matrix) and to create the needed visualizations to communicate the trends effectively. Let’s start creating those visualizations.


Visualizations of DLV Values and Predictions


From here onwards we use all our generated datasets to create plots that communicate our results and insights.
Let’s start with visualizing the: (1) number of donors in the steady state versus reality, (2) plot the DLV values predicted at various discount factors and current donations amounts, and (3) validate our prediction ability by comparing predicted DLVs to true DLV amounts for drivers who started at a given state.

Plot 1: How far are the group sizes in reality from the Steady State?


Compare Steady state to true donor allocation

require(ggplot2)
require(scales)
require(ggthemes)

state_levels <- c("New", 'Non'  , 'monthly1' , 'monthly10', 'monthly25', 'monthly50+', 'weekly1' , 'weekly10' , 'weekly25' , 'weekly50+' , 'daily1', 'daily10', 'daily25', 'daily50+')


plot_steady_state_and_reality_compared <- function(fulldataframe = df, state_levels = state_levels, states_column = "state", steadystates = steady, analysis_title="") {

  # DF of donor proportions
  df_donor_props <- summary(fulldataframe[[states_column]]) / nrow(fulldataframe)
  
  df_donor_props_vertical <- t(rbind(steadystates, df_donor_props))
  colnames(df_donor_props_vertical) <- c("Steady_State", "Reality")
  melted_donors_props <- melt(df_donor_props_vertical)
  colnames(melted_donors_props) <- c("state","scenario","proportion")
  melted_donors_props$proportion <- as.numeric(as.character(melted_donors_props$proportion))
  melted_donors_props$state <- factor(melted_donors_props$state, levels=rev(state_levels), ordered = TRUE)
  melted_donors_props[order(melted_donors_props$state),]
  
  # PLOT 
  gg_steady_state_and_reality_compared <- ggplot(data=melted_donors_props, 
                                                 aes(x=state, y=proportion, 
                                                     fill=scenario)) +
    geom_bar(stat="identity", position=position_dodge(0.8)) + 
    coord_flip() + 
    scale_fill_discrete(name = "Scenario: ", labels = c("Predicted Steady State      ", "Reality (2015-2016 Avg)"), h =  c(10,170)) +
    labs(x = "Donor State (in Steady State vs Reality)", y = "Proportion of Donors", 
         title = paste0(analysis_title, "Donor Proportions in Stedy State / Reality")) +
    theme_minimal()  + theme(legend.position = "top") +
    #scale_x_continuous(limits = c(0,0.6)) +
    theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
          plot.subtitle = element_text(family = "Helvetica"))  + 
    scale_fill_ptol() 
  
  ggsave("./dataviz/gg_steady_state_and_reality_compared.png", height = 6, width = 9)
  return(gg_steady_state_and_reality_compared)
}

gg_steady_state_and_reality_compared <- plot_steady_state_and_reality_compared(fulldataframe=df, state_levels=state_levels)
# the ggplotly wrapper functions turns the ggplot object into an web interactive plotly object!
ggplotly(width = 950, height = 650,  p=gg_steady_state_and_reality_compared)
  • Notice that the bar plot above has a follows important visualization principles:
  • The bars are horizontal rather than vertical, since the list of categories is long and they would be too compressed and unreadable if they were to be aligned horizontally.
  • The colors are contrasting colors from the ptol() scheme which is friendly for the eye yet considering color blindness.
  • The legend is up at the top and conveniently tells us exactly all we need to know.
  • The bars are arranged according to the manually specified order: from the categories with the least donations to most contribution, which also roughly corresponds with the highest amounts of donors to least (inverse correlation)

Plot 2: Visualize: DLVs donor Lifetime Values as stacked bar plots


Constant Donor LifeTime Value Now we want to see the donor lifetime values. Since this depends on discount factors, let’s visualize the donor’s lifetime value according to various discount factors and be able to compare between them. If the ranking or proportions stays the same, then that would be simple for us and indicate the stability of ranking between the profitability of states regardless of time horizon.
For that, let’s first generate another dataset of discounted lifetime values:

DLV_0.8 <- generate_DLV_dataset_beta(dataframe = df, discount_factor = 0.8)

And now plot on the same plot both current donation amounts, and two levels of discounted predicted lifetime values.

plot_DLV_stacked_barplot_0.6_0.8 <- function(DLV_0.6,DLV_0.8, beta_1=0.6, beta_2=0.8, analysis_title="") {
    library("ggthemes")
  
    DLV_0.6_08 <- cbind(DLV_0.6,DLV_0.8[c(grep('Value', colnames(DLV_0.8), value=T),
                                            grep('remainder', colnames(DLV_0.8), value=T))])
    colname_DLV_value_1 <- paste0(beta_1,"_DLV_Value")
    colname_DLV_value_2 <- paste0(beta_2,"_DLV_Value")
    colname_DLV_remainder_1 <- paste0(beta_1,"_DLV_remainder")
    colname_DLV_remainder_2 <- paste0(beta_2,"_DLV_remainder")
    
    colnames(DLV_0.6_08) <- c("state", "Avg_monthly_Amounts", colname_DLV_value_1,
                              colname_DLV_remainder_1 , colname_DLV_value_2, colname_DLV_remainder_2)
    DLV_0.6_08_usable <- DLV_0.6_08[,c("state","Avg_monthly_Amounts", 
                                       colname_DLV_remainder_1, colname_DLV_remainder_2)]
    # Melt Dataframe to Long format for ggplot
    melted_dlv <- melt(DLV_0.6_08_usable,id.vars = "state") 
    melted_dlv$variable <- factor(melted_dlv$variable, 
                           levels=c(colname_DLV_remainder_2, colname_DLV_remainder_1,"Avg_monthly_Amounts"), ordered=TRUE)
    
    # Plot stacked bar plot
    gg_stacked_barplot_DLV_amounts_monthly <- ggplot(data = melted_dlv, aes(x = state, y = value,fill=variable)) +
       geom_bar(stat = 'identity') + 
       coord_flip() + scale_x_discrete(limits = rev(levels(melted_dlv$state))) + 
       labs(title = paste0(analysis_title, "Donors Current Donations and Lifetime Values"),
            subtitle = paste0("Discount factors = ",beta_1," and ",beta_2,
                              ". \nDLV: Predicted total discounted lifetime value"),
            y = "Value (Discounted Amounts)", x = "State", fill = "DLV Value Portion   ") +
        theme_fivethirtyeight() + 
        theme(legend.position = "top") +
        guides(fill = guide_legend(reverse=T)) +
        scale_fill_brewer(palette = 15) # scale_fill_ptol() +

    ggsave("./dataviz/gg_stacked_barplot_DLV_amounts.png", width = 6, height = 4)
    gg_stacked_barplot_DLV_amounts_monthly
}

gg_stacked_barplot_DLV_amounts_monthly <- plot_DLV_stacked_barplot_0.6_0.8(DLV_0.6, DLV_0.8, beta_1=0.6, beta_2=0.8)

ggplotly(width = 950, height = 650, p=gg_stacked_barplot_DLV_amounts_monthly)
## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

Plot 3.a: Visualize Model Prediction Accuracy: Histograms of true DLVS distributions and compare where the estimated DLVs fall


Comparing Markov Predicted DLV with actual amounts of data.

As a methodological check for confidence, we can merely report statistical significance or confidence intervals. However, these are somewhat not useful in terms that they don’t show us clearly the underlying distributions of the populations nor of the predictions and where does our predicted value or the true value sit on them; which would be a much more comprehensive and understandable way to convey our confidence than simple statistics like #confidence intervals.

donors_DLVs_combined_0.6_2015 <- generate_donor_DLVs_predictions(df, discount=0.6, DLV_dataset=DLV_0.6_2015) 
setDT(donors_DLVs_combined_0.6)
colnames(donors_DLVs_combined_0.6) <- c("starting_state", unlist(colnames(donors_DLVs_combined_0.6)[2:5]))
unique(donors_DLVs_combined_0.6[, c("starting_state", "true_dlv_mean")])
##     starting_state true_dlv_mean
##  1:         daily1     12.770835
##  2:        daily10     30.307347
##  3:       monthly1     10.709987
##  4:      monthly10     21.641112
##  5:      monthly25     38.659149
##  6:     monthly50+    108.321877
##  7:            Non      4.663363
##  8:        weekly1     11.828701
##  9:       weekly10     21.750778
## 10:       weekly25     43.942132
## 11:      weekly50+    131.598858
# save datasets
#fwrite(donors_DLVs_combined_0.4_2015, "./donors_DLVs_combined_0.4_2015.csv")
fwrite(donors_DLVs_combined_0.6_2015, "./donors_DLVs_combined_0.6_2015.csv")

Now that we have the needed dataset of true DLV values, let’s visualize their distributions per state, and our predicted state DLV.

plot_DLV_preds_dists <- function(DLV_dataset, discount, data_timeframe_str = "All data", analysis_title="") {
      
      # rename columns in dataset for convenience
      names(DLV_dataset)[names(DLV_dataset) == grep('DLV_Value', colnames(DLV_dataset), value=T)] <- "DLV_Values" 
      names(DLV_dataset)[names(DLV_dataset) == grep('true_dlv_mean', colnames(DLV_dataset), value=T)] <- "true_dlv_mean"
      errors =  abs(DLV_dataset[['DLV_Values']] - DLV_dataset[['true_dlv_mean']]) 
      
      # build ggplot
      gg_dists <- ggplot(DLV_dataset, aes(x=true_dlv_amounts), group = starting_state) +
        
        # histogram of true values per state, y=..density..:
        geom_histogram(aes(y=..density..), position="identity", 
                       fill='steelblue', alpha=0.5,  binwidth=5, na.rm=TRUE) + #fill = 'steelblue'
        
        # true values mean:
        geom_vline(data=DLV_dataset, aes(xintercept = true_dlv_mean, colour = 'True DLV Mean'), 
                   linetype="dashed", size=1, show.legend=T, na.rm=TRUE) +
        
        
        # estimated markov value:
        geom_vline(data=DLV_dataset, aes(xintercept = DLV_Values, colour='Predicted DLV'),
                   linetype="dashed", size=1, show.legend=T, na.rm=TRUE) +
        
        
        # make into a grid with free scales 
        facet_wrap( ~ starting_state,  scales = "free", labeller =label_both) +
        scale_colour_manual(name='',values=c('True DLV Mean'='gold', 'Predicted DLV'='red')) +
        xlim(0, 205) + # set limits of X axis scale
        #theme_base() + 
        theme(legend.position="top"
              #, plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
              #plot.subtitle = element_text(size=12, family = "Helvetica")
              ) + 
         labs(x = "Total DLV Amounts", y = "Density",
             title = paste0(analysis_title, "True DLVs Distributions with Predicted DLV, Discount Factor: ", discount),
             subtitle = paste0("Trained on ", data_timeframe_str, ". What was the true total DLV amounts donated of drivers who started in 04/2015 at each of the following states?
Mean Error: ", round(mean(errors),2), " . Median error: ", round(median(errors),2), " over the true values") ) 
      
      ggsave(filename = paste0("./dataviz/gg_DLV_states_distributions__",discount,"_", gsub('([[:punct:]])|\\s+','_',data_timeframe_str) ), 
             plot = gg_dists, device = "png", width = 10, height = 8)
      gg_dists
}

Plot distributions for both true values and predictions based on a 0.6 discount factor:

# Plot!
#require(cowplot)
plot_DLV_preds_dists(donors_DLVs_combined_0.6, discount = 0.6, 
                                      data_timeframe_str = "only 2015 data")

And if they are both based on the 0.3 discount factor, and only on training data from 2015:

plot_DLV_preds_dists(donors_DLVs_combined_0.3_2015, discount = 0.3, 
                                      data_timeframe_str = "only 2015 data")

Performance evaluation

The above visualization shows the distributions of true DLVs of all drivers, their mean in yellow, and our predicted DLV in red. What can we do about it? 1. The simplest thing is to continue as is and know to discount our prediction by a factor of roughly a half. 2. We can try to improve our model inherently. For example, instead of using 1-step Markov process, we can look further back into the past for better predictions, in a 2-step (or more) Markovian process, or redefine the quantity used for prediction to account for the past more. 3. We can use ensemble models: given our current model’s results as predicted y and the true y labels, we can overlay another model, perhaps a simple regression, to learn how to turn one result to the other. This could have promising potential if our errors are consistent. We could use the factor(state) as a regressor (which is equivalent to a one-hot encoding of each state as regressors), the predicted y, and the true y as a label. This would be a “fixed effects” regression model. By using this model, we could significantly improve the results. However, this doesn’t yet fix the inherent results of the model, which is preferable if we can. 4. Since we see consistent overestimation of our prediction, there might be a simple fix: it might just be fixed via a proxy of calibrating the discount factor to generate the results we want. Let’s check prediction results of other discount factors and their correlation with our desired true Y’s (DLV amounts with a 0.6 discount factor).

# we need the true DLVs with 0.6 discount factor, 
# and the predictions for 0.4 values (or similar),
# and see if they match better. 
# if they do, we can just change our reference point of discount factor to fit reality. 

f_combine_pred_true_dts_2discounts <- function(preds_dt, true_dlvs_dt) {
  setDT(true_dlvs_dt)
  setDT(preds_dt)
  # renaming columns for compatibility
  # colnames(preds_dt) <- c("starting_state", unlist(colnames(preds_dt)[2:5]))
  DLVs_pred_train_unique <- unique(preds_dt[, .(starting_state, true_dlv_mean, DLV_Values)])
  DLVs_trues_unique <- unique(true_dlvs_dt[, .(starting_state, true_dlv_mean, DLV_Values)])
  
  # change column names to include the name / type of dataset before merging
  colnames(DLVs_pred_train_unique) <- c("starting_state", "pred_dlvs_true_train", 
                                      "dlv_pred_train")
  colnames(DLVs_trues_unique) <- c("starting_state", "true_dlvs", 
                                 "true_dlvs_pred")
  setDF(DLVs_pred_train_unique)
  setDF(DLVs_trues_unique)
  # merge pred train predictions and trues all true values
  #setkey(DLVs_pred_train_unique, starting_state)
  #setkey(DLVs_trues_unique, starting_state)
  DLV_preds_predtrain_truesall <- merge(DLVs_pred_train_unique[c("starting_state", 
                                                                 "dlv_pred_train")], 
                                     DLVs_trues_unique[c("starting_state", "true_dlvs")])
  DLV_preds_predtrain_truesall
}

# run function: merge predictions and true datasets of different discount factors.
DLV_preds_0.4train_0.6all <- f_combine_pred_true_dts_2discounts(donors_DLVs_combined_0.4_2015, 
                                                                donors_DLVs_combined_0.6)
DLV_preds_0.4train_0.6all
##    starting_state dlv_pred_train  true_dlvs
## 1          daily1       19.21751  12.770835
## 2         daily10       29.15468  30.307347
## 3        monthly1       21.74505  10.709987
## 4       monthly10       33.46116  21.641112
## 5       monthly25       67.11091  38.659149
## 6      monthly50+      165.81957 108.321877
## 7             Non       15.15070   4.663363
## 8         weekly1       32.22563  11.828701
## 9        weekly10       52.92302  21.750778
## 10       weekly25       53.39368  43.942132
## 11      weekly50+      118.06194 131.598858

This is still about half but still overshooting. Let’s try a lower discount value.

# DLV_0.3_2015 <- generate_DLV_dataset_beta(dataframe = setDF(dt_2015), discount_factor = 0.3)
# donors_DLVs_combined_0.3_2015 <- generate_donor_DLVs_predictions(df, discount=0.3, DLV_dataset=DLV_0.3_2015) 

# fwrite(donors_DLVs_combined_0.3_2015, "./data/donors_DLVs_combined_0.3_2015")
# donors_DLVs_combined_0.3_2015

# run function: merge predictions and true datasets of different discount factors.
#DLV_0.4_2015 <- generate_DLV_dataset_beta(dataframe = setDF(dt_2015), discount_ factor = 0.4)
colnames(donors_DLVs_combined_0.3_2015) <- c("starting_state", colnames(donors_DLVs_combined_0.3_2015)[2:4], "DLV_Values")
DLV_preds_0.3train_0.6all2 <- f_combine_pred_true_dts_2discounts(preds_dt =
                                                                   donors_DLVs_combined_0.3_2015,
                                                                 true_dlvs_dt =
                                                                   donors_DLVs_combined_0.6)
DLV_preds_0.3train_0.6all2
##    starting_state dlv_pred_train  true_dlvs
## 1          daily1       19.21751  12.770835
## 2         daily10       29.15468  30.307347
## 3        monthly1       21.74505  10.709987
## 4       monthly10       33.46116  21.641112
## 5       monthly25       67.11091  38.659149
## 6      monthly50+      165.81957 108.321877
## 7             Non       15.15070   4.663363
## 8         weekly1       32.22563  11.828701
## 9        weekly10       52.92302  21.750778
## 10       weekly25       53.39368  43.942132
## 11      weekly50+      118.06194 131.598858

Let’s check if one is better correlated:

print(cor(DLV_preds_0.3train_0.6all2$dlv_pred_train, DLV_preds_0.3train_0.6all2$true_dlvs))
## [1] 0.9175808
print(cor(DLV_preds_0.4train_0.6all$dlv_pred_train, DLV_preds_0.4train_0.6all$true_dlvs))
## [1] 0.9175808

In first runs it looked like our first model is slightly better correlated, but they are actually both very well correlated and very similar. In later runs they were exactly the same. How about which is actually closer? let’s calculated the sum of errors or squared errors:

# install.packages("caret")
require(caret)
print(caret::postResample(pred = DLV_preds_0.3train_0.6all2$dlv_pred_train, obs = DLV_preds_0.3train_0.6all2$true_dlvs))
##       RMSE   Rsquared        MAE 
## 23.7221984  0.8419546 18.3135354
print(caret::postResample(pred = DLV_preds_0.4train_0.6all$dlv_pred_train,  obs = DLV_preds_0.4train_0.6all$true_dlvs))
##       RMSE   Rsquared        MAE 
## 23.7221984  0.8419546 18.3135354

So we see that the R square is high and comparable in both cases (0.96 or 0.97), where the RMSE is higher for the 0.4 discount factor rather than the 0.3 discount factor. We will go with the 0.3 discount factor for now as our proxy estimation for what a 0.6 discount factor will be.

Reiterating Model Performance: Recalibration of Discount Factor.

Now let’s check the performance under the adjusted discount factor; translating 0.6 to 0.3 discount factors. Let’s create that matched combined dataset with 0.6 true DLVs and predictions using 0.3:

# take the predcitions from here: DLV_Values
# I've made the plotting function take as the prediction value any column with "DLV_Value" in its name
# so I'm going to keep that in the name, and remove that part of the name from the other one.
colnames(donors_DLVs_combined_0.3_2015) <- c(colnames(donors_DLVs_combined_0.3_2015)[1:4], "DLV_Values_Preds_0.3_train")

# and copy into true DLV values (amounts and means) from here - let's rename it so it would be clearer and not treated as predictive DLV Values
colnames(donors_DLVs_combined_0.6_2015) <- c(colnames(donors_DLVs_combined_0.6_2015)[1:4], "old_DLVs_0.6_train")
# we'll set both DT keys as donor_ids and then match on that. 
setDT(donors_DLVs_combined_0.3_2015)
setDT(donors_DLVs_combined_0.6_2015)
setkey(donors_DLVs_combined_0.3_2015, donor_uid)
setkey(donors_DLVs_combined_0.6_2015, donor_uid)
dt_DLVs_preds <- merge(donors_DLVs_combined_0.6_2015, donors_DLVs_combined_0.3_2015[,c("donor_uid","DLV_Values_Preds_0.3_train")])
# show resulting dataframe
head(dt_DLVs_preds,3)
##                       donor_uid starting_state true_dlv_amounts
## 1:         'STANLEY|LEHTO|98597            Non        8.8845466
## 2: (ANJANEYA)MURTY|KANURY|97330      monthly25       23.9719126
## 3:     (JAMES) DAVID|RICH|85308            Non        0.4702925
##    true_dlv_mean old_DLVs_0.6_train DLV_Values_Preds_0.3_train
## 1:      4.663363           34.40281                   15.15070
## 2:     38.659149           93.90574                   67.11091
## 3:      4.663363           34.40281                   15.15070

Plot 3.b: Recallibrated Model Performance: True DLV Values Distributions vs Prediction

Plotting the distributions of true values and predicted value on top of those:

# set names back to original
colnames(donors_DLVs_combined_0.3_2015) <- c(colnames(donors_DLVs_combined_0.3_2015)[1:4], "DLV_Values")
# now let's plot again using our previous function for plotting
plot_DLV_preds_dists(dt_DLVs_preds, discount = "0.3 (->0.6)", data_timeframe_str = "only 2015 data")

This is much better!
Now we see that actually the predicted DLV Values (red dashed lines) are pretty close to the true observed population mean (yellow dashed line) and always within a central area of the true data distribution. This all was achieved only by this very simply 1-step Markov model, without any model resembling or complications of the model itself; but just with an adjustment of the discount factor reference point itself.

Visualizing Historical Running Values


Now, we’d also want to know how our estimates would change throughout time. I will estimate this by calculating the relevant values above (actual donations amounts per state, number of donors per state, transition probabilities, and discounted lifetime values) based only on the data of each month separately. Notice that this is different conceptually than our total values which are based on the aggregate of many months and a longer history, rather than only on the latest month, for example. You may think that we’d want then to calculate the data every time based on the corresponding history at that time (for example, if we measure August’s data, then use all months from the start of the data, April, to August). However, that would mean that we would regress to the mean and will not notice as much of the seasonal fluctuations which is exactly what we want this plot for. So even though the data would be more sparse and some of the values are therefore less reliable (based on less data), this is still the most straightforward, clear and least biased way to go about this.

Preprocessing: Add Date Column for Plotting

We mostly had a “monthyear” column that was meant to just convey order until now. Now we need an actual date column for plotting dates smoothly on the X axis by ggplot, that not all datasets have.

# let's create a function that converts month-year column to a date column
convert_monthyear_to_date <- function(any_df, replace=T) {
  # if "date" column doesn't exist yet, create it from monthyaer
  if ( !( any(names(any_df) == 'date')) ) {  
    any_df[['date']] = as.Date(paste0(any_df$monthyear, "-01") )
    if (replace) {any_df$monthyear <- NULL} }
  any_df
}
# replace monthyear with date:
monthly_diagonals <- convert_monthyear_to_date(monthly_diagonals, replace = T)
monthly_trans_from_dailylow <- convert_monthyear_to_date(monthly_trans_from_dailylow, replace = T)
# add date while keeping monthyear column (and filter new ones without history):
monthly_amount_avgs <- convert_monthyear_to_date(monthly_amount_avgs[monthly_amount_avgs$state!='New',], replace = F)

Plot 4: Donations Flow: Monthly Average Donations Amounts


### Flow: monthly Amount Average
require(directlabels)

plot_monthly_amount_flows <- function(df_monthly, state_levels = state_levels, analysis_title="") {
    
    df_monthly$state <- factor(df_monthly$state, levels = state_levels)
    gg_monthly_flow_amount_avg <- 
      ggplot(data = df_monthly, aes(x = date, y = avg_amount, color=state)) +
      #geom_point(size=0.8) +
      geom_line(size=0.8, alpha = 0.6)  + 
       labs(title = paste0(analysis_title, "History of Average Monthly Donations From Each State"), 
        subtitle = "What was the average donation by donors in each state each month?",
        y = "Average Monthly Donation in $USD", x = "Month (2015-2016)", color = "Donor State  ") +
        ##geom_dl(aes(label = state), method = list(dl.trans(x = x + 0.1), "last.points", cex = 0.6)) +
        ##geom_dl(aes(label = state), method = list(dl.trans(x = x - 0.1), "first.points", cex = 0.7)) +
        theme_minimal() +
        theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
              plot.subtitle = element_text(size=12, family = "Helvetica"))  +

    ggsave(filename = "./dataviz/gg_monthly_flow_amount_avg", device = "png", width = 10, height = 6)
    gg_monthly_flow_amount_avg
  }

# plot!

ggplot_monthly_amount_flows <- plot_monthly_amount_flows(df_monthly= monthly_amount_avgs, state_levels = state_levels) 
print("HERE REMOVED GG Direct Lables")
## [1] "HERE REMOVED GG Direct Lables"
ggplotly(width = 950, height = 650, p=ggplot_monthly_amount_flows)

Great. The above plot shows us that the actual donations per state didn’t change much through time. Almost all states stayed relatively constant throughout the whole time, while only Weekly and Daily high amount donors varied more widely.

This makes sense from two reasons:
1. Firstly it is a result of the way I segmented the states: that state includes all donation amounts above $50 while the others have a small range of up to 25 USD max, so this state has much more natural variance.
2. Daily and Weekly states have much fewer donors than Monthly states. Thus small variations in 1 or 2 donors would lead to a high variation in the average value of the state.

Seeing that this is acceptable, let’s move on.

Plot 5: Number of donors per state throughout time


### Number of donors per state throughout time
generate_num_donors_groups <- function(df, states_column="state") {
    library(plyr)
    num_donors_groups <- ddply(.data = df[c("donor_uid","date",states_column)],
                                    .variables = .(date, eval(parse(text=states_column))), .fun = nrow)
    colnames(num_donors_groups) <- c("date","state","Num_donors")
    return(num_donors_groups)
}

plot_donor_groupsize_timeline <- function(num_donors_groups, analysis_title = '') {
    
    ### Number of donors per group throughout time history
    gg_donors_groupsizes_timeline <- 
      ggplot(data = num_donors_groups, aes(x = date, y = Num_donors, color=state)) +
       geom_line(size=1, alpha = 0.7)  +
       labs(title = paste0(analysis_title = '',"Monthly Number of Donors In Each State"), 
        subtitle = "Each month, how many donors were there in each state (group)?",
        y = "Number of Donors (in State S that Month)", x = "Month", color = "Donor State") +
        theme_minimal()  + 
        #geom_dl(aes(label = state), method = list(dl.trans(x = x + 0.1), "last.points", cex = 0.6)) +
        #geom_dl(aes(label = state), method = list(dl.trans(x = x - 0.1), "first.points", cex = 0.7)) + 
        theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
              plot.subtitle = element_text(family = "Helvetica"))  
    ggsave(filename = "./dataviz/gg_donors_groupsizes_timeline", device = "png", width = 10, height = 6)
    gg_donors_groupsizes_timeline
}

# generate neeeded dataset
num_donors_groups <- generate_num_donors_groups(setDF(df))

# plot the timeline!
gg_donors_groupsizes_timeline <- plot_donor_groupsize_timeline(num_donors_groups)
ggplotly(width = 950, height = 650, p=gg_donors_groupsizes_timeline)

This group sizes plot reveals these main conclusions:
1. Many new donors join as time passes. We can detect that by the decrease in the “Non” count, but this is not surprising.
2. Monthly donors join the most, and fewer donors are joining/sustaining as more frequent the group gets.
3. The lower the donation value is, increasingly more donors are joining and continuing.
All in all, it is quite an intuitive conclusion here that there are more donors the lower the commitment is: in amount or frequency. However, this reveals an important conclusion: that frequency matters more than amount; since there are more monthly high amount donors rather than weekly or daily low amount donors. Perhaps a high frequency means more “commitment” for most people in our sample, which would be an exciting area of investigation.

Plot 6: Transition Probabilities History: Values on the Diagonal (repeating states) every week


# Transition Matrix Throughout Time: Values on the Diagonal (transition to self) every week
plot_transition_matrix_monthly_history <- function(monthly_df, titletext, 
                                                   analysis_title = '', subtitletext='', melt_it=T) {
  # if the dataset is a matrix, melt it first
  if (melt_it) {monthly_df <-  melt(monthly_df, id.vars = "date")}
  colnames(monthly_df) <- c("date", "state", "transition_prob")
  gg_transitions_matrix_timeline <-
    ggplot(data = monthly_df, aes(x = date, y = transition_prob, color=state)) +
     geom_line(size=1, alpha = 0.7)  +
     # geom_smooth(method = 'loess', span = 0.4, alpha = 0.1, size=0.3) + 
     labs(title = paste0(analysis_title, "Monthly Transition Probabilities: ",titletext), 
      subtitle = subtitletext,
      y = "Probability of State Repetition", x = "Month", color = "Donor State") +
      theme_minimal()  + 
      #geom_dl(aes(label = state), method = list(dl.trans(x = x + 0.1), "last.points", cex = 0.4)) +
      #geom_dl(aes(label = state), method = list(dl.trans(x = x - 0.1), "first.points", cex = 0.5)) + 
      theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
            plot.subtitle = element_text(family = "Helvetica"))  #+ scale_colour_ptol() 
  ggsave(filename = "./dataviz/gg_transitions_matrix_timeline", device = "png", width = 10, height = 6)
  gg_transitions_matrix_timeline
}
# clear out empty column with no transitions to NEW or repetitions of NEW by definitions
# monthly_diagonals <- convert_monthyear_to_date(monthly_diagonals)
# monthly_diagonals$New <- NULL
# Plot DIAGONAL values history
ggplot_plot_transition_matrix_monthly_history1 <- 
plot_transition_matrix_monthly_history(monthly_df = monthly_diagonals, 
                                       titletext = "Repeats", 
                                       subtitletext = "Probability Of Repeating The Same State",
                                       melt_it = T)
ggplotly(width = 950, height = 650, p=ggplot_plot_transition_matrix_monthly_history1)

So the above plot shows us a few things:
1. Most transitions are relatively stable, nut moving much more than 10% in their history (weekly and monthly states)
2. Some states had huge fluctuations, particularly daily25, moving from 0 to 1 in September 2015 and going back towards zero.
How can this be? This could very well come from a too small sample. Let’s look if that’s the case:

setDT(dt)
dt[state=="daily25" & date>=as.Date("2015-08-30") & date<as.Date("2015-10-02") ,]
##          V1                 donor_uid monthyear total_amount avg_amount
##  1:  726236       BRIAN|PICKENS|85008   2015-09       398.06   30.62000
##  2:  958710    CHARLES|FORWERCK|49684   2015-09       534.86   29.71444
##  3: 3087926      JOSE|RODRIGUEZ|00907   2015-10       458.83   41.71182
##  4: 3477887      KENNETH|LOWDEN|01749   2015-10       600.00   50.00000
##  5: 3731029           LEO|CANTY|06095   2015-09       458.00   32.71429
##  6: 4188277     MARY|SCHWEITZER|89450   2015-09       345.00   31.36364
##  7: 4245599         MATTHEW|LEE|67208   2015-10       346.00   26.61538
##  8: 5237831         ROBERT|PRUS|87505   2015-09       629.86   48.45077
##  9: 5882596      SUZANNE|SEARLE|02026   2015-09       499.95   33.33000
## 10: 5882597      SUZANNE|SEARLE|02026   2015-10       666.60   33.33000
## 11: 6269367     WILLARD|DEMPSEY|66502   2015-09       285.10   25.91818
## 12: 6303423 WILLIAM|MAISTRELLIS|02152   2015-09       545.00   41.92308
##     n_donations_month n_donations_life months_donated cluster   state
##  1:                13               21              6       2 daily25
##  2:                18               68              8       2 daily25
##  3:                11               57              7       4 daily25
##  4:                12               10              5       4 daily25
##  5:                14               15              5       2 daily25
##  6:                11               77             10       2 daily25
##  7:                13               55              7       2 daily25
##  8:                13               57              7       4 daily25
##  9:                15               96             10       2 daily25
## 10:                20               96             10       2 daily25
## 11:                11               31              7       2 daily25
## 12:                13               47              6       4 daily25
##     state_prev state_num state_prev_num       date timestep
##  1:   weekly10        13             22 2015-09-01        5
##  2:   weekly25        13             23 2015-09-01        5
##  3:  weekly50+        13             24 2015-10-01        6
##  4:   weekly25        13             23 2015-10-01        6
##  5:        Non        13              0 2015-09-01        5
##  6:   weekly25        13             23 2015-09-01        5
##  7:    daily10        13             12 2015-10-01        6
##  8:   weekly25        13             23 2015-09-01        5
##  9:    daily25        13             13 2015-09-01        5
## 10:    daily25        13             13 2015-10-01        6
## 11:  monthly25        13             33 2015-09-01        5
## 12:        Non        13              0 2015-09-01        5

Yes, indeed, by filtering we see that there were too few donors making that transition at that time to really say that it’s logical to have a 100% transition probability between those states. Let’s drop transitions for numbers of months with fewer than 10 donors:

setDT(num_donors_groups)
num_donors_groups[Num_donors<10]
##           date     state Num_donors
##  1: 2015-05-01    daily1          1
##  2: 2015-05-01   daily10          2
##  3: 2015-06-01    daily1          3
##  4: 2015-06-01   daily10          5
##  5: 2015-07-01    daily1          6
##  6: 2015-07-01   daily10          1
##  7: 2015-07-01  daily50+          1
##  8: 2015-08-01 weekly50+          9
##  9: 2015-08-01    daily1          3
## 10: 2015-08-01   daily10          1
## 11: 2015-08-01   daily25          1
## 12: 2015-08-01  daily50+          1
## 13: 2015-09-01   daily25          8
## 14: 2015-09-01  daily50+          3
## 15: 2015-10-01   daily25          4
## 16: 2015-10-01  daily50+          2
## 17: 2015-11-01    daily1          6
## 18: 2015-11-01   daily10          5
## 19: 2015-11-01  daily50+          1

Sadly, these are 19 combinations (out of 124) that we can’t make a sufficient conclusion for. Let’s drop those from our matrices:

remove_smallsample_transitions <- function(monthly_matrix, num_donors_groups) {
  matrix_melted <- melt(monthly_matrix, id.vars = .(date))
  colnames(matrix_melted) <- c("date", "state", "transition_prob")
  # num_donors_toosmall <- num_donors_groups[Num_donors<10, ]
  num_donors_enough <- num_donors_groups[Num_donors>=10, ]
  # left join for only rows of num_donors_enough:
  matrix_melted_large <- merge(num_donors_enough[, c("date", "state")], matrix_melted, all.x = TRUE)
  matrix_melted_large <- matrix_melted_large[complete.cases(matrix_melted_large), ]
  matrix_melted_large
}

monthly_diagonals_largeNs <- remove_smallsample_transitions(monthly_matrix = monthly_diagonals, num_donors_groups = num_donors_groups)
monthly_diagonals_largeNs
##            date      state transition_prob
##   1: 2015-05-01        Non       0.0000000
##   2: 2015-05-01   monthly1       0.0000000
##   3: 2015-05-01  monthly10       0.0000000
##   4: 2015-05-01  monthly25       0.0000000
##   5: 2015-05-01 monthly50+       0.0000000
##  ---                                      
## 101: 2016-02-01  weekly50+       0.2953890
## 102: 2016-02-01     daily1       0.4622356
## 103: 2016-02-01    daily10       0.4285714
## 104: 2016-02-01    daily25       0.3700787
## 105: 2016-02-01   daily50+       0.2400000
trans_matrix <- spread(as.data.frame(monthly_diagonals_largeNs), key = state, value = transition_prob, fill = 0,  drop = FALSE) 
# show transition matrix
trans_matrix
##          date New       Non  monthly1 monthly10 monthly25 monthly50+
## 1  2015-05-01   0 0.0000000 0.0000000 0.0000000 0.0000000  0.0000000
## 2  2015-06-01   0 0.8601765 0.4940459 0.3697429 0.2724654  0.2056265
## 3  2015-07-01   0 0.8684338 0.4475266 0.2392017 0.2427718  0.1725108
## 4  2015-08-01   0 0.8460426 0.5727861 0.3458286 0.3020536  0.2016585
## 5  2015-09-01   0 0.7104557 0.7134541 0.5356326 0.4124901  0.3093543
## 6  2015-10-01   0 0.8159545 0.5151641 0.2918454 0.2729428  0.2070071
## 7  2015-11-01   0 0.8479445 0.6672665 0.4193409 0.3472743  0.2536180
## 8  2015-12-01   0 0.5587033 0.6848462 0.4545049 0.3793568  0.4470461
## 9  2016-01-01   0 0.4697698 0.5800798 0.3722922 0.3749312  0.3659146
## 10 2016-02-01   0 0.4332018 0.5074471 0.3796755 0.5109247  0.4808587
##      weekly1   weekly10   weekly25  weekly50+    daily1   daily10
## 1  0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.0000000
## 2  0.2926829 0.21212121 0.13333333 0.08333333 0.0000000 0.0000000
## 3  0.1157895 0.08629442 0.03797468 0.11538462 0.0000000 0.0000000
## 4  0.2068966 0.12790698 0.04000000 0.00000000 0.0000000 0.0000000
## 5  0.3571429 0.28767123 0.25000000 0.66666667 0.6666667 0.0000000
## 6  0.1120815 0.14487179 0.12500000 0.10937500 0.1600000 0.1333333
## 7  0.1818182 0.21052632 0.13142857 0.12280702 0.0000000 0.0000000
## 8  0.4516129 0.32035928 0.24468085 0.27272727 0.3333333 0.8000000
## 9  0.2888696 0.26736995 0.19289941 0.24045802 0.4202899 0.3968254
## 10 0.2607187 0.31481086 0.29571984 0.29538905 0.4622356 0.4285714
##      daily25  daily50+
## 1  0.0000000 0.0000000
## 2  0.0000000 0.0000000
## 3  0.0000000 0.0000000
## 4  0.0000000 0.0000000
## 5  0.0000000 0.0000000
## 6  0.0000000 0.0000000
## 7  0.0000000 0.0000000
## 8  0.0000000 0.0000000
## 9  0.2156863 0.1666667
## 10 0.3700787 0.2400000

Now we have only more credible transitions. Let’s see the transitions output of only more credible transitions?

ggplot_transition_matrix_monthly_history <- 
  plot_transition_matrix_monthly_history(monthly_df = monthly_diagonals_largeNs, 
                                       titletext = "Repeats", 
                                       subtitletext = "Probability of repeating the same state",
                                       melt_it = F)
ggplotly(width = 950, height = 650, p=ggplot_transition_matrix_monthly_history)

Now we have more credible transitions. While the transitions are still not very stable, there are at least no non-sensible leaps from 0 to 1 and back. The fact that the transitions are very unstable shows that the states themselves are not very stable.
According to the transitions matrix, we can see that most donors are generally on “Non”, or a low-frequency donation mode, and make one donation at a given time and then return to their less active state. This can explain why are these transitions relatively low.
It is exciting to see that in September there was a peak in state repetition probability meaning that most donors repeated their own state, and then a drop. There was such a peak again in December and then a decline in January. It might be a consequence of political events or campaigning efforts.
One striking difference over time is in the “daily” states - they climbed in value since they entered the dataset, meaning that people got more and more consistent in giving.

Plot 7: Monthly History of Average Predicted DLV Values


### DLV VALUES throughout time

### 
plot_dlv_values_timeline <- function(DLV_monthly = DLV_monthly_0.6, state_levels = state_levels, analysis_title = '') {
  names(DLV_monthly) <- c("date", "state", "avg_amount_monthly", "DLV_Values", "DLV_future_remainder")
  DLV_monthly$state <- factor(DLV_monthly$state, levels = state_levels)
  DLV_monthly$date <- as.POSIXct(DLV_monthly$date)
  
  gg_monthly_DLVs <- 
    ggplot(data = DLV_monthly, aes(x = date, y = DLV_Values, color=state)) +
     geom_line(size=1, alpha = 0.6) + 
     labs(title = paste0(analysis_title, "Monthly Estimated Average DLVs Per Donor State"), 
      subtitle = "DLV monthly predictions by the Markov Model, estimated with a 0.3 Discount Factor (equivalent to 0.6)",
      y = "DLV - Discounted donor Lifetime Values (in Discounted Amounts)", x = "Month", color = "Donor State ") +
      #geom_dl(aes(label = state), method = list(dl.trans(x = x + 0.1), "last.points", cex = 0.6)) +
      #geom_dl(aes(label = state), method = list(dl.trans(x = x - 0.1), "first.points", cex = 0.6)) +
      theme_minimal() +
      theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
            plot.subtitle = element_text(family = "Helvetica"))  # + scale_colour_ptol() 
  ggsave(filename = "./dataviz/gg_monthly_DLVs", device = "png", width = 10, height = 6)
  return(gg_monthly_DLVs)
}

# Generate Monthly DLV Values
# Running monthly DLV estimations and saving all to one dataframe 
DLV_preds_monthly_0.3 <- ddply(.data = df, .variables = .(date), .fun =  generate_DLV_dataset_beta) # , ...(beta = 0.3)
head(DLV_preds_monthly_0.3,3)
##         date   state Avg_Amounts_monthly DLV_Values DLV_future_remainder
## 1 2015-05-01  daily1            9.666667   13.96529             4.298626
## 2 2015-05-01 daily10           15.871212   15.87121             0.000000
## 3 2015-05-01 daily25            0.000000    0.00000             0.000000
# Plot normal DLV Values History
ggplot_plot_dlv_values_timeline <- plot_dlv_values_timeline(DLV_preds_monthly_0.3)
ggplotly(width = 950, height = 650, p=ggplot_plot_dlv_values_timeline)

This plot shows the following about the data: 1. There is some non-obvious distribution of the DLVs: high monthly donors donate more in total than weekly and high daily donors. It seems as above 50$; the monthly donors would be more skewed upwards than daily and weekly.
2. It was possible to consider only one month’s data at a time and according to that get DLV values per just that month that was, albeit not entirely constant, yet most of them were quite stable!
That’s not a bad result for the stability of our DLV estimations; we don’t see significant shifts in ranking or trends that would profoundly affect our conclusions.

Conclusions from this plot

Most importantly, the plot reveals the following couple of insights:

The ranking of groups was relatively consistent and showed that the highest DLVs were of the groups that donated the most in average amount (all the 50+) and declining after that. Inside each such combination of groups, there was an INVERSE relationship with the frequency. Monthly donors typically donated more than weekly and they donated more than daily donors of the same average amount tier!

Now, for the outcome, let’s see what would be the overall contribution per group of donors in each state - meaning including the average amount and the number of donors!

Plot 8: Monthly History of Total Predicted DLV Values, Total Per Group Size


plot_dlv_groupsized_values <- function(DLV_monthly, num_donors_groups, analysis_title='',discount_factor='0.6 (0.3 Adjusted)') {
    
    # merge monthly DLVs with groupsize, multiply these columns
    DLV_monthly_groupsized <- merge(DLV_monthly, num_donors_groups, 
                                   by = c("date","state"))
    DLV_monthly_groupsized['Total_group_DLV'] <- DLV_monthly_groupsized$DLV_Values * DLV_monthly_groupsized$Num_donors
    
  
    ### PLOT ###
    gg_monthly_DLVs_grouptotal <- 
      ggplot(data = DLV_monthly_groupsized, aes(x = date, y = Total_group_DLV, color=state)) +
       geom_line(size=1, alpha = 0.6)  + 
       labs(title = paste0(analysis_title, "Monthly Estimated Group DLVs. Discount = ",discount_factor),
        subtitle = "Total DLV monthly estimations by the model, multiplied by the number of donors in the group that month",
        y = "DLV - Discounted Lifetime Donations", 
        x = "Month", color = "Donor State ") +
        #geom_dl(aes(label = state), method = list(dl.trans(x = x + 0.1), "last.points", cex = 0.6)) +
        #geom_dl(aes(label = state), method = list(dl.trans(x = x - 0.1), "first.points", cex = 0.6)) +
        theme_minimal() +
        theme(plot.title = element_text(size=18, family = "Helvetica", face = "bold"), 
              plot.subtitle = element_text(family = "Helvetica")) 
    ggsave(filename = "gg_monthly_DLVs_grouptotal_smoothed", device = "png", width = 10, height = 6)
    return(gg_monthly_DLVs_grouptotal)
}

# num_donors_groups <- generate_num_donors_groups(df)
ggplot_dlv_groupsized_values <-  plot_dlv_groupsized_values(DLV_preds_monthly_0.3[DLV_preds_monthly_0.3$state!='Non',], num_donors_groups)
ggplotly(width = 950, height = 650, p=ggplot_dlv_groupsized_values)

Since the numbers of monthly donors increased through time, especially of smaller donations, we would expect an increase in their total contribution. However, that didn’t account for the overall gain. The ranking between which monthly group had higher values fluctuated and switched a few times, and ended roughly at the order of the magnitude of the donation. This plot shows the critical conclusion that most of the lifetime value comes from the monthly donors, even if they donate a small amount because they were a much larger group. Therefore, for the Sanders’ campaigners - or democratic campaigners - you might be better off convincing more people to donate a monthly donation of whatever magnitude they can afford, than trying to get a higher frequency of donations that the donors will later stop, or than trying to push hard for a higher amount!


Conclusion: Key Insights for Decision Makers


To conclude, below are a few of the critical insights resulting from this analysis that relevant decision-makers, such as political campaign groups, should know about:

  1. GO FOR HIGH R DONATION AMOUNTS, NOT FOR DONATING A LITTLE FREQUENTLY. The groups with the highest average DLV came from the highest average donation, not from higher frequency. Meaning, even if people donated every day, it still didn’t sum up to the same amounts in a whole month as the monthly donors.
  2. NUDGE FOR LOW FREQUENCY (LIKE MONTHLY) TO GET PEOPLE TO DONATE MORE. Within the high amount donors of various frequencies, there was an INVERSE relationship with the frequency; monthly donors typically donated more than weekly, and they donated more than daily donors of the same average amount tier!
  3. CONSIDER THE SIZE OF EACH GROUP AND CONVERSION RATE: IF YOU CAN GET MANY MORE PEOPLE TO DONATE MONTHLY THAN GETTING PEOPLE TO DONATE DAILY, THEN THE SHEER SIZE OF THE GROUP WILL RESULT IN MORE DONATIONS. In aggregate discounted lifetime donation values (times the number of drivers), most of the lifetime value comes from the monthly donors, even if they donate a small amount because they were a much larger group. Therefore, for the Sanders’ campaigners - or democratic campaigners - you might be better off convincing more people to donate a monthly donation of whatever magnitude they can afford, than trying to get a higher frequency of contributions that users will later stop, or than trying to push hard for a higher amount!
  4. PEOPLE MIND LESS ABOUT HOW OFTEN THEY DONATE THAN HOW MUCH THEY DONATE EACH TIME. While some of the states are a bit more stable, most states tended to drop down a tier of frequency (as in, from daily to weekly) but stayed at a similar amount of average donation!
  5. PEOPLE JOIN MORE AND STAY LONGER AT “LOWER ASKS”: WHEN THEY DONATE LESS OFTEN. THIS RESULTS WITH LARGER CONTRIBUTIONS OVER TIME. With time, the numbers of Monthly donors increased and the number of donors who donate a small amount decreased. Intuitively, it was easy to get people to join or stay in the lower commitment donation patterns.

Disclaimer: take each conclusion with a grain of salt: the first sentence of each key takeaway assumes that the following point creates a casual relationship, which it might not, but it helps deliver the message and helps to understand the value in each takeaway.