library(kableExtra)
library(caret)
library(ggplot2)
library(DescTools) #pseudR2
library(Metrics) #rmse
library(MASS) #glm.nb


Introduction

This blog post is related to the blog Declined Credit Card transactions.

This blog uses the same data set as the blog mentioned above but investigates the number of successful credit card transactions on a given day. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.

Column Description
Day Day of transaction
Status Status of transaction: Decline or Successful
Market Type Describes where transaction was eCommerce or retail.
Count Count of transactions that fall within the category for the given day


Summary of Exercise

This blog is particularly interesting to me because I assist with credit card reconciliation each month at work, and the results of this exercise would provide more insights into the data set that I work with. This data set is going to be used on several blog posts. In this exercise, count regression is going to be performed on the number of successful transactions.

This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.


Response Variable

The response variable in this exercise is Count - specifically for the number of successful transactions for the given day and market type.


Explanatory Variable

There are only two explanatory variables in the regression. Day is the date when the transactions occurred. Market Type describes whether the transaction was an eCommerce or retail transaction.


Process Data

The data set is read in a data frame. The data frame has 472 rows.

df <- read.csv('creditcard_transactions.csv')
colnames(df) <- c('Day', 'Count', 'MarketType', 'Status')
paste('Number of rows: ', nrow(df))
## [1] "Number of rows:  472"

The day column is converted to the date type. Below you will see the data types of each columns.

The Day column contain dates. Count is an integer. MarketType and Status are factors with 2 levels.

df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)
## 'data.frame':    472 obs. of  4 variables:
##  $ Day       : Date, format: "2019-08-01" "2019-08-02" ...
##  $ Count     : int  18 6 5 2 2 8 3 9 11 13 ...
##  $ MarketType: Factor w/ 2 levels "eCommerce","retail": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status    : Factor w/ 2 levels "Declined","Successful": 1 1 1 1 1 1 1 1 1 1 ...

Below is summary of the data. There are no NA values in the data set.

summary(df)
##       Day                 Count            MarketType         Status   
##  Min.   :2019-08-01   Min.   :  0.00   eCommerce:236   Declined  :228  
##  1st Qu.:2019-08-30   1st Qu.:  2.00   retail   :236   Successful:244  
##  Median :2019-09-29   Median : 22.50                                   
##  Mean   :2019-09-29   Mean   : 50.33                                   
##  3rd Qu.:2019-10-30   3rd Qu.: 79.25                                   
##  Max.   :2019-11-30   Max.   :941.00

Create subset for Successful transactions.

successful <- subset(df, df$Status == 'Successful')

Below is a box plot of of number of successful transactions grouped by MarketType. There tends to be more retail transactions than eCommerce.

ggplot(successful, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()

summary(successful)
##       Day                 Count            MarketType         Status   
##  Min.   :2019-08-01   Min.   :  0.00   eCommerce:122   Declined  :  0  
##  1st Qu.:2019-08-31   1st Qu.: 39.00   retail   :122   Successful:244  
##  Median :2019-09-30   Median : 75.50                                   
##  Mean   :2019-09-30   Mean   : 90.22                                   
##  3rd Qu.:2019-10-31   3rd Qu.:137.00                                   
##  Max.   :2019-11-30   Max.   :340.00

Below is distribution of the number of successful transactions. It roughly resembles a Poisson distribution.

ggplot(successful, aes(Count)) + geom_bar(aes(fill=MarketType))

Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.

successful$julianDate <- julian(successful$Day)
kable(head(successful)) %>% kable_styling()
Day Count MarketType Status julianDate
229 2019-08-01 63 eCommerce Successful 18109
230 2019-08-02 27 eCommerce Successful 18110
231 2019-08-03 38 eCommerce Successful 18111
232 2019-08-04 21 eCommerce Successful 18112
233 2019-08-05 32 eCommerce Successful 18113
234 2019-08-06 53 eCommerce Successful 18114

Create train and test subsets for the successful transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.

successful_train <- subset(successful, successful$Day < '2019-11-01')
successful_test <- subset(successful, successful$Day > '2019-10-31')


Modeling

One of the assumptions of a Poisson regression model is that the variance and mean of the dependent variable are the same. The \(variance\) is 3821.054 and \(mean\) is 94.27717. The variance is much larger than the mean. This suggests that the data may be over dispersed. A Negative Binomial regression model may be more appropriate.

var(successful_train$Count)
## [1] 3821.054
mean(successful_train$Count)
## [1] 94.27717

\(M_1\): Poisson Model

The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.5104103 . The \(RMSE\) is 39.99211. The \(AIC\) is 4243.

m_1 <- glm(Count ~ julianDate + MarketType, family = poisson(link = "log"), data=successful_train)
summary(m_1)
## 
## Call:
## glm(formula = Count ~ julianDate + MarketType, family = poisson(link = "log"), 
##     data = successful_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -16.4867   -2.9912   -0.8702    1.8663   28.1915  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      62.1550192  5.2010880   11.95   <2e-16 ***
## julianDate       -0.0032102  0.0002865  -11.20   <2e-16 ***
## MarketTyperetail  1.0637884  0.0173842   61.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7529.5  on 183  degrees of freedom
## Residual deviance: 3112.4  on 181  degrees of freedom
## AIC: 4243
## 
## Number of Fisher Scoring iterations: 5
PseudoR2(m_1, which = 'McFadden')
##  McFadden 
## 0.5104103
pred_1 <- predict(m_1, successful_train, type="response")
Metrics::rmse(successful_train$Count, pred_1)
## [1] 39.99211

Below is a plot of predicted vs. actual. We can see two main clusters: eCommerce market tends to be lower transaction counts than retail.

plot(x=successful_train$Count, y=pred_1, xlab="actual",ylab="predicted")

\(M_2\): Negative Binomial

The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.1066597. The \(RMSE\) is 40.03737. The \(AIC\) is 1858.8 (much lower than the Poisson model).

m_2 <- glm.nb(Count ~ julianDate + MarketType, data = successful_train)
summary(m_2)
## 
## Call:
## glm.nb(formula = Count ~ julianDate + MarketType, data = successful_train, 
##     init.theta = 4.594930074, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6087  -0.8406  -0.1767   0.3639   6.2692  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      54.390221  24.244083   2.243   0.0249 *  
## julianDate       -0.002782   0.001335  -2.083   0.0372 *  
## MarketTyperetail  1.060534   0.070955  14.947   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.5949) family taken to be 1)
## 
##     Null deviance: 418.35  on 183  degrees of freedom
## Residual deviance: 197.37  on 181  degrees of freedom
## AIC: 1858.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.595 
##           Std. Err.:  0.508 
## 
##  2 x log-likelihood:  -1850.798
PseudoR2(m_2, which = 'McFadden')
##  McFadden 
## 0.1066597
pred_2 <- predict(m_2, successful_train, type="response")
Metrics::rmse(successful_train$Count, pred_2)
## [1] 40.03737

Below is a plot of predicted vs. actual. This is very similar to the Poisson model. We can see two main clusters: eCommerce market tends to be lower transaction counts than retail.

plot(x=successful_train$Count, y=pred_2, xlab="actual",ylab="predicted")

Evaluation with Test Data

Poisson Model:

test_pred1 <- predict(m_1, successful_test, type="response")
Metrics::rmse(successful_test$Count, test_pred1)
## [1] 42.78124

The plot of actual vs. predicted also shows two main clusters: eCommerce market tends to have lower counts than retail.

plot(x=successful_test$Count, y=test_pred1, xlab="actual",ylab="predicted")

Negative Binomial Model:

test_pred2 <- predict(m_2, successful_test, type="response")
Metrics::rmse(successful_test$Count, test_pred2)
## [1] 43.14785

The plot of actual vs. predicted also shows two main clusters: eCommerce market tends to have lower counts than retail.

plot(x=successful_test$Count, y=test_pred2, xlab="actual",ylab="predicted")

Predictions of both models are very similar although the AIC of Negative Binomial model is much lower.

successful_test$pred1 <- test_pred1
successful_test$pred2 <- test_pred2
successful_retail <- subset(successful_test, successful_test$MarketType == 'retail')
successful_eCommerce <- subset(successful_test, successful_test$MarketType == 'eCommerce')

Test data number of successful transactions highligted by Market Type.

ggplot(successful_test, aes(x=Day, y=Count, color=MarketType, shape=MarketType)) + geom_point()

Poisson model prediction:

ggplot(successful_test, aes(x=Day, y=pred1, color=MarketType, shape=MarketType)) + geom_point()

Negative Binomial model prediction:

ggplot(successful_test, aes(x=Day, y=pred2, color=MarketType, shape=MarketType)) + geom_point()

The predictions look very linear. Both models show the number of successful transactions dropping as time progresses. You can also see two main clusters: eCommerce market has lower counts than retail.