Declined Credit Card Transactions

library(kableExtra)
library(caret)
library(ggplot2)
library(DescTools) #pseudR2
library(Metrics) #rmse
library(MASS) #glm.nb

Introduction

In this blog post I am going to be doing count regression on the number of declined credit card transactions on a small data set that contains four months worth of data. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.

Column	Description
`Day`	Day of transaction
`Status`	Status of transaction: Decline or Successful
`Market Type`	Describes where transaction was eCommerce or retail.
`Count`	Count of transactions that fall within the category for the given day

Summary of Exercise

This blog is particularly interesting to me because I assist with credit card reconciliation each month at work, and the results of this exercise would provide more insights into the data set that I work with. This data set is going to be used on several blog posts. In this exercise, I’ll a simple count regression is going to be performed on the number of declined transactions.

This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.

Response Variable

The response variable in this exercise is Count - specifically for the number of declined transactions for the given day and market type.

Explanatory Variable

There are only two explanatory variables in the regression. Day is the date when the transactions occurred. Market Type describes whether the transaction was an eCommerce or retail transaction.

Process Data

The data set is read in a data frame. The data frame has 472 rows.

df <- read.csv('creditcard_transactions.csv')
colnames(df) <- c('Day', 'Count', 'MarketType', 'Status')
paste('Number of rows: ', nrow(df))

## [1] "Number of rows:  472"

Below is a preview of the first 10 rows of the data.

kable(head(df,10)) %>% kable_styling()

Day	Count	MarketType	Status
8/1/2019	18	eCommerce	Declined
8/2/2019	6	eCommerce	Declined
8/3/2019	5	eCommerce	Declined
8/4/2019	2	eCommerce	Declined
8/5/2019	2	eCommerce	Declined
8/6/2019	8	eCommerce	Declined
8/7/2019	3	eCommerce	Declined
8/8/2019	9	eCommerce	Declined
8/9/2019	11	eCommerce	Declined
8/10/2019	13	eCommerce	Declined

The day column is converted to the date type. Below you will see the data types of each columns.

The Day column contain dates. Count is an integer. MarketType and Status are factors with 2 levels.

df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)

## 'data.frame':    472 obs. of  4 variables:
##  $ Day       : Date, format: "2019-08-01" "2019-08-02" ...
##  $ Count     : int  18 6 5 2 2 8 3 9 11 13 ...
##  $ MarketType: Factor w/ 2 levels "eCommerce","retail": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status    : Factor w/ 2 levels "Declined","Successful": 1 1 1 1 1 1 1 1 1 1 ...

Below is a summary of the data. There are no NA values in the data set.

summary(df)

##       Day                 Count            MarketType         Status   
##  Min.   :2019-08-01   Min.   :  0.00   eCommerce:236   Declined  :228  
##  1st Qu.:2019-08-30   1st Qu.:  2.00   retail   :236   Successful:244  
##  Median :2019-09-29   Median : 22.50                                   
##  Mean   :2019-09-29   Mean   : 50.33                                   
##  3rd Qu.:2019-10-30   3rd Qu.: 79.25                                   
##  Max.   :2019-11-30   Max.   :941.00

Create subset for declined transactions.

decline <- subset(df, df$Status == 'Declined')

Below is a box plot of of number of declined transactions grouped by MarketType. As you can see, there is an outlier value. The maximum number of declined transactions is 941. The 3rd quartile value is 3.00. So, the max value is an outlier. Perhaps this requires further investigation.

ggplot(decline, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()

summary(decline)

##       Day                 Count             MarketType         Status   
##  Min.   :2019-08-01   Min.   :  0.000   eCommerce:114   Declined  :228  
##  1st Qu.:2019-08-29   1st Qu.:  1.000   retail   :114   Successful:  0  
##  Median :2019-09-28   Median :  2.000                                   
##  Mean   :2019-09-28   Mean   :  7.636                                   
##  3rd Qu.:2019-10-29   3rd Qu.:  5.000                                   
##  Max.   :2019-11-30   Max.   :941.000

On 11/7/2019, 941 declined transactions occurred. A closer look showed that these declined transactions appear to be fraudulent attempts from a dummy customer name. This is not part of normal activity.

subset(decline, decline$Count==941)

##           Day Count MarketType   Status
## 94 2019-11-07   941  eCommerce Declined

The data point with 941 count is dropped.

decline <- subset(decline, decline$Count < 941)

Below is distribution of count, which approximately resembles the Poisson distribution.

ggplot(decline, aes(Count)) + geom_bar(aes(fill=MarketType))

Below is an updated box plot with the count of 941 removed.

ggplot(decline, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()

Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.

decline$julianDate <- julian(decline$Day)
kable(head(decline)) %>% kable_styling()

Day	Count	MarketType	Status	julianDate
2019-08-01	18	eCommerce	Declined	18109
2019-08-02	6	eCommerce	Declined	18110
2019-08-03	5	eCommerce	Declined	18111
2019-08-04	2	eCommerce	Declined	18112
2019-08-05	2	eCommerce	Declined	18113
2019-08-06	8	eCommerce	Declined	18114

Create train and test subsets for the declined transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.

decline_train <- subset(decline, decline$Day < '2019-11-01')
decline_test <- subset(decline, decline$Day > '2019-10-31')

Modeling

One of the assumptions of a Poisson regression model is that the variance and mean of the dependent variable are the same. The \(variance\) is 19.25451 and \(mean\) is 3.823864. The variance is much larger than the mean. This suggests that the data may be over dispersed. A Negative Binomial regression model may be more appropriate.

var(decline_train$Count)

## [1] 19.25451

mean(decline_train$Count)

## [1] 3.823864

\(M_1\): Poisson Model

The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.4087017. Te \(RMSE\) is 4.148659. The \(AIC\) is 1091.1.

m_1 <- glm(Count ~ julianDate + MarketType, family = poisson(link = "log"), data=decline_train)
summary(m_1)

## 
## Call:
## glm(formula = Count ~ julianDate + MarketType, family = poisson(link = "log"), 
##     data = decline_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4910  -1.7287  -0.5686   0.9462   6.3049  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      110.860730  26.577460   4.171 3.03e-05 ***
## julianDate        -0.006018   0.001464  -4.110 3.96e-05 ***
## MarketTyperetail  -0.688693   0.081710  -8.428  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 726.07  on 175  degrees of freedom
## Residual deviance: 633.75  on 173  degrees of freedom
## AIC: 1091.1
## 
## Number of Fisher Scoring iterations: 5

PseudoR2(m_1, which = 'Nagelkerke')

## Nagelkerke 
##  0.4087017

pred_1 <- predict(m_1, decline_train, type="response")
Metrics::rmse(decline_train$Count, pred_1)

## [1] 4.148659

Below is a plot of predicted vs. actual. The plot shows that predicted declined number of transactions on a given day roughly range from 2 to 7 when actual values is less than 8. Predictions in the number of declined transactions don’t seem to go beyond 7.

plot(x=decline_train$Count, y=pred_1, xlab="actual",ylab="predicted")

\(M_2\): Negative Binomial

The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.137205. Te \(RMSE\) is 4.150194. The \(AIC\) is 849.83.

m_2 <- glm.nb(Count ~ julianDate + MarketType, data = decline_train)
summary(m_2)

## 
## Call:
## glm.nb(formula = Count ~ julianDate + MarketType, data = decline_train, 
##     init.theta = 1.385131254, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1726  -0.9656  -0.2890   0.4655   2.4892  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      120.081979  52.007712   2.309   0.0209 *  
## julianDate        -0.006525   0.002865  -2.278   0.0227 *  
## MarketTyperetail  -0.704328   0.152467  -4.620 3.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.3851) family taken to be 1)
## 
##     Null deviance: 219.06  on 175  degrees of freedom
## Residual deviance: 193.29  on 173  degrees of freedom
## AIC: 849.83
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.385 
##           Std. Err.:  0.214 
## 
##  2 x log-likelihood:  -841.826

PseudoR2(m_2, which = 'Nagelkerke')

## Nagelkerke 
##   0.137205

pred_2 <- predict(m_2, decline_train, type="response")
Metrics::rmse(decline_train$Count, pred_2)

## [1] 4.150194

Below is a plot of predicted vs. actual. This is very similar to the Poisson model. The plot shows that predicted declined number of transactions on a given day roughly range from 2 to 7 when actual values is less than 8. Predictions in the number of declined transactions don’t seem to go beyond 7.

plot(x=decline_train$Count, y=pred_2, xlab="actual",ylab="predicted")

Evaluation with Test Data

Poisson Model:

test_pred1 <- predict(m_1, decline_test, type="response")
Metrics::rmse(decline_test$Count, test_pred1)

## [1] 2.361022

plot(x=decline_test$Count, y=test_pred1, xlab="actual",ylab="predicted")

Negative Binomial Model:

test_pred2 <- predict(m_2, decline_test, type="response")
Metrics::rmse(decline_test$Count, test_pred2)

## [1] 2.359342

plot(x=decline_test$Count, y=test_pred2, xlab="actual",ylab="predicted")

Predictions of both models are very similar.

decline_test$pred1 <- test_pred1
decline_test$pred2 <- test_pred2
decline_retail <- subset(decline_test, decline_test$MarketType == 'retail')
decline_eCommerce <- subset(decline_test, decline_test$MarketType == 'eCommerce')

Test data number of declined transactions highligted by Market Type.

ggplot(decline_test, aes(x=Day, y=Count, color=MarketType, shape=MarketType)) + geom_point()

Poisson model prediction:

ggplot(decline_test, aes(x=Day, y=pred1, color=MarketType, shape=MarketType)) + geom_point()

Negative Binomial model prediction:

ggplot(decline_test, aes(x=Day, y=pred2, color=MarketType, shape=MarketType)) + geom_point()

The predictions look very linear. Both models show the number of declined transactions dropping as time progresses.