library(kableExtra)
library(caret)
library(ggplot2)
library(DescTools) #pseudR2
library(Metrics) #rmse
library(MASS) #glm.nb
In this blog post I am going to be doing count regression on the number of declined credit card transactions on a small data set that contains four months worth of data. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.
Column | Description |
---|---|
Day |
Day of transaction |
Status |
Status of transaction: Decline or Successful |
Market Type |
Describes where transaction was eCommerce or retail. |
Count |
Count of transactions that fall within the category for the given day |
This blog is particularly interesting to me because I assist with credit card reconciliation each month at work, and the results of this exercise would provide more insights into the data set that I work with. This data set is going to be used on several blog posts. In this exercise, I’ll a simple count regression is going to be performed on the number of declined transactions.
This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.
The response variable in this exercise is Count - specifically for the number of declined transactions for the given day and market type.
There are only two explanatory variables in the regression. Day is the date when the transactions occurred. Market Type describes whether the transaction was an eCommerce or retail transaction.
The data set is read in a data frame. The data frame has 472 rows.
df <- read.csv('creditcard_transactions.csv')
colnames(df) <- c('Day', 'Count', 'MarketType', 'Status')
paste('Number of rows: ', nrow(df))
## [1] "Number of rows: 472"
Below is a preview of the first 10 rows of the data.
kable(head(df,10)) %>% kable_styling()
Day | Count | MarketType | Status |
---|---|---|---|
8/1/2019 | 18 | eCommerce | Declined |
8/2/2019 | 6 | eCommerce | Declined |
8/3/2019 | 5 | eCommerce | Declined |
8/4/2019 | 2 | eCommerce | Declined |
8/5/2019 | 2 | eCommerce | Declined |
8/6/2019 | 8 | eCommerce | Declined |
8/7/2019 | 3 | eCommerce | Declined |
8/8/2019 | 9 | eCommerce | Declined |
8/9/2019 | 11 | eCommerce | Declined |
8/10/2019 | 13 | eCommerce | Declined |
The day column is converted to the date type. Below you will see the data types of each columns.
The Day column contain dates. Count is an integer. MarketType and Status are factors with 2 levels.
df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)
## 'data.frame': 472 obs. of 4 variables:
## $ Day : Date, format: "2019-08-01" "2019-08-02" ...
## $ Count : int 18 6 5 2 2 8 3 9 11 13 ...
## $ MarketType: Factor w/ 2 levels "eCommerce","retail": 1 1 1 1 1 1 1 1 1 1 ...
## $ Status : Factor w/ 2 levels "Declined","Successful": 1 1 1 1 1 1 1 1 1 1 ...
Below is a summary of the data. There are no NA values in the data set.
summary(df)
## Day Count MarketType Status
## Min. :2019-08-01 Min. : 0.00 eCommerce:236 Declined :228
## 1st Qu.:2019-08-30 1st Qu.: 2.00 retail :236 Successful:244
## Median :2019-09-29 Median : 22.50
## Mean :2019-09-29 Mean : 50.33
## 3rd Qu.:2019-10-30 3rd Qu.: 79.25
## Max. :2019-11-30 Max. :941.00
Create subset for declined transactions.
decline <- subset(df, df$Status == 'Declined')
Below is a box plot of of number of declined transactions grouped by MarketType. As you can see, there is an outlier value. The maximum number of declined transactions is 941. The 3rd quartile value is 3.00. So, the max value is an outlier. Perhaps this requires further investigation.
ggplot(decline, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()
summary(decline)
## Day Count MarketType Status
## Min. :2019-08-01 Min. : 0.000 eCommerce:114 Declined :228
## 1st Qu.:2019-08-29 1st Qu.: 1.000 retail :114 Successful: 0
## Median :2019-09-28 Median : 2.000
## Mean :2019-09-28 Mean : 7.636
## 3rd Qu.:2019-10-29 3rd Qu.: 5.000
## Max. :2019-11-30 Max. :941.000
On 11/7/2019, 941 declined transactions occurred. A closer look showed that these declined transactions appear to be fraudulent attempts from a dummy customer name. This is not part of normal activity.
subset(decline, decline$Count==941)
## Day Count MarketType Status
## 94 2019-11-07 941 eCommerce Declined
The data point with 941 count is dropped.
decline <- subset(decline, decline$Count < 941)
Below is distribution of count, which approximately resembles the Poisson distribution.
ggplot(decline, aes(Count)) + geom_bar(aes(fill=MarketType))
Below is an updated box plot with the count of 941 removed.
ggplot(decline, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()
Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.
decline$julianDate <- julian(decline$Day)
kable(head(decline)) %>% kable_styling()
Day | Count | MarketType | Status | julianDate |
---|---|---|---|---|
2019-08-01 | 18 | eCommerce | Declined | 18109 |
2019-08-02 | 6 | eCommerce | Declined | 18110 |
2019-08-03 | 5 | eCommerce | Declined | 18111 |
2019-08-04 | 2 | eCommerce | Declined | 18112 |
2019-08-05 | 2 | eCommerce | Declined | 18113 |
2019-08-06 | 8 | eCommerce | Declined | 18114 |
Create train and test subsets for the declined transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.
decline_train <- subset(decline, decline$Day < '2019-11-01')
decline_test <- subset(decline, decline$Day > '2019-10-31')
One of the assumptions of a Poisson regression model is that the variance and mean of the dependent variable are the same. The \(variance\) is 19.25451 and \(mean\) is 3.823864. The variance is much larger than the mean. This suggests that the data may be over dispersed. A Negative Binomial regression model may be more appropriate.
var(decline_train$Count)
## [1] 19.25451
mean(decline_train$Count)
## [1] 3.823864
The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.4087017. Te \(RMSE\) is 4.148659. The \(AIC\) is 1091.1.
m_1 <- glm(Count ~ julianDate + MarketType, family = poisson(link = "log"), data=decline_train)
summary(m_1)
##
## Call:
## glm(formula = Count ~ julianDate + MarketType, family = poisson(link = "log"),
## data = decline_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4910 -1.7287 -0.5686 0.9462 6.3049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 110.860730 26.577460 4.171 3.03e-05 ***
## julianDate -0.006018 0.001464 -4.110 3.96e-05 ***
## MarketTyperetail -0.688693 0.081710 -8.428 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 726.07 on 175 degrees of freedom
## Residual deviance: 633.75 on 173 degrees of freedom
## AIC: 1091.1
##
## Number of Fisher Scoring iterations: 5
PseudoR2(m_1, which = 'Nagelkerke')
## Nagelkerke
## 0.4087017
pred_1 <- predict(m_1, decline_train, type="response")
Metrics::rmse(decline_train$Count, pred_1)
## [1] 4.148659
Below is a plot of predicted vs. actual. The plot shows that predicted declined number of transactions on a given day roughly range from 2 to 7 when actual values is less than 8. Predictions in the number of declined transactions don’t seem to go beyond 7.
plot(x=decline_train$Count, y=pred_1, xlab="actual",ylab="predicted")
The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.137205. Te \(RMSE\) is 4.150194. The \(AIC\) is 849.83.
m_2 <- glm.nb(Count ~ julianDate + MarketType, data = decline_train)
summary(m_2)
##
## Call:
## glm.nb(formula = Count ~ julianDate + MarketType, data = decline_train,
## init.theta = 1.385131254, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1726 -0.9656 -0.2890 0.4655 2.4892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 120.081979 52.007712 2.309 0.0209 *
## julianDate -0.006525 0.002865 -2.278 0.0227 *
## MarketTyperetail -0.704328 0.152467 -4.620 3.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.3851) family taken to be 1)
##
## Null deviance: 219.06 on 175 degrees of freedom
## Residual deviance: 193.29 on 173 degrees of freedom
## AIC: 849.83
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.385
## Std. Err.: 0.214
##
## 2 x log-likelihood: -841.826
PseudoR2(m_2, which = 'Nagelkerke')
## Nagelkerke
## 0.137205
pred_2 <- predict(m_2, decline_train, type="response")
Metrics::rmse(decline_train$Count, pred_2)
## [1] 4.150194
Below is a plot of predicted vs. actual. This is very similar to the Poisson model. The plot shows that predicted declined number of transactions on a given day roughly range from 2 to 7 when actual values is less than 8. Predictions in the number of declined transactions don’t seem to go beyond 7.
plot(x=decline_train$Count, y=pred_2, xlab="actual",ylab="predicted")
test_pred1 <- predict(m_1, decline_test, type="response")
Metrics::rmse(decline_test$Count, test_pred1)
## [1] 2.361022
plot(x=decline_test$Count, y=test_pred1, xlab="actual",ylab="predicted")
test_pred2 <- predict(m_2, decline_test, type="response")
Metrics::rmse(decline_test$Count, test_pred2)
## [1] 2.359342
plot(x=decline_test$Count, y=test_pred2, xlab="actual",ylab="predicted")
Predictions of both models are very similar.
decline_test$pred1 <- test_pred1
decline_test$pred2 <- test_pred2
decline_retail <- subset(decline_test, decline_test$MarketType == 'retail')
decline_eCommerce <- subset(decline_test, decline_test$MarketType == 'eCommerce')
Test data number of declined transactions highligted by Market Type.
ggplot(decline_test, aes(x=Day, y=Count, color=MarketType, shape=MarketType)) + geom_point()
Poisson model prediction:
ggplot(decline_test, aes(x=Day, y=pred1, color=MarketType, shape=MarketType)) + geom_point()
Negative Binomial model prediction:
ggplot(decline_test, aes(x=Day, y=pred2, color=MarketType, shape=MarketType)) + geom_point()
The predictions look very linear. Both models show the number of declined transactions dropping as time progresses.