library(kableExtra)
library(caret)
library(ggplot2)
library(DescTools) #pseudR2
library(Metrics) #rmse
library(MASS) #glm.nb
This blog post is related to the blog Declined Credit Card transactions.
This blog uses the same data set as the blog mentioned above but investigates the number of successful credit card transactions on a given day. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.
Column | Description |
---|---|
Day |
Day of transaction |
Status |
Status of transaction: Decline or Successful |
Market Type |
Describes where transaction was eCommerce or retail. |
Count |
Count of transactions that fall within the category for the given day |
This blog is particularly interesting to me because I assist with credit card reconciliation each month at work, and the results of this exercise would provide more insights into the data set that I work with. This data set is going to be used on several blog posts. In this exercise, count regression is going to be performed on the number of successful transactions.
This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.
The response variable in this exercise is Count - specifically for the number of successful transactions for the given day and market type.
There are only two explanatory variables in the regression. Day is the date when the transactions occurred. Market Type describes whether the transaction was an eCommerce or retail transaction.
The data set is read in a data frame. The data frame has 472 rows.
df <- read.csv('creditcard_transactions.csv')
colnames(df) <- c('Day', 'Count', 'MarketType', 'Status')
paste('Number of rows: ', nrow(df))
## [1] "Number of rows: 472"
The day column is converted to the date type. Below you will see the data types of each columns.
The Day column contain dates. Count is an integer. MarketType and Status are factors with 2 levels.
df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)
## 'data.frame': 472 obs. of 4 variables:
## $ Day : Date, format: "2019-08-01" "2019-08-02" ...
## $ Count : int 18 6 5 2 2 8 3 9 11 13 ...
## $ MarketType: Factor w/ 2 levels "eCommerce","retail": 1 1 1 1 1 1 1 1 1 1 ...
## $ Status : Factor w/ 2 levels "Declined","Successful": 1 1 1 1 1 1 1 1 1 1 ...
Below is summary of the data. There are no NA values in the data set.
summary(df)
## Day Count MarketType Status
## Min. :2019-08-01 Min. : 0.00 eCommerce:236 Declined :228
## 1st Qu.:2019-08-30 1st Qu.: 2.00 retail :236 Successful:244
## Median :2019-09-29 Median : 22.50
## Mean :2019-09-29 Mean : 50.33
## 3rd Qu.:2019-10-30 3rd Qu.: 79.25
## Max. :2019-11-30 Max. :941.00
Create subset for Successful transactions.
successful <- subset(df, df$Status == 'Successful')
Below is a box plot of of number of successful transactions grouped by MarketType. There tends to be more retail transactions than eCommerce.
ggplot(successful, aes(x=MarketType, y=Count, color=MarketType)) +geom_boxplot()
summary(successful)
## Day Count MarketType Status
## Min. :2019-08-01 Min. : 0.00 eCommerce:122 Declined : 0
## 1st Qu.:2019-08-31 1st Qu.: 39.00 retail :122 Successful:244
## Median :2019-09-30 Median : 75.50
## Mean :2019-09-30 Mean : 90.22
## 3rd Qu.:2019-10-31 3rd Qu.:137.00
## Max. :2019-11-30 Max. :340.00
Below is distribution of the number of successful transactions. It roughly resembles a Poisson distribution.
ggplot(successful, aes(Count)) + geom_bar(aes(fill=MarketType))
Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.
successful$julianDate <- julian(successful$Day)
kable(head(successful)) %>% kable_styling()
Day | Count | MarketType | Status | julianDate | |
---|---|---|---|---|---|
229 | 2019-08-01 | 63 | eCommerce | Successful | 18109 |
230 | 2019-08-02 | 27 | eCommerce | Successful | 18110 |
231 | 2019-08-03 | 38 | eCommerce | Successful | 18111 |
232 | 2019-08-04 | 21 | eCommerce | Successful | 18112 |
233 | 2019-08-05 | 32 | eCommerce | Successful | 18113 |
234 | 2019-08-06 | 53 | eCommerce | Successful | 18114 |
Create train and test subsets for the successful transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.
successful_train <- subset(successful, successful$Day < '2019-11-01')
successful_test <- subset(successful, successful$Day > '2019-10-31')
One of the assumptions of a Poisson regression model is that the variance and mean of the dependent variable are the same. The \(variance\) is 3821.054 and \(mean\) is 94.27717. The variance is much larger than the mean. This suggests that the data may be over dispersed. A Negative Binomial regression model may be more appropriate.
var(successful_train$Count)
## [1] 3821.054
mean(successful_train$Count)
## [1] 94.27717
The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.5104103 . The \(RMSE\) is 39.99211. The \(AIC\) is 4243.
m_1 <- glm(Count ~ julianDate + MarketType, family = poisson(link = "log"), data=successful_train)
summary(m_1)
##
## Call:
## glm(formula = Count ~ julianDate + MarketType, family = poisson(link = "log"),
## data = successful_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16.4867 -2.9912 -0.8702 1.8663 28.1915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 62.1550192 5.2010880 11.95 <2e-16 ***
## julianDate -0.0032102 0.0002865 -11.20 <2e-16 ***
## MarketTyperetail 1.0637884 0.0173842 61.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7529.5 on 183 degrees of freedom
## Residual deviance: 3112.4 on 181 degrees of freedom
## AIC: 4243
##
## Number of Fisher Scoring iterations: 5
PseudoR2(m_1, which = 'McFadden')
## McFadden
## 0.5104103
pred_1 <- predict(m_1, successful_train, type="response")
Metrics::rmse(successful_train$Count, pred_1)
## [1] 39.99211
Below is a plot of predicted vs. actual. We can see two main clusters: eCommerce market tends to be lower transaction counts than retail.
plot(x=successful_train$Count, y=pred_1, xlab="actual",ylab="predicted")
The predictors in this model are the Julian date and market type. The predictors appear to be significant. The pseudo \(R^2\) is 0.1066597. The \(RMSE\) is 40.03737. The \(AIC\) is 1858.8 (much lower than the Poisson model).
m_2 <- glm.nb(Count ~ julianDate + MarketType, data = successful_train)
summary(m_2)
##
## Call:
## glm.nb(formula = Count ~ julianDate + MarketType, data = successful_train,
## init.theta = 4.594930074, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6087 -0.8406 -0.1767 0.3639 6.2692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 54.390221 24.244083 2.243 0.0249 *
## julianDate -0.002782 0.001335 -2.083 0.0372 *
## MarketTyperetail 1.060534 0.070955 14.947 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.5949) family taken to be 1)
##
## Null deviance: 418.35 on 183 degrees of freedom
## Residual deviance: 197.37 on 181 degrees of freedom
## AIC: 1858.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.595
## Std. Err.: 0.508
##
## 2 x log-likelihood: -1850.798
PseudoR2(m_2, which = 'McFadden')
## McFadden
## 0.1066597
pred_2 <- predict(m_2, successful_train, type="response")
Metrics::rmse(successful_train$Count, pred_2)
## [1] 40.03737
Below is a plot of predicted vs. actual. This is very similar to the Poisson model. We can see two main clusters: eCommerce market tends to be lower transaction counts than retail.
plot(x=successful_train$Count, y=pred_2, xlab="actual",ylab="predicted")
test_pred1 <- predict(m_1, successful_test, type="response")
Metrics::rmse(successful_test$Count, test_pred1)
## [1] 42.78124
The plot of actual vs. predicted also shows two main clusters: eCommerce market tends to have lower counts than retail.
plot(x=successful_test$Count, y=test_pred1, xlab="actual",ylab="predicted")
test_pred2 <- predict(m_2, successful_test, type="response")
Metrics::rmse(successful_test$Count, test_pred2)
## [1] 43.14785
The plot of actual vs. predicted also shows two main clusters: eCommerce market tends to have lower counts than retail.
plot(x=successful_test$Count, y=test_pred2, xlab="actual",ylab="predicted")
Predictions of both models are very similar although the AIC of Negative Binomial model is much lower.
successful_test$pred1 <- test_pred1
successful_test$pred2 <- test_pred2
successful_retail <- subset(successful_test, successful_test$MarketType == 'retail')
successful_eCommerce <- subset(successful_test, successful_test$MarketType == 'eCommerce')
Test data number of successful transactions highligted by Market Type.
ggplot(successful_test, aes(x=Day, y=Count, color=MarketType, shape=MarketType)) + geom_point()
Poisson model prediction:
ggplot(successful_test, aes(x=Day, y=pred1, color=MarketType, shape=MarketType)) + geom_point()
Negative Binomial model prediction:
ggplot(successful_test, aes(x=Day, y=pred2, color=MarketType, shape=MarketType)) + geom_point()
The predictions look very linear. Both models show the number of successful transactions dropping as time progresses. You can also see two main clusters: eCommerce market has lower counts than retail.