library(kableExtra)
library(caret)
library(ggplot2)
library(Metrics) #rmse
library(gridExtra)


Introduction

This blog post is related to the blogs:

This blog uses the same data set as the blog mentioned above but investigates the number of successful credit card transactions on a given day. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.

Column Description
Day Day of transaction
Source Describes source of transaction
Daily Total Total for the day


Summary of Exercise

This blog is particularly interesting to me because I assist with credit card reconciliation each month, and the results of this exercise would provide more insights into the data set that I work with.

This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.


Response Variable

The response variable in this exercise is Daily Total, which is the total amount for the day for the given source.


Explanatory Variable

There are only two explanatory variables. Day is the date when the transactions occurred. Source describes where the transaction originated from.


Process Data

The data set is read in a data frame. The data frame has 445 rows.

df <- read.csv('creditcard_dailyTotal_bySource.csv')
colnames(df) <- c('Day', 'Source', 'DailyTotal')
paste('Number of rows: ', nrow(df))
## [1] "Number of rows:  445"

The day column is converted to the date type. Below you will see the data types of each columns.

The Day column contain dates. DailyTotal is a number. Source is a factor with 4 levels.

df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)
## 'data.frame':    445 obs. of  3 variables:
##  $ Day       : Date, format: "2019-08-01" "2019-08-01" ...
##  $ Source    : Factor w/ 4 levels "POS1","POS2",..: 4 1 2 1 4 2 3 4 1 2 ...
##  $ DailyTotal: num  3028 934 845 1183 905 ...

Below is summary of the data. There are no NA values in the data set.

summary(df)
##       Day              Source      DailyTotal     
##  Min.   :2019-08-01   POS1:120   Min.   :   10.0  
##  1st Qu.:2019-08-30   POS2:120   1st Qu.:  444.0  
##  Median :2019-09-29   POS3: 83   Median :  893.8  
##  Mean   :2019-09-28   Web :122   Mean   : 1208.2  
##  3rd Qu.:2019-10-28              3rd Qu.: 1298.4  
##  Max.   :2019-11-30              Max.   :15760.8

As you can see, the maximum daily total is 15,760.8. For the purpose of this exercise, this amount is going to be dropped. Below is a list of observations that are over 5,000. There are 15 such observations.

Day Source DailyTotal
28 2019-08-08 Web 7578.85
36 2019-08-10 Web 5816.90
74 2019-08-20 Web 8004.65
77 2019-08-21 Web 7255.40
81 2019-08-22 Web 5886.69
243 2019-10-04 Web 5084.72
257 2019-10-08 Web 5705.27
272 2019-10-12 Web 15760.76
347 2019-11-02 Web 5272.89
368 2019-11-08 Web 6887.38
396 2019-11-16 Web 8639.29
406 2019-11-19 Web 6123.45
410 2019-11-20 Web 5325.17
430 2019-11-26 Web 7053.81
434 2019-11-27 Web 6159.53


Drop the maximum daily total of 15,760.8.

df <- subset(df, df$DailyTotal < 15000)

Preview data.

Day Source DailyTotal
2019-08-01 Web 3028.29
2019-08-01 POS1 934.00
2019-08-01 POS2 845.24
2019-08-02 POS1 1183.00
2019-08-02 Web 905.05
2019-08-02 POS2 843.64

Below is box plot of Daily Total grouped by Source.

ggplot(df, aes(x=Source, y=DailyTotal, color=Source)) +geom_boxplot()

Below is distribution of daily total. The distribution looks roughly normal but extremely skewed to the right.

ggplot(df, aes(DailyTotal)) + geom_histogram(aes(fill=Source)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below is density plot of Daily Totals broken down by Source. The distribution of Web and POS3 are more skewed to the right.

p0 <- ggplot(df, aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('All Sources')
p1 <- ggplot(subset(df, Source == 'Web'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('Web')
p2 <- ggplot(subset(df, Source == 'POS1'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS1')
p3 <- ggplot(subset(df, Source == 'POS2'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS2')
p4 <- ggplot(subset(df, Source == 'POS3'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS3')
grid.arrange(p0, p1, p2, p3, p4, ncol=3, nrow=2)

Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.

df$julianDate <- julian(df$Day)

Create train and test subsets for the successful transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.

train <- subset(df, df$Day < '2019-11-01')
test <- subset(df, df$Day > '2019-10-31')

Below is scatter plot of daily totals of transactions originating from the Web from August through October. The plot shows a general trend of decreasing daily totals.

ggplot(subset(train, train$Source == 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() +  geom_smooth(method='lm')

Below is a scatter plot of Daily Total for the rest of the other sources that are not Web. For POS1 and POS2 the plot shows decreasing daily totals. For POS3 the the plot shows increasing trend.

ggplot(subset(train, train$Source != 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() +  geom_smooth(method='lm')


Modeling

\(M_1\): Linear-Linear Multiple Linear Regression Full Model

This exercise will only investigate multiple linear regression full model. The adjusted \(R^2\) is 0.3688. The \(AIC\) is 5615.372.

julianDate and Source are used to predict DailyTotal.

The model below is significant. The coefficients are significant.

With each unit increase in julianDate (1 day increment), DailyTotal decreases by 4.477. This shows that overall daily total is decreasing.

Transactions originating from POS1 is about $1087 less than Web.

Transactions originating from POS2 is about $1204 less than Web.

Transactions originating from POS3 is about $1930 less than Web.

train <- within(train, Source <- relevel(Source, ref = 'Web'))
m_1 <- lm(DailyTotal ~ julianDate + Source, data=train)
summary(m_1)
## 
## Call:
## lm(formula = DailyTotal ~ julianDate + Source, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2045.3  -305.7   -61.6   196.1  5817.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 83350.653  32621.223   2.555   0.0111 *  
## julianDate     -4.477      1.797  -2.492   0.0132 *  
## SourcePOS1  -1086.782    130.534  -8.326 2.13e-15 ***
## SourcePOS2  -1204.452    130.180  -9.252  < 2e-16 ***
## SourcePOS3  -1930.104    141.244 -13.665  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 880.5 on 337 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3688 
## F-statistic:  50.8 on 4 and 337 DF,  p-value: < 2.2e-16
paste("AIC: ", AIC(m_1))
## [1] "AIC:  5615.37176983201"


\(M_2\): Log-Linear Multiple Linear Regression Full Model

The adjusted \(R^2\) is 0.6626. \(AIC\) is 777.2032.

m_2 <- lm(log(DailyTotal) ~ julianDate + Source, data=train)
summary(m_2)
## 
## Call:
## lm(formula = log(DailyTotal) ~ julianDate + Source, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46479 -0.35364  0.05274  0.41458  2.60326 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 64.983069  27.643001   2.351   0.0193 *  
## julianDate  -0.003174   0.001523  -2.084   0.0379 *  
## SourcePOS1  -0.521477   0.110614  -4.714 3.55e-06 ***
## SourcePOS2  -0.754941   0.110313  -6.844 3.66e-11 ***
## SourcePOS3  -2.953982   0.119689 -24.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7461 on 337 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6626 
## F-statistic: 168.4 on 4 and 337 DF,  p-value: < 2.2e-16
paste("AIC: ", AIC(m_2))
## [1] "AIC:  777.203230830408"


Evaluation with Test Data

The \(RMSE\) value of the linear-linear model is 1464.747. The \(RMSE\) value of the log-linear model is 1667.16.

The log-linear model has a higher \(R^2\) and lower \(AIC\); however the \(RMSE\) of the log-linear is higher than linear-linear model.

Linear-Linear:

test_pred1 <- predict(m_1, test)
Metrics::rmse(test$DailyTotal, test_pred1)
## [1] 1464.747

The plot of actual vs. predicted shows three main clusters - indicative of the different sources.

plot(x=test$DailyTotal, y=test_pred1, xlab="actual",ylab="predicted")

Log-Linear:

The predictions of the log-linear model is transformed by applying exp function.

test_pred2 <- exp(predict(m_2, test))
Metrics::rmse(test$DailyTotal, test_pred2)
## [1] 1667.16

The plot of actual vs. predicted show four clusters - indicative of the different sources.

plot(x=test$DailyTotal, y=test_pred2, xlab="actual",ylab="predicted")