Predicting Daily Totals

library(kableExtra)
library(caret)
library(ggplot2)
library(Metrics) #rmse
library(gridExtra)

Introduction

This blog post is related to the blogs:

This blog uses the same data set as the blog mentioned above but investigates the number of successful credit card transactions on a given day. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.

Column	Description
`Day`	Day of transaction
`Source`	Describes source of transaction
`Daily Total`	Total for the day

Summary of Exercise

This blog is particularly interesting to me because I assist with credit card reconciliation each month, and the results of this exercise would provide more insights into the data set that I work with.

This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.

Response Variable

The response variable in this exercise is Daily Total, which is the total amount for the day for the given source.

Explanatory Variable

There are only two explanatory variables. Day is the date when the transactions occurred. Source describes where the transaction originated from.

Process Data

The data set is read in a data frame. The data frame has 445 rows.

df <- read.csv('creditcard_dailyTotal_bySource.csv')
colnames(df) <- c('Day', 'Source', 'DailyTotal')
paste('Number of rows: ', nrow(df))

## [1] "Number of rows:  445"

The day column is converted to the date type. Below you will see the data types of each columns.

The Day column contain dates. DailyTotal is a number. Source is a factor with 4 levels.

df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)

## 'data.frame':    445 obs. of  3 variables:
##  $ Day       : Date, format: "2019-08-01" "2019-08-01" ...
##  $ Source    : Factor w/ 4 levels "POS1","POS2",..: 4 1 2 1 4 2 3 4 1 2 ...
##  $ DailyTotal: num  3028 934 845 1183 905 ...

Below is summary of the data. There are no NA values in the data set.

summary(df)

##       Day              Source      DailyTotal     
##  Min.   :2019-08-01   POS1:120   Min.   :   10.0  
##  1st Qu.:2019-08-30   POS2:120   1st Qu.:  444.0  
##  Median :2019-09-29   POS3: 83   Median :  893.8  
##  Mean   :2019-09-28   Web :122   Mean   : 1208.2  
##  3rd Qu.:2019-10-28              3rd Qu.: 1298.4  
##  Max.   :2019-11-30              Max.   :15760.8

As you can see, the maximum daily total is 15,760.8. For the purpose of this exercise, this amount is going to be dropped. Below is a list of observations that are over 5,000. There are 15 such observations.

	Day	Source	DailyTotal
28	2019-08-08	Web	7578.85
36	2019-08-10	Web	5816.90
74	2019-08-20	Web	8004.65
77	2019-08-21	Web	7255.40
81	2019-08-22	Web	5886.69
243	2019-10-04	Web	5084.72
257	2019-10-08	Web	5705.27
272	2019-10-12	Web	15760.76
347	2019-11-02	Web	5272.89
368	2019-11-08	Web	6887.38
396	2019-11-16	Web	8639.29
406	2019-11-19	Web	6123.45
410	2019-11-20	Web	5325.17
430	2019-11-26	Web	7053.81
434	2019-11-27	Web	6159.53

Drop the maximum daily total of 15,760.8.

df <- subset(df, df$DailyTotal < 15000)

Preview data.

Day	Source	DailyTotal
2019-08-01	Web	3028.29
2019-08-01	POS1	934.00
2019-08-01	POS2	845.24
2019-08-02	POS1	1183.00
2019-08-02	Web	905.05
2019-08-02	POS2	843.64

Below is box plot of Daily Total grouped by Source.

ggplot(df, aes(x=Source, y=DailyTotal, color=Source)) +geom_boxplot()

Below is distribution of daily total. The distribution looks roughly normal but extremely skewed to the right.

ggplot(df, aes(DailyTotal)) + geom_histogram(aes(fill=Source))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below is density plot of Daily Totals broken down by Source. The distribution of Web and POS3 are more skewed to the right.

p0 <- ggplot(df, aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('All Sources')
p1 <- ggplot(subset(df, Source == 'Web'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('Web')
p2 <- ggplot(subset(df, Source == 'POS1'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS1')
p3 <- ggplot(subset(df, Source == 'POS2'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS2')
p4 <- ggplot(subset(df, Source == 'POS3'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS3')
grid.arrange(p0, p1, p2, p3, p4, ncol=3, nrow=2)

Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.

df$julianDate <- julian(df$Day)

Create train and test subsets for the successful transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.

train <- subset(df, df$Day < '2019-11-01')
test <- subset(df, df$Day > '2019-10-31')

Below is scatter plot of daily totals of transactions originating from the Web from August through October. The plot shows a general trend of decreasing daily totals.

ggplot(subset(train, train$Source == 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() +  geom_smooth(method='lm')

Below is a scatter plot of Daily Total for the rest of the other sources that are not Web. For POS1 and POS2 the plot shows decreasing daily totals. For POS3 the the plot shows increasing trend.

ggplot(subset(train, train$Source != 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() +  geom_smooth(method='lm')

Modeling

$M_1$: Linear-Linear Multiple Linear Regression Full Model

This exercise will only investigate multiple linear regression full model. The adjusted $R^2$ is 0.3688. The $AIC$ is 5615.372.

julianDate and Source are used to predict DailyTotal.

The model below is significant. The coefficients are significant.

With each unit increase in julianDate (1 day increment), DailyTotal decreases by 4.477. This shows that overall daily total is decreasing.

Transactions originating from POS1 is about $1087 less than Web.

Transactions originating from POS2 is about $1204 less than Web.

Transactions originating from POS3 is about $1930 less than Web.

train <- within(train, Source <- relevel(Source, ref = 'Web'))
m_1 <- lm(DailyTotal ~ julianDate + Source, data=train)
summary(m_1)

## 
## Call:
## lm(formula = DailyTotal ~ julianDate + Source, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2045.3  -305.7   -61.6   196.1  5817.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 83350.653  32621.223   2.555   0.0111 *  
## julianDate     -4.477      1.797  -2.492   0.0132 *  
## SourcePOS1  -1086.782    130.534  -8.326 2.13e-15 ***
## SourcePOS2  -1204.452    130.180  -9.252  < 2e-16 ***
## SourcePOS3  -1930.104    141.244 -13.665  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 880.5 on 337 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3688 
## F-statistic:  50.8 on 4 and 337 DF,  p-value: < 2.2e-16

paste("AIC: ", AIC(m_1))

## [1] "AIC:  5615.37176983201"

$M_2$: Log-Linear Multiple Linear Regression Full Model

The adjusted $R^2$ is 0.6626. $AIC$ is 777.2032.

m_2 <- lm(log(DailyTotal) ~ julianDate + Source, data=train)
summary(m_2)

## 
## Call:
## lm(formula = log(DailyTotal) ~ julianDate + Source, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46479 -0.35364  0.05274  0.41458  2.60326 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 64.983069  27.643001   2.351   0.0193 *  
## julianDate  -0.003174   0.001523  -2.084   0.0379 *  
## SourcePOS1  -0.521477   0.110614  -4.714 3.55e-06 ***
## SourcePOS2  -0.754941   0.110313  -6.844 3.66e-11 ***
## SourcePOS3  -2.953982   0.119689 -24.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7461 on 337 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6626 
## F-statistic: 168.4 on 4 and 337 DF,  p-value: < 2.2e-16

paste("AIC: ", AIC(m_2))

## [1] "AIC:  777.203230830408"

Evaluation with Test Data

The $RMSE$ value of the linear-linear model is 1464.747. The $RMSE$ value of the log-linear model is 1667.16.

The log-linear model has a higher $R^2$ and lower $AIC$; however the $RMSE$ of the log-linear is higher than linear-linear model.

Linear-Linear:

test_pred1 <- predict(m_1, test)
Metrics::rmse(test$DailyTotal, test_pred1)

## [1] 1464.747

The plot of actual vs. predicted shows three main clusters - indicative of the different sources.

plot(x=test$DailyTotal, y=test_pred1, xlab="actual",ylab="predicted")

Log-Linear:

The predictions of the log-linear model is transformed by applying exp function.

test_pred2 <- exp(predict(m_2, test))
Metrics::rmse(test$DailyTotal, test_pred2)

## [1] 1667.16

The plot of actual vs. predicted show four clusters - indicative of the different sources.

plot(x=test$DailyTotal, y=test_pred2, xlab="actual",ylab="predicted")