library(kableExtra)
library(caret)
library(ggplot2)
library(Metrics) #rmse
library(gridExtra)
This blog post is related to the blogs:
This blog uses the same data set as the blog mentioned above but investigates the number of successful credit card transactions on a given day. This data set does not contain any personally identifiable information. It presents summary data on the number of successful and declined transactions for each day and market type.
Column | Description |
---|---|
Day |
Day of transaction |
Source |
Describes source of transaction |
Daily Total |
Total for the day |
This blog is particularly interesting to me because I assist with credit card reconciliation each month, and the results of this exercise would provide more insights into the data set that I work with.
This data set contains transaction counts from August to November. The model is going to be trained with data from August to October and validated with November data.
The response variable in this exercise is Daily Total, which is the total amount for the day for the given source.
There are only two explanatory variables. Day is the date when the transactions occurred. Source describes where the transaction originated from.
The data set is read in a data frame. The data frame has 445 rows.
df <- read.csv('creditcard_dailyTotal_bySource.csv')
colnames(df) <- c('Day', 'Source', 'DailyTotal')
paste('Number of rows: ', nrow(df))
## [1] "Number of rows: 445"
The day column is converted to the date type. Below you will see the data types of each columns.
The Day column contain dates. DailyTotal is a number. Source is a factor with 4 levels.
df$Day <- as.Date(df$Day, format="%m/%d/%Y")
str(df)
## 'data.frame': 445 obs. of 3 variables:
## $ Day : Date, format: "2019-08-01" "2019-08-01" ...
## $ Source : Factor w/ 4 levels "POS1","POS2",..: 4 1 2 1 4 2 3 4 1 2 ...
## $ DailyTotal: num 3028 934 845 1183 905 ...
Below is summary of the data. There are no NA values in the data set.
summary(df)
## Day Source DailyTotal
## Min. :2019-08-01 POS1:120 Min. : 10.0
## 1st Qu.:2019-08-30 POS2:120 1st Qu.: 444.0
## Median :2019-09-29 POS3: 83 Median : 893.8
## Mean :2019-09-28 Web :122 Mean : 1208.2
## 3rd Qu.:2019-10-28 3rd Qu.: 1298.4
## Max. :2019-11-30 Max. :15760.8
As you can see, the maximum daily total is 15,760.8. For the purpose of this exercise, this amount is going to be dropped. Below is a list of observations that are over 5,000. There are 15 such observations.
Day | Source | DailyTotal | |
---|---|---|---|
28 | 2019-08-08 | Web | 7578.85 |
36 | 2019-08-10 | Web | 5816.90 |
74 | 2019-08-20 | Web | 8004.65 |
77 | 2019-08-21 | Web | 7255.40 |
81 | 2019-08-22 | Web | 5886.69 |
243 | 2019-10-04 | Web | 5084.72 |
257 | 2019-10-08 | Web | 5705.27 |
272 | 2019-10-12 | Web | 15760.76 |
347 | 2019-11-02 | Web | 5272.89 |
368 | 2019-11-08 | Web | 6887.38 |
396 | 2019-11-16 | Web | 8639.29 |
406 | 2019-11-19 | Web | 6123.45 |
410 | 2019-11-20 | Web | 5325.17 |
430 | 2019-11-26 | Web | 7053.81 |
434 | 2019-11-27 | Web | 6159.53 |
Drop the maximum daily total of 15,760.8.
df <- subset(df, df$DailyTotal < 15000)
Preview data.
Day | Source | DailyTotal |
---|---|---|
2019-08-01 | Web | 3028.29 |
2019-08-01 | POS1 | 934.00 |
2019-08-01 | POS2 | 845.24 |
2019-08-02 | POS1 | 1183.00 |
2019-08-02 | Web | 905.05 |
2019-08-02 | POS2 | 843.64 |
Below is box plot of Daily Total grouped by Source.
ggplot(df, aes(x=Source, y=DailyTotal, color=Source)) +geom_boxplot()
Below is distribution of daily total. The distribution looks roughly normal but extremely skewed to the right.
ggplot(df, aes(DailyTotal)) + geom_histogram(aes(fill=Source))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Below is density plot of Daily Totals broken down by Source. The distribution of Web and POS3 are more skewed to the right.
p0 <- ggplot(df, aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('All Sources')
p1 <- ggplot(subset(df, Source == 'Web'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('Web')
p2 <- ggplot(subset(df, Source == 'POS1'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS1')
p3 <- ggplot(subset(df, Source == 'POS2'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS2')
p4 <- ggplot(subset(df, Source == 'POS3'), aes(DailyTotal)) + geom_density(color="darkblue", fill="lightblue") + ggtitle('POS3')
grid.arrange(p0, p1, p2, p3, p4, ncol=3, nrow=2)
Convert the Day date value to a Julian date, which is the number of days since 1/1/1970. The regression is going to be based on the Julian date.
df$julianDate <- julian(df$Day)
Create train and test subsets for the successful transactions. Transactions from August through October are going to be part of the train set. November transactions are going to be part of the test set.
train <- subset(df, df$Day < '2019-11-01')
test <- subset(df, df$Day > '2019-10-31')
Below is scatter plot of daily totals of transactions originating from the Web from August through October. The plot shows a general trend of decreasing daily totals.
ggplot(subset(train, train$Source == 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() + geom_smooth(method='lm')
Below is a scatter plot of Daily Total for the rest of the other sources that are not Web. For POS1 and POS2 the plot shows decreasing daily totals. For POS3 the the plot shows increasing trend.
ggplot(subset(train, train$Source != 'Web'), aes(x=Day, y=DailyTotal, color=Source)) + geom_point() + geom_smooth(method='lm')
This exercise will only investigate multiple linear regression full model. The adjusted \(R^2\) is 0.3688. The \(AIC\) is 5615.372.
julianDate and Source are used to predict DailyTotal.
The model below is significant. The coefficients are significant.
With each unit increase in julianDate (1 day increment), DailyTotal decreases by 4.477. This shows that overall daily total is decreasing.
Transactions originating from POS1 is about $1087 less than Web.
Transactions originating from POS2 is about $1204 less than Web.
Transactions originating from POS3 is about $1930 less than Web.
train <- within(train, Source <- relevel(Source, ref = 'Web'))
m_1 <- lm(DailyTotal ~ julianDate + Source, data=train)
summary(m_1)
##
## Call:
## lm(formula = DailyTotal ~ julianDate + Source, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2045.3 -305.7 -61.6 196.1 5817.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83350.653 32621.223 2.555 0.0111 *
## julianDate -4.477 1.797 -2.492 0.0132 *
## SourcePOS1 -1086.782 130.534 -8.326 2.13e-15 ***
## SourcePOS2 -1204.452 130.180 -9.252 < 2e-16 ***
## SourcePOS3 -1930.104 141.244 -13.665 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 880.5 on 337 degrees of freedom
## Multiple R-squared: 0.3762, Adjusted R-squared: 0.3688
## F-statistic: 50.8 on 4 and 337 DF, p-value: < 2.2e-16
paste("AIC: ", AIC(m_1))
## [1] "AIC: 5615.37176983201"
The adjusted \(R^2\) is 0.6626. \(AIC\) is 777.2032.
m_2 <- lm(log(DailyTotal) ~ julianDate + Source, data=train)
summary(m_2)
##
## Call:
## lm(formula = log(DailyTotal) ~ julianDate + Source, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.46479 -0.35364 0.05274 0.41458 2.60326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.983069 27.643001 2.351 0.0193 *
## julianDate -0.003174 0.001523 -2.084 0.0379 *
## SourcePOS1 -0.521477 0.110614 -4.714 3.55e-06 ***
## SourcePOS2 -0.754941 0.110313 -6.844 3.66e-11 ***
## SourcePOS3 -2.953982 0.119689 -24.680 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7461 on 337 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6626
## F-statistic: 168.4 on 4 and 337 DF, p-value: < 2.2e-16
paste("AIC: ", AIC(m_2))
## [1] "AIC: 777.203230830408"
The \(RMSE\) value of the linear-linear model is 1464.747. The \(RMSE\) value of the log-linear model is 1667.16.
The log-linear model has a higher \(R^2\) and lower \(AIC\); however the \(RMSE\) of the log-linear is higher than linear-linear model.
test_pred1 <- predict(m_1, test)
Metrics::rmse(test$DailyTotal, test_pred1)
## [1] 1464.747
The plot of actual vs. predicted shows three main clusters - indicative of the different sources.
plot(x=test$DailyTotal, y=test_pred1, xlab="actual",ylab="predicted")
The predictions of the log-linear model is transformed by applying exp function.
test_pred2 <- exp(predict(m_2, test))
Metrics::rmse(test$DailyTotal, test_pred2)
## [1] 1667.16
The plot of actual vs. predicted show four clusters - indicative of the different sources.
plot(x=test$DailyTotal, y=test_pred2, xlab="actual",ylab="predicted")