This report is to analyze and gain insights of monthly financial transactions data. It is to explore the up and down trend of monthly financial amount over years 2013 to 2016. More importantly, a few machine learning models will be developed and deployed to capture the trend of data and help predict the future months or years. A consideration of a few metrics have been pointed out to compare the performance of these models to select the optimal one.
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven methodology to provide a structured approach to plan a data mining project. As its powerful practicality, flexibility and its usefulness to solve business issues, we take this methodology into our report. 6 steps will be conducted:
Financial transactions data set describes the total transaction amounts for customers each month. Transaction volumes can vary greatly for different industries and locations. There are 10 industries and 10 locations, which are encoded by digit 1 to 10.
## date customer_id industry location
## Length:94248 Length:94248 Min. : 1.000 Min. : 1.000
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 2.000
## Mode :character Mode :character Median : 2.000 Median : 4.000
## Mean : 2.455 Mean : 4.214
## 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :10.000 Max. :10.000
## monthly_amount
## Min. : 0
## 1st Qu.: 95323
## Median : 179399
## Mean : 395397
## 3rd Qu.: 375439
## Max. :100000000
The first step is to understand the data structure. There are 5 fields in the sales data: date / month, customer identification, industry, location and monthly amount of sales. 94248 transactions has been made with industry and location ranging from 1 to 10 and monthly amount from 0 to 100,000,000.
From the plot, it can be found no missing information on any fields of data.
By plotting number of transactions and total monthly amount over locations, location 1 and 2 made the top contributions in terms of sales transactions and the total amount of sales, from 18% to 33% while other locations produced quite similar percentage, less than 10%.
On the number of transactions and total sale amounts over industry plot, the industry 1 were dominant by its contribution in number of transactions, nearly 50%. Meanwhile, the industry 2 only produced 23.4% of transactions but generate as high amount of sales as industry 1, about 22% to 23%. Industry 6 is quite similar to industry 2 in terms of high average amount of sales per transaction. Together with industry 3, they are 2nd group of big players, which produced amount of sales as 14% each.
Number of transactions and total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found on total amount of sales while the up trend for number of transactions is quite smooth. It is not surprising that total sales of location 1 and 2 increased significantly compared to other locations. Meanwhile, in terms of industry, industry 2, 3 and 1 shows a rapid growing over the years while others’ progress are quite slow.
Boxplot of sale transactions by the industry and location present their high variance with considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, less than 200,000. In terms of locations, most of locations had its median amount of spending less than 500,000.
Heatmap plot presents the distribution of total sales amount by location and industry. On the plot, it can be seen that under industry 6, there is no locations generated sales except location 1. It is the same situation in industry 10 & location 6 and industry 9 & location 8. Majority of cells are given the dark blue presenting its high amount of sales while few are bright blue or low amount of sales, which belongs to industry 2 and 3.
In the scope of this report, a linear regression model will be developed and deployed to fit into time series data set of financial transactions. Different models could be developed and justifed by varieties of metrics to measure their performances. Prediction ability of model could be one of major factors to make the choice.
## # A tibble: 47 x 6
## # Groups: date, industry [47]
## date industry location mean_monthly month year
## <date> <fct> <fct> <dbl> <chr> <int>
## 1 2013-01-01 1 1 136081. 01 2013
## 2 2013-02-01 1 1 152964. 02 2013
## 3 2013-03-01 1 1 158481. 03 2013
## 4 2013-04-01 1 1 152341. 04 2013
## 5 2013-05-01 1 1 170330. 05 2013
## 6 2013-06-01 1 1 152575. 06 2013
## 7 2013-07-01 1 1 158755. 07 2013
## 8 2013-08-01 1 1 165313. 08 2013
## 9 2013-09-01 1 1 147199. 09 2013
## 10 2013-10-01 1 1 163409. 10 2013
## # … with 37 more rows
An aggregated data set using the fields date, industry and location, with a mean of monthly amount is created. To train dataset, a subset of data with industry 1 and location 1 is filtered out to form 47 rows of data. Each row presents a mean of monthly amount ranging from 2013 to 2016.
The mean amount of monthly sales over time plot presents its seasonality trend, specifically down trend at the end of year followed by the up trend at the beginning of year. Additionally, during the year, the pattern of fluctuation is repeated year by year. In average, the monthly mean amount of sales is increasing slowly over the years.
First, we split data into train and the test set The train set includes three years of data from 2013 to 2016 while test set includes one last year of data, 2016.
##
## Call:
## lm(formula = mean_monthly ~ date, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17585 -7236 1484 6569 17030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.764e+05 7.681e+04 -2.297 0.0279 *
## date 2.083e+01 4.729e+00 4.405 0.0001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8966 on 34 degrees of freedom
## Multiple R-squared: 0.3633, Adjusted R-squared: 0.3446
## F-statistic: 19.4 on 1 and 34 DF, p-value: 1e-04
From above financial transaction plot, it exists a relationship between the mean amount of sales and time. So we put a simmple linear model to explore this relationship. The summary of the model presents R square, which indicates about 36% variability in the mean of monthly sales amount has been explained by the model, which leaves 64% unexplained variablity. The up trend of monthly sales over time is captured well in the model but not the seasonality of sales amount. The estimate coefficients represents the mean of monthly amount increases by about 21 day by day, or about 630 monthly. The small p-value indicates that it is unlikely that no relationship exists between date and mean of monthly sales amount.
##
## Call:
## lm(formula = mean_monthly ~ date + month, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12943.4 -1553.1 494.6 2858.6 10733.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.758e+05 5.029e+04 -3.495 0.001954 **
## date 1.981e+01 3.123e+00 6.342 1.80e-06 ***
## month02 1.828e+04 4.561e+03 4.008 0.000551 ***
## month03 2.508e+04 4.563e+03 5.496 1.38e-05 ***
## month04 9.805e+03 4.568e+03 2.146 0.042623 *
## month05 2.188e+04 4.575e+03 4.783 8.00e-05 ***
## month06 1.448e+04 4.584e+03 3.159 0.004389 **
## month07 1.903e+04 4.595e+03 4.142 0.000395 ***
## month08 2.004e+04 4.608e+03 4.349 0.000236 ***
## month09 1.453e+04 4.622e+03 3.144 0.004546 **
## month10 2.475e+04 4.639e+03 5.335 2.04e-05 ***
## month11 1.983e+04 4.658e+03 4.258 0.000296 ***
## month12 4.360e+03 4.678e+03 0.932 0.360955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5584 on 23 degrees of freedom
## Multiple R-squared: 0.8329, Adjusted R-squared: 0.7457
## F-statistic: 9.555 on 12 and 23 DF, p-value: 2.627e-06
We introduce the month variable to capture the seasonality of sales amount. From the summary of linear model with a “formula = mean_monthly ~ date + month”, the R-squared improves significantly. A more 30% of explained varibility increases by the introduction of month variable makes the total of 83% of explained variability.
## Analysis of Variance Table
##
## Model 1: mean_monthly ~ date
## Model 2: mean_monthly ~ date + month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 2733280204
## 2 23 717291619 11 2.016e+09 5.8766 0.0001764 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To test whether this model is significantly better than the previous one, we compare both with the anova() function. The small p value proves its statistically significant. So introduction of month could give a tremendously high association with the mean of monthly amount.
Compared to the previous simple linear model, the fitted data given by this model perform quite impressively. It can capture up and down trend for each month significanly better. However, the seasonality trend are not same every year. So the model perform a little worse during 2014-2015 and 2015-2016 because these years having 3 cycles up/down instead of 4 cycles as other years.
Residuals vs fitted plot presents equal error variances or simply random fluctuations around the true line. The normal probability plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.
In general, the predicted data for year of 2016 underestimate the actual mean of monthly amount. The growth of 2014 and 2015 is not high but year of 2016 witnesses the significant rise over 2015. So briging the slow-paced of past data could mislead prediction of 2016.
## [1] 0.06164264
## X R2 RMSE MAE
## 1 Predictions vs Test data 0.5543456 11293.81 10457.18
## X R2 RMSE MAE
## 1 Fitted value vs Train data 0.8329209 4463.717 3172.824
The performance of prediction is significantly lower than the model’s performance on train data. The R2 is 0.55 lower than 0.83, which presents its low fit to the actual data. The prediction error RMSE is 11293, representing an error rate of 11293/mean(df_test$mean_monthly) = 11293/183214 ~ 6%, which is still good.
We apply the modelling process we built for industry 1 and location 1 to all industries and locations. Then we calculate evaluation measure for the training data and your testing data, for all models. It is to indentify the two industries and locations with worst performance. Then an investigation could be conducted to see why and how we could improve the model in the future.
We were running all locations and industries through 3 below models:
mean_monthly ~ date + month
mean_monthly ~ date + month + year
mean_monthly ~ date + month * sin(2piyear)
The first model is a combination of date and month to capture the monthly trend of data. In the second model, we introduce the year variable which may catch a global trend of year. The last one is a complex algorithm, in which the interaction between month and year is suggested.
## # A tibble: 27 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 0.500 28618. 24392. 0.00845 8.26
## 2 8 2 0.492 53696. 46229. 0.0104 16.2
## 3 4 9 0.468 485829. 391556. 0.0181 52.2
## 4 6 5 0.439 66656. 55353. 0.0339 11.5
## 5 7 1 0.420 15276. 11960. 0.0499 7.88
## 6 4 8 0.394 38038. 30286. 0.0797 10.2
## 7 9 10 0.379 7518. 5584. 0.922 7.97
## 8 5 2 0.371 47363. 37861. 0.118 14.1
## 9 1 2 0.371 21778. 17635. 0.119 5.02
## 10 8 7 0.321 28917. 22022. 0.244 10.4
## # … with 17 more rows
## # A tibble: 17 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 1 0.460 14742. 12231. 0.0369 7.61
## 2 10 7 0.454 50015. 40807. 0.0412 12.9
## 3 9 7 0.396 88321. 74943. 0.117 15.5
## 4 6 1 0.384 6022. 5278. 0.140 4.73
## 5 9 10 0.379 7518. 5584. 0.922 7.97
## 6 1 2 0.373 21746. 17649. 0.167 5.01
## 7 8 7 0.350 28289. 21204. 0.228 10.2
## 8 5 5 0.324 66135. 34488. 0.336 29.6
## 9 1 3 0.319 80968. 68222. 0.330 12.1
## 10 1 5 0.308 54529. 44824. 0.371 8.47
## 11 10 5 0.241 82884. 72349. 0.649 15.1
## 12 5 3 0.232 125442. 110487. 0.688 26.0
## 13 7 10 0.214 99659. 74356. 0.757 31.3
## 14 8 3 0.213 31049. 23461. 0.927 14.9
## 15 4 7 0.210 49906. 42007. 0.773 11.6
## 16 3 5 0.186 46758. 28867. 0.899 15.1
## 17 6 7 0.178 30855. 27066. 0.875 15.4
## # A tibble: 4 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 5 0.497 46491. 36434. 0.596 7.22
## 2 6 5 0.473 64643. 54519. 0.682 11.1
## 3 9 7 0.462 83335. 67208. 0.716 14.6
## 4 6 7 0.445 25346. 21222. 0.769 12.6
To compare these model’s performances, we filter data sets having R2 less than 0.5 so the worse model is the one giving out more results.
Model (1) made worst performance with 27 results while model (2) did better with 17 outputs. Model 3 executed a best performance with significantly few, only 4 industries and locations which R2 each less than 0.5 and most of them are greater than 0.4. Additionally among four above locations and industries, there is no RMSE greater than 100,000 and all less than 15%.
Model 2 proves its best performance so far. We take model 2 to next step to continue to assess its performance.
Among data sets given R2 below 0.5, we picked two worst data sets by its high RMSE:
| location | industry | R2 | RMSE | MAE | PVALUE | RMSE_Mean |
|---|---|---|---|---|---|---|
| 9 | 7 | 0.462 | 83335. | 67208. | 0.716 | 14.6 |
| 6 | 7 | 0.445 | 25346. | 21222. | 0.769 | 12.6 |
Plotting the fitted & real data for two worst datasets, location 9 & industry 7 and location 6 & industry 7, we can see that the fitted line cannot catch the constant up and down trend of the monthly mean, that could be a reason for a low R2. However, the error variances generated by this model is not so bad, which is a good indicator for its predict capability.
## # A tibble: 19 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 10 1 0 0 NaN 0
## 2 3 10 1.000 0 0 NaN 0
## 3 5 9 0.984 74757. 46647. 7.93e- 2 11.1
## 4 8 10 0.978 487762. 392714. 1.69e-24 2.31
## 5 10 9 0.963 43353. 25762. 2.02e- 1 7.73
## 6 9 9 0.931 67821. 52898. 1.59e-11 8.33
## 7 2 9 0.927 62145. 45570. 7.99e-16 7.55
## 8 8 8 0.924 46254. 34857. 1.65e-15 10.9
## 9 1 10 0.920 66324. 37289. 3.67e- 1 17.7
## 10 2 2 0.911 16170. 13147. 2.24e-14 4.23
## 11 4 4 0.908 20317. 15754. 3.89e-14 6.82
## 12 3 4 0.905 14356. 11689. 7.33e-14 5.36
## 13 2 4 0.895 6624. 5255. 3.65e-13 3.24
## 14 7 4 0.890 19176. 14298. 7.16e-13 4.33
## 15 2 8 0.885 50969. 40465. 1.68e-12 9.52
## 16 3 1 0.884 6447. 5192. 1.84e-12 3.69
## 17 4 1 0.878 6306. 4859. 4.21e-12 2.69
## 18 4 3 0.866 59101. 45356. 1.83e-11 7.86
## 19 7 3 0.857 46445. 37433. 5.63e-11 9.66
## # A tibble: 26 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 10 1 0 0 NaN 0
## 2 3 10 1.000 0 0 NaN 0
## 3 1 10 0.986 27346. 14617. 3.08e- 1 7.31
## 4 5 9 0.984 74757. 46647. 7.93e- 2 11.1
## 5 8 10 0.979 476035. 371702. 8.76e-24 2.25
## 6 10 9 0.963 43353. 25762. 2.02e- 1 7.73
## 7 2 9 0.954 49459. 38765. 3.07e-18 6.01
## 8 9 9 0.931 67802. 53026. 8.68e-11 8.33
## 9 8 8 0.927 45371. 33256. 5.39e-15 10.7
## 10 2 2 0.917 15643. 12880. 4.32e-14 4.09
## # … with 16 more rows
## # A tibble: 43 x 7
## location industry R2 RMSE MAE PVALUE RMSE_Mean
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 10 1 0 0 NaN 0
## 2 5 9 1 0 0 NaN 0
## 3 9 10 1 0 0 NaN 0
## 4 10 9 1 0 0 NaN 0
## 5 10 10 1 0 0 NaN 0
## 6 3 10 1.000 0 0 NaN 0
## 7 10 3 0.998 10747. 6725. 2.47e- 5 2.39
## 8 8 10 0.986 393350. 296260. 3.28e-15 1.86
## 9 9 9 0.979 37129. 27745. 1.84e- 7 4.56
## 10 2 2 0.974 8798. 6887. 2.23e-12 2.30
## # … with 33 more rows
To evaluate its performance on best datasets, we filter R2 better than 0.85, model 3 generates 43 rows, a significantly higher than 26 rows of model 2 and 19 of model 1. The we plot a few top datasets to see how it captures very well the data.
| location | industry | R2 | RMSE | MAE | PVALUE | RMSE_Mean |
|---|---|---|---|---|---|---|
| 1 | 10 | 1 | 0 | 0 | NaN | 0 |
| 5 | 9 | 1 | 0 | 0 | NaN | 0 |
| 9 | 10 | 1 | 0 | 0 | NaN | 0 |
| 10 | 9 | 1 | 0 | 0 | NaN | 0 |
| 10 | 10 | 1 | 0 | 0 | NaN | 0 |
| 3 | 10 | 1.000 | 0 | 0 | NaN | 0 |
In general, the total amount of sales for a whole industries and locations presents a growth year by year. Location 1 and 2 witnessed a high amount of sales and considerable rise over the years while there was a rapid jump in industry 2, 3 and 1.
A few linear models has been developed and deployed to explore the trend and predict the future through several locations and industries. The complex model, “mean_monthly ~ date + month * sin(2piyear)” which integrates the interaction between month and year proves its optimal performances.
A. Abhyankar, L. Sarno, G. Valente Exchange rates and fundamentals: evidence on the economic value of predictability J. Int. Econ., 66 (2005), pp. 325-348