Assignment 1 for DAM

0.1 Introduction
0.2 Business understanding
0.3 Exploration Data Analysis
0.4 Model Development
0.5 Model prediction
0.6 Advanced model fitting
0.7 Conclusion
0.8 References

0.1 Introduction

This report is to analyze and gain insights of monthly financial transactions data. It is to explore the up and down trend of monthly financial amount over years 2013 to 2016. More importantly, a few machine learning models will be developed and deployed to capture the trend of data and help predict the future months or years. A consideration of a few metrics have been pointed out to compare the performance of these models to select the optimal one.

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven methodology to provide a structured approach to plan a data mining project. As its powerful practicality, flexibility and its usefulness to solve business issues, we take this methodology into our report. 6 steps will be conducted:

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

0.2 Business understanding

Financial transactions data set describes the total transaction amounts for customers each month. Transaction volumes can vary greatly for different industries and locations. There are 10 industries and 10 locations, which are encoded by digit 1 to 10.

0.3 Exploration Data Analysis

##      date           customer_id           industry         location     
##  Length:94248       Length:94248       Min.   : 1.000   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 1.000   1st Qu.: 2.000  
##  Mode  :character   Mode  :character   Median : 2.000   Median : 4.000  
##                                        Mean   : 2.455   Mean   : 4.214  
##                                        3rd Qu.: 3.000   3rd Qu.: 6.000  
##                                        Max.   :10.000   Max.   :10.000  
##  monthly_amount     
##  Min.   :        0  
##  1st Qu.:    95323  
##  Median :   179399  
##  Mean   :   395397  
##  3rd Qu.:   375439  
##  Max.   :100000000

The first step is to understand the data structure. There are 5 fields in the sales data: date / month, customer identification, industry, location and monthly amount of sales. 94248 transactions has been made with industry and location ranging from 1 to 10 and monthly amount from 0 to 100,000,000.

From the plot, it can be found no missing information on any fields of data.

By plotting number of transactions and total monthly amount over locations, location 1 and 2 made the top contributions in terms of sales transactions and the total amount of sales, from 18% to 33% while other locations produced quite similar percentage, less than 10%.

On the number of transactions and total sale amounts over industry plot, the industry 1 were dominant by its contribution in number of transactions, nearly 50%. Meanwhile, the industry 2 only produced 23.4% of transactions but generate as high amount of sales as industry 1, about 22% to 23%. Industry 6 is quite similar to industry 2 in terms of high average amount of sales per transaction. Together with industry 3, they are 2nd group of big players, which produced amount of sales as 14% each.

Number of transactions and total amount of sales rose sharply throughout the years, from 2013 to 2017. The seasonal trend can be found on total amount of sales while the up trend for number of transactions is quite smooth. It is not surprising that total sales of location 1 and 2 increased significantly compared to other locations. Meanwhile, in terms of industry, industry 2, 3 and 1 shows a rapid growing over the years while others’ progress are quite slow.

Boxplot of sale transactions by the industry and location present their high variance with considerable amount of outliers. The median amount of spending per customer for industry 6 and 9 are highest, over 500,000 while the lowest ones belong to industry 1 and 10, less than 200,000. In terms of locations, most of locations had its median amount of spending less than 500,000.

Heatmap plot presents the distribution of total sales amount by location and industry. On the plot, it can be seen that under industry 6, there is no locations generated sales except location 1. It is the same situation in industry 10 & location 6 and industry 9 & location 8. Majority of cells are given the dark blue presenting its high amount of sales while few are bright blue or low amount of sales, which belongs to industry 2 and 3.

0.4 Model Development

In the scope of this report, a linear regression model will be developed and deployed to fit into time series data set of financial transactions. Different models could be developed and justifed by varieties of metrics to measure their performances. Prediction ability of model could be one of major factors to make the choice.

## # A tibble: 47 x 6
## # Groups:   date, industry [47]
##    date       industry location mean_monthly month  year
##    <date>     <fct>    <fct>           <dbl> <chr> <int>
##  1 2013-01-01 1        1             136081. 01     2013
##  2 2013-02-01 1        1             152964. 02     2013
##  3 2013-03-01 1        1             158481. 03     2013
##  4 2013-04-01 1        1             152341. 04     2013
##  5 2013-05-01 1        1             170330. 05     2013
##  6 2013-06-01 1        1             152575. 06     2013
##  7 2013-07-01 1        1             158755. 07     2013
##  8 2013-08-01 1        1             165313. 08     2013
##  9 2013-09-01 1        1             147199. 09     2013
## 10 2013-10-01 1        1             163409. 10     2013
## # … with 37 more rows

An aggregated data set using the fields date, industry and location, with a mean of monthly amount is created. To train dataset, a subset of data with industry 1 and location 1 is filtered out to form 47 rows of data. Each row presents a mean of monthly amount ranging from 2013 to 2016.

The mean amount of monthly sales over time plot presents its seasonality trend, specifically down trend at the end of year followed by the up trend at the beginning of year. Additionally, during the year, the pattern of fluctuation is repeated year by year. In average, the monthly mean amount of sales is increasing slowly over the years.

First, we split data into train and the test set The train set includes three years of data from 2013 to 2016 while test set includes one last year of data, 2016.

## 
## Call:
## lm(formula = mean_monthly ~ date, data = df_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17585  -7236   1484   6569  17030 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.764e+05  7.681e+04  -2.297   0.0279 *  
## date         2.083e+01  4.729e+00   4.405   0.0001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8966 on 34 degrees of freedom
## Multiple R-squared:  0.3633, Adjusted R-squared:  0.3446 
## F-statistic:  19.4 on 1 and 34 DF,  p-value: 1e-04

From above financial transaction plot, it exists a relationship between the mean amount of sales and time. So we put a simmple linear model to explore this relationship. The summary of the model presents R square, which indicates about 36% variability in the mean of monthly sales amount has been explained by the model, which leaves 64% unexplained variablity. The up trend of monthly sales over time is captured well in the model but not the seasonality of sales amount. The estimate coefficients represents the mean of monthly amount increases by about 21 day by day, or about 630 monthly. The small p-value indicates that it is unlikely that no relationship exists between date and mean of monthly sales amount.

## 
## Call:
## lm(formula = mean_monthly ~ date + month, data = df_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12943.4  -1553.1    494.6   2858.6  10733.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.758e+05  5.029e+04  -3.495 0.001954 ** 
## date         1.981e+01  3.123e+00   6.342 1.80e-06 ***
## month02      1.828e+04  4.561e+03   4.008 0.000551 ***
## month03      2.508e+04  4.563e+03   5.496 1.38e-05 ***
## month04      9.805e+03  4.568e+03   2.146 0.042623 *  
## month05      2.188e+04  4.575e+03   4.783 8.00e-05 ***
## month06      1.448e+04  4.584e+03   3.159 0.004389 ** 
## month07      1.903e+04  4.595e+03   4.142 0.000395 ***
## month08      2.004e+04  4.608e+03   4.349 0.000236 ***
## month09      1.453e+04  4.622e+03   3.144 0.004546 ** 
## month10      2.475e+04  4.639e+03   5.335 2.04e-05 ***
## month11      1.983e+04  4.658e+03   4.258 0.000296 ***
## month12      4.360e+03  4.678e+03   0.932 0.360955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5584 on 23 degrees of freedom
## Multiple R-squared:  0.8329, Adjusted R-squared:  0.7457 
## F-statistic: 9.555 on 12 and 23 DF,  p-value: 2.627e-06

We introduce the month variable to capture the seasonality of sales amount. From the summary of linear model with a “formula = mean_monthly ~ date + month”, the R-squared improves significantly. A more 30% of explained varibility increases by the introduction of month variable makes the total of 83% of explained variability.

## Analysis of Variance Table
## 
## Model 1: mean_monthly ~ date
## Model 2: mean_monthly ~ date + month
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1     34 2733280204                                  
## 2     23  717291619 11 2.016e+09 5.8766 0.0001764 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To test whether this model is significantly better than the previous one, we compare both with the anova() function. The small p value proves its statistically significant. So introduction of month could give a tremendously high association with the mean of monthly amount.

Compared to the previous simple linear model, the fitted data given by this model perform quite impressively. It can capture up and down trend for each month significanly better. However, the seasonality trend are not same every year. So the model perform a little worse during 2014-2015 and 2015-2016 because these years having 3 cycles up/down instead of 4 cycles as other years.

Residuals vs fitted plot presents equal error variances or simply random fluctuations around the true line. The normal probability plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.

0.5 Model prediction

In general, the predicted data for year of 2016 underestimate the actual mean of monthly amount. The growth of 2014 and 2015 is not high but year of 2016 witnesses the significant rise over 2015. So briging the slow-paced of past data could mislead prediction of 2016.

## [1] 0.06164264

##                          X        R2     RMSE      MAE
## 1 Predictions vs Test data 0.5543456 11293.81 10457.18

##                            X        R2     RMSE      MAE
## 1 Fitted value vs Train data 0.8329209 4463.717 3172.824

The performance of prediction is significantly lower than the model’s performance on train data. The R2 is 0.55 lower than 0.83, which presents its low fit to the actual data. The prediction error RMSE is 11293, representing an error rate of 11293/mean(df_test$mean_monthly) = 11293/183214 ~ 6%, which is still good.

0.6 Advanced model fitting

We apply the modelling process we built for industry 1 and location 1 to all industries and locations. Then we calculate evaluation measure for the training data and your testing data, for all models. It is to indentify the two industries and locations with worst performance. Then an investigation could be conducted to see why and how we could improve the model in the future.

We were running all locations and industries through 3 below models:

mean_monthly ~ date + month
mean_monthly ~ date + month + year
mean_monthly ~ date + month * sin(2piyear)

The first model is a combination of date and month to capture the monthly trend of data. In the second model, we introduce the year variable which may catch a global trend of year. The last one is a complex algorithm, in which the interaction between month and year is suggested.

## # A tibble: 27 x 7
##    location industry    R2    RMSE     MAE  PVALUE RMSE_Mean
##    <fct>    <fct>    <dbl>   <dbl>   <dbl>   <dbl>     <dbl>
##  1 7        5        0.500  28618.  24392. 0.00845      8.26
##  2 8        2        0.492  53696.  46229. 0.0104      16.2 
##  3 4        9        0.468 485829. 391556. 0.0181      52.2 
##  4 6        5        0.439  66656.  55353. 0.0339      11.5 
##  5 7        1        0.420  15276.  11960. 0.0499       7.88
##  6 4        8        0.394  38038.  30286. 0.0797      10.2 
##  7 9        10       0.379   7518.   5584. 0.922        7.97
##  8 5        2        0.371  47363.  37861. 0.118       14.1 
##  9 1        2        0.371  21778.  17635. 0.119        5.02
## 10 8        7        0.321  28917.  22022. 0.244       10.4 
## # … with 17 more rows

## # A tibble: 17 x 7
##    location industry    R2    RMSE     MAE PVALUE RMSE_Mean
##    <fct>    <fct>    <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
##  1 7        1        0.460  14742.  12231. 0.0369      7.61
##  2 10       7        0.454  50015.  40807. 0.0412     12.9 
##  3 9        7        0.396  88321.  74943. 0.117      15.5 
##  4 6        1        0.384   6022.   5278. 0.140       4.73
##  5 9        10       0.379   7518.   5584. 0.922       7.97
##  6 1        2        0.373  21746.  17649. 0.167       5.01
##  7 8        7        0.350  28289.  21204. 0.228      10.2 
##  8 5        5        0.324  66135.  34488. 0.336      29.6 
##  9 1        3        0.319  80968.  68222. 0.330      12.1 
## 10 1        5        0.308  54529.  44824. 0.371       8.47
## 11 10       5        0.241  82884.  72349. 0.649      15.1 
## 12 5        3        0.232 125442. 110487. 0.688      26.0 
## 13 7        10       0.214  99659.  74356. 0.757      31.3 
## 14 8        3        0.213  31049.  23461. 0.927      14.9 
## 15 4        7        0.210  49906.  42007. 0.773      11.6 
## 16 3        5        0.186  46758.  28867. 0.899      15.1 
## 17 6        7        0.178  30855.  27066. 0.875      15.4

## # A tibble: 4 x 7
##   location industry    R2   RMSE    MAE PVALUE RMSE_Mean
##   <fct>    <fct>    <dbl>  <dbl>  <dbl>  <dbl>     <dbl>
## 1 1        5        0.497 46491. 36434.  0.596      7.22
## 2 6        5        0.473 64643. 54519.  0.682     11.1 
## 3 9        7        0.462 83335. 67208.  0.716     14.6 
## 4 6        7        0.445 25346. 21222.  0.769     12.6

To compare these model’s performances, we filter data sets having R2 less than 0.5 so the worse model is the one giving out more results.

Model (1) made worst performance with 27 results while model (2) did better with 17 outputs. Model 3 executed a best performance with significantly few, only 4 industries and locations which R2 each less than 0.5 and most of them are greater than 0.4. Additionally among four above locations and industries, there is no RMSE greater than 100,000 and all less than 15%.

Model 2 proves its best performance so far. We take model 2 to next step to continue to assess its performance.

Among data sets given R2 below 0.5, we picked two worst data sets by its high RMSE:

Location 9 & Industry 7
Location 6 & Industry 7

location	industry	R2	RMSE	MAE	PVALUE	RMSE_Mean
9	7	0.462	83335.	67208.	0.716	14.6
6	7	0.445	25346.	21222.	0.769	12.6

Plotting the fitted & real data for two worst datasets, location 9 & industry 7 and location 6 & industry 7, we can see that the fitted line cannot catch the constant up and down trend of the monthly mean, that could be a reason for a low R2. However, the error variances generated by this model is not so bad, which is a good indicator for its predict capability.

## # A tibble: 19 x 7
##    location industry    R2    RMSE     MAE     PVALUE RMSE_Mean
##    <fct>    <fct>    <dbl>   <dbl>   <dbl>      <dbl>     <dbl>
##  1 10       10       1          0       0  NaN             0   
##  2 3        10       1.000      0       0  NaN             0   
##  3 5        9        0.984  74757.  46647.   7.93e- 2     11.1 
##  4 8        10       0.978 487762. 392714.   1.69e-24      2.31
##  5 10       9        0.963  43353.  25762.   2.02e- 1      7.73
##  6 9        9        0.931  67821.  52898.   1.59e-11      8.33
##  7 2        9        0.927  62145.  45570.   7.99e-16      7.55
##  8 8        8        0.924  46254.  34857.   1.65e-15     10.9 
##  9 1        10       0.920  66324.  37289.   3.67e- 1     17.7 
## 10 2        2        0.911  16170.  13147.   2.24e-14      4.23
## 11 4        4        0.908  20317.  15754.   3.89e-14      6.82
## 12 3        4        0.905  14356.  11689.   7.33e-14      5.36
## 13 2        4        0.895   6624.   5255.   3.65e-13      3.24
## 14 7        4        0.890  19176.  14298.   7.16e-13      4.33
## 15 2        8        0.885  50969.  40465.   1.68e-12      9.52
## 16 3        1        0.884   6447.   5192.   1.84e-12      3.69
## 17 4        1        0.878   6306.   4859.   4.21e-12      2.69
## 18 4        3        0.866  59101.  45356.   1.83e-11      7.86
## 19 7        3        0.857  46445.  37433.   5.63e-11      9.66

## # A tibble: 26 x 7
##    location industry    R2    RMSE     MAE     PVALUE RMSE_Mean
##    <fct>    <fct>    <dbl>   <dbl>   <dbl>      <dbl>     <dbl>
##  1 10       10       1          0       0  NaN             0   
##  2 3        10       1.000      0       0  NaN             0   
##  3 1        10       0.986  27346.  14617.   3.08e- 1      7.31
##  4 5        9        0.984  74757.  46647.   7.93e- 2     11.1 
##  5 8        10       0.979 476035. 371702.   8.76e-24      2.25
##  6 10       9        0.963  43353.  25762.   2.02e- 1      7.73
##  7 2        9        0.954  49459.  38765.   3.07e-18      6.01
##  8 9        9        0.931  67802.  53026.   8.68e-11      8.33
##  9 8        8        0.927  45371.  33256.   5.39e-15     10.7 
## 10 2        2        0.917  15643.  12880.   4.32e-14      4.09
## # … with 16 more rows

## # A tibble: 43 x 7
##    location industry    R2    RMSE     MAE     PVALUE RMSE_Mean
##    <fct>    <fct>    <dbl>   <dbl>   <dbl>      <dbl>     <dbl>
##  1 1        10       1          0       0  NaN             0   
##  2 5        9        1          0       0  NaN             0   
##  3 9        10       1          0       0  NaN             0   
##  4 10       9        1          0       0  NaN             0   
##  5 10       10       1          0       0  NaN             0   
##  6 3        10       1.000      0       0  NaN             0   
##  7 10       3        0.998  10747.   6725.   2.47e- 5      2.39
##  8 8        10       0.986 393350. 296260.   3.28e-15      1.86
##  9 9        9        0.979  37129.  27745.   1.84e- 7      4.56
## 10 2        2        0.974   8798.   6887.   2.23e-12      2.30
## # … with 33 more rows

To evaluate its performance on best datasets, we filter R2 better than 0.85, model 3 generates 43 rows, a significantly higher than 26 rows of model 2 and 19 of model 1. The we plot a few top datasets to see how it captures very well the data.

location	industry	R2	PVALUE
1	10	1	NaN
5	9	1	NaN
9	10	1	NaN
10	9	1	NaN
10	10	1	NaN
3	10	1.000	NaN

0.7 Conclusion

In general, the total amount of sales for a whole industries and locations presents a growth year by year. Location 1 and 2 witnessed a high amount of sales and considerable rise over the years while there was a rapid jump in industry 2, 3 and 1.

A few linear models has been developed and deployed to explore the trend and predict the future through several locations and industries. The complex model, “mean_monthly ~ date + month * sin(2piyear)” which integrates the interaction between month and year proves its optimal performances.

0.8 References

A. Abhyankar, L. Sarno, G. Valente Exchange rates and fundamentals: evidence on the economic value of predictability J. Int. Econ., 66 (2005), pp. 325-348