Read & inspect transactions data

## Warning: Expecting logical in H3783 / R3783C8: got 'Pay by TnG'
## Warning: Expecting logical in H3787 / R3787C8: got 'Pay by card'

Consolidate the data into daily with sum/min/max/mean/median/number of daily transaction

Create a new column called After_1_day_sum_sales, in that column, move the target variable to a day forward, to simulate that the metadata of today pair with target variable of tomorrow

Train with 4 models: linear regression, svm, random forest and decision tree

Findings

  1. SVM has the lowest value in MAE, whereas LR has the lowest value in RMSE.
  2. This means LR may handle extreme deviations better than SVM but overall SVM performs better in predicting typical sales.

Findings

  1. None of the algorithms able to handle outliers (6 data points that are around 10,000 to 20,000).
  2. Random Forest algorithm predicts some data (supposedly 5,000 actual value) unexpectedly high, until 12,000.
  3. Visually, Decision Tree algorithm has a rigid predicted values, due to the nature of the decision rules it has.

Next Step: Remove potential outliers and train again

The table below shows the summary of the After_1_day_sum_sales column

Findings

  1. SVM is still the lowest MAE.
  2. Random Forest algorithm now becomes better in predicting in lowest RMSE score.
  3. Overall RMSE dropped more than half of the original score, from 4,000 to lesser than 2,000.
  4. MAE has reduced around 1/3 of original score too, from 2,000 to 1,500.
  5. The standard deviation of the target column is 1.7k, so the MAE is still in acceptable range.

Findings

  1. The outliers significantly reduced the performance of models, by making them predict values with lesser errors.
  2. SVM outperforms other algorithms in predicting the next day’s sales.