1. Outline

This goes in succession of UMM Kaggle : EDA.

2. Review

Not regarding date

  • Highly correlated variables : (visitStartTime, visitNumber), (visitStartTime, hits), (visitStartTime, pageviews), and (pageviews, hits) - We need to add the interaction terms of these terms.
  • There are few actual buyers - We need to seperate zero targets and non-zero targets.
  • If isTrueDirect = TRUE, the target value gets bigger.
  • hits, visitNumber, and pageviews are rightly skewed

Regarding date

  • There are some revenueSum peaks. In there,
    • isMobile tends to be FALSE
    • adwordsClickInfo.isVideoAd tends to be TRUE
    • adwordsClickInfo.adNetworkType tends to be 0
  • isTrueDirect effects highly

3. Doing Supervised Learning for our target, transactionRevenue

2-1. Non-Time Series

#model1 <- train(transactionRevenue~hits+pageviews+visitNumber+visitNumber*visitStartTime, data=newtrain1, preProcess="scale", method="nb")

2-2. Time Series

Let’s try to forecast log transformed daily transactionRevenue with respect of time series. We divided transactionRevenue into .5*1E06 because the mean of that is 1.704272810^{6}; around 1E06.

timeS <- newtrain %>%
        group_by(date) %>%
        summarise(revenueMean = log1p(mean(transactionRevenue/(1E06)))) %>%
        ungroup() %>%
        with(zoo(revenueMean, order.by=date))

timeRange <- difftime(max(newtest$date), min(newtest$date)) + 1
target_arima <- auto.arima(timeS)
summary(target_arima)
## Series: timeS 
## ARIMA(4,0,3) with non-zero mean 
## 
## Coefficients:
##           ar1      ar2     ar3      ar4     ma1     ma2      ma3    mean
##       -0.2547  -0.4445  0.3578  -0.0995  0.6029  0.5490  -0.3260  0.8924
## s.e.   0.3520   0.2285  0.2290   0.0782  0.3492  0.3354   0.3007  0.0249
## 
## sigma^2 estimated as 0.1451:  log likelihood=-162.27
## AIC=342.54   AICc=343.04   BIC=377.66
## 
## Training set error measures:
##                         ME      RMSE       MAE  MPE MAPE     MASE
## Training set -0.0003557683 0.3766821 0.2882868 -Inf  Inf 0.790288
##                      ACF1
## Training set -0.004423992
forecast(target_arima, h=timeRange) %>%
        autoplot() +
        theme_minimal()

In this model, the value converges into specific value. This is meaningless result, so we decided to add some regression terms.

We made regression terms with mean values of pageviews, hits, isTrueDirect and adwordsClickInfo.isVideoAd.

2-2-1. Add pageviews
target_arima_PV <- auto.arima(timeS, xreg=mean_pageviews_train)

summary(target_arima_PV)
## Series: timeS 
## Regression with ARIMA(3,1,2) errors 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2    xreg
##       0.7053  -0.1867  -0.0618  -1.3535  0.4445  4.2972
## s.e.  0.2955   0.0954   0.0700   0.2922  0.2539  0.5438
## 
## sigma^2 estimated as 0.1203:  log likelihood=-129.08
## AIC=272.15   AICc=272.47   BIC=299.45
## 
## Training set error measures:
##                       ME      RMSE       MAE  MPE MAPE      MASE
## Training set 0.009628873 0.3435698 0.2641214 -Inf  Inf 0.7240426
##                      ACF1
## Training set -0.002529003
  • ARIMA(3,1,2)
  • RMSE in newtrain : 0.344
2-2-2. Add hits
target_arima_HT <- auto.arima(timeS, xreg=mean_hits_train)

summary(target_arima_HT)
## Series: timeS 
## Regression with ARIMA(3,1,2) errors 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2    xreg
##       0.7101  -0.1896  -0.0587  -1.3525  0.4437  3.9559
## s.e.  0.3000   0.0972   0.0703   0.2966  0.2573  0.5028
## 
## sigma^2 estimated as 0.1201:  log likelihood=-128.76
## AIC=271.52   AICc=271.84   BIC=298.82
## 
## Training set error measures:
##                      ME      RMSE       MAE  MPE MAPE      MASE
## Training set 0.01096295 0.3432809 0.2642414 -Inf  Inf 0.7243716
##                      ACF1
## Training set -0.002888984
  • ARIMA(3,1,2)
  • RMSE in newtrain : 0.343
2-2-3. Add isTrueDirect
target_arima_ITD <- auto.arima(timeS, xreg=mean_isTrueDirect_train)

summary(target_arima_ITD)
## Series: timeS 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1    xreg
##       0.3260  2.8899
## s.e.  0.0497  0.0866
## 
## sigma^2 estimated as 0.1268:  log likelihood=-140.52
## AIC=287.03   AICc=287.1   BIC=298.74
## 
## Training set error measures:
##                       ME      RMSE       MAE  MPE MAPE      MASE
## Training set 0.001884742 0.3551658 0.2695036 -Inf  Inf 0.7387969
##                     ACF1
## Training set 0.002840936
  • ARIMA(1,0,0)
  • RMSE in newtrain : 0.355
2-2-4. Add adwordsClickInfo.isVideoAd
target_arima_VDO <- auto.arima(timeS, xreg=mean_Video_train)

summary(target_arima_VDO)
## Series: timeS 
## Regression with ARIMA(4,1,4) errors 
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ma1      ma2      ma3     ma4
##       -0.1930  -0.4152  0.3746  -0.1505  -0.4497  -0.0277  -0.8383  0.4240
## s.e.   0.2162   0.1562  0.1521   0.0906   0.2089   0.0835   0.0858  0.1848
##         xreg
##       4.3452
## s.e.  2.3091
## 
## sigma^2 estimated as 0.1441:  log likelihood=-161
## AIC=341.99   AICc=342.61   BIC=380.99
## 
## Training set error measures:
##                       ME      RMSE       MAE  MPE MAPE      MASE
## Training set 0.004220124 0.3743871 0.2850392 -Inf  Inf 0.7813851
##                      ACF1
## Training set -0.004242756
  • ARIMA(4,1,4)
  • RMSE in newtrain : 0.374

Among above 4 ARIMA models with regression term, models with hits returns best RMSE in newtrain set.

We can see that the shape of hits-added one and pageviews-added one are similar. But, their shapes are little vague for explaining stationarity. The last one, adwordsClickInfo.isVideoAd-added one performs poorly. isTrueDirect-added one’s shape seems quite reasonable, except one peak.

2-2-5. Add the combination of hits ahd isTrueDirect

So, we used a mixed value of hits and isTrueDirect; there mean.

target_arima_INTERACT <- auto.arima(timeS, 
                                    xreg=(mean_hits_train+mean_isTrueDirect_train)/2)

summary(target_arima_INTERACT)
## Series: timeS 
## Regression with ARIMA(1,0,2) errors 
## 
## Coefficients:
##          ar1      ma1      ma2  intercept    xreg
##       0.9328  -0.5821  -0.1668    -2.9901  4.9080
## s.e.  0.0355   0.0655   0.0570     0.4622  0.5757
## 
## sigma^2 estimated as 0.1147:  log likelihood=-120.73
## AIC=253.46   AICc=253.69   BIC=276.87
## 
## Training set error measures:
##                       ME      RMSE       MAE  MPE MAPE      MASE
## Training set 0.005025823 0.3363246 0.2545888 -Inf  Inf 0.6979108
##                     ACF1
## Training set 0.005533488

This returns 0.336 as RMSE in newtrain set. It performs better than either hits or isTrueDirect added things.

TSINTERACT <- forecast(target_arima_INTERACT, h=timeRange, 
                       xreg=(mean_hits_test+mean_isTrueDirect_test)/2) %>%
        autoplot()

TSINTERACT

It seems to have stationarity also.