This goes in succession of UMM Kaggle : Cleaning Data.
dateWe’ve performed a simple linear regression with lm() (without interaction terms) and step() functions.
Result of Simple Linear Regression
From this, we can see that date, visitNumber, visitStarTime, hits, pageViews, operatingSystem, isTrueDirect and newVisits are the most significant. We thought browser would do important role in this set also. These will be used at predicting procedure.
Before doing modeling, we must see whether there are correlations between variables we’re about to use.
print(chi_matrix) ; print(crammerV_matrix)
## h pv nV vN oS b vST
## h 0 0 0 0 0 0 0.1736
## pv 0 0 0 0 0 0 1.0000
## nV 0 0 0 0 0 0 0.1323
## vN 0 0 0 0 0 0 1.0000
## oS 0 0 0 0 0 0 1.0000
## b 0 0 0 0 0 0 1.0000
## vST 0 0 0 0 0 0 0.0000
## h pv nV vN oS b vST
## h 0 8.447 0.14 0.974 0.174 0.146 16.372
## pv 0 0.000 0.14 1.094 0.171 0.144 14.457
## nV 0 0.000 0.00 1.000 0.139 0.149 0.992
## vN 0 0.000 0.00 0.000 0.169 0.167 19.361
## oS 0 0.000 0.00 0.000 0.000 2.288 4.302
## b 0 0.000 0.00 0.000 0.000 0.000 7.100
## vST 0 0.000 0.00 0.000 0.000 0.000 0.000
The above one, chi_matrix is filled with p-values from Chi-Square Test of each variables, and the below one, crammerV_matrix is the matrix of Crammer’s V values of each variables.
If the p-value is smaller than 0.5, it means two variables are related to each other. As we can see, ALL of p-values are significant.
The smaller Crammer’s V, the lower the correlation. Therefore, visitStartTime and visitNumber are highly correlated. Next is visitStartTime and hits, and so on. We need to add interaction terms of these highly correlated pairs to our model.
Our target spreads here and there. It seems there are no distributions fits this. We also tried to plot our target’s natural log form, which described in data page.
nrow(newtrain);sum(newtrain$transactionRevenue==0)
## [1] 903653
## [1] 892138
We discovered that there are many zero values in transactionRevenue, and that made the bottom 0 basements. Non-zero values are scattered around 17.
As we can see, most of our target, transactionRevenue, has value 0. This means there are few actual buyers in this training dataset. That’s because we conducted our project seperately; zero targets and non-zero targets.
This plot says we should take different posture if transactionRevenue is 0 or not.
transactionRevenue per others## Selecting by revenueSum
From respect of browser, Chrome has dominant values.
## Selecting by revenueSum
The number of Android and iOS are much smaller than we thought. Most of profit is from Macintosh and Windows users. Chrome OS users have largest revenueMean value.
## Selecting by revenueSum
The number of buyers and sum of revenue are absolutely overwhelming in United States.
## Selecting by revenueSum
The probablity that visitors through google will buy things seems to be small, because its revenueMean value is small. On the other hand, visitors through mail.google.com may have quite large probability.
length(levels(as.factor(newtrain$fullVisitorId))); length(newtrain$fullVisitorId)
## [1] 714167
## [1] 903653
As we can see, there are some fullVisitorId which has more than one transactionRevenue values. Therefore, fullVisitorId is not unique indicator for transactionRevenue.
There are absolutely more revenues when the visitor visits directly.
Newly visitors, which the value is 1, tend to be not active at buying things.
Visitors who have visited more returns bigger mean revenue. But, many of them were visited few times, and most of the revenue are converged to them. In other words, this is rightly skewed from respect of our target.
The vertical lines indicate 0, 25, 50, 100 and 150. We can see that visitors who visits less than 25 times has bigger revenueMean than others, and the biggest value is for newly visitors. Visitors who visits less than 100 times have the most of revenueSum values. This is also rightly skewed.
The fewer pageviews value, the more visitors. It seems that the distribution of revenueSum per pageviews is also rightly skewed.
So, in modeling stage, we need to do some processes for these rightly skewed values, hits, visitNumber and pageviews. For example, log transformation.
dateWe already know that this data is time-related. Furthermore, from above regression result, the variable date seems to play important role. That’s why we need to see this with respect to date, in other words, time series.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The number of visitors were increased between Oct 2016 and Jan 2017. Also there are obvious peaks in revenueSum and revenueMean plots. We’ll see those peaks.
The vertical dash line indicates 2017-01-01. Maybe we should investigate 2016-08-25, 2016-09-16, 2017-02-14 and 2017-04-05.
## # A tibble: 4 x 4
## date number visitNumberMean revenueMax
## <fct> <int> <dbl> <dbl>
## 1 2016-08-25 2539 2.81 4085500000
## 2 2016-09-16 2603 2.80 16023750000
## 3 2017-02-14 2379 2.41 17855500000
## 4 2017-04-05 2619 2.36 23129500000
The number values are around the mean visit value, 2468.9972678. That means there were some rich customers at those days, and we can see this from above revenueMax.
## # A tibble: 8 x 3
## # Groups: date [4]
## date isMobile number
## <date> <lgl> <int>
## 1 2016-08-25 FALSE 2050
## 2 2016-08-25 TRUE 489
## 3 2016-09-16 FALSE 2028
## 4 2016-09-16 TRUE 575
## 5 2017-02-14 FALSE 1797
## 6 2017-02-14 TRUE 582
## 7 2017-04-05 FALSE 1836
## 8 2017-04-05 TRUE 783
We can see that there were more direct customers than mobile customers in those days.
## # A tibble: 17 x 3
## # Groups: date [4]
## date campaign number
## <date> <chr> <int>
## 1 2016-08-25 0 2460
## 2 2016-08-25 AW - Accessories 2
## 3 2016-08-25 AW - Dynamic Search Ads Whole Site 35
## 4 2016-08-25 AW - Electronics 4
## 5 2016-08-25 Data Share Promo 38
## 6 2016-09-16 0 2528
## 7 2016-09-16 AW - Dynamic Search Ads Whole Site 41
## 8 2016-09-16 Data Share Promo 34
## 9 2017-02-14 0 2289
## 10 2017-02-14 AW - Accessories 18
## 11 2017-02-14 AW - Dynamic Search Ads Whole Site 21
## 12 2017-02-14 Data Share Promo 51
## 13 2017-04-05 0 2511
## 14 2017-04-05 AW - Accessories 28
## 15 2017-04-05 AW - Dynamic Search Ads Whole Site 5
## 16 2017-04-05 Data Share Promo 74
## 17 2017-04-05 test-liyuhz 1
The number of campaigns of those days were so small.
## # A tibble: 8 x 3
## # Groups: date [4]
## date adwordsClickInfo.isVideoAd number
## <date> <lgl> <int>
## 1 2016-08-25 FALSE 41
## 2 2016-08-25 TRUE 2498
## 3 2016-09-16 FALSE 41
## 4 2016-09-16 TRUE 2562
## 5 2017-02-14 FALSE 39
## 6 2017-02-14 TRUE 2340
## 7 2017-04-05 FALSE 33
## 8 2017-04-05 TRUE 2586
The number of visitors who visit through clicking video advertisement were dominant.
## # A tibble: 8 x 3
## # Groups: date [4]
## date adwordsClickInfo.adNetworkType number
## <date> <chr> <int>
## 1 2016-08-25 0 2498
## 2 2016-08-25 Google Search 41
## 3 2016-09-16 0 2562
## 4 2016-09-16 Google Search 41
## 5 2017-02-14 0 2340
## 6 2017-02-14 Google Search 39
## 7 2017-04-05 0 2586
## 8 2017-04-05 Google Search 33
There were so few people who came by searching from google.
Let’s summarise results.
isTrueDirectIn section 2, the variable isTrueDirect plays some role in detecting characteristics of our training data. From respect of date, it also returns a meaningful result.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'