UMM Kaggle : EDA

1. Outline

This goes in succession of UMM Kaggle : Cleaning Data.

2. EDA : Not regarding `date`

2-1. Feature Selection

We’ve performed a simple linear regression with lm() (without interaction terms) and step() functions.

Result of Simple Linear Regression

From this, we can see that date, visitNumber, visitStarTime, hits, pageViews, operatingSystem, isTrueDirect and newVisits are the most significant. We thought browser would do important role in this set also. These will be used at predicting procedure.

2-2. Correlation between Variables

Before doing modeling, we must see whether there are correlations between variables we’re about to use.

print(chi_matrix) ; print(crammerV_matrix)

##     h pv nV vN oS b    vST
## h   0  0  0  0  0 0 0.1736
## pv  0  0  0  0  0 0 1.0000
## nV  0  0  0  0  0 0 0.1323
## vN  0  0  0  0  0 0 1.0000
## oS  0  0  0  0  0 0 1.0000
## b   0  0  0  0  0 0 1.0000
## vST 0  0  0  0  0 0 0.0000

##     h    pv   nV    vN    oS     b    vST
## h   0 8.447 0.14 0.974 0.174 0.146 16.372
## pv  0 0.000 0.14 1.094 0.171 0.144 14.457
## nV  0 0.000 0.00 1.000 0.139 0.149  0.992
## vN  0 0.000 0.00 0.000 0.169 0.167 19.361
## oS  0 0.000 0.00 0.000 0.000 2.288  4.302
## b   0 0.000 0.00 0.000 0.000 0.000  7.100
## vST 0 0.000 0.00 0.000 0.000 0.000  0.000

The above one, chi_matrix is filled with p-values from Chi-Square Test of each variables, and the below one, crammerV_matrix is the matrix of Crammer’s V values of each variables.

If the p-value is smaller than 0.5, it means two variables are related to each other. As we can see, ALL of p-values are significant.

The smaller Crammer’s V, the lower the correlation. Therefore, visitStartTime and visitNumber are highly correlated. Next is visitStartTime and hits, and so on. We need to add interaction terms of these highly correlated pairs to our model.

2-3. See our target’s features

Our target spreads here and there. It seems there are no distributions fits this. We also tried to plot our target’s natural log form, which described in data page.

nrow(newtrain);sum(newtrain$transactionRevenue==0)

## [1] 903653

## [1] 892138

We discovered that there are many zero values in transactionRevenue, and that made the bottom 0 basements. Non-zero values are scattered around 17.

As we can see, most of our target, transactionRevenue, has value 0. This means there are few actual buyers in this training dataset. That’s because we conducted our project seperately; zero targets and non-zero targets.

This plot says we should take different posture if transactionRevenue is 0 or not.

2-4. Statistics of `transactionRevenue` per others

## Selecting by revenueSum

From respect of browser, Chrome has dominant values.

## Selecting by revenueSum

The number of Android and iOS are much smaller than we thought. Most of profit is from Macintosh and Windows users. Chrome OS users have largest revenueMean value.

## Selecting by revenueSum

The number of buyers and sum of revenue are absolutely overwhelming in United States.

## Selecting by revenueSum

The probablity that visitors through google will buy things seems to be small, because its revenueMean value is small. On the other hand, visitors through mail.google.com may have quite large probability.

length(levels(as.factor(newtrain$fullVisitorId))); length(newtrain$fullVisitorId)

## [1] 714167

## [1] 903653

As we can see, there are some fullVisitorId which has more than one transactionRevenue values. Therefore, fullVisitorId is not unique indicator for transactionRevenue.

There are absolutely more revenues when the visitor visits directly.

Newly visitors, which the value is 1, tend to be not active at buying things.

Visitors who have visited more returns bigger mean revenue. But, many of them were visited few times, and most of the revenue are converged to them. In other words, this is rightly skewed from respect of our target.

The vertical lines indicate 0, 25, 50, 100 and 150. We can see that visitors who visits less than 25 times has bigger revenueMean than others, and the biggest value is for newly visitors. Visitors who visits less than 100 times have the most of revenueSum values. This is also rightly skewed.

The fewer pageviews value, the more visitors. It seems that the distribution of revenueSum per pageviews is also rightly skewed.

So, in modeling stage, we need to do some processes for these rightly skewed values, hits, visitNumber and pageviews. For example, log transformation.

3. EDA : Regarding `date`

3-1. Investigating Some Peaks

We already know that this data is time-related. Furthermore, from above regression result, the variable date seems to play important role. That’s why we need to see this with respect to date, in other words, time series.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The number of visitors were increased between Oct 2016 and Jan 2017. Also there are obvious peaks in revenueSum and revenueMean plots. We’ll see those peaks.

The vertical dash line indicates 2017-01-01. Maybe we should investigate 2016-08-25, 2016-09-16, 2017-02-14 and 2017-04-05.

## # A tibble: 4 x 4
##   date       number visitNumberMean  revenueMax
##   <fct>       <int>           <dbl>       <dbl>
## 1 2016-08-25   2539            2.81  4085500000
## 2 2016-09-16   2603            2.80 16023750000
## 3 2017-02-14   2379            2.41 17855500000
## 4 2017-04-05   2619            2.36 23129500000

The number values are around the mean visit value, 2468.9972678. That means there were some rich customers at those days, and we can see this from above revenueMax.

## # A tibble: 8 x 3
## # Groups:   date [4]
##   date       isMobile number
##   <date>     <lgl>     <int>
## 1 2016-08-25 FALSE      2050
## 2 2016-08-25 TRUE        489
## 3 2016-09-16 FALSE      2028
## 4 2016-09-16 TRUE        575
## 5 2017-02-14 FALSE      1797
## 6 2017-02-14 TRUE        582
## 7 2017-04-05 FALSE      1836
## 8 2017-04-05 TRUE        783

We can see that there were more direct customers than mobile customers in those days.

## # A tibble: 17 x 3
## # Groups:   date [4]
##    date       campaign                           number
##    <date>     <chr>                               <int>
##  1 2016-08-25 0                                    2460
##  2 2016-08-25 AW - Accessories                        2
##  3 2016-08-25 AW - Dynamic Search Ads Whole Site     35
##  4 2016-08-25 AW - Electronics                        4
##  5 2016-08-25 Data Share Promo                       38
##  6 2016-09-16 0                                    2528
##  7 2016-09-16 AW - Dynamic Search Ads Whole Site     41
##  8 2016-09-16 Data Share Promo                       34
##  9 2017-02-14 0                                    2289
## 10 2017-02-14 AW - Accessories                       18
## 11 2017-02-14 AW - Dynamic Search Ads Whole Site     21
## 12 2017-02-14 Data Share Promo                       51
## 13 2017-04-05 0                                    2511
## 14 2017-04-05 AW - Accessories                       28
## 15 2017-04-05 AW - Dynamic Search Ads Whole Site      5
## 16 2017-04-05 Data Share Promo                       74
## 17 2017-04-05 test-liyuhz                             1

The number of campaigns of those days were so small.

## # A tibble: 8 x 3
## # Groups:   date [4]
##   date       adwordsClickInfo.isVideoAd number
##   <date>     <lgl>                       <int>
## 1 2016-08-25 FALSE                          41
## 2 2016-08-25 TRUE                         2498
## 3 2016-09-16 FALSE                          41
## 4 2016-09-16 TRUE                         2562
## 5 2017-02-14 FALSE                          39
## 6 2017-02-14 TRUE                         2340
## 7 2017-04-05 FALSE                          33
## 8 2017-04-05 TRUE                         2586

The number of visitors who visit through clicking video advertisement were dominant.

## # A tibble: 8 x 3
## # Groups:   date [4]
##   date       adwordsClickInfo.adNetworkType number
##   <date>     <chr>                           <int>
## 1 2016-08-25 0                                2498
## 2 2016-08-25 Google Search                      41
## 3 2016-09-16 0                                2562
## 4 2016-09-16 Google Search                      41
## 5 2017-02-14 0                                2340
## 6 2017-02-14 Google Search                      39
## 7 2017-04-05 0                                2586
## 8 2017-04-05 Google Search                      33

There were so few people who came by searching from google.

Let’s summarise results.

Direct than mobile
Non-campaigns than campaigns
Most of visitors clicked video ads
Most of visitors didn’t search from google

3-2. `isTrueDirect`

In section 2, the variable isTrueDirect plays some role in detecting characteristics of our training data. From respect of date, it also returns a meaningful result.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

UMM Kaggle : EDA

Baescott

2018 12 7

1. Outline

2. EDA : Not regarding `date`

2-1. Feature Selection

2-2. Correlation between Variables

2-3. See our target’s features

2-4. Statistics of `transactionRevenue` per others

3. EDA : Regarding `date`

3-1. Investigating Some Peaks

3-2. `isTrueDirect`

UMM Kaggle : EDA

Baescott

2018 12 7

1. Outline

2. EDA : Not regarding date

2-1. Feature Selection

2-2. Correlation between Variables

2-3. See our target’s features

2-4. Statistics of transactionRevenue per others

3. EDA : Regarding date

3-1. Investigating Some Peaks

3-2. isTrueDirect

2. EDA : Not regarding `date`

2-4. Statistics of `transactionRevenue` per others

3. EDA : Regarding `date`

3-2. `isTrueDirect`