The dataset we are analyzing is a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset. The raw dataset contains 12 columns to predict the transaction revenue per customer. The outcome from this analysis will aid in better use of marketing budgets. There are 2 datasets, big and small. The small dataset is about 1 million transactions and the big one is about 10 times larger. Due to hardware constraint the small dataset is used in this report.

Preprocessing

The data is a mix of csv and json where the json data are separate columns. That mismatch requires a lot of preprocessing to get the features and target that we need for analyzing. The json columns (device, geoNetwork, totals, trafficSource) are convert from str json into a json object. Then the json object is parsed with jsonlite into vectors. Those vectors are flattened and then join with the original dataframe. This preprocessing step is required for both the training and testing test.

path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/final/'

df <- read.csv(paste0(path,"test.csv"), header = T, stringsAsFactors = F, colClasses = c(fullVisitorId = "character"))

df_device <- paste("[", paste(df$device, collapse = ","), "]") %>% fromJSON(flatten = T)
df_geoNetwork <- paste("[", paste(df$geoNetwork, collapse = ","), "]") %>% fromJSON(flatten = T)
df_totals <- paste("[", paste(df$totals, collapse = ","), "]") %>% fromJSON(flatten = T)
df_trafficSource <- paste("[", paste(df$trafficSource, collapse = ","), "]") %>% fromJSON(flatten = T)

df <- df %>%
    cbind(df_device, df_geoNetwork, df_totals, df_trafficSource) %>%
    select(-device, -geoNetwork, -totals, -trafficSource)

factorVars <- c("channelGrouping", "browser", "operatingSystem", "deviceCategory", "country")
df[, factorVars] <- lapply(df[, factorVars], as.factor)
df$transactionRevenue <- as.numeric(df$transactionRevenue)

numVars <- c("visits", "hits", "bounces", "pageviews", "newVisits")
df[, numVars] <- lapply(df[, numVars], as.integer)

df$visitStartTime <- as.POSIXct(df$visitStartTime, tz="UTC", origin='1970-01-01')
write.csv(df, 'DATA_621/final/test_clean.csv', row.names = FALSE)

The clean dataset has 903653 transactions and 55 columns, where each transaction is one visit to google store. The target variable is transactionRevenue, the amount of money GStore brought in from that transaction.

path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/final/'
df <- read.csv(paste0(path, 'train_clean.csv'))
df <- df %>% mutate(date=ymd(date))
nan_count <- sum(!is.na(df$transactionRevenue))
nan_pct <- nan_count / nrow(df) * 100
total_rev <- sum(df$transactionRevenue, na.rm=T)

Only 11515 sessions in the training set led to a transaction. This is only 1.2742723% of the observations, and that of course matches with the 98.7% missing values in the previous section. Altogether, the GStore made $1,540,071,240,000 USD in revenues in that year. The missing transactionRevenue is replaced with 0 to avoid errors with NAN later on. This filling of missing value is acceptable since the lowest a revenue can go is 0.

Exploratory Data Analysis

We being with looking at the transactionRevenue

p1 <- df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=transactionRevenue)) + geom_histogram()
p2 <- df %>% filter(transactionRevenue > 0) %>% mutate(transactionRevenue=log(transactionRevenue)) %>% ggplot(aes(x=transactionRevenue)) + geom_histogram()
grid.arrange(p1, p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most transactions generate no revenue hence the strongly skewed right tail. However, if we log the transactionRevenue it looks almost normal. Since the metrics of the competitation is in log scale, the transactionRevenue will be converted to log scale from here on.

df <- df %>%
  mutate(transactionRevenue=log(transactionRevenue),
         transactionRevenue=replace_na(transactionRevenue,0))
gg_miss_var(df %>% filter(transactionRevenue > 0), show_pct = TRUE)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=date, y=transactionRevenue)) + geom_line()

It looks like transactionRevenue doesn’t so much variation through out the year

Sparse columns

Some columns have only a few values, those columns will be dropped as they carry no row-wise information.

uniq_counts <-df %>% 
  select_if(negate(is.numeric)) %>% 
  apply(2, function(x) length(unique(x)))
uniq_cols <- names(uniq_counts[uniq_counts == 1])

uniq_cols
##  [1] "socialEngagementType"                "browserVersion"                     
##  [3] "browserSize"                         "operatingSystemVersion"             
##  [5] "mobileDeviceBranding"                "mobileDeviceModel"                  
##  [7] "mobileInputSelector"                 "mobileDeviceInfo"                   
##  [9] "mobileDeviceMarketingName"           "flashVersion"                       
## [11] "language"                            "screenColors"                       
## [13] "screenResolution"                    "cityId"                             
## [15] "latitude"                            "longitude"                          
## [17] "networkLocation"                     "adwordsClickInfo.criteriaParameters"

17 Columns contain only one value for all 900k rows. The columns below were all dropped by checks for their usefulness towards positive revenue , NA count, unique value counts and duplication of another column.

adwordsClickInfo.isVideoAd - NA (882166) and False () adwordsClickInfo.adNetworkType - NA, and only “Google Search” for revenue greater than 0 adwordsClickInfo.gclId - According to this post gclId is a hash/encryption info for data is available in other features adwordsClickInfo.slot - 103 Top and 5362 RHS, the rest are NA adwordsClickInfo.page - NA and 1, for positive revenue campaignCode - NA for all value with positive revenue and “11251kjhkvahf” for other values isTrueDirect - NA and True for only a few rows visitId - sessionID already has this information visits - 1 for all positive revenue

drop_cols = c('adwordsClickInfo.isVideoAd', 'adwordsClickInfo.adNetworkType', 'adwordsClickInfo.gclId', 'adwordsClickInfo.slot', 'adwordsClickInfo.page', 'campaignCode', 'isTrueDirect', 'visits', uniq_cols)

df <- df %>% select(-all_of(drop_cols))

After dropping those columns we are left with 30 columns.

Active users

An active user is someone that has visited the Google Merchandise Store

p1 <- df %>% 
  group_by(date) %>% 
  summarise(dailySessions = n()) %>%
  ggplot(aes(x=date, y=dailySessions)) + 
  geom_line() + 
  geom_smooth(col='red') + 
  labs(x="", y="Sessions per Day")

p2 <- df %>% 
  group_by(date) %>% 
  summarise(revenue=sum(transactionRevenue)) %>%
  ggplot(aes(x=date, y=revenue)) + 
  geom_line() + 
  geom_smooth(col='red') + 
  labs(x="", y="Daily Revenue")

grid.arrange(p1, p2, nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It looks like between Oct 2016 and Jan 2017 there was a peek in active users. The daily revenue slightly peeks around the same time then plateaus.

Channel Grouping

Channels are the medium on how users came to the GStore.

p1 <- session(df, channelGrouping, 12)
p2 <- revenue(df, channelGrouping, 12)
grid.arrange(p1, p2)

As you see in the plots, Organic Search and Social channels led to the most sessions; however social channels delivers very little revenue. Referral on the other hand delivers most revenues with a relatively small number of sessions. This makes sense since a referral is a session with a purpose while other channel groups could be casual browsing.

Device category

Device category are the hardware which people use to access the GStore, there are 3.

p1 <- session(df, deviceCategory, 3)
p2 <- revenue(df, deviceCategory, 3)
#p3 <- df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=transactionRevenue, fill=deviceCategory)) + geom_histogram(position='identity', alpha=0.75, bins=50)
grid.arrange(p1, p2)

The main takeaway here is that mobile and tablet have a lot fewer sessions, and relatively low revenues per session when compared to desktop. Desktops produce the most sessions and revenue.

Operating System

An operating system is the next level from device category. There are 20 unique operating systems

“Windows” “Macintosh” “Linux” “Android” “iOS” “Chrome OS”
“BlackBerry” “(not set)” “Samsung” “Windows Phone” “Xbox” “Nintendo Wii” “Firefox OS” “Nintendo WiiU” “FreeBSD” “Nokia” “NTT DoCoMo” “Nintendo 3DS” “SunOS” “OpenBSD”

with revenues; however most have low sessions and revenue so they are ignored. The “(no set)” category is most likely a missing value.

p1 <- session(df, operatingSystem, 6)
p2 <- revenue(df, operatingSystem, 6)
grid.arrange(p1, p2)

What stands out is Macintosh has higher revenues than Windows with fewer sessions. In addition, it also confirms again that the mobile sessions (Android and iOS) have little revenues compared to the number of sessions. Macintosh and Windows are desktop operating systems which is inline with the device category analysis.

Browser

Web browsers are the last step to getting to the GStore. There are 54 web browsers from the popular chrome and safari to the less known such as Seznam and DoCoMo

p1 <- session(df, browser, 5)
p2 <- revenue(df, browser, 5)
grid.arrange(p1, p2)

By far the most sessions and revenues come from the Chrome browser. Safari has a lot of sessions, more than FireFox but generates relatively low revenues. Firefox also has a healthy balance between sessions and revenue.

Pageviews, Bounces and Hits

A pageview is each time a visitor views a page on your website.

p1 <- df %>% filter(!is.na(df$pageviews) & pageviews <=30) %>% 
    ggplot(aes(x=pageviews)) +
    geom_histogram(binwidth=1)

p2 <- df %>% filter(!is.na(df$pageviews) & pageviews <=30) %>% group_by(pageviews) %>% summarise(revenue=sum(transactionRevenue)) %>%
    ggplot(aes(x=pageviews, y=revenue)) +
    geom_col()
grid.arrange(p1, p2)

The distribution of pageviews is very right skewed. This makes sense since we mostly use a small number of sites. Most of the revenue is generated from sites with a small number of pageviews.

Country

There are 222 countries in the dataset which is most of the world. Most of them have near 0 sessions, mostly because of the lack of internet access.

c1 <- session(df, country, 6)
c2 <- revenue(df, country, 6)
grid.arrange(c1, c2)

Since google is based in the US there is no surprise that the US has the highest session counts and largest revenue. Cananda and Venezuela follow very far behind.

adContent

AdContent as the name implies is the ad shown to the user at the time

df %>% filter(transactionRevenue > 0) %>% drop_na(adContent) %>% group_by(adContent) %>% summarise(revenue=sum(transactionRevenue)) %>% ggplot(aes(x=adContent, y=revenue)) + geom_col() + labs(x='AdContent', y='Revenue') + coord_flip()

The results here are expected for sessions that produced positive revenue. Ads that were about Google Merchandise Collection generated the most revenue. Since this is a categorical variable the missing values are difficult to impute.

Look at the final dataframe

skimr::skim(df)
Data summary
Name df
Number of rows 903653
Number of columns 29
_______________________
Column type frequency:
character 19
Date 1
logical 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
channelGrouping 0 1.00 6 14 0 8 0
sessionId 0 1.00 24 31 0 902755 0
visitStartTime 0 1.00 19 19 0 887159 0
browser 0 1.00 1 43 0 54 0
operatingSystem 0 1.00 3 13 0 20 0
deviceCategory 0 1.00 6 7 0 3 0
continent 0 1.00 4 9 0 6 0
subContinent 0 1.00 9 18 0 23 0
country 0 1.00 4 24 0 222 0
region 0 1.00 4 33 0 376 0
metro 0 1.00 6 41 0 94 0
city 0 1.00 3 33 0 649 0
networkDomain 0 1.00 2 37 0 28064 0
campaign 0 1.00 9 47 0 10 0
source 0 1.00 3 60 0 380 0
medium 0 1.00 3 9 0 7 0
keyword 502929 0.44 1 147 0 3659 0
referralPath 572712 0.37 1 270 0 1475 0
adContent 892707 0.01 8 43 0 44 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-08-01 2017-08-01 2017-01-09 366

Variable type: logical

skim_variable n_missing complete_rate mean count
isMobile 0 1 0.26 FAL: 664530, TRU: 239123

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fullVisitorId 0 1.00 4.505845e+18 3.071128e+18 4.823595e+12 1.593048e+18 4.388191e+18 7.194903e+18 5.198308e+19 ▇▁▁▁▁
visitId 0 1.00 1.485007e+09 9.022124e+06 1.470035e+09 1.477561e+09 1.483949e+09 1.492759e+09 1.501657e+09 ▆▇▅▅▅
visitNumber 0 1.00 2.260000e+00 9.280000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.950000e+02 ▇▁▁▁▁
hits 0 1.00 4.600000e+00 9.640000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 5.000000e+02 ▇▁▁▁▁
pageviews 100 1.00 3.850000e+00 7.030000e+00 1.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 4.690000e+02 ▇▁▁▁▁
bounces 453023 0.50 1.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▁▁▇▁▁
newVisits 200593 0.78 1.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▁▁▇▁▁
transactionRevenue 0 1.00 2.300000e-01 2.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.386000e+01 ▇▁▁▁▁

Columns still worth looking at - keyword

Based on the EDA I would expected the most revenue from users with the following specifications

browser - Chrome channelGrouping - Referral operatingSystem - Macintosh isMobile - False deviceCategory - Desktop continent - Americas country - United States city - New York

Modeling

The target in this dataset is a revenue so we will begin with a randomforest regression. Given the size of the dataset we will build a model on revenue greater than 0. The predictions will then be threshold to 0 for values between the 1st Quartile (17.03) and the Min (9.21) which is 13.12.

library(tidymodels)
library(randomForest)
library(vip)

Train test split

We begin by splitting the data into training and testing

set.seed(42)
df_split <- df %>% 
  filter(transactionRevenue > 0) %>%
  initial_split(strata = transactionRevenue)

df_train <- training(df_split)
df_test <- training(df_split)

set.seed(43)
df_folds <- bootstraps(df_train, strata = transactionRevenue, times=10)

Preprocessing

Certain features are selected based on the analysis above. Some of those features are then thresholded based on frequency below 1%. Other featues are imputed with knn. Step order here

df_rec <- recipe(transactionRevenue ~ channelGrouping + browser + operatingSystem + 
                   isMobile + deviceCategory + continent + subContinent + country +
                   region + metro + city + networkDomain + hits + pageviews + 
                   source + medium, data=df_train) %>%
  step_novel(operatingSystem, country, city, browser, medium, networkDomain, metro, source, region, subContinent) %>%
  step_other(operatingSystem, country, city, browser, medium, networkDomain, subContinent, threshold = 0.01) %>%
  step_impute_knn(pageviews, hits)

smp <- df_rec %>% prep() %>% bake(new_data=NULL)

Tuning

With the data preprocessed we build the random forest model with tuneable parameters

rf_spec <- rand_forest(mtry = tune(), min_n = tune(), trees=1500) %>%
  set_mode('regression') %>%
  set_engine('ranger')

rf_workflow <- workflow() %>%
  add_recipe(df_rec) %>%
  add_model(rf_spec)

Training

Using parallel processing 10 models are trained across a grid of 20 to find the best hyperparameters for the random forest model

set.seed(44)
doParallel::registerDoParallel()
rf_tune <- tune_grid(rf_workflow, resamples = df_folds, grid=20)
## i Creating pre-processing data to finalize unknown parameter: mtry

Evaluation

The metric used in this competition is RMSE, so values corresponding to the lowest values are chosen

show_best(rf_tune, metric='rmse')
## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2    21 rmse    standard    1.10     6 0.00671 Preprocessor1_Model18
## 2     3    11 rmse    standard    1.11     6 0.00601 Preprocessor1_Model20
## 3     3     6 rmse    standard    1.11     6 0.00605 Preprocessor1_Model10
## 4     5    33 rmse    standard    1.11     6 0.00592 Preprocessor1_Model07
## 5     4    12 rmse    standard    1.12     6 0.00584 Preprocessor1_Model13
autoplot(rf_tune)

Selecting the best parameters we train the full model on the trainined set

rf_model <- rf_workflow %>%
  finalize_workflow(select_best(rf_tune, metric='rmse'))

When the tuning completed we can now train the model with the best fit parameters on the train set and test on the test set

Testing

rf_fit <- last_fit(rf_model, df_split)
collect_metrics(rf_fit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.08  Preprocessor1_Model1
## 2 rsq     standard       0.177 Preprocessor1_Model1

Our model has a RMSE of 1.08 on the test set which is a little better than on the training set of 1.10.

Finally we can see how our model performed against the true revenue

collect_predictions(rf_fit) %>% 
  ggplot(aes(x=transactionRevenue, y=.pred)) + 
  geom_abline(lty=2) + 
  geom_point(alpha=0.5, color='blue')  + 
  coord_fixed()

Feature importance

imp_spec <- rf_spec %>%
  finalize_model(select_best(rf_tune, metric = 'rmse')) %>%
  set_engine('ranger', importance='permutation')

workflow() %>%
  add_recipe(df_rec) %>%
  add_model(imp_spec) %>%
  fit(df_test) %>%
  extract_fit_parsnip() %>%
  vip() + 
  labs(y='Importance', x='Features')

It looks like the number of hits and pageviews are among the most important features.

Manual prediction

model <- rf_fit$.workflow[[1]]
predict(model, df_test[100,])
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  16.9

With the trained model we can now predict on the kaggle test set. The results will be uploaded to the leader board for evaluation

test <- read.csv(paste0(path, 'test_clean.csv'))
X <- c('channelGrouping' , 'browser', 'operatingSystem', 'isMobile', 'deviceCategory', 'continent', 'subContinent', 'country', 'region', 'metro', 'city', 'networkDomain', 'hits', 'pageviews', 'source','medium')

#chrvars <- df_test %>% select(where(is.character)) %>% names()
#test[, chrvars] <- lapply(test[, chrvars], as.character)

stest = test %>% select(X)

y_hat <- c()

y_hat <- predict(model, new_data=stest)