The dataset we are analyzing is a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset. The raw dataset contains 12 columns to predict the transaction revenue per customer. The outcome from this analysis will aid in better use of marketing budgets. There are 2 datasets, big and small. The small dataset is about 1 million transactions and the big one is about 10 times larger. Due to hardware constraint the small dataset is used in this report.
The data is a mix of csv and json where the json data are separate columns. That mismatch requires a lot of preprocessing to get the features and target that we need for analyzing. The json columns (device, geoNetwork, totals, trafficSource) are convert from str json into a json object. Then the json object is parsed with jsonlite into vectors. Those vectors are flattened and then join with the original dataframe. This preprocessing step is required for both the training and testing test.
path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/final/'
df <- read.csv(paste0(path,"test.csv"), header = T, stringsAsFactors = F, colClasses = c(fullVisitorId = "character"))
df_device <- paste("[", paste(df$device, collapse = ","), "]") %>% fromJSON(flatten = T)
df_geoNetwork <- paste("[", paste(df$geoNetwork, collapse = ","), "]") %>% fromJSON(flatten = T)
df_totals <- paste("[", paste(df$totals, collapse = ","), "]") %>% fromJSON(flatten = T)
df_trafficSource <- paste("[", paste(df$trafficSource, collapse = ","), "]") %>% fromJSON(flatten = T)
df <- df %>%
cbind(df_device, df_geoNetwork, df_totals, df_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
factorVars <- c("channelGrouping", "browser", "operatingSystem", "deviceCategory", "country")
df[, factorVars] <- lapply(df[, factorVars], as.factor)
df$transactionRevenue <- as.numeric(df$transactionRevenue)
numVars <- c("visits", "hits", "bounces", "pageviews", "newVisits")
df[, numVars] <- lapply(df[, numVars], as.integer)
df$visitStartTime <- as.POSIXct(df$visitStartTime, tz="UTC", origin='1970-01-01')
write.csv(df, 'DATA_621/final/test_clean.csv', row.names = FALSE)
The clean dataset has 903653 transactions and 55 columns, where each transaction is one visit to google store. The target variable is transactionRevenue, the amount of money GStore brought in from that transaction.
path = '/home/kenan/Documents/learning/masters/CUNY-SPS-Masters-DS/DATA_621/final/'
df <- read.csv(paste0(path, 'train_clean.csv'))
df <- df %>% mutate(date=ymd(date))
nan_count <- sum(!is.na(df$transactionRevenue))
nan_pct <- nan_count / nrow(df) * 100
total_rev <- sum(df$transactionRevenue, na.rm=T)
Only 11515 sessions in the training set led to a transaction. This is only 1.2742723% of the observations, and that of course matches with the 98.7% missing values in the previous section. Altogether, the GStore made $1,540,071,240,000 USD in revenues in that year. The missing transactionRevenue is replaced with 0 to avoid errors with NAN later on. This filling of missing value is acceptable since the lowest a revenue can go is 0.
We being with looking at the transactionRevenue
p1 <- df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=transactionRevenue)) + geom_histogram()
p2 <- df %>% filter(transactionRevenue > 0) %>% mutate(transactionRevenue=log(transactionRevenue)) %>% ggplot(aes(x=transactionRevenue)) + geom_histogram()
grid.arrange(p1, p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most transactions generate no revenue hence the strongly skewed right tail. However, if we log the transactionRevenue it looks almost normal. Since the metrics of the competitation is in log scale, the transactionRevenue will be converted to log scale from here on.
df <- df %>%
mutate(transactionRevenue=log(transactionRevenue),
transactionRevenue=replace_na(transactionRevenue,0))
gg_miss_var(df %>% filter(transactionRevenue > 0), show_pct = TRUE)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=date, y=transactionRevenue)) + geom_line()
It looks like transactionRevenue doesn’t so much variation through out the year
Some columns have only a few values, those columns will be dropped as they carry no row-wise information.
uniq_counts <-df %>%
select_if(negate(is.numeric)) %>%
apply(2, function(x) length(unique(x)))
uniq_cols <- names(uniq_counts[uniq_counts == 1])
uniq_cols
## [1] "socialEngagementType" "browserVersion"
## [3] "browserSize" "operatingSystemVersion"
## [5] "mobileDeviceBranding" "mobileDeviceModel"
## [7] "mobileInputSelector" "mobileDeviceInfo"
## [9] "mobileDeviceMarketingName" "flashVersion"
## [11] "language" "screenColors"
## [13] "screenResolution" "cityId"
## [15] "latitude" "longitude"
## [17] "networkLocation" "adwordsClickInfo.criteriaParameters"
17 Columns contain only one value for all 900k rows. The columns below were all dropped by checks for their usefulness towards positive revenue , NA count, unique value counts and duplication of another column.
adwordsClickInfo.isVideoAd - NA (882166) and False () adwordsClickInfo.adNetworkType - NA, and only “Google Search” for revenue greater than 0 adwordsClickInfo.gclId - According to this post gclId is a hash/encryption info for data is available in other features adwordsClickInfo.slot - 103 Top and 5362 RHS, the rest are NA adwordsClickInfo.page - NA and 1, for positive revenue campaignCode - NA for all value with positive revenue and “11251kjhkvahf” for other values isTrueDirect - NA and True for only a few rows visitId - sessionID already has this information visits - 1 for all positive revenue
drop_cols = c('adwordsClickInfo.isVideoAd', 'adwordsClickInfo.adNetworkType', 'adwordsClickInfo.gclId', 'adwordsClickInfo.slot', 'adwordsClickInfo.page', 'campaignCode', 'isTrueDirect', 'visits', uniq_cols)
df <- df %>% select(-all_of(drop_cols))
After dropping those columns we are left with 30 columns.
An active user is someone that has visited the Google Merchandise Store
p1 <- df %>%
group_by(date) %>%
summarise(dailySessions = n()) %>%
ggplot(aes(x=date, y=dailySessions)) +
geom_line() +
geom_smooth(col='red') +
labs(x="", y="Sessions per Day")
p2 <- df %>%
group_by(date) %>%
summarise(revenue=sum(transactionRevenue)) %>%
ggplot(aes(x=date, y=revenue)) +
geom_line() +
geom_smooth(col='red') +
labs(x="", y="Daily Revenue")
grid.arrange(p1, p2, nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It looks like between Oct 2016 and Jan 2017 there was a peek in active users. The daily revenue slightly peeks around the same time then plateaus.
Channels are the medium on how users came to the GStore.
p1 <- session(df, channelGrouping, 12)
p2 <- revenue(df, channelGrouping, 12)
grid.arrange(p1, p2)
As you see in the plots, Organic Search and Social channels led to the most sessions; however social channels delivers very little revenue. Referral on the other hand delivers most revenues with a relatively small number of sessions. This makes sense since a referral is a session with a purpose while other channel groups could be casual browsing.
Device category are the hardware which people use to access the GStore, there are 3.
p1 <- session(df, deviceCategory, 3)
p2 <- revenue(df, deviceCategory, 3)
#p3 <- df %>% filter(transactionRevenue > 0) %>% ggplot(aes(x=transactionRevenue, fill=deviceCategory)) + geom_histogram(position='identity', alpha=0.75, bins=50)
grid.arrange(p1, p2)
The main takeaway here is that mobile and tablet have a lot fewer sessions, and relatively low revenues per session when compared to desktop. Desktops produce the most sessions and revenue.
An operating system is the next level from device category. There are 20 unique operating systems
“Windows” “Macintosh” “Linux” “Android” “iOS” “Chrome OS”
“BlackBerry” “(not set)” “Samsung” “Windows Phone” “Xbox” “Nintendo Wii” “Firefox OS” “Nintendo WiiU” “FreeBSD” “Nokia” “NTT DoCoMo” “Nintendo 3DS” “SunOS” “OpenBSD”
with revenues; however most have low sessions and revenue so they are ignored. The “(no set)” category is most likely a missing value.
p1 <- session(df, operatingSystem, 6)
p2 <- revenue(df, operatingSystem, 6)
grid.arrange(p1, p2)
What stands out is Macintosh has higher revenues than Windows with fewer sessions. In addition, it also confirms again that the mobile sessions (Android and iOS) have little revenues compared to the number of sessions. Macintosh and Windows are desktop operating systems which is inline with the device category analysis.
Web browsers are the last step to getting to the GStore. There are 54 web browsers from the popular chrome and safari to the less known such as Seznam and DoCoMo
p1 <- session(df, browser, 5)
p2 <- revenue(df, browser, 5)
grid.arrange(p1, p2)
By far the most sessions and revenues come from the Chrome browser. Safari has a lot of sessions, more than FireFox but generates relatively low revenues. Firefox also has a healthy balance between sessions and revenue.
A pageview is each time a visitor views a page on your website.
p1 <- df %>% filter(!is.na(df$pageviews) & pageviews <=30) %>%
ggplot(aes(x=pageviews)) +
geom_histogram(binwidth=1)
p2 <- df %>% filter(!is.na(df$pageviews) & pageviews <=30) %>% group_by(pageviews) %>% summarise(revenue=sum(transactionRevenue)) %>%
ggplot(aes(x=pageviews, y=revenue)) +
geom_col()
grid.arrange(p1, p2)
The distribution of pageviews is very right skewed. This makes sense since we mostly use a small number of sites. Most of the revenue is generated from sites with a small number of pageviews.
There are 222 countries in the dataset which is most of the world. Most of them have near 0 sessions, mostly because of the lack of internet access.
c1 <- session(df, country, 6)
c2 <- revenue(df, country, 6)
grid.arrange(c1, c2)
Since google is based in the US there is no surprise that the US has the highest session counts and largest revenue. Cananda and Venezuela follow very far behind.
AdContent as the name implies is the ad shown to the user at the time
df %>% filter(transactionRevenue > 0) %>% drop_na(adContent) %>% group_by(adContent) %>% summarise(revenue=sum(transactionRevenue)) %>% ggplot(aes(x=adContent, y=revenue)) + geom_col() + labs(x='AdContent', y='Revenue') + coord_flip()
The results here are expected for sessions that produced positive revenue. Ads that were about Google Merchandise Collection generated the most revenue. Since this is a categorical variable the missing values are difficult to impute.
Look at the final dataframe
skimr::skim(df)
| Name | df |
| Number of rows | 903653 |
| Number of columns | 29 |
| _______________________ | |
| Column type frequency: | |
| character | 19 |
| Date | 1 |
| logical | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| channelGrouping | 0 | 1.00 | 6 | 14 | 0 | 8 | 0 |
| sessionId | 0 | 1.00 | 24 | 31 | 0 | 902755 | 0 |
| visitStartTime | 0 | 1.00 | 19 | 19 | 0 | 887159 | 0 |
| browser | 0 | 1.00 | 1 | 43 | 0 | 54 | 0 |
| operatingSystem | 0 | 1.00 | 3 | 13 | 0 | 20 | 0 |
| deviceCategory | 0 | 1.00 | 6 | 7 | 0 | 3 | 0 |
| continent | 0 | 1.00 | 4 | 9 | 0 | 6 | 0 |
| subContinent | 0 | 1.00 | 9 | 18 | 0 | 23 | 0 |
| country | 0 | 1.00 | 4 | 24 | 0 | 222 | 0 |
| region | 0 | 1.00 | 4 | 33 | 0 | 376 | 0 |
| metro | 0 | 1.00 | 6 | 41 | 0 | 94 | 0 |
| city | 0 | 1.00 | 3 | 33 | 0 | 649 | 0 |
| networkDomain | 0 | 1.00 | 2 | 37 | 0 | 28064 | 0 |
| campaign | 0 | 1.00 | 9 | 47 | 0 | 10 | 0 |
| source | 0 | 1.00 | 3 | 60 | 0 | 380 | 0 |
| medium | 0 | 1.00 | 3 | 9 | 0 | 7 | 0 |
| keyword | 502929 | 0.44 | 1 | 147 | 0 | 3659 | 0 |
| referralPath | 572712 | 0.37 | 1 | 270 | 0 | 1475 | 0 |
| adContent | 892707 | 0.01 | 8 | 43 | 0 | 44 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-08-01 | 2017-08-01 | 2017-01-09 | 366 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| isMobile | 0 | 1 | 0.26 | FAL: 664530, TRU: 239123 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fullVisitorId | 0 | 1.00 | 4.505845e+18 | 3.071128e+18 | 4.823595e+12 | 1.593048e+18 | 4.388191e+18 | 7.194903e+18 | 5.198308e+19 | ▇▁▁▁▁ |
| visitId | 0 | 1.00 | 1.485007e+09 | 9.022124e+06 | 1.470035e+09 | 1.477561e+09 | 1.483949e+09 | 1.492759e+09 | 1.501657e+09 | ▆▇▅▅▅ |
| visitNumber | 0 | 1.00 | 2.260000e+00 | 9.280000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.950000e+02 | ▇▁▁▁▁ |
| hits | 0 | 1.00 | 4.600000e+00 | 9.640000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 5.000000e+02 | ▇▁▁▁▁ |
| pageviews | 100 | 1.00 | 3.850000e+00 | 7.030000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 4.690000e+02 | ▇▁▁▁▁ |
| bounces | 453023 | 0.50 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | ▁▁▇▁▁ |
| newVisits | 200593 | 0.78 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | ▁▁▇▁▁ |
| transactionRevenue | 0 | 1.00 | 2.300000e-01 | 2.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.386000e+01 | ▇▁▁▁▁ |
Columns still worth looking at - keyword
Based on the EDA I would expected the most revenue from users with the following specifications
browser - Chrome channelGrouping - Referral operatingSystem - Macintosh isMobile - False deviceCategory - Desktop continent - Americas country - United States city - New York
The target in this dataset is a revenue so we will begin with a randomforest regression. Given the size of the dataset we will build a model on revenue greater than 0. The predictions will then be threshold to 0 for values between the 1st Quartile (17.03) and the Min (9.21) which is 13.12.
library(tidymodels)
library(randomForest)
library(vip)
We begin by splitting the data into training and testing
set.seed(42)
df_split <- df %>%
filter(transactionRevenue > 0) %>%
initial_split(strata = transactionRevenue)
df_train <- training(df_split)
df_test <- training(df_split)
set.seed(43)
df_folds <- bootstraps(df_train, strata = transactionRevenue, times=10)
Certain features are selected based on the analysis above. Some of those features are then thresholded based on frequency below 1%. Other featues are imputed with knn. Step order here
df_rec <- recipe(transactionRevenue ~ channelGrouping + browser + operatingSystem +
isMobile + deviceCategory + continent + subContinent + country +
region + metro + city + networkDomain + hits + pageviews +
source + medium, data=df_train) %>%
step_novel(operatingSystem, country, city, browser, medium, networkDomain, metro, source, region, subContinent) %>%
step_other(operatingSystem, country, city, browser, medium, networkDomain, subContinent, threshold = 0.01) %>%
step_impute_knn(pageviews, hits)
smp <- df_rec %>% prep() %>% bake(new_data=NULL)
With the data preprocessed we build the random forest model with tuneable parameters
rf_spec <- rand_forest(mtry = tune(), min_n = tune(), trees=1500) %>%
set_mode('regression') %>%
set_engine('ranger')
rf_workflow <- workflow() %>%
add_recipe(df_rec) %>%
add_model(rf_spec)
Using parallel processing 10 models are trained across a grid of 20 to find the best hyperparameters for the random forest model
set.seed(44)
doParallel::registerDoParallel()
rf_tune <- tune_grid(rf_workflow, resamples = df_folds, grid=20)
## i Creating pre-processing data to finalize unknown parameter: mtry
The metric used in this competition is RMSE, so values corresponding to the lowest values are chosen
show_best(rf_tune, metric='rmse')
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 21 rmse standard 1.10 6 0.00671 Preprocessor1_Model18
## 2 3 11 rmse standard 1.11 6 0.00601 Preprocessor1_Model20
## 3 3 6 rmse standard 1.11 6 0.00605 Preprocessor1_Model10
## 4 5 33 rmse standard 1.11 6 0.00592 Preprocessor1_Model07
## 5 4 12 rmse standard 1.12 6 0.00584 Preprocessor1_Model13
autoplot(rf_tune)
Selecting the best parameters we train the full model on the trainined set
rf_model <- rf_workflow %>%
finalize_workflow(select_best(rf_tune, metric='rmse'))
When the tuning completed we can now train the model with the best fit parameters on the train set and test on the test set
rf_fit <- last_fit(rf_model, df_split)
collect_metrics(rf_fit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.08 Preprocessor1_Model1
## 2 rsq standard 0.177 Preprocessor1_Model1
Our model has a RMSE of 1.08 on the test set which is a little better than on the training set of 1.10.
Finally we can see how our model performed against the true revenue
collect_predictions(rf_fit) %>%
ggplot(aes(x=transactionRevenue, y=.pred)) +
geom_abline(lty=2) +
geom_point(alpha=0.5, color='blue') +
coord_fixed()
Feature importance
imp_spec <- rf_spec %>%
finalize_model(select_best(rf_tune, metric = 'rmse')) %>%
set_engine('ranger', importance='permutation')
workflow() %>%
add_recipe(df_rec) %>%
add_model(imp_spec) %>%
fit(df_test) %>%
extract_fit_parsnip() %>%
vip() +
labs(y='Importance', x='Features')
It looks like the number of hits and pageviews are among the most important features.
Manual prediction
model <- rf_fit$.workflow[[1]]
predict(model, df_test[100,])
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 16.9
With the trained model we can now predict on the kaggle test set. The results will be uploaded to the leader board for evaluation
test <- read.csv(paste0(path, 'test_clean.csv'))
X <- c('channelGrouping' , 'browser', 'operatingSystem', 'isMobile', 'deviceCategory', 'continent', 'subContinent', 'country', 'region', 'metro', 'city', 'networkDomain', 'hits', 'pageviews', 'source','medium')
#chrvars <- df_test %>% select(where(is.character)) %>% names()
#test[, chrvars] <- lapply(test[, chrvars], as.character)
stest = test %>% select(X)
y_hat <- c()
y_hat <- predict(model, new_data=stest)