Google Analytics is a well known Web Analytics Platform that allows businesses to track the consumer acquisition, engagement, beahvior and conversion through their website. Even though it might be considered as a basic platform, it follows the standard Data Science process where it collects data from website user’s browser, transforms and cleans based on business input and provides analysis of the data via the Google Analytics User Interface. In my final project, I decided to leverage Google Analytics Data to solve one of the most commont problems Marketing Departments within businesses solve everyday. The high level problem statement is “How should our business allocate their Marketing Budget on weekly, daily, yearly or even daily basis?” . In order to explore this problem further, we would need to look at the Google Analytics Data in detail for Unique visitors, Sessions, Channel Grouping, Transaction Revenue and New vs Returning Consumer. Our purpose of this analysis is to provide actionable insights to the marketing team so they can use their budget based on the business conversion goals. Instead of using a Power Point Presentation, we will build an engaging analysis and shiny app that works with our Google Analytics Data.
In this case, since I wasnt able to access any real business data, I decided to leverage a Google Analytics Customer Revenue Prediction Competion data available by Kaggle. The full Data set can be found here: ( https://www.kaggle.com/c/ga-customer-revenue-prediction/data ) . The raw Toogle Analytics data between the dates of August 1st 2016 and April 30th 2018, for the Google Merchandise store. YOu can access to the Google Merchandise Store data through the Google Analytics Demo Account (https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount#/report-home/a54516992w87479473p92320289).
##Data Fields
fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
channelGrouping - The channel via which the user came to the Store.
date - The date on which the user visited the Store.
device - The specifications for the device used to access the Store.
geoNetwork - This section contains information about the geography of the user.
socialEngagementType - Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
totals - This section contains aggregate values across the session.
trafficSource - This section contains information about the Traffic Source from which the session originated.
**visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.**
visitNumber - The session number for this user. If this is the first session, then this is set to 1.
visitStartTime - The timestamp (expressed as POSIX time).
hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
totals - This set of columns mostly includes high-level aggregate data.
## Warning: package 'readxl' was built under R version 3.6.3
## Warning: package 'shiny' was built under R version 3.6.2
#library(googleAuthR) # To prompt for authentication by the user
#library(googleAnalyticsR) # For the pulling of the data
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
##
## Attaching package: 'googleCharts'
## The following object is masked from 'package:shiny':
##
## column
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: shinydashboard
## Warning: package 'shinydashboard' was built under R version 3.6.3
##
## Attaching package: 'shinydashboard'
## The following object is masked from 'package:graphics':
##
## box
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'knitr' was built under R version 3.6.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## The following object is masked from 'package:shiny':
##
## validate
## Warning: package 'repr' was built under R version 3.6.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'ggpubr' was built under R version 3.6.3
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
For this purpose of this project, we will only use the train data and will not touch the test data provided by Kaggle.
As we can see our dataset is a mess. It is the raw data that is pulled in from Big Query in csv format and needs to be cleaned in order for us to perform any analysis and visualization. The columns device, geoNetwork, totals and Traffic Source has multiple key value pairs.
#JSON columns are "device", "geoNetwork", "totals", "trafficSource"
tr_device <- paste("[", paste(ga_train$device, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_geoNetwork <- paste("[", paste(ga_train$geoNetwork, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_totals <- paste("[", paste(ga_train$totals, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_trafficSource <- paste("[", paste(ga_train$trafficSource, collapse = ","), "]") %>% fromJSON(flatten = T)
#Combine to make the full training and test sets
ga_train <- ga_train %>%
cbind(tr_device, tr_geoNetwork, tr_totals, tr_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
#Remove temporary tr_ and te_ sets
rm(tr_device)
rm(tr_geoNetwork)
rm(tr_totals)
rm(tr_trafficSource)
# convert some character variables to categorical variables
factorVars <- c("channelGrouping", "browser", "operatingSystem", "deviceCategory", "country")
ga_train[, factorVars] <- lapply(ga_train[, factorVars], as.factor)
#converting the date variable from integer to the date format
ga_train$date <- ymd(ga_train$date)
#converting character variables into numeric
numVars <- c("visits", "hits", "bounces", "pageviews", "newVisits")
ga_train[, numVars] <- lapply(ga_train[, numVars], as.integer)
ga_train$transactionRevenue <- as.numeric(ga_train$transactionRevenue)
#converting visit start times to POSIXct
ga_train$visitStartTime <- as.POSIXct(ga_train$visitStartTime, tz="UTC", origin='1970-01-01')
Let’s look at the structure and descriptive statistic of our data.
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
metastats <- tibble::rownames_to_column(metastats, "STATS")
metastats["pct_missing"] <- round(metastats["n"]/903653, 3)
head(metastats)
## STATS vars n mean sd min
## 1 channelGrouping 1 903653 NaN NA Inf
## 2 date 2 903653 NaN NA Inf
## 3 fullVisitorId 3 903653 4.505845e+18 3.071128e+18 4.823595e+12
## 4 sessionId 4 903653 NaN NA Inf
## 5 socialEngagementType 5 903653 NaN NA Inf
## 6 visitId 6 903653 1.485007e+09 9.022124e+06 1.470035e+09
## max range se pct_missing
## 1 -Inf -Inf NA 1
## 2 -Inf -Inf NA 1
## 3 5.198308e+19 5.198307e+19 3.230703e+15 1
## 4 -Inf -Inf NA 1
## 5 -Inf -Inf NA 1
## 6 1.501657e+09 3.162238e+07 9.490911e+03 1
## channelGrouping date fullVisitorId
## Organic Search:381561 Min. :2016-08-01 Min. :4.824e+12
## Social :226117 1st Qu.:2016-10-27 1st Qu.:1.593e+18
## Direct :143026 Median :2017-01-09 Median :4.388e+18
## Referral :104838 Mean :2017-01-20 Mean :4.506e+18
## Paid Search : 25326 3rd Qu.:2017-04-21 3rd Qu.:7.195e+18
## Affiliates : 16403 Max. :2017-08-01 Max. :5.198e+19
## (Other) : 6382
## sessionId socialEngagementType
## 0011338928267756760_1471848731: 2 Not Socially Engaged:903653
## 0014884852016449602_1483257533: 2
## 0019612470994343873_1496731675: 2
## 0031542961723593073_1495261490: 2
## 0039798900163330977_1485331011: 2
## 0047064119019485310_1480924286: 2
## (Other) :903641
## visitId visitNumber visitStartTime
## Min. :1.470e+09 Min. : 1.000 Min. :2016-08-01 07:00:12
## 1st Qu.:1.478e+09 1st Qu.: 1.000 1st Qu.:2016-10-27 09:39:29
## Median :1.484e+09 Median : 1.000 Median :2017-01-09 08:02:14
## Mean :1.485e+09 Mean : 2.265 Mean :2017-01-21 14:01:30
## 3rd Qu.:1.493e+09 3rd Qu.: 1.000 3rd Qu.:2017-04-21 07:12:42
## Max. :1.502e+09 Max. :395.000 Max. :2017-08-02 06:59:53
##
## browser browserVersion browserSize
## Chrome :620364 Length:903653 Length:903653
## Safari :182245 Class :character Class :character
## Firefox : 37069 Mode :character Mode :character
## Internet Explorer: 19375
## Edge : 10205
## Android Webview : 7865
## (Other) : 26530
## operatingSystem operatingSystemVersion isMobile mobileDeviceBranding
## Windows :350072 Length:903653 Mode :logical Length:903653
## Macintosh:253938 Class :character FALSE:664530 Class :character
## Android :123892 Mode :character TRUE :239123 Mode :character
## iOS :107665
## Linux : 35034
## Chrome OS: 26337
## (Other) : 6715
## mobileDeviceModel mobileInputSelector mobileDeviceInfo
## Length:903653 Length:903653 Length:903653
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## mobileDeviceMarketingName flashVersion language
## Length:903653 Length:903653 Length:903653
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## screenColors screenResolution deviceCategory continent
## Length:903653 Length:903653 desktop:664479 Length:903653
## Class :character Class :character mobile :208725 Class :character
## Mode :character Mode :character tablet : 30449 Mode :character
##
##
##
##
## subContinent country region
## Length:903653 United States :364744 Length:903653
## Class :character India : 51140 Class :character
## Mode :character United Kingdom: 37393 Mode :character
## Canada : 25869
## Vietnam : 24598
## Turkey : 20522
## (Other) :379387
## metro city cityId networkDomain
## Length:903653 Length:903653 Length:903653 Length:903653
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## latitude longitude networkLocation visits
## Length:903653 Length:903653 Length:903653 Min. :1
## Class :character Class :character Class :character 1st Qu.:1
## Mode :character Mode :character Mode :character Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## hits pageviews bounces newVisits
## Min. : 1.000 Min. : 1.00 Min. :1 Min. :1
## 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.:1 1st Qu.:1
## Median : 2.000 Median : 1.00 Median :1 Median :1
## Mean : 4.596 Mean : 3.85 Mean :1 Mean :1
## 3rd Qu.: 4.000 3rd Qu.: 4.00 3rd Qu.:1 3rd Qu.:1
## Max. :500.000 Max. :469.00 Max. :1 Max. :1
## NA's :100 NA's :453023 NA's :200593
## transactionRevenue campaign source medium
## Min. :1.000e+04 Length:903653 Length:903653 Length:903653
## 1st Qu.:2.493e+07 Class :character Class :character Class :character
## Median :4.945e+07 Mode :character Mode :character Mode :character
## Mean :1.337e+08
## 3rd Qu.:1.077e+08
## Max. :2.313e+10
## NA's :892138
## keyword isTrueDirect referralPath adContent
## Length:903653 Mode:logical Length:903653 Length:903653
## Class :character TRUE:274005 Class :character Class :character
## Mode :character NA's:629648 Mode :character Mode :character
##
##
##
##
## campaignCode adwordsClickInfo.criteriaParameters adwordsClickInfo.page
## Length:903653 Length:903653 Length:903653
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## adwordsClickInfo.slot adwordsClickInfo.gclId adwordsClickInfo.adNetworkType
## Length:903653 Length:903653 Length:903653
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## adwordsClickInfo.isVideoAd
## Mode :logical
## FALSE:21460
## NA's :882193
##
##
##
##
We have 903653 obvervations, with 55 variables. SessionId is the unique identifier for each user. We can see that the number of distinct sessionid is almost same as the number of observations.
Based on the Channel Grouping, we are acquiring majority of the users via Organic Search, followed by Social. Majority of the visitors are using Chrome as their browser, using Desktop as a device and are from United Staes.
Let’s look at the missing values in more detail.
## channelGrouping date
## 0 0
## fullVisitorId sessionId
## 0 0
## socialEngagementType visitId
## 0 0
## visitNumber visitStartTime
## 0 0
## browser browserVersion
## 0 0
## browserSize operatingSystem
## 0 0
## operatingSystemVersion isMobile
## 0 0
## mobileDeviceBranding mobileDeviceModel
## 0 0
## mobileInputSelector mobileDeviceInfo
## 0 0
## mobileDeviceMarketingName flashVersion
## 0 0
## language screenColors
## 0 0
## screenResolution deviceCategory
## 0 0
## continent subContinent
## 0 0
## country region
## 0 0
## metro city
## 0 0
## cityId networkDomain
## 0 0
## latitude longitude
## 0 0
## networkLocation visits
## 0 0
## hits pageviews
## 0 100
## bounces newVisits
## 453023 200593
## transactionRevenue campaign
## 892138 0
## source medium
## 0 0
## keyword isTrueDirect
## 502929 629648
## referralPath adContent
## 572712 892707
## campaignCode adwordsClickInfo.criteriaParameters
## 903652 0
## adwordsClickInfo.page adwordsClickInfo.slot
## 882193 882193
## adwordsClickInfo.gclId adwordsClickInfo.adNetworkType
## 882092 882193
## adwordsClickInfo.isVideoAd
## 882193
We have a lot of missing values in campaigncode, transactionrevenue and some other variables. For the purpose of this project, we wont be handling these missing values. As we can see Channel Grouping variable which gives us the variety of Marketing Channels that we acquire the users from dont have any missing values along with sessionid, visit number.
## channelGrouping date fullVisitorId sessionId
## 1 Organic Search 2016-09-02 1.131660e+18 1131660440785968503_1472830385
## 2 Organic Search 2016-09-02 3.773060e+17 377306020877927890_1472880147
## 3 Organic Search 2016-09-02 3.895546e+18 3895546263509774583_1472865386
## 4 Organic Search 2016-09-02 4.763447e+18 4763447161404445595_1472881213
## 5 Organic Search 2016-09-02 2.729444e+16 27294437909732085_1472822600
## 6 Organic Search 2016-09-02 2.938943e+18 2938943183656635653_1472807194
## socialEngagementType visitId visitNumber visitStartTime browser
## 1 Not Socially Engaged 1472830385 1 2016-09-02 15:33:05 Chrome
## 2 Not Socially Engaged 1472880147 1 2016-09-03 05:22:27 Firefox
## 3 Not Socially Engaged 1472865386 1 2016-09-03 01:16:26 Chrome
## 4 Not Socially Engaged 1472881213 1 2016-09-03 05:40:13 UC Browser
## 5 Not Socially Engaged 1472822600 2 2016-09-02 13:23:20 Chrome
## 6 Not Socially Engaged 1472807194 1 2016-09-02 09:06:34 Chrome
## browserVersion browserSize operatingSystem
## 1 not available in demo dataset not available in demo dataset Windows
## 2 not available in demo dataset not available in demo dataset Macintosh
## 3 not available in demo dataset not available in demo dataset Windows
## 4 not available in demo dataset not available in demo dataset Linux
## 5 not available in demo dataset not available in demo dataset Android
## 6 not available in demo dataset not available in demo dataset Windows
## operatingSystemVersion isMobile mobileDeviceBranding
## 1 not available in demo dataset FALSE not available in demo dataset
## 2 not available in demo dataset FALSE not available in demo dataset
## 3 not available in demo dataset FALSE not available in demo dataset
## 4 not available in demo dataset FALSE not available in demo dataset
## 5 not available in demo dataset TRUE not available in demo dataset
## 6 not available in demo dataset FALSE not available in demo dataset
## mobileDeviceModel mobileInputSelector
## 1 not available in demo dataset not available in demo dataset
## 2 not available in demo dataset not available in demo dataset
## 3 not available in demo dataset not available in demo dataset
## 4 not available in demo dataset not available in demo dataset
## 5 not available in demo dataset not available in demo dataset
## 6 not available in demo dataset not available in demo dataset
## mobileDeviceInfo mobileDeviceMarketingName
## 1 not available in demo dataset not available in demo dataset
## 2 not available in demo dataset not available in demo dataset
## 3 not available in demo dataset not available in demo dataset
## 4 not available in demo dataset not available in demo dataset
## 5 not available in demo dataset not available in demo dataset
## 6 not available in demo dataset not available in demo dataset
## flashVersion language
## 1 not available in demo dataset not available in demo dataset
## 2 not available in demo dataset not available in demo dataset
## 3 not available in demo dataset not available in demo dataset
## 4 not available in demo dataset not available in demo dataset
## 5 not available in demo dataset not available in demo dataset
## 6 not available in demo dataset not available in demo dataset
## screenColors screenResolution deviceCategory
## 1 not available in demo dataset not available in demo dataset desktop
## 2 not available in demo dataset not available in demo dataset desktop
## 3 not available in demo dataset not available in demo dataset desktop
## 4 not available in demo dataset not available in demo dataset desktop
## 5 not available in demo dataset not available in demo dataset mobile
## 6 not available in demo dataset not available in demo dataset desktop
## continent subContinent country region
## 1 Asia Western Asia Turkey Izmir
## 2 Oceania Australasia Australia not available in demo dataset
## 3 Europe Southern Europe Spain Community of Madrid
## 4 Asia Southeast Asia Indonesia not available in demo dataset
## 5 Europe Northern Europe United Kingdom not available in demo dataset
## 6 Europe Southern Europe Italy not available in demo dataset
## metro city
## 1 (not set) Izmir
## 2 not available in demo dataset not available in demo dataset
## 3 (not set) Madrid
## 4 not available in demo dataset not available in demo dataset
## 5 not available in demo dataset not available in demo dataset
## 6 not available in demo dataset not available in demo dataset
## cityId networkDomain latitude
## 1 not available in demo dataset ttnet.com.tr not available in demo dataset
## 2 not available in demo dataset dodo.net.au not available in demo dataset
## 3 not available in demo dataset unknown.unknown not available in demo dataset
## 4 not available in demo dataset unknown.unknown not available in demo dataset
## 5 not available in demo dataset unknown.unknown not available in demo dataset
## 6 not available in demo dataset fastwebnet.it not available in demo dataset
## longitude networkLocation visits hits
## 1 not available in demo dataset not available in demo dataset 1 1
## 2 not available in demo dataset not available in demo dataset 1 1
## 3 not available in demo dataset not available in demo dataset 1 1
## 4 not available in demo dataset not available in demo dataset 1 1
## 5 not available in demo dataset not available in demo dataset 1 1
## 6 not available in demo dataset not available in demo dataset 1 1
## pageviews bounces newVisits transactionRevenue campaign source medium
## 1 1 1 1 NA (not set) google organic
## 2 1 1 1 NA (not set) google organic
## 3 1 1 1 NA (not set) google organic
## 4 1 1 1 NA (not set) google organic
## 5 1 1 NA NA (not set) google organic
## 6 1 1 1 NA (not set) google organic
## keyword isTrueDirect referralPath adContent campaignCode
## 1 (not provided) NA <NA> <NA> <NA>
## 2 (not provided) NA <NA> <NA> <NA>
## 3 (not provided) NA <NA> <NA> <NA>
## 4 google + online NA <NA> <NA> <NA>
## 5 (not provided) TRUE <NA> <NA> <NA>
## 6 (not provided) NA <NA> <NA> <NA>
## adwordsClickInfo.criteriaParameters adwordsClickInfo.page
## 1 not available in demo dataset <NA>
## 2 not available in demo dataset <NA>
## 3 not available in demo dataset <NA>
## 4 not available in demo dataset <NA>
## 5 not available in demo dataset <NA>
## 6 not available in demo dataset <NA>
## adwordsClickInfo.slot adwordsClickInfo.gclId adwordsClickInfo.adNetworkType
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## adwordsClickInfo.isVideoAd
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Let’s further perform explanatory data analysis to this data set.
c <- ga_train %>%
ggplot(aes(channelGrouping)) +
geom_bar(aes(fill=channelGrouping), position = position_dodge())
se <- ga_train %>%
ggplot(aes(socialEngagementType)) +
geom_bar(aes(fill=channelGrouping), position = position_dodge())
d <- ga_train %>%
ggplot(aes(deviceCategory)) +
geom_bar(aes(fill=channelGrouping), position = position_dodge())
co <- ga_train %>%
ggplot(aes(continent)) +
geom_bar(aes(fill=channelGrouping), position = position_dodge())
me <- ga_train %>%
ggplot(aes(medium)) +
geom_bar(aes(fill=channelGrouping), position = position_dodge())
ggarrange(c, se, d, co, me, nrow = 5)
We can see that overall Organic Search is providing the biggest consumer acquisition for the Google Store, next channel is Social followed by Referral. I think this makes sense as Organic Search is ruled by Google’s Search Algorithm and they dont neccessarily need any paid or display media for user acquisition.
Let’s look at Sessions and Transaction Revenue, one of the usual business goals of an ecommerce business.
# Assign $0 value to the TransactionRevenu that has missing values.
ga_train$transactionRevenue[is.na(ga_train$transactionRevenue)] <- 0
y <- ga_train$transactionRevenue #saving original values in a vector
ga_train$transactionRevenue <- ga_train$transactionRevenue/1000000
ga_train %>% filter(transactionRevenue >0) %>% summarize('number of transactions'=n(), 'total revenues train set'=sum(transactionRevenue))
## number of transactions total revenues train set
## 1 11515 1540071
We dont neccessarily need to see the transaction values that are 0 or less than $1,000
## [1] 0.01 23129.50
ga_train %>% filter(transactionRevenue>=1000) %>% summarize('number of transactions with at least 1000 USD revenues'=n(), 'sum revenues of transactions with at least 1000 USD revenues'=sum(transactionRevenue))
## number of transactions with at least 1000 USD revenues
## 1 195
## sum revenues of transactions with at least 1000 USD revenues
## 1 418588.3
z1 <- ga_train %>%
group_by(date) %>%
summarise(dailySessions = n()) %>%
ggplot(aes(x=date, y=dailySessions)) + geom_line(col='plum') +
scale_y_continuous(labels=comma) + geom_smooth(col='tomato') +
labs(x="", y="Sessions per Day") + scale_x_date(date_breaks = "1 month", date_labels = "%b %d")
z2 <- ga_train %>%
group_by(date) %>%
summarise(dailyRevenue = sum(transactionRevenue)) %>%
ggplot(aes(x=date, y=dailyRevenue)) + geom_line(col='plum') +
scale_y_continuous(labels=comma) + geom_smooth(col='tomato') +
labs(x="", y="Daily Revenues (USD)") + scale_x_date(date_breaks = "1 month", date_labels = "%b %d")
grid.arrange(z1,z2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can see that sessions have increased starting October and peaked in November, however went back to normal in January. When we compare the Daily Revenus within the same time frame we see that it stayed flat. This means, users are engaging with the Google eStore, however not contributing to the purchase.
Let’s analyze further to see when consumers are engaging with the website the most.
# creating this function to plot the next two visualizations easily
plotSessions <- function(dataframe, factorVariable, topN=10) {
var_col <- enquo(factorVariable)
dataframe %>% count(!!var_col) %>% top_n(topN, wt=n) %>%
ggplot(aes_(x=var_col, y=~n, fill=var_col)) +
geom_bar(stat='identity')+
scale_y_continuous(labels=comma)+
labs(x="", y="number of sessions")+
theme(legend.position="none")
}
#also creating a function to plot transactionRevenue for a factorvariable
plotRevenue <- function(dataframe, factorVariable, topN=10) {
var_col <- enquo(factorVariable)
dataframe %>% group_by(!!var_col) %>% summarize(rev=sum(transactionRevenue)) %>% filter(rev>0) %>% top_n(topN, wt=rev) %>% ungroup() %>%
ggplot(aes_(x=var_col, y=~rev, fill=var_col)) +
geom_bar(stat='identity')+
scale_y_continuous(labels=comma)+
labs(x="", y="Revenues (USD)")+
theme(legend.position="none")
}
options(repr.plot.height=4)
week1 <- plotSessions(ga_train, weekday)
week2 <- plotRevenue(ga_train, weekday)
grid.arrange(week1, week2)
ga_train$month <- month(ga_train$date, label=TRUE)
m1 <- plotSessions(ga_train, month, 12)
m2 <- plotRevenue(ga_train, month, 12)
grid.arrange(m1, m2)
Based on this analysis, we can see that users engage with the website the most on Tuesday and Wednesday and contribute to the Revenue the most on Tuesday. We had a spike in number of sessions on November, however didnt contribute to the Revenue. However, when we look at the December, even though we didnt have as much user engagement compare to November and October, we have the biggest contribution to the Revenue.
Let’s look at our Marketing Channels and their contribution to the user engagement and Revenue since that is what we want to explore in terms of solving business problem.
#adding reordering of x manually
sessionOrder <- ga_train %>%
count(channelGrouping) %>%
top_n(10, wt=n) %>%
arrange(desc(n))
sessionOrder <- sessionOrder$channelGrouping
channel1 <- plotSessions(ga_train, channelGrouping) + scale_x_discrete(limits=sessionOrder)
channel2 <- plotRevenue(ga_train, channelGrouping) + scale_x_discrete(limits=sessionOrder)
grid.arrange(channel1, channel2)
We can see that Organic Search has the highest number of Sessions however does not bring the biggest Revenue. Even thogh Referral has the 4th place in terms of User Engagement with the site, we see that it brings the highest contribution to the Revenue by far.
Even though as businesses we would like to have higher user acqusition and engagement, that doesnt always means that this meets their business goals. Depending on the website, business goals can mean content consumption, lead generation or higher purchase conversion which is the case in our analysis. We may not neccessarily want to have higher sessions but rather we would like to have higher unique visitors or higher revenue and allcote our Channel Distribution budgets accordingly.
Finally, we have complied these analysis into the below shiny apps to further present the visualization of Channel Performance.
References: https://www.youtube.com/watch?v=4Ht_vEXJ4wo : Create a Talking GA Shiny App, https://medium.com/compassred-data-blog/google-analytics-dashboards-in-r-shiny-fc8e0ebcef2c : Google Analytics Dashboard in R.