Introduction

Google Analytics (GA) is a web analytics tool provided by Google to analyze and understand how customers use their websites. Google Analytics could present a piece of vital information for the companies to support their decision making, for instance, measure the marketing performance (e.g., revenue by marketing channel, conversion rate by campaign), assess website performance to boost conversions (e.g., bounce rate by the operating system, page load speed by browser type).

In general, GA works by embedding a block of JavaScript code on the company website pages. When visitors access the page, the JavaScript code runs the tracking mechanism where it collects the data about the page request, e.g., browser, operating system. The information then sends to the Google server to be stored for further use in the analysis.

Preparations

We will use googleAnalyticsR to collect the data from Google Analytics and tidyverse to manipulate the dataset.

library(googleAnalyticsR)
library(tidyverse)

#authenticate. If you've done the Google authentication already, you only need to select 
#one of the pre-authorized account. Otherwise, the browser will pop-up to perform authentication
ga_auth()

#initiate parameters
ga_view_id <- 'xxxxxxx' #replace the value based on the GA ID
start_date <- '2020-09-01'
end_date <- '2020-09-30'

#retrieve list of available dimensions
ga_dim <- ga_meta()
ga_dim
##                       name      type dataType group
## 1              ga:userType DIMENSION   STRING  User
## 2          ga:sessionCount DIMENSION   STRING  User
## 3  ga:daysSinceLastSession DIMENSION   STRING  User
## 4      ga:userDefinedValue DIMENSION   STRING  User
## 5            ga:userBucket DIMENSION   STRING  User
## 6                 ga:users    METRIC  INTEGER  User
## 7              ga:newUsers    METRIC  INTEGER  User
## 8    ga:percentNewSessions    METRIC  PERCENT  User
## 9             ga:1dayUsers    METRIC  INTEGER  User
## 10            ga:7dayUsers    METRIC  INTEGER  User

Data Collections

Google Analytics allows a maximum of 7 dimensions in any request (see Google Analytics API). One possible workaround solution is to split the collection process into multiple requests and merge/join them together at the end. This alternative was best suited if specific dimensions can be used to identify sessions/transactions at any given time, e.g., transactionID.

#GA will aggregate the data based on the supplied dimensions. So, dimensions
#need to be carefully selected to avoid miss-aggregation, e.g., collecting deviceCategory
#(desktop/mobile/tablet)at the same request with movileDeviceModel will return data 
#from mobile/tablet only (API will omit desktop data) 
#Using a list-of-list allows us to control the dimensions supplied to the API (maximum of six 
#dimensions allowed for each list as the API requests provided with the transactionId as the key)
dimension_list <- list('user'=list('userType','deviceCategory','city','county',
                                   'browser','operatingSystem'),
                       'source'=list('channelGrouping','medium','source','socialNetwork'),
                       'mobile'=list('mobileDeviceModel','mobileDeviceInfo'))

#identify list of metrics
metric_list <- c('transactionRevenue','itemQuantity','sessions','users')

#initiate the data frame to store the merged dataset
ga_data <- data.frame()

for (i in 1:length(dimension_list)) {
  ga_request <- google_analytics(viewId = ga_view_id,
                                 date_range= c(start_date,end_date),
                                 metrics=metric_list,
                                 dimensions = c('transactionId',unlist(dimension_list[i])),
                                 anti_sample = TRUE)
  
  if (i == 1) {
    #the first output will be set as a base
    ga_data <- ga_request
  } else {
    #merge the subsequent output with the base using a specific key,
    #(in this case: transactionId)
    #first, need to remove the metrics column to avoid duplicated metrics
    #column name (suffix: _x & _y)
    ga_request <- ga_request %>% select(-metric_list)
    ga_data <- ga_data %>% left_join(ga_request, by = 'transactionId')
  }
}