Google Analytics (GA) is a web analytics tool provided by Google to analyze and understand how customers use their websites. Google Analytics could present a piece of vital information for the companies to support their decision making, for instance, measure the marketing performance (e.g., revenue by marketing channel, conversion rate by campaign), assess website performance to boost conversions (e.g., bounce rate by the operating system, page load speed by browser type).
In general, GA works by embedding a block of JavaScript code on the company website pages. When visitors access the page, the JavaScript code runs the tracking mechanism where it collects the data about the page request, e.g., browser, operating system. The information then sends to the Google server to be stored for further use in the analysis.
We will use googleAnalyticsR
to collect the data from Google Analytics and tidyverse
to manipulate the dataset.
library(googleAnalyticsR)
library(tidyverse)
#authenticate. If you've done the Google authentication already, you only need to select
#one of the pre-authorized account. Otherwise, the browser will pop-up to perform authentication
ga_auth()
#initiate parameters
ga_view_id <- 'xxxxxxx' #replace the value based on the GA ID
start_date <- '2020-09-01'
end_date <- '2020-09-30'
#retrieve list of available dimensions
ga_dim <- ga_meta()
ga_dim
## name type dataType group
## 1 ga:userType DIMENSION STRING User
## 2 ga:sessionCount DIMENSION STRING User
## 3 ga:daysSinceLastSession DIMENSION STRING User
## 4 ga:userDefinedValue DIMENSION STRING User
## 5 ga:userBucket DIMENSION STRING User
## 6 ga:users METRIC INTEGER User
## 7 ga:newUsers METRIC INTEGER User
## 8 ga:percentNewSessions METRIC PERCENT User
## 9 ga:1dayUsers METRIC INTEGER User
## 10 ga:7dayUsers METRIC INTEGER User
Google Analytics allows a maximum of 7 dimensions in any request (see Google Analytics API). One possible workaround solution is to split the collection process into multiple requests and merge/join them together at the end. This alternative was best suited if specific dimensions can be used to identify sessions/transactions at any given time, e.g., transactionID.
#GA will aggregate the data based on the supplied dimensions. So, dimensions
#need to be carefully selected to avoid miss-aggregation, e.g., collecting deviceCategory
#(desktop/mobile/tablet)at the same request with movileDeviceModel will return data
#from mobile/tablet only (API will omit desktop data)
#Using a list-of-list allows us to control the dimensions supplied to the API (maximum of six
#dimensions allowed for each list as the API requests provided with the transactionId as the key)
dimension_list <- list('user'=list('userType','deviceCategory','city','county',
'browser','operatingSystem'),
'source'=list('channelGrouping','medium','source','socialNetwork'),
'mobile'=list('mobileDeviceModel','mobileDeviceInfo'))
#identify list of metrics
metric_list <- c('transactionRevenue','itemQuantity','sessions','users')
#initiate the data frame to store the merged dataset
ga_data <- data.frame()
for (i in 1:length(dimension_list)) {
ga_request <- google_analytics(viewId = ga_view_id,
date_range= c(start_date,end_date),
metrics=metric_list,
dimensions = c('transactionId',unlist(dimension_list[i])),
anti_sample = TRUE)
if (i == 1) {
#the first output will be set as a base
ga_data <- ga_request
} else {
#merge the subsequent output with the base using a specific key,
#(in this case: transactionId)
#first, need to remove the metrics column to avoid duplicated metrics
#column name (suffix: _x & _y)
ga_request <- ga_request %>% select(-metric_list)
ga_data <- ga_data %>% left_join(ga_request, by = 'transactionId')
}
}