I recently read an article in the New York Times outlining how state lines are becoming increasingly less valuable for economic and social policy. The article is highlighted by a beautiul choropleth map of the United States broken down into larger socialy grouped regions and their respective urban corridors. As I read, it made me ask the question if it is possible to use a similar sort of mapping system to visualize google analytics data (I’m currently training to become a data scientist, but have worked in web analytics for the past few years). So, I set out to see if it’s possible to create a choropleth map in R based on Google Analytics Data.
I quickly found that all of the tools are there, but that I haven’t yet found an example of someone pulling them all together in one place. So, I tried it myself. This is only a proof of concept that it is possible to map GA data into a choropleth map in R. It’s far from perfect, but a good start for much more interesting analysis later on.
The first step in analyzing your Google Analytics data is creating a new project in the google console.
Another important step is to give your API service account access to your Google Analytics data. To do this:
For the following, the dimensions and metrics can be changed, but in order to map the data, latitude and longitude must be included. To see specifications on how to add or subtract dimensions or metrics, check out the Google Analytics Dimensions & Metrics Explorer.
require(RGoogleAnalytics)
token <- Auth(client.id,client.secret)
save(token,file="./token_file")
ValidateToken(token)
query.list <- Init(start.date = "2008-1-1",
end.date = "2008-6-30",
dimensions = "ga:date,ga:pagePath,ga:hour,ga:medium, ga:city, ga:latitude, ga:longitude",
metrics = "ga:sessions,ga:pageviews",
max.results = 10000,
sort = "-ga:date",
table.id = ga.table) # this is your GA View ID.
ga.query <- QueryBuilder(query.list)
ga.data <- GetReportData(ga.query, token, split_daywise = T)
# This simply cleans the data
ga.data$longitude <- as.numeric(ga.data$longitude)
ga.data$latitude <- as.numeric(ga.data$latitude)
ga.data <- ga.data[ga.data$latitude != 0,]
Reverse Geomapping the Longitude and Latitude of the GA data is somewhat more difficult than simply downloading and processing the GA data. It requires enabling another google API, the Google Maps Geomapping API. To do this,
It is important to note that free Google Maps Geomapping API projects have a limit of 2500 uses per day. This is why we only take the unique cases of longitude/latitude to limit the number of uses per running of the script. The script runs by connecting to the API and creating a json document through the parameters listed in the API url. These can be changed, but must include the longitude and latitude value and client key. To see options for formatting this url, see the Google Maps API documentation
# only get individual longitude/latitude cases.
long_lat <- unique(ga.data[,c('latitude', 'longitude')])
reverseGeoCode <- function(latlng) {
latlngStr <- gsub(' ','%20', paste(latlng, collapse=",")) # Collapse and Lon/Lat URL Parameters
library("RJSONIO") # Load Library
# Open Connection
connectStr <- paste('https://maps.google.com/maps/api/geocode/json?result_type=postal_code&latlng=',latlngStr, "&key=", client.key, sep="")
con <- url(connectStr)
data.json <- fromJSON(paste(readLines(con), collapse=""))
close(con)
# Filter the received JSON
pc <- c()
if(data.json["status"]=="OK"){
pc <- data.json$results[[1]]$address_components[[1]]$long_name
} else {
pc <- NA
}
return(pc)
}
The following loop connects to the API for each individual case of the latitude/longitude in your selected range. It then ammends the zip code values for each unique case.
p <- c()
for(i in 1:nrow(long_lat)){
p[i] <- reverseGeoCode(long_lat[i,])
}
paste("You converted", length(p), "geocodes to zipcodes.")
long_lat$postal_code <- p
The following section creates a unique value for each instance of the longitude and latitutde, namely a character string for each observation of the two measures separated by a comma. This is then used to search the original GA dataset to identify each latitude/longitude case and assign the corresponding zipcode found by the Geomapping function. (Note: I’m sure there is a much more elegant way to do this, but at this stage it’s not completely necessary to figure it out).
for(i in 1:nrow(long_lat)){
long_lat$comb[i] <- paste(long_lat$longitude[i], long_lat$latitude[i], sep = ", ")
}
for(i in 1:nrow(ga.data)){
ga.data$comb[i] <- paste(ga.data$longitude[i], ga.data$latitude[i], sep = ", ")
}
for(i in 1:nrow(ga.data)){
if(ga.data$comb[i] %in% long_lat$comb) {
ga.data$postal_code[i] <- long_lat$postal_code[grep(ga.data$comb[i], long_lat$comb, fixed = TRUE)]
}
}
The following plots the data. There are plenty of ways plot a choropleth map in R, but this example was helpful for the specific way I chose to plot this.
library(choroplethrZip)
library(ggplot2)
library(RColorBrewer)
ga.data$value <- ga.data$sessions
ga.data$region <- ga.data$postal_code
choro_map <- aggregate(ga.data$value ~ ga.data$region, FUN = sum)
names(choro_map) <- c("region", "value")
choro = ZipChoropleth$new(choro_map)
choro$title = "Sessions Jan 2007 - June 2007"
choro$ggplot_scale = scale_fill_brewer(name="Sessions", palette=15, drop=FALSE)
choro$set_zoom_zip(state_zoom="washington", county_zoom=NULL, msa_zoom=NULL, zip_zoom=NULL)
choro$render()
Though this particularl use of the Google Analytics API is not particularly interesting (notice that the website I connected to only had a handful of zipcodes with any data at all!) it does illustrate how the Google Analytics API & Google Maps Geomapping API can be leveraged to provide more granularity to to the Google Analytics geographical data. In my experience, once you’ve read your data into R, the analytical possibilities are basically endless. It would be quite easy to create interesting segements of the data using demographic dimensions, or utilize this mapping data to give better context to a multivariate test on specific campaigns.
Another benefit of using this particular system instead of simply accessing the data is that it is very easy to reproduce and share. This can easily be turned into a pdf, or published some other way - or even used to create a shiny app that can be shared with team members who do not have the expertise to navigate the Google Analytics GUI.
This script was simply meant to be a proof of concept. I’ve seen samples of people doing each of the three parts - accessing the GA API in R, Reverse Geomapping in R, and creating choropleth maps in R - but I haven’t yet seen someone combine them all. As such, there are clearly some holes to be addressed in a more formal analysis.
.