Goals of this script

I recently read an article in the New York Times outlining how state lines are becoming increasingly less valuable for economic and social policy. The article is highlighted by a beautiul choropleth map of the United States broken down into larger socialy grouped regions and their respective urban corridors. As I read, it made me ask the question if it is possible to use a similar sort of mapping system to visualize google analytics data (I’m currently training to become a data scientist, but have worked in web analytics for the past few years). So, I set out to see if it’s possible to create a choropleth map in R based on Google Analytics Data.

I quickly found that all of the tools are there, but that I haven’t yet found an example of someone pulling them all together in one place. So, I tried it myself. This is only a proof of concept that it is possible to map GA data into a choropleth map in R. It’s far from perfect, but a good start for much more interesting analysis later on.

Create a GA Project on the Google Console

The first step in analyzing your Google Analytics data is creating a new project in the google console.

  • Navigate to console.developers.google.com/project
  • Create a project
  • Follow the instructions to create the project. Enable the Analytics API.
  • Navigate to the Credentials Tab and get your client id and secret key. Save those as the variables listed below.

Another important step is to give your API service account access to your Google Analytics data. To do this:

  • Navigate to the Service Accounts section of the AIM & Admin menu of the google console.
  • Copy your Compute Engine default service account email.
  • Navigate to your analytics.google.com account and add the service default account email to whichever google analytics view you’d like to map (Admin > View > User Management)
  • While in Google Analytics, go back to View Settings and copy the View ID. Save this as a variable to be used below.

Accessing Google Analytics in R

For the following, the dimensions and metrics can be changed, but in order to map the data, latitude and longitude must be included. To see specifications on how to add or subtract dimensions or metrics, check out the Google Analytics Dimensions & Metrics Explorer.

require(RGoogleAnalytics)
token <- Auth(client.id,client.secret)
save(token,file="./token_file")
ValidateToken(token)
query.list <- Init(start.date = "2008-1-1",
                   end.date = "2008-6-30",
                   dimensions = "ga:date,ga:pagePath,ga:hour,ga:medium, ga:city, ga:latitude, ga:longitude",
                   metrics = "ga:sessions,ga:pageviews",
                   max.results = 10000,
                   sort = "-ga:date",
                   table.id = ga.table) # this is your GA View ID. 
ga.query <- QueryBuilder(query.list)
ga.data <- GetReportData(ga.query, token, split_daywise = T)


# This simply cleans the data
ga.data$longitude <- as.numeric(ga.data$longitude)
ga.data$latitude <- as.numeric(ga.data$latitude)
ga.data <- ga.data[ga.data$latitude != 0,]

Reverse Geomapping GA Geographic Data to Zip Code

Reverse Geomapping the Longitude and Latitude of the GA data is somewhat more difficult than simply downloading and processing the GA data. It requires enabling another google API, the Google Maps Geomapping API. To do this,

  • Go to your Google Console and enable the Google Maps Geomapping API.
  • Copy down your key (if this is part of a new Google API project, otherwise, simply reuse your client.id from above.)

It is important to note that free Google Maps Geomapping API projects have a limit of 2500 uses per day. This is why we only take the unique cases of longitude/latitude to limit the number of uses per running of the script. The script runs by connecting to the API and creating a json document through the parameters listed in the API url. These can be changed, but must include the longitude and latitude value and client key. To see options for formatting this url, see the Google Maps API documentation

# only get individual longitude/latitude cases.  
long_lat <- unique(ga.data[,c('latitude', 'longitude')])

reverseGeoCode <- function(latlng) {
    latlngStr <-  gsub(' ','%20', paste(latlng, collapse=",")) # Collapse and Lon/Lat URL Parameters
    library("RJSONIO") # Load Library
    # Open Connection
    connectStr <- paste('https://maps.google.com/maps/api/geocode/json?result_type=postal_code&latlng=',latlngStr, "&key=", client.key, sep="")
    con <- url(connectStr)
    data.json <- fromJSON(paste(readLines(con), collapse=""))
    close(con)
    # Filter  the received JSON
    pc <- c()
    if(data.json["status"]=="OK"){
        pc <- data.json$results[[1]]$address_components[[1]]$long_name
    } else {
        pc <- NA
    }
    return(pc)
}

(original function source)

The following loop connects to the API for each individual case of the latitude/longitude in your selected range. It then ammends the zip code values for each unique case.

p <- c()
for(i in 1:nrow(long_lat)){
    p[i] <- reverseGeoCode(long_lat[i,])
}
paste("You converted", length(p), "geocodes to zipcodes.")

long_lat$postal_code <- p

Assigning Zip Code to Long/Lat Coordinates

The following section creates a unique value for each instance of the longitude and latitutde, namely a character string for each observation of the two measures separated by a comma. This is then used to search the original GA dataset to identify each latitude/longitude case and assign the corresponding zipcode found by the Geomapping function. (Note: I’m sure there is a much more elegant way to do this, but at this stage it’s not completely necessary to figure it out).

for(i in 1:nrow(long_lat)){
    long_lat$comb[i] <- paste(long_lat$longitude[i], long_lat$latitude[i], sep = ", ")
}
for(i in 1:nrow(ga.data)){
    ga.data$comb[i] <- paste(ga.data$longitude[i], ga.data$latitude[i], sep = ", ")
}

for(i in 1:nrow(ga.data)){
    if(ga.data$comb[i] %in% long_lat$comb) {
        ga.data$postal_code[i] <- long_lat$postal_code[grep(ga.data$comb[i], long_lat$comb, fixed = TRUE)]
        }
}

Plotting GA Sessions to Choropleth Map of Washington State

The following plots the data. There are plenty of ways plot a choropleth map in R, but this example was helpful for the specific way I chose to plot this.

library(choroplethrZip)
library(ggplot2)
library(RColorBrewer)
ga.data$value <- ga.data$sessions
ga.data$region <- ga.data$postal_code
choro_map <- aggregate(ga.data$value ~ ga.data$region, FUN = sum)
names(choro_map) <- c("region", "value")
choro = ZipChoropleth$new(choro_map)
choro$title = "Sessions Jan 2007 - June 2007"
choro$ggplot_scale = scale_fill_brewer(name="Sessions", palette=15, drop=FALSE)
choro$set_zoom_zip(state_zoom="washington", county_zoom=NULL, msa_zoom=NULL, zip_zoom=NULL)
choro$render()

Possible Applications

Though this particularl use of the Google Analytics API is not particularly interesting (notice that the website I connected to only had a handful of zipcodes with any data at all!) it does illustrate how the Google Analytics API & Google Maps Geomapping API can be leveraged to provide more granularity to to the Google Analytics geographical data. In my experience, once you’ve read your data into R, the analytical possibilities are basically endless. It would be quite easy to create interesting segements of the data using demographic dimensions, or utilize this mapping data to give better context to a multivariate test on specific campaigns.

Another benefit of using this particular system instead of simply accessing the data is that it is very easy to reproduce and share. This can easily be turned into a pdf, or published some other way - or even used to create a shiny app that can be shared with team members who do not have the expertise to navigate the Google Analytics GUI.

Some things to improve on…

This script was simply meant to be a proof of concept. I’ve seen samples of people doing each of the three parts - accessing the GA API in R, Reverse Geomapping in R, and creating choropleth maps in R - but I haven’t yet seen someone combine them all. As such, there are clearly some holes to be addressed in a more formal analysis.

  1. There are regions of the map that simply didn’t plot correctly. I’m not sure whether this is because the zcta numbers don’t map to the postal codes given by the Google Maps API, or if the shape file is outdated.
  2. The process requires a lot of setup in the Google Console, and a lot of copy & pasting data to get the proper keys. It is possible to get those keys from a json file generated by the console, so automating that part of the process would probably be helpful. Again, this script is simply to offer a proof of concept that one can create a choropleth map from Google Analytics data; it was never my intention to make this a fully automated script.
  3. The science isn’t particularly interesting. For me, this is helpful as a preliminary look at the data, but the insights it will provide are purely surface level. There are certainly much more interesting things that can be done with this data, but those will depend on the specific goals of your analysis.


.