R in a digital analytics workflow

Mark Edmondson (@HoloMarkeD)
20th September, 2014 for #MeasureCamp

Demo linking GA and CRM data

Wunderman

What we'll do today

An overview of how to use R in your workflow:

  1. Importing Data
    • CRM data sitting on MS SQL Server
    • Web data via Google Analytics API
  2. Transform; Analyse; Visualise
  3. Presentation

Where time is spent in practice

what-data

Intro - why use R?

Why not just do it in Excel?

  • Reproducability: write a function for one client, repeat for another
  • Flexibility: library() includes open source functions for all needs
  • Scalability: copes easily beyond Excel limits, can scale up in cloud using same code
  • Interportability: Can run (almost) same code on Windows, Linux, Mac

Scoping

Finding the question to answer

  • What is the business case?
  • What data is available?
  • Do we need to enchance/capture more data?
  • What keys can the data be linked on?

Scoping - Data Available

Demo with a major Scandinavian retailer

  • 1 million+ registered customers
  • 200+ retail shops
  • Online activation for offline customers Key Point
  • Online shop running Google Analytics
  • CRM data sitting on Microsoft SQL Server
  • Every Purchase Receipt for last 2 years
  • Store Location, Customer Loyalty ID, Product Category, Household size, etc.

Scoping - Finding the Question

Data Keys Common to GA and CRM

  • userID
  • product SKU
  • date
  • location

Lots of data analysis possible - what to do?

Brainstorm:

  • Compare online location to customer's regular store
  • Remarketing based on offline purchasing (link userID)
  • Lifetime revenue potential
  • Online Product suggestions based on Offline history
  • etc.

Demo: Can offline product trends predict online trends?

1. Importing Data

GA and CRM data combination

Ask the business question, then see what data you need.

Demo data set had:

combo-data Repeat process for questions that need userID link

Import Data (Quick Version)

As we only have 20mins, slides going over data connection are in the Appendix:

After importing, we have:

  • GA: 363,187 rows: date, productSku, itemRevenue
  • CRM: 35,250 rows: product_cat, productID (SKU)
  • CRM: 26,452 rows: revenue, category, date

Big Data?

We now have a standard dataFrame in R which can be used for all R functions.

  • CRM query imported a 10% sample with 4.4 million rows with 24 variables.
  • Saves to 120MB on disc, 820MB object size in RAM
  • Runs fine on MacBook Air 4GB RAM

R is RAM based, so first place to scale is there.

Run same code on a GCE instance with 54GB RAM for bigger data sets.

Bigger than that, need on disk R libraries such as dplyr(), ff(), RHadoop() etc..

2. Transform; Visualise; Analyse

The cycle of data analysis - how to get “flow”

cycle-of-data-analysis

Explore the data - Transform

Where R shines, manipulating data

## merge GA SKUs to find category

gaCat <- merge(lookup, GAdata, by.x = "ProductID", by.y = "productSku" , all.y = TRUE)

## aggregate on date / category
library(reshape2)

gaCatMelt <- melt(gaCat, measure.vars="itemRevenue")

gaCatCast <- dcast(gaCatMelt, date + CategoryName ~ variable, sum)

Explore the data - Visualise

Now we can start plotting and doing exploratory analysis. Oh no, outliers!

plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)

plot of chunk unnamed-chunk-2

Explore the data - Analyse

Looks like dirty data, can safely remove

## find outliers
DailyCatRev[DailyCatRev$Revenue > 10000000, c("date","Revenue")]

DailyCatRev[DailyCatRev$Revenue < -10000000, c("date","Revenue")]
date Revenue
2013-08-27 999999999
2013-08-28 -999999999

Fixing Dirty Data

## remove outliers
DailyCatRev <- DailyCatRev[DailyCatRev$Revenue < 10000000,]

DailyCatRev <- DailyCatRev[DailyCatRev$Revenue > -10000000,]

Fixing Dirty Data Plot

plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)

plot of chunk unnamed-chunk-4

Statistics

Actual analysis! Cross correlation function on offline vs online revenue. No relationship found

ccf(GAdataAgg[,"itemRevenue"], DailyCatRev[,"Revenue"])

plot of chunk unnamed-chunk-5

Statistics on All Categories

Write a function to check cross correlation on all 600 product categories

The 'R way' is to avoid loops, use vectorised functions

## cross correlation function
crossCorRevenue <- function(product){
ccf(GAdataAgg[GAdataAgg$CategoryName == product,"itemRevenue"], DailyCatRev[DailyCatRev$CategoryName == product,"Revenue"], main=paste("CCF for", product) )
}

## apply to all unique Category names
sapply(unique(GAdataAgg$CategoryName), crossCorRevenue)

Statistics on Category X

Cross correlation function on offline vs online revenue.

plot of chunk unnamed-chunk-6

Evidence of ~20 day correlation lag

Statistics on Category X

Examining Offline/Online revenue for that category shows where online peaks have preceeded offline.

plot of chunk unnamed-chunk-7

Cross correlation next steps

Not conclusive, but some evidence of lags in online and offline shopping behaviour Next steps:

  • Deal with Christmas peak affecting results
  • Repeat with each product, not product category (now easy with vectorised formula to repeat on 100,000+ products)
  • Get bigger sample size - techniques trained on 10% sample. etc. etc.

Main point is R is quick to iterate over transform, visualise, analyse loop

Another example - Weekday comparison

Days of the week comparison

plot of chunk unnamed-chunk-8

3. Presentation

Present your findings

This presentation is made in R using RStudio's RMarkdown

  • Turns markdown and R code into HTML5
  • Customise via CSS

Other alternatives include:

  • PDF exports of ggplot() graphs
  • Interactive web dashboards using Shiny
  • Export processed data and visualise in Tableau, Qlikview etc.

Previous slide's markup


This presentation is made in R using RStudio's RMarkdown

* Turns markdown and R code into HTML5
* Customise via CSS

Other alternatives include:
* PDF exports of ggplot() graphs
* Interactive web dashboards using Shiny
* Export processed data and visualise in Tableau, Qlikview etc.

Hotkeys: hit 'o' to navigate, 'f' for fullscreen (when not on RPubs)

Shiny

Interactive web dashboard - my example app forecasting GA data

what-data

Conclusions

Summary

Quick tour how R helps in all steps of data analysis:

  1. Importing data - rga() RODBC()
  2. Data Cleaning/Transforming data - merge() reshape2() dplyr()
  3. Analysis and statistics - ccf() glm() forecast()
  4. Visualisation - ggplot2() ggvis() shiny()
  5. Presentation - knitr()

We didn't cover

  1. Forecasting / Prediction
  2. Neural Nets / Machine Learning
  3. Text Analysis
  4. Connection with other Web APIs (e.g. twitteR() )
  5. Report Automation (via RScript)
  6. Creation of data feeds
  7. R in the cloud
  8. (New Sep 2014!) Google's casualImpact package

Resources

Appendix

1. Import CRM data from Microsoft SQL Server

This is (sadly) a lot easier when running R on Windows as you need to setup Active Directory

2. Import CRM data from Microsoft SQL Server

Now load up R/RStudio and connect using the crudentials you've just created

  • Connect to Microsoft SQL server, take a look at what Tables you have library("RODBC") channel <- odbcConnect("nameOfUDS", uid="username", pwd='password') Tables <- sqlTables(channel)

3. Import CRM data from Microsoft SQL Server

Your connection is saved to object channel - use SQL queries to get the data you want

  • Pull in top 100 rows
exampleSQL <- sqlQuery(channel, "SELECT TOP 1000 * FROM dbo.purchase)
  • More complicated example - pull in 10% sample from Puchase table and join it to Receipt table
purchasesSample <- sqlQuery(channel, "SELECT * FROM dbo.purchase tablesample (10 percent) LEFT OUTER JOIN dbo.purchaseline purline ON dbo.purchase.ReceiptID = purline.ReceiptID")

Import CRM data from other systems

We now have a standard dataFrame in R which can be used for all R functions.

Similar workflow if using another type of database with ODBC (Open Database Connectivity Overview)

Other data sources may have their own library e.g. RMySQL, rmongodb, bigrquery, redshift-r, etc.

1. Import Google Analytics data

We will now use another library to connect to Google Analytics - skardhamar/rga()

We can merge on Product SKU code, so we pull ga:date, ga:productSku and ga:itemRevenue

## install rga following instructions on github, then....

## pull all the GA profiles
## your profile id is in this table
profiles <- ga$getProfiles() 

GAdata <- ga$getData(ids = "YOUR_PROFILE_ID", start.date = "2013-01-01", end.date = "2014-06-01", metrics = "ga:itemRevenue", dimensions = "ga:date,ga:productSku", batch = TRUE)

2. Importing Google Analytics data

  • batch flag allows you to pull data beyond 10000 rows
  • walk flag gets around sampled data by pulling daily and aggregating

Go have a cup of coffee.