R in a digital analytics workflow

Mark Edmondson (@HoloMarkeD)
20th September, 2014 for #MeasureCamp

Demo linking GA and CRM data

Wunderman

What we'll do today

An overview of how to use R in your workflow:

Importing Data
- CRM data sitting on MS SQL Server
- Web data via Google Analytics API
Transform; Analyse; Visualise
Presentation

Where time is spent in practice

what-data

Intro - why use R?

Why not just do it in Excel?

Reproducability: write a function for one client, repeat for another
Flexibility: library() includes open source functions for all needs
Scalability: copes easily beyond Excel limits, can scale up in cloud using same code
Interportability: Can run (almost) same code on Windows, Linux, Mac

Scoping

Finding the question to answer

What is the business case?
What data is available?
Do we need to enchance/capture more data?
What keys can the data be linked on?

Scoping - Data Available

Demo with a major Scandinavian retailer

1 million+ registered customers
200+ retail shops
Online activation for offline customers Key Point
Online shop running Google Analytics
CRM data sitting on Microsoft SQL Server
Every Purchase Receipt for last 2 years
Store Location, Customer Loyalty ID, Product Category, Household size, etc.

Scoping - Finding the Question

Data Keys Common to GA and CRM

userID

product SKU

date

location

Lots of data analysis possible - what to do?

Brainstorm:

Compare online location to customer's regular store

Remarketing based on offline purchasing (link userID)

Lifetime revenue potential

Online Product suggestions based on Offline history

etc.

Demo: Can offline product trends predict online trends?

1. Importing Data

GA and CRM data combination

Ask the business question, then see what data you need.

Demo data set had:

combo-data Repeat process for questions that need userID link

Import Data (Quick Version)

As we only have 20mins, slides going over data connection are in the Appendix:

Importing CRM data from Microsoft SQL Server using library(RODBC)
Importing GA data using library(rga)

After importing, we have:

GA: 363,187 rows: date, productSku, itemRevenue
CRM: 35,250 rows: product_cat, productID (SKU)
CRM: 26,452 rows: revenue, category, date

Big Data?

We now have a standard dataFrame in R which can be used for all R functions.

CRM query imported a 10% sample with 4.4 million rows with 24 variables.
Saves to 120MB on disc, 820MB object size in RAM
Runs fine on MacBook Air 4GB RAM

R is RAM based, so first place to scale is there.

Run same code on a GCE instance with 54GB RAM for bigger data sets.

Bigger than that, need on disk R libraries such as dplyr(), ff(), RHadoop() etc..

2. Transform; Visualise; Analyse

The cycle of data analysis - how to get “flow”

cycle-of-data-analysis

Explore the data - Transform

Where R shines, manipulating data

## merge GA SKUs to find category

gaCat <- merge(lookup, GAdata, by.x = "ProductID", by.y = "productSku" , all.y = TRUE)

## aggregate on date / category
library(reshape2)

gaCatMelt <- melt(gaCat, measure.vars="itemRevenue")

gaCatCast <- dcast(gaCatMelt, date + CategoryName ~ variable, sum)

Explore the data - Visualise

Now we can start plotting and doing exploratory analysis. Oh no, outliers!

plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)

plot of chunk unnamed-chunk-2

Explore the data - Analyse

Looks like dirty data, can safely remove

## find outliers
DailyCatRev[DailyCatRev$Revenue > 10000000, c("date","Revenue")]

DailyCatRev[DailyCatRev$Revenue < -10000000, c("date","Revenue")]

date	Revenue
2013-08-27	999999999
2013-08-28	-999999999

Fixing Dirty Data

## remove outliers
DailyCatRev <- DailyCatRev[DailyCatRev$Revenue < 10000000,]

DailyCatRev <- DailyCatRev[DailyCatRev$Revenue > -10000000,]

Fixing Dirty Data Plot

plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)

plot of chunk unnamed-chunk-4

Statistics

Actual analysis! Cross correlation function on offline vs online revenue. No relationship found

ccf(GAdataAgg[,"itemRevenue"], DailyCatRev[,"Revenue"])

plot of chunk unnamed-chunk-5

Statistics on All Categories

Write a function to check cross correlation on all 600 product categories

The 'R way' is to avoid loops, use vectorised functions

## cross correlation function
crossCorRevenue <- function(product){
ccf(GAdataAgg[GAdataAgg$CategoryName == product,"itemRevenue"], DailyCatRev[DailyCatRev$CategoryName == product,"Revenue"], main=paste("CCF for", product) )
}

## apply to all unique Category names
sapply(unique(GAdataAgg$CategoryName), crossCorRevenue)

Statistics on Category X

Cross correlation function on offline vs online revenue.

plot of chunk unnamed-chunk-6

Evidence of ~20 day correlation lag

Statistics on Category X

Examining Offline/Online revenue for that category shows where online peaks have preceeded offline.

plot of chunk unnamed-chunk-7

Cross correlation next steps

Not conclusive, but some evidence of lags in online and offline shopping behaviour Next steps:

Deal with Christmas peak affecting results
Repeat with each product, not product category (now easy with vectorised formula to repeat on 100,000+ products)
Get bigger sample size - techniques trained on 10% sample. etc. etc.

Main point is R is quick to iterate over transform, visualise, analyse loop

Another example - Weekday comparison

Days of the week comparison

plot of chunk unnamed-chunk-8

3. Presentation

Present your findings

This presentation is made in R using RStudio's RMarkdown

Turns markdown and R code into HTML5
Customise via CSS

Other alternatives include:

PDF exports of ggplot() graphs
Interactive web dashboards using Shiny
Export processed data and visualise in Tableau, Qlikview etc.

Previous slide's markup


This presentation is made in R using RStudio's RMarkdown

* Turns markdown and R code into HTML5
* Customise via CSS

Other alternatives include:
* PDF exports of ggplot() graphs
* Interactive web dashboards using Shiny
* Export processed data and visualise in Tableau, Qlikview etc.

Hotkeys: hit 'o' to navigate, 'f' for fullscreen (when not on RPubs)

Shiny

Interactive web dashboard - my example app forecasting GA data

what-data

Conclusions

Summary

Quick tour how R helps in all steps of data analysis:

Importing data - rga() RODBC()
Data Cleaning/Transforming data - merge() reshape2() dplyr()
Analysis and statistics - ccf() glm() forecast()
Visualisation - ggplot2() ggvis() shiny()
Presentation - knitr()

We didn't cover

Forecasting / Prediction
Neural Nets / Machine Learning
Text Analysis
Connection with other Web APIs (e.g. twitteR() )
Report Automation (via RScript)
Creation of data feeds
R in the cloud
(New Sep 2014!) Google's casualImpact package

Resources

Appendix

Importing Data

Importing CRM data from Microsoft SQL Server using library(RODBC)
Importing GA data using library(rga)

1. Import CRM data from Microsoft SQL Server

This is (sadly) a lot easier when running R on Windows as you need to setup Active Directory

Create a user DSN data source

Should have something like this:

2. Import CRM data from Microsoft SQL Server

Now load up R/RStudio and connect using the crudentials you've just created

Connect to Microsoft SQL server, take a look at what Tables you have library("RODBC") channel <- odbcConnect("nameOfUDS", uid="username", pwd='password') Tables <- sqlTables(channel)

3. Import CRM data from Microsoft SQL Server

Your connection is saved to object channel - use SQL queries to get the data you want

Pull in top 100 rows

exampleSQL <- sqlQuery(channel, "SELECT TOP 1000 * FROM dbo.purchase)

More complicated example - pull in 10% sample from Puchase table and join it to Receipt table

purchasesSample <- sqlQuery(channel, "SELECT * FROM dbo.purchase tablesample (10 percent) LEFT OUTER JOIN dbo.purchaseline purline ON dbo.purchase.ReceiptID = purline.ReceiptID")

Import CRM data from other systems

We now have a standard dataFrame in R which can be used for all R functions.

Similar workflow if using another type of database with ODBC (Open Database Connectivity Overview)

Other data sources may have their own library e.g. RMySQL, rmongodb, bigrquery, redshift-r, etc.

1. Import Google Analytics data

We will now use another library to connect to Google Analytics - skardhamar/rga()

We can merge on Product SKU code, so we pull ga:date, ga:productSku and ga:itemRevenue

## install rga following instructions on github, then....

## pull all the GA profiles
## your profile id is in this table
profiles <- ga$getProfiles() 

GAdata <- ga$getData(ids = "YOUR_PROFILE_ID", start.date = "2013-01-01", end.date = "2014-06-01", metrics = "ga:itemRevenue", dimensions = "ga:date,ga:productSku", batch = TRUE)

2. Importing Google Analytics data

batch flag allows you to pull data beyond 10000 rows
walk flag gets around sampled data by pulling daily and aggregating

Go have a cup of coffee.