Mark Edmondson (@HoloMarkeD)
20th September, 2014 for #MeasureCamp
Demo linking GA and CRM data
An overview of how to use R in your workflow:
Why not just do it in Excel?
Finding the question to answer
Demo with a major Scandinavian retailer
Data Keys Common to GA and CRM
Lots of data analysis possible - what to do?
Brainstorm:
Demo: Can offline product trends predict online trends?
Ask the business question, then see what data you need.
Demo data set had:
Repeat process for questions that need userID link
As we only have 20mins, slides going over data connection are in the Appendix:
After importing, we have:
We now have a standard dataFrame in R which can be used for all R functions.
R is RAM based, so first place to scale is there.
Run same code on a GCE instance with 54GB RAM for bigger data sets.
Bigger than that, need on disk R libraries such as dplyr(), ff(), RHadoop() etc..
The cycle of data analysis - how to get “flow”
Where R shines, manipulating data
## merge GA SKUs to find category
gaCat <- merge(lookup, GAdata, by.x = "ProductID", by.y = "productSku" , all.y = TRUE)
## aggregate on date / category
library(reshape2)
gaCatMelt <- melt(gaCat, measure.vars="itemRevenue")
gaCatCast <- dcast(gaCatMelt, date + CategoryName ~ variable, sum)
Now we can start plotting and doing exploratory analysis. Oh no, outliers!
plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)
Looks like dirty data, can safely remove
## find outliers
DailyCatRev[DailyCatRev$Revenue > 10000000, c("date","Revenue")]
DailyCatRev[DailyCatRev$Revenue < -10000000, c("date","Revenue")]
| date | Revenue |
|---|---|
| 2013-08-27 | 999999999 |
| 2013-08-28 | -999999999 |
## remove outliers
DailyCatRev <- DailyCatRev[DailyCatRev$Revenue < 10000000,]
DailyCatRev <- DailyCatRev[DailyCatRev$Revenue > -10000000,]
plot(DailyCatRev[,"Revenue"], type="s", main="Offline Revenue", yaxt='n', ann=FALSE)
Actual analysis! Cross correlation function on offline vs online revenue. No relationship found
ccf(GAdataAgg[,"itemRevenue"], DailyCatRev[,"Revenue"])
Write a function to check cross correlation on all 600 product categories
The 'R way' is to avoid loops, use vectorised functions
## cross correlation function
crossCorRevenue <- function(product){
ccf(GAdataAgg[GAdataAgg$CategoryName == product,"itemRevenue"], DailyCatRev[DailyCatRev$CategoryName == product,"Revenue"], main=paste("CCF for", product) )
}
## apply to all unique Category names
sapply(unique(GAdataAgg$CategoryName), crossCorRevenue)
Cross correlation function on offline vs online revenue.
Evidence of ~20 day correlation lag
Examining Offline/Online revenue for that category shows where online peaks have preceeded offline.
Not conclusive, but some evidence of lags in online and offline shopping behaviour Next steps:
Main point is R is quick to iterate over transform, visualise, analyse loop
Days of the week comparison
This presentation is made in R using RStudio's RMarkdown
Other alternatives include:
This presentation is made in R using RStudio's RMarkdown
* Turns markdown and R code into HTML5
* Customise via CSS
Other alternatives include:
* PDF exports of ggplot() graphs
* Interactive web dashboards using Shiny
* Export processed data and visualise in Tableau, Qlikview etc.
Hotkeys: hit 'o' to navigate, 'f' for fullscreen (when not on RPubs)
Interactive web dashboard - my example app forecasting GA data
Quick tour how R helps in all steps of data analysis:
This is (sadly) a lot easier when running R on Windows as you need to setup Active Directory
Now load up R/RStudio and connect using the crudentials you've just created
library("RODBC")
channel <- odbcConnect("nameOfUDS", uid="username", pwd='password')
Tables <- sqlTables(channel)
Your connection is saved to object channel - use SQL queries to get the data you want
exampleSQL <- sqlQuery(channel, "SELECT TOP 1000 * FROM dbo.purchase)
purchasesSample <- sqlQuery(channel, "SELECT * FROM dbo.purchase tablesample (10 percent) LEFT OUTER JOIN dbo.purchaseline purline ON dbo.purchase.ReceiptID = purline.ReceiptID")
We now have a standard dataFrame in R which can be used for all R functions.
Similar workflow if using another type of database with ODBC (Open Database Connectivity Overview)
Other data sources may have their own library e.g. RMySQL, rmongodb, bigrquery, redshift-r, etc.
We will now use another library to connect to Google Analytics - skardhamar/rga()
We can merge on Product SKU code, so we pull ga:date, ga:productSku and ga:itemRevenue
## install rga following instructions on github, then....
## pull all the GA profiles
## your profile id is in this table
profiles <- ga$getProfiles()
GAdata <- ga$getData(ids = "YOUR_PROFILE_ID", start.date = "2013-01-01", end.date = "2014-06-01", metrics = "ga:itemRevenue", dimensions = "ga:date,ga:productSku", batch = TRUE)
batch flag allows you to pull data beyond 10000 rowswalk flag gets around sampled data by pulling daily and aggregatingGo have a cup of coffee.