Machine Learning with Web Analytics and R

Mark Edmondson - Senior Digital Analyst, Wunderman
18th September 2015

An Introduction: Random Forests and K-Means Clustering

Introduction

Machine Learning gives ability for programs to learn without being explicitly programmed.

They make models from input data to create useful output, commonly predicitive analytics.

We look at how this can be applied to web analytics data in particular

Introduction to Machine Learning

Types of Machine Learning

Commonly split between supervised and unsupervised learning

Supervised: Train the model against test set with known outcomes
Unsupervised: Let model find own results

Will look today at these tasks

Supervised Learning: Categorisation using Random Forests
Unsupervised Learning: Clustering using K-Means

Intro to Random Forest - Decision Trees

This recent visualisation from r2d3.net is an excellent introduction

Many Decision Trees = Random Forest

Decision Tree on training set can be 100% correct.

But it will overfit data, so new test data won't perform as well.

Random Forests run many decision trees on sub-sets of the data.

Model result is aggregation of all tree results.

Intro to K-means clustering

A video tutorial on k-means is here.

2-Dimensions example:

k-means example

Intro to K-means clustering

Pick number of clusters, and start with random centroids.
Assign points to nearest centroid.
Find nearest points, pick new centroid.
When no new cluster assignments, stop.

k-means steps

Scenario

You are in charge of a reward scheme website, where existing customers log in to spend their points.

You want users to spend as many points as they can, so they have high percieved value.

You capture a unique userId on login into custom dimension1 and uses GA enhanced ecommerce to track which prizes users view and claim.

Machine Learning Projects Needs

Pose the question (most important)
Data preparation (majority of work!)
Running the model (sexy statistics)
Assess results (what you'll be judged on)
How to put it into production (the ROI)

Example 1 - Random Forest

Predict which prize from view history

Can we predict what prizes a user will claim from their view history?

Application: Send email with predicted prize based on view history to users who don't claim.

Getting the data - views

Use your favourite GA -> R library to get prize views:

library(googleAnalyticsR_public)

gar_auth(new_user=T)
id <- "XXXXX"

## 61607 results, 30049 unique Ids, 185 Sku's
product_views <- 
  google_analytics(id, 
                   '2015-08-01', 
                   '2015-09-01',
                   'productDetailViews',
                   c('productSku',
                       'dimension1'),
                   samplingLevel = "WALK")

Getting the data - claimed prizes

Get prizes claimed.

Can't query transactionId and productDetailViews in same call, so do 2 API calls and link on userID (dimension1)

## 8855 results, 6336 unique Ids, 169 Sku's
product_trans <- 
  google_analytics(id, 
                   '2015-08-01', 
                   '2015-09-01',
                   'uniquePurchases',
                   c('transactionId',
                     'productSku',
                     'dimension1'))

GA Data Format

Prize Views - 61607 rows

productSku	dimension1	productDetailViews
SKU0023	UID1234556	1
SKU0025	UID1234557	1
SKU0065	UID1234558	2
SKU0066	UID1234558	1
…	…	…

Prizes Claimed - 8855 rows

transactionId	productSku	dimension1	uniquePurchases
rfgHgt	SKU0023	UID1234556	1
rfGhkt	SKU0026	UID1234557	1
rtGhjdk	SKU0093	UID1234558	1
rfGhQW	SKU0134	UID1234558	1
…	…	…	…

Required Data Format

Random Forest needs a matrix of predictor variables (predictors) with a column for the prize they eventually claimed (response) for each user (record)

We need a data frame of this format:

userId	SKU0001_views	SKU0002_views	…	SKU0185_views	Claimed
UID1234556	0	1	…	0	SKU0002
UID1234557	2	1	…	0	NoSale
UID1234558	5	0	…	0	SKU0001
UID1234558	0	0	…	1	SKU0185
…	…	…	…	…	…

Transforming the data

Using reshape2 and dplyr to transform data resulted in 30,049 rows x 187 column matrix

product_views <- 
  recast(product_views, 
         dimension1 ~ productSku + variable, 
         fun.aggregate=sum)

model_data <- left_join(product_views, 
                        product_trans,
                        by=dimension1)

## NAs are no sale
model_data$boughtSku[is.na(model_data$boughtSku)] <- "NoSale"

Splitting into training and test

## 75% of the sample size
smp_size <- floor(0.75 * nrow(model_data))

## set seed to make reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(model_data)), 
                    size = smp_size)

train <- model_data[train_ind, ]

test <- model_data[-train_ind, ]

Running the Random Forest model

## only want product view columns
predictors <- train[ , product_name_only]

response <- as.factor(train[,"boughtSku"])


## finally run the model
## takes a long time (30mins)
## go get a coffee

library(randomForest)

rf <- randomForest(x = predictors,
                   y = response)

Random Forest Raw Results

Once we have the model in rf we can use it to predict prizes from new data, starting with the test data set:

## split test set same as training
predictor_test <- test[ , product_name_only]
response_test <- as.factor(test[,"boughtSku"])

## check result on test set
prediction <- predict(rf, predictor_test)

## TRUE if its right, FALSE if not
predictor_test$correct <- 
  prediction == response_test

An accuracy of ~70% was found in first attempt, not bad for first pass.

Visualisation of the model results - where were the errors?

actual vs forecast randomForest

Putting the model into deployment

Next steps:

Run model on more test sets
Train model on more data
Try reducing number of parameters
Examine large error outliers
Compare with simple models (last/first product viewed?)
Run model against users who have viewed and not sold yet
Run email campaign with control and model results for final judgement

Example 2 - K-Means Clustering

k-means example

Can we categorise users based on their product history?

Are the prize categories on the website suitable? How does our website categories compare to user behaviour?

Application: Possible changes on how prizes are organised on the website.

K-Means Data

We'll reuse the model data from before, with some modifications.

Only those users who bought, and only the product view columns.

SKU0001_views	SKU0002_views	…	SKU0185_views
0	1	…	0
2	1	…	0
5	0	…	0
0	0	…	1
…	…	…	…

Principal Component Analysis

185 dimensions will take a long time, and probably overfit the data.

We perform Principal Component Analysis (PCA) to see if there are important products that dominate model (could be applied to previous Random Forest as well)

PCA rotates dimensions to try and minimize them as much as possible, the ranks them in amount of variance.

Finding number of components from plot

pc <- princomp(model_data)
plot(pc, type="l")

how many dimensions?

Finding number of components form summary

# look for dimension that is ~ 85% variance
summary(pc)
loadings(pc)

summary dimensions

Applying choice of components to data

We'll choose top 3 components for this example:

# run alternative pca needed for k-means
pc <- prcomp(k_data)

## We have chosen top 3 dimensions
## limit data to first 3 columns
comp <- data.frame(pc$x[,1:3])

Run K-Means with 4 clusters

Running with top 3 principal componenets looking for 4 clusters:

# Apply k-means with k=4
k <- kmeans(comp, centers = 4, nstart=25, iter.max=1000)

How many clusters?

Number of clusters is subjective.

Run k-means several times gradually increasing number of centroids, looking for where sum of squares of groups have boundaries:

wss <- (nrow(comp)-1)*sum(apply(comp,2,var))

## loop 25 times
for (i in 2:15) wss[i] <- 
  sum(kmeans(comp, centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

How many clusters - plot

how many clusters?

Heatmap visualisation of clusters

4 clusters

How to present the clustering in heatmap

library(d3heatmap)
## put cluster on original data
r2 <- data.frame(k_data, cluster = k$cluster)

## columns of cluster, rows of Sku
rl <- 
  as.data.frame(lapply(1:4, function(x){ 
      r3 <- r2[r2$cluster == x,names(k_data)] 
      r4 <- colSums(r3) / nrow(r3)
      r4}))

names(rl) <- paste("cluster",1:4)

d3heatmap(rl, theme="dark", scale = 'row')

K-Means - Next Steps

Compare clustered products to their existing category
Experiment with varying dimensions and amount of clusters
Correct for any self-reinforcing results
A/B test new product categorisation for uplift

Summary

Pitfalls Using ML in Web Analytics

Web analytics is messy data (a user is…?)
Most practical analysis needs robust unique userIds
Time-series techniques are quickest way in (forecasting / anomolies) - see GA Effect
Correlating confounders: e.g. PPC clicks and cost
Self reinforcing results: more clicks on a personalised top result
No magic: Only assume ML can scale a human expert
Overfitting vs Bias: Always judge on test set not training
No regularisation: e.g. pageviews + bounce rate in same model

Other Machine Learning Models

Which model to use?

which model?

Machine Learning with Web Analytics and R

Introduction

Introduction to Machine Learning

Types of Machine Learning

Intro to Random Forest - Decision Trees

Many Decision Trees = Random Forest

Intro to K-means clustering

Intro to K-means clustering

Scenario

Machine Learning Projects Needs

Example 1 - Random Forest

Predict which prize from view history

Getting the data - views

Getting the data - claimed prizes

GA Data Format

Prize Views - 61607 rows

Prizes Claimed - 8855 rows

Required Data Format

Transforming the data

Splitting into training and test

Running the Random Forest model

Random Forest Raw Results

Visualisation of the model results - where were the errors?

Putting the model into deployment

Example 2 - K-Means Clustering

Can we categorise users based on their product history?

K-Means Data

Principal Component Analysis

Finding number of components from plot

Finding number of components form summary

Applying choice of components to data

Run K-Means with 4 clusters

How many clusters?

How many clusters - plot

Heatmap visualisation of clusters

How to present the clustering in heatmap

K-Means - Next Steps

Summary

Pitfalls Using ML in Web Analytics

Other Machine Learning Models

Resources