Machine Learning with Web Analytics and R

Mark Edmondson - Senior Digital Analyst, Wunderman
18th September 2015

An Introduction: Random Forests and K-Means Clustering


Machine Learning gives ability for programs to learn without being explicitly programmed.

They make models from input data to create useful output, commonly predicitive analytics.

We look at how this can be applied to web analytics data in particular

Introduction to Machine Learning

Types of Machine Learning

Commonly split between supervised and unsupervised learning

  • Supervised: Train the model against test set with known outcomes
  • Unsupervised: Let model find own results

Will look today at these tasks

  • Supervised Learning: Categorisation using Random Forests
  • Unsupervised Learning: Clustering using K-Means

Intro to Random Forest - Decision Trees

This recent visualisation from is an excellent introduction

Many Decision Trees = Random Forest

Decision Tree on training set can be 100% correct.

But it will overfit data, so new test data won't perform as well.

Random Forests run many decision trees on sub-sets of the data.

Model result is aggregation of all tree results.

Intro to K-means clustering

A video tutorial on k-means is here.

2-Dimensions example:

k-means example

Intro to K-means clustering

  • Pick number of clusters, and start with random centroids.
  • Assign points to nearest centroid.
  • Find nearest points, pick new centroid.
  • When no new cluster assignments, stop.

k-means steps


You are in charge of a reward scheme website, where existing customers log in to spend their points.

You want users to spend as many points as they can, so they have high percieved value.

You capture a unique userId on login into custom dimension1 and uses GA enhanced ecommerce to track which prizes users view and claim.

Machine Learning Projects Needs

  1. Pose the question (most important)
  2. Data preparation (majority of work!)
  3. Running the model (sexy statistics)
  4. Assess results (what you'll be judged on)
  5. How to put it into production (the ROI)

Example 1 - Random Forest

Predict which prize from view history

Can we predict what prizes a user will claim from their view history?

Application: Send email with predicted prize based on view history to users who don't claim.

Getting the data - views

Use your favourite GA -> R library to get prize views:


id <- "XXXXX"

## 61607 results, 30049 unique Ids, 185 Sku's
product_views <- 
                   samplingLevel = "WALK")

Getting the data - claimed prizes

Get prizes claimed.

Can't query transactionId and productDetailViews in same call, so do 2 API calls and link on userID (dimension1)

## 8855 results, 6336 unique Ids, 169 Sku's
product_trans <- 

GA Data Format

Prize Views - 61607 rows

productSku dimension1 productDetailViews
SKU0023 UID1234556 1
SKU0025 UID1234557 1
SKU0065 UID1234558 2
SKU0066 UID1234558 1

Prizes Claimed - 8855 rows

transactionId productSku dimension1 uniquePurchases
rfgHgt SKU0023 UID1234556 1
rfGhkt SKU0026 UID1234557 1
rtGhjdk SKU0093 UID1234558 1
rfGhQW SKU0134 UID1234558 1

Required Data Format

Random Forest needs a matrix of predictor variables (predictors) with a column for the prize they eventually claimed (response) for each user (record)

We need a data frame of this format:

userId SKU0001_views SKU0002_views SKU0185_views Claimed
UID1234556 0 1 0 SKU0002
UID1234557 2 1 0 NoSale
UID1234558 5 0 0 SKU0001
UID1234558 0 0 1 SKU0185

Transforming the data

Using reshape2 and dplyr to transform data resulted in 30,049 rows x 187 column matrix

product_views <- 
         dimension1 ~ productSku + variable, 

model_data <- left_join(product_views, 

## NAs are no sale
model_data$boughtSku[$boughtSku)] <- "NoSale"

Splitting into training and test

## 75% of the sample size
smp_size <- floor(0.75 * nrow(model_data))

## set seed to make reproducible
train_ind <- sample(seq_len(nrow(model_data)), 
                    size = smp_size)

train <- model_data[train_ind, ]

test <- model_data[-train_ind, ]

Running the Random Forest model

## only want product view columns
predictors <- train[ , product_name_only]

response <- as.factor(train[,"boughtSku"])

## finally run the model
## takes a long time (30mins)
## go get a coffee


rf <- randomForest(x = predictors,
                   y = response)

Random Forest Raw Results

Once we have the model in rf we can use it to predict prizes from new data, starting with the test data set:

## split test set same as training
predictor_test <- test[ , product_name_only]
response_test <- as.factor(test[,"boughtSku"])

## check result on test set
prediction <- predict(rf, predictor_test)

## TRUE if its right, FALSE if not
predictor_test$correct <- 
  prediction == response_test

An accuracy of ~70% was found in first attempt, not bad for first pass.

Visualisation of the model results - where were the errors?

actual vs forecast randomForest

Putting the model into deployment

Next steps:

  • Run model on more test sets
  • Train model on more data
  • Try reducing number of parameters
  • Examine large error outliers
  • Compare with simple models (last/first product viewed?)
  • Run model against users who have viewed and not sold yet
  • Run email campaign with control and model results for final judgement

Example 2 - K-Means Clustering

k-means example

Can we categorise users based on their product history?

Are the prize categories on the website suitable? How does our website categories compare to user behaviour?

Application: Possible changes on how prizes are organised on the website.

K-Means Data

We'll reuse the model data from before, with some modifications.

Only those users who bought, and only the product view columns.

SKU0001_views SKU0002_views SKU0185_views
0 1 0
2 1 0
5 0 0
0 0 1

Principal Component Analysis

185 dimensions will take a long time, and probably overfit the data.

We perform Principal Component Analysis (PCA) to see if there are important products that dominate model (could be applied to previous Random Forest as well)

PCA rotates dimensions to try and minimize them as much as possible, the ranks them in amount of variance.

Finding number of components from plot

pc <- princomp(model_data)
plot(pc, type="l")

how many dimensions?

Finding number of components form summary

# look for dimension that is ~ 85% variance

summary dimensions

Applying choice of components to data

We'll choose top 3 components for this example:

# run alternative pca needed for k-means
pc <- prcomp(k_data)

## We have chosen top 3 dimensions
## limit data to first 3 columns
comp <- data.frame(pc$x[,1:3])

Run K-Means with 4 clusters

Running with top 3 principal componenets looking for 4 clusters:

# Apply k-means with k=4
k <- kmeans(comp, centers = 4, nstart=25, iter.max=1000)

How many clusters?

Number of clusters is subjective.

Run k-means several times gradually increasing number of centroids, looking for where sum of squares of groups have boundaries:

wss <- (nrow(comp)-1)*sum(apply(comp,2,var))

## loop 25 times
for (i in 2:15) wss[i] <- 
  sum(kmeans(comp, centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

How many clusters - plot

how many clusters?

Heatmap visualisation of clusters

4 clusters

How to present the clustering in heatmap

## put cluster on original data
r2 <- data.frame(k_data, cluster = k$cluster)

## columns of cluster, rows of Sku
rl <-, function(x){ 
      r3 <- r2[r2$cluster == x,names(k_data)] 
      r4 <- colSums(r3) / nrow(r3)

names(rl) <- paste("cluster",1:4)

d3heatmap(rl, theme="dark", scale = 'row')

K-Means - Next Steps

  • Compare clustered products to their existing category
  • Experiment with varying dimensions and amount of clusters
  • Correct for any self-reinforcing results
  • A/B test new product categorisation for uplift


Pitfalls Using ML in Web Analytics

  • Web analytics is messy data (a user is…?)
  • Most practical analysis needs robust unique userIds
  • Time-series techniques are quickest way in (forecasting / anomolies) - see GA Effect
  • Correlating confounders: e.g. PPC clicks and cost
  • Self reinforcing results: more clicks on a personalised top result
  • No magic: Only assume ML can scale a human expert
  • Overfitting vs Bias: Always judge on test set not training
  • No regularisation: e.g. pageviews + bounce rate in same model

Other Machine Learning Models