Mark Edmondson - Senior Digital Analyst, Wunderman

18th September 2015

An Introduction: Random Forests and K-Means Clustering

Machine Learning gives ability for programs to learn without being explicitly programmed.

They make models from input data to create useful output, commonly predicitive analytics.

We look at how this can be applied to web analytics data in particular

Commonly split between supervised and unsupervised learning

- Supervised: Train the model against test set with known outcomes
- Unsupervised: Let model find own results

Will look today at these tasks

- Supervised Learning: Categorisation using Random Forests
- Unsupervised Learning: Clustering using K-Means

This recent visualisation from r2d3.net is an excellent introduction

Decision Tree on training set can be 100% correct.

But it will **overfit data**, so new test data won't perform as well.

Random Forests run many decision trees on sub-sets of the data.

Model result is aggregation of all tree results.

A video tutorial on k-means is here.

2-Dimensions example:

- Pick number of clusters, and start with random centroids.
- Assign points to nearest centroid.
- Find nearest points, pick new centroid.
- When no new cluster assignments, stop.

You are in charge of a reward scheme website, where existing customers log in to spend their points.

You want users to spend as many points as they can, so they have high percieved value.

You capture a unique userId on login into custom dimension1 and uses GA enhanced ecommerce to track which prizes users view and claim.

- Pose the question (most important)
- Data preparation (majority of work!)
- Running the model (sexy statistics)
- Assess results (what you'll be judged on)
- How to put it into production (the ROI)

Can we predict what prizes a user will claim from their view history?

*Application:* Send email with predicted prize based on view history to users who don't claim.

Use your favourite GA -> R library to get prize views:

```
library(googleAnalyticsR_public)
gar_auth(new_user=T)
id <- "XXXXX"
## 61607 results, 30049 unique Ids, 185 Sku's
product_views <-
google_analytics(id,
'2015-08-01',
'2015-09-01',
'productDetailViews',
c('productSku',
'dimension1'),
samplingLevel = "WALK")
```

Get prizes claimed.

Can't query transactionId and productDetailViews in same call, so do 2 API calls and link on userID (dimension1)

```
## 8855 results, 6336 unique Ids, 169 Sku's
product_trans <-
google_analytics(id,
'2015-08-01',
'2015-09-01',
'uniquePurchases',
c('transactionId',
'productSku',
'dimension1'))
```

productSku | dimension1 | productDetailViews |
---|---|---|

SKU0023 | UID1234556 | 1 |

SKU0025 | UID1234557 | 1 |

SKU0065 | UID1234558 | 2 |

SKU0066 | UID1234558 | 1 |

… | … | … |

transactionId | productSku | dimension1 | uniquePurchases |
---|---|---|---|

rfgHgt | SKU0023 | UID1234556 | 1 |

rfGhkt | SKU0026 | UID1234557 | 1 |

rtGhjdk | SKU0093 | UID1234558 | 1 |

rfGhQW | SKU0134 | UID1234558 | 1 |

… | … | … | … |

Random Forest needs a matrix of predictor variables (predictors) with a column for the prize they eventually claimed (response) for each user (record)

We need a data frame of this format:

userId | SKU0001_views | SKU0002_views | … | SKU0185_views | Claimed |
---|---|---|---|---|---|

UID1234556 | 0 | 1 | … | 0 | SKU0002 |

UID1234557 | 2 | 1 | … | 0 | NoSale |

UID1234558 | 5 | 0 | … | 0 | SKU0001 |

UID1234558 | 0 | 0 | … | 1 | SKU0185 |

… | … | … | … | … | … |

Using `reshape2`

and `dplyr`

to transform data resulted in 30,049 rows x 187 column matrix

```
product_views <-
recast(product_views,
dimension1 ~ productSku + variable,
fun.aggregate=sum)
model_data <- left_join(product_views,
product_trans,
by=dimension1)
## NAs are no sale
model_data$boughtSku[is.na(model_data$boughtSku)] <- "NoSale"
```

```
## 75% of the sample size
smp_size <- floor(0.75 * nrow(model_data))
## set seed to make reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(model_data)),
size = smp_size)
train <- model_data[train_ind, ]
test <- model_data[-train_ind, ]
```

```
## only want product view columns
predictors <- train[ , product_name_only]
response <- as.factor(train[,"boughtSku"])
## finally run the model
## takes a long time (30mins)
## go get a coffee
library(randomForest)
rf <- randomForest(x = predictors,
y = response)
```

Once we have the model in `rf`

we can use it to predict prizes from new data, starting with the test data set:

```
## split test set same as training
predictor_test <- test[ , product_name_only]
response_test <- as.factor(test[,"boughtSku"])
## check result on test set
prediction <- predict(rf, predictor_test)
## TRUE if its right, FALSE if not
predictor_test$correct <-
prediction == response_test
```

An accuracy of ~70% was found in first attempt, not bad for first pass.

Next steps:

- Run model on more test sets
- Train model on more data
- Try reducing number of parameters
- Examine large error outliers
- Compare with simple models (last/first product viewed?)
- Run model against users who have viewed and not sold yet
- Run email campaign with control and model results for final judgement

Are the prize categories on the website suitable? How does our website categories compare to user behaviour?

*Application:* Possible changes on how prizes are organised on the website.

We'll reuse the model data from before, with some modifications.

Only those users who bought, and only the product view columns.

SKU0001_views | SKU0002_views | … | SKU0185_views |
---|---|---|---|

0 | 1 | … | 0 |

2 | 1 | … | 0 |

5 | 0 | … | 0 |

0 | 0 | … | 1 |

… | … | … | … |

185 dimensions will take a long time, and probably overfit the data.

We perform Principal Component Analysis (PCA) to see if there are important products that dominate model (could be applied to previous Random Forest as well)

PCA rotates dimensions to try and minimize them as much as possible, the ranks them in amount of variance.

```
pc <- princomp(model_data)
plot(pc, type="l")
```

```
# look for dimension that is ~ 85% variance
summary(pc)
loadings(pc)
```

We'll choose top 3 components for this example:

```
# run alternative pca needed for k-means
pc <- prcomp(k_data)
## We have chosen top 3 dimensions
## limit data to first 3 columns
comp <- data.frame(pc$x[,1:3])
```

Running with top 3 principal componenets looking for 4 clusters:

```
# Apply k-means with k=4
k <- kmeans(comp, centers = 4, nstart=25, iter.max=1000)
```

Number of clusters is subjective.

Run k-means several times gradually increasing number of centroids, looking for where sum of squares of groups have boundaries:

```
wss <- (nrow(comp)-1)*sum(apply(comp,2,var))
## loop 25 times
for (i in 2:15) wss[i] <-
sum(kmeans(comp, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
```

```
library(d3heatmap)
## put cluster on original data
r2 <- data.frame(k_data, cluster = k$cluster)
## columns of cluster, rows of Sku
rl <-
as.data.frame(lapply(1:4, function(x){
r3 <- r2[r2$cluster == x,names(k_data)]
r4 <- colSums(r3) / nrow(r3)
r4}))
names(rl) <- paste("cluster",1:4)
d3heatmap(rl, theme="dark", scale = 'row')
```

- Compare clustered products to their existing category
- Experiment with varying dimensions and amount of clusters
- Correct for any self-reinforcing results
- A/B test new product categorisation for uplift

- Web analytics is messy data (a user is…?)
- Most practical analysis needs robust unique userIds
- Time-series techniques are quickest way in (forecasting / anomolies) - see GA Effect
- Correlating confounders: e.g. PPC clicks and cost
- Self reinforcing results: more clicks on a personalised top result
- No magic: Only assume ML can scale a human expert
- Overfitting vs Bias: Always judge on test set not training
- No regularisation: e.g. pageviews + bounce rate in same model