Mark Edmondson - Senior Digital Analyst, Wunderman
18th September 2015
An Introduction: Random Forests and K-Means Clustering
Machine Learning gives ability for programs to learn without being explicitly programmed.
They make models from input data to create useful output, commonly predicitive analytics.
We look at how this can be applied to web analytics data in particular
Commonly split between supervised and unsupervised learning
Will look today at these tasks
This recent visualisation from r2d3.net is an excellent introduction
Decision Tree on training set can be 100% correct.
But it will overfit data, so new test data won't perform as well.
Random Forests run many decision trees on sub-sets of the data.
Model result is aggregation of all tree results.
You are in charge of a reward scheme website, where existing customers log in to spend their points.
You want users to spend as many points as they can, so they have high percieved value.
You capture a unique userId on login into custom dimension1 and uses GA enhanced ecommerce to track which prizes users view and claim.
Can we predict what prizes a user will claim from their view history?
Application: Send email with predicted prize based on view history to users who don't claim.
Use your favourite GA -> R library to get prize views:
library(googleAnalyticsR_public)
gar_auth(new_user=T)
id <- "XXXXX"
## 61607 results, 30049 unique Ids, 185 Sku's
product_views <-
google_analytics(id,
'2015-08-01',
'2015-09-01',
'productDetailViews',
c('productSku',
'dimension1'),
samplingLevel = "WALK")
Get prizes claimed.
Can't query transactionId and productDetailViews in same call, so do 2 API calls and link on userID (dimension1)
## 8855 results, 6336 unique Ids, 169 Sku's
product_trans <-
google_analytics(id,
'2015-08-01',
'2015-09-01',
'uniquePurchases',
c('transactionId',
'productSku',
'dimension1'))
| productSku | dimension1 | productDetailViews |
|---|---|---|
| SKU0023 | UID1234556 | 1 |
| SKU0025 | UID1234557 | 1 |
| SKU0065 | UID1234558 | 2 |
| SKU0066 | UID1234558 | 1 |
| … | … | … |
| transactionId | productSku | dimension1 | uniquePurchases |
|---|---|---|---|
| rfgHgt | SKU0023 | UID1234556 | 1 |
| rfGhkt | SKU0026 | UID1234557 | 1 |
| rtGhjdk | SKU0093 | UID1234558 | 1 |
| rfGhQW | SKU0134 | UID1234558 | 1 |
| … | … | … | … |
Random Forest needs a matrix of predictor variables (predictors) with a column for the prize they eventually claimed (response) for each user (record)
We need a data frame of this format:
| userId | SKU0001_views | SKU0002_views | … | SKU0185_views | Claimed |
|---|---|---|---|---|---|
| UID1234556 | 0 | 1 | … | 0 | SKU0002 |
| UID1234557 | 2 | 1 | … | 0 | NoSale |
| UID1234558 | 5 | 0 | … | 0 | SKU0001 |
| UID1234558 | 0 | 0 | … | 1 | SKU0185 |
| … | … | … | … | … | … |
Using reshape2 and dplyr to transform data resulted in 30,049 rows x 187 column matrix
product_views <-
recast(product_views,
dimension1 ~ productSku + variable,
fun.aggregate=sum)
model_data <- left_join(product_views,
product_trans,
by=dimension1)
## NAs are no sale
model_data$boughtSku[is.na(model_data$boughtSku)] <- "NoSale"
## 75% of the sample size
smp_size <- floor(0.75 * nrow(model_data))
## set seed to make reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(model_data)),
size = smp_size)
train <- model_data[train_ind, ]
test <- model_data[-train_ind, ]
## only want product view columns
predictors <- train[ , product_name_only]
response <- as.factor(train[,"boughtSku"])
## finally run the model
## takes a long time (30mins)
## go get a coffee
library(randomForest)
rf <- randomForest(x = predictors,
y = response)
Once we have the model in rf we can use it to predict prizes from new data, starting with the test data set:
## split test set same as training
predictor_test <- test[ , product_name_only]
response_test <- as.factor(test[,"boughtSku"])
## check result on test set
prediction <- predict(rf, predictor_test)
## TRUE if its right, FALSE if not
predictor_test$correct <-
prediction == response_test
An accuracy of ~70% was found in first attempt, not bad for first pass.
Next steps:
Are the prize categories on the website suitable? How does our website categories compare to user behaviour?
Application: Possible changes on how prizes are organised on the website.
We'll reuse the model data from before, with some modifications.
Only those users who bought, and only the product view columns.
| SKU0001_views | SKU0002_views | … | SKU0185_views |
|---|---|---|---|
| 0 | 1 | … | 0 |
| 2 | 1 | … | 0 |
| 5 | 0 | … | 0 |
| 0 | 0 | … | 1 |
| … | … | … | … |
185 dimensions will take a long time, and probably overfit the data.
We perform Principal Component Analysis (PCA) to see if there are important products that dominate model (could be applied to previous Random Forest as well)
PCA rotates dimensions to try and minimize them as much as possible, the ranks them in amount of variance.
pc <- princomp(model_data)
plot(pc, type="l")
# look for dimension that is ~ 85% variance
summary(pc)
loadings(pc)
We'll choose top 3 components for this example:
# run alternative pca needed for k-means
pc <- prcomp(k_data)
## We have chosen top 3 dimensions
## limit data to first 3 columns
comp <- data.frame(pc$x[,1:3])
Running with top 3 principal componenets looking for 4 clusters:
# Apply k-means with k=4
k <- kmeans(comp, centers = 4, nstart=25, iter.max=1000)
Number of clusters is subjective.
Run k-means several times gradually increasing number of centroids, looking for where sum of squares of groups have boundaries:
wss <- (nrow(comp)-1)*sum(apply(comp,2,var))
## loop 25 times
for (i in 2:15) wss[i] <-
sum(kmeans(comp, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
library(d3heatmap)
## put cluster on original data
r2 <- data.frame(k_data, cluster = k$cluster)
## columns of cluster, rows of Sku
rl <-
as.data.frame(lapply(1:4, function(x){
r3 <- r2[r2$cluster == x,names(k_data)]
r4 <- colSums(r3) / nrow(r3)
r4}))
names(rl) <- paste("cluster",1:4)
d3heatmap(rl, theme="dark", scale = 'row')