Download a reduced form of the data.
library(dslabs, quietly = T)
library(tidyverse, quietly = T)
library(broom, quietly = T)
options(scipen=9999)
Load the movielens data.
data(movielens)
movielens <- as_tibble(movielens)
movielens
Blei’s deconfounder model assumes some cause of an outcome variable is confounded by a latent 3rd variable. He further assumes the cause variable is a vector.
Blei uses a latent variable model to estimate a proxy for the actual latent confounder. In the recommendation paper, he calls this a “substitute confounder”.
The goal is to construct this proxy in a way that renders elements of the cause vector conditionally independent of other elements of the vector given the proxy.
Finally, after getting the expectation of this variable, he augments the data.
For the movielens problem, he assume the cause is userId and the outcome is rating.
First, let’s turn userId into a cause vector.
movielens$userId <- as.factor(movielens$userId)
movielens$movieId <- as.factor(movielens$movieId)
userIdVector <- movielens %>%
select(movieId, userId) %>%
mutate(watched = 1) %>%
spread(userId, watched, fill=0) %>%
select(-movieId) %>%
t
print(class(userIdVector))
## [1] "matrix"
dim(userIdVector)
## [1] 671 9066
This creates a matrix. On the rows are userIds, on the columns are films each user watched. The elements are 1 if the user watched the film, 0 otherwise.
Blei uses a Poisson factorization model to construct a “substitute confounder”. For simplicity, I am going to use k-means clustering, just because it is much simpler to explain. I arbitrarily select the number of clusters to be 3.
cl <- kmeans(userIdVector, 3)
user_cluster_map <- tibble(
userId = factor(attr(cl$cluster, "names"), levels=levels(movielens$userId)),
cluster = factor(cl$cluster)
)
user_cluster_map
So for every user, we get a cluster. The cluster is our naive “substitute confounder”.
Blei then augments the original data with the substitute confounder.
augmented <- movielens %>%
left_join(user_cluster_map)
## Joining, by = "userId"
augmented %>% select(movieId, userId, cluster, genres, rating)
From this point, you fit a recommender system model that predicts the rating given userId, movieId, cluster, and any other variable (e.g. genres). The idea is that by including the “cluster” as a predictor, you are predicting the effect of do(userId := u) by closing the backdoor through the latent confounder.
Note, there are identification issues with Blei’s assumption. See this post.