Credit to Matthew Borthwick for scraping and providing this data.
Data Source: boardgamegeek.com
The data in question are user ratings for boardgames. This notebook covers the cleaning/reshaping of the data, principal component analysis, and ratings prediction with User-User collaborative filtering.
Our tasks are as follows:
Lets first download the data and take a look.
source("requirements.R")
ggthemr("dust")
elite <- read_csv("Data/boardgame-elite-users.csv") %>% rename(userID = `Compiled from boardgamegeek.com by Matt Borthwick`)
titles <- read_csv("Data/boardgame-titles.csv") %>% rename(gameID = `boardgamegeek.com game ID`)
elite[1:5,] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| userID | gameID | rating |
|---|---|---|
| 272 | 9216 | 1 |
| 272 | 9217 | 1 |
| 272 | 39938 | 1 |
| 272 | 15363 | 5 |
| 272 | 24068 | 1 |
titles[1:5,] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))
| gameID | title |
|---|---|
| 13004 | The Downfall of Pompeii |
| 66188 | Fresco |
| 503 | Through the Desert |
| 66690 | Dominion: Prosperity |
| 150 | PitchCar |
For each row, we have a user ID, a game ID, and a rating for the game by the indicated user. We can explore the data a bit more. Lets calculate the average and standard deviation of ratings for each game as well as how many times it was reviewed. A plot of average rating shows somewhat of an upward trend as number of reviews increases.
#Group by game, calculate the within gruoup average, standard deviation, and ratings count, and join with the titles dataframe.
elite <- elite %>% group_by(gameID) %>%
mutate(avg_rating = mean(rating), sd_rating = sd(rating), freq = n()) %>%
left_join(titles)
#kable() function makes nice tables
elite[1:5,] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "responsive"))
| userID | gameID | rating | avg_rating | sd_rating | freq | title |
|---|---|---|---|---|---|---|
| 272 | 9216 | 1 | 7.555253 | 1.362515 | 178 | Goa |
| 272 | 9217 | 1 | 7.078808 | 1.399039 | 193 | Saint Petersburg |
| 272 | 39938 | 1 | 6.643584 | 1.457874 | 173 | Carson City |
| 272 | 15363 | 5 | 7.036125 | 1.317551 | 160 | Nexus Ops |
| 272 | 24068 | 1 | 6.440321 | 1.378874 | 156 | Shadow Hunters |
#Plot mean and standard deviation against review frequency
ggplot(elite, aes(x = x)) +
geom_point(aes(x = freq, y = sd_rating, col = "red")) +
geom_point(aes(x = freq, y = avg_rating, col = "blue")) +
xlab("Number of Reviews") + ylab("Value") +
scale_color_manual(values = c("red" = "red", "blue" = "blue"),
labels = c("Average", "Standard Deviation"))
Scatterplots of mean rating (red) and standard deviation (blue) against number of reviews
We ultimately want our data to contain a single row for each user where the columns for that user represent their rating on each game among all games
To get the data in this format, we lean on dplyr, and the spread() function in particular, which will convert the data into a wide format:
elite_sparse <- elite%>%
select(userID, gameID, rating) %>%
spread(gameID, rating)
elite_sparse[1:5, 1:15] %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "responsive"))
| userID | 3 | 5 | 10 | 11 | 12 | 13 | 18 | 41 | 42 | 45 | 46 | 47 | 50 | 51 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 272 | 1 | 1 | 1 | NA | 1 | 7.5 | 1 | 4 | 1 | 6 | 1 | 1 | NA | 3 |
| 388 | 8 | 7 | 8 | 8 | 8 | 7.0 | 8 | 9 | 9 | 8 | 10 | 7 | 10 | 6 |
| 430 | 6 | 6 | 7 | 7 | 6 | 8.0 | NA | 5 | 6 | 9 | 6 | 6 | 3 | 7 |
| 2044 | 8 | 7 | 7 | 7 | NA | 8.0 | 7 | 7 | 7 | 6 | 7 | NA | 8 | 7 |
| 3080 | 9 | NA | 5 | 8 | 7 | 8.0 | 8 | 7 | 8 | 9 | 7 | NA | 7 | 6 |
As you can see, each reviewer does not have a score for every game, and we end up with a lot of missing values even for this set of “elite” reviewers. There are a few methods that can be used to fill the missing cells, including using the mean/median rating. Here I will use a set of functions from the missMDA package that uses iterative methods to find likely values. Some of the resulting data will not make sense (negative ratings, ratings higher than the maximum), however this will still provide a workable base for performing PCA.
Two steps are performed to fill the table: 1. Estimate the number of dimensions(components) to be used in PCA 2. Fill the missing cells assuming we have a number of dimensions estimated in 1.
#Estimate dimensions
ncomp <- estim_ncpPCA(as.data.frame(elite_sparse[,2:ncol(elite_sparse)]), ncp.min = 0, ncp.max = 6)
#Fill the matrix
elite_sparse_filled <- imputePCA(as.data.frame(elite_sparse[,2:ncol(elite_sparse)]),
ncp = ncomp$ncp, scale = TRUE,
method = "EM", row.w = NULL, coeff.ridge = 1,
threshold = 1e-06, seed = NULL, nb.init = 1, maxiter = 1000)$completeObs
#Perform PCA
PCA <- prcomp(elite_sparse_filled, retx = TRUE)
#Dataframe for plotting
components <- as.data.frame(PCA$rotation) %>% mutate(gameID = as.numeric(rownames(.))) %>% left_join(titles)
We graph the data in terms of pairs of principal components. We can see how PCA chooses the component axes: The data varies most in the direction of principal component 1, then principal component 2, and so on. In our case, the variance explained by each component appears to stabilize after the first.
theme <- list(geom_point(color = "black", fill = "green", pch = 21),
theme(plot.margin=unit(c(0.25,0.25,0.25,0.25), "cm"),
axis.title.x = element_text(size = 8, vjust = -0.5),
axis.title.y = element_text(size = 8)
), xlim(-60, 60), ylim(-60, 60))
grid.arrange(
ggplot(as.data.frame(PCA$x), aes(x = PC1, y=PC2)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC2, y=PC3)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC3, y=PC4)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC4, y=PC5)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC5, y=PC6)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC6, y=PC7)) + theme,
ncol = 3
)
Pairs of the first 7 principal components
We can look at the top 10 loadings for each principal component and see what kinds of games are included to see if they describe some sort of genre or series of games.
cbind(top_n(components, 10, PC1) %>% arrange(desc(PC1)) %>% select(PC1, title),
top_n(components, 10, PC2) %>% select(PC2, title),
top_n(components, 10, PC3) %>% select(PC3, title)) %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "responsive"))
| PC1 | title | PC2 | title | PC3 | title |
|---|---|---|---|---|---|
| 0.0866010 | Catan: 5-6 Player Extension | 0.0937714 | Tichu | 0.0998507 | War of the Ring (first edition) |
| 0.0754696 | Catan: Cities & Knights | 0.1207736 | Taj Mahal | 0.0964667 | Twilight Imperium (Third Edition) |
| 0.0738649 | Pandemic Legacy: Season 1 | 0.0982333 | The Princes of Florence | 0.1299529 | Through the Ages: A Story of Civilization |
| 0.0736338 | Boss Monster: The Dungeon Building Card Game | 0.1147404 | Genoa | 0.0952696 | Brass: Lancashire |
| 0.0735421 | XCOM: The Board Game | 0.1039938 | Amun-Re | 0.0943403 | Le Havre |
| 0.0702398 | Mansions of Madness: Second Edition | 0.1049687 | Agricola | 0.0976572 | Kemet |
| 0.0679850 | Space Hulk (third edition) | 0.0986251 | In the Year of the Dragon | 0.1001427 | Scythe |
| 0.0662898 | T.I.M.E Stories | 0.1211776 | The Castles of Burgundy | 0.1109133 | Mombasa |
| 0.0662785 | 7 Wonders: Leaders | 0.1003861 | Kingdom Builder | 0.0988313 | Food Chain Magnate |
| 0.0661670 | Catan: Seafarers | 0.1048600 | Bora Bora | 0.1274151 | Through the Ages: A New Story of Civilization |
Looking at the descriptions for the lists of games, there does not appear to be a common theme except that the first PC loads heavily on Settlers of Catan and its expansions. I cannot see a reason for attributing a theme to the other components aside from confirmation bias.
A quick plot of the cumulative explained variance from each component shows greatly diminshed returns after the 6th or so PC. The variance explained by the first PC is significantly greater than the rest.
explained_variance <- PCA$sdev^2/sum(PCA$sdev^2)
qplot(0:199, c(0,cumsum(explained_variance))) + xlab("Principal Component #") + ylab("Cumulative Explained Variance")
Aside from exploratory analysis, we can use these components for further analysis such as simple regression or K-means clustering. For now, we will put this PCA aside and move on to building a recommendation engine from scratch.
We will now attempt to build a recommendation engine for the elite user data. The process will use a basic User-User collaborative filtering scheme with some experimental adjustments to predict scores.
The similarity between two users will be determined by calculating a scaled Pearson’s Correlation coefficient:
\[s(u,v)=\frac{1}{\vert{D_{u,v}(N)\vert}^\alpha}{\huge[}\frac{\sum_{i\in I_{common}}(r_{u,i}-\overline{r}_{u})(r_{v,i}-\overline{r}_{v})}{\sqrt{\sum_{i\in I_{common}}(r_{u,i}-\overline{r}_{u})^2}\sqrt{\sum_{i\in I_{common}}(r_{v,i}-\overline{r}_{v})^2}}+1{\huge]}\] Where \(r_{u,i}\) and \(\overline{r}_{u}\) are user \(u\)’s rating for item \(i\) and \(u\)’s average rating across all items respectively. \(I_{common}\) denotes the set of all items that \(u\) and \(v\) have both rated. \(\vert D_{u,v}(N)\vert\) is the absolute difference in the number of ratings between user \(u\) and \(v\). I include this adjustment under the suspicion that users who rate a lot of games tend form a group that rates games similarly. The parameter \(\alpha\) controls the magnitude of this effect and will be optimized for predictive power later. For now, we will perform the analysis assuming \(\alpha = 1\)
We will write three functions to do our predictions. The first, similarity(user1, user2), computes the similarity score between two users. The second, predict_score(test, train, itemID) predicts a score given a user, training data, and an item ID. The formula begins with the user-in-question’s average rating, and moves up or down depending on other users rating of the game in question weighted by the similarity score:
\[p_{u,i} = \overline{r}_u+\frac{\sum_{u'\in U'}s(u,u')(r_{u',i}-\overline{r}_{u'})}{\sum_{u'\in U'}\vert s(u,u')\vert}\] where \(p_{u,i}\) is the predicted score for user \(u\) and item \(i\), \(\overline{r}_u\) is the average score for user \(u\). \(U'\) and \(u'\) indicate the set of all other users and an individual other user respectively.
The last function, test_batch(test, train), loops over a batch of test cases. For each test case, it randomly selects an item with a rating and holds it out, and then attempts to predict the score for that item using predict_score(). At the end of the loop a dataframe containing the predicted/true values and the item ID will be returned.
#getUsers <- function(df, itemID){
# quoted_item <- enquo(itemID)
#
# df %>% filter(!is.na(!!quoted_item))
#}
similarity <- function(user1, user2, alpha = 1){
df <- rbind(user1, user2)
df <- df[,which((!is.na(df[1,])) & (!is.na(df[2,])))]
diff <- sum(!is.na(user1)) - sum(!is.na(user2))
score = 1/(abs(diff)+1)^alpha*(cor(as.matrix(df)[1,], as.matrix(df)[2,])+1)
return(score)
}
predict_score <- function(user, df_train, itemID, sd = FALSE){
user <- user %>%
rowwise() %>%
mutate(n = sum(!is.na(.))) %>%
ungroup() %>%
mutate(mean = apply(user %>% select(-n), 1, mean, na.rm = TRUE), sd = apply(user %>% select(-n), 1, sd, na.rm = TRUE))
relevant_users <- df_train %>% filter(!is.na(!!sym(itemID)))
#someone please tell me why the pipe must be broken here
relevant_users <- relevant_users %>%
rowwise() %>%
mutate(n = sum(!is.na(.))) %>%
ungroup() %>%
mutate(mean = apply(as.matrix(relevant_users %>% select(-n)), 1, mean, na.rm = TRUE),
sd = apply(as.matrix(relevant_users %>% select(-n)), 1, sd, na.rm = TRUE)) %>%
filter(sd != 0)
#we must do this part of the dplyr pipe seperately due to having to call the data frame within apply()
relevant_users <- relevant_users %>%
mutate(simscore = apply(relevant_users %>% select(c(-n, -mean, -sd)), 1, similarity, user))
if(sd){
predict = user$mean +
user$sd*sum(relevant_users$simscore*((relevant_users %>% select(itemID))-relevant_users$mean)/relevant_users$sd)/
sum(abs(relevant_users$simscore))
}
else{
predict = user$mean +
sum(relevant_users$simscore*((relevant_users %>% select(itemID))-relevant_users$mean))/
sum(abs(relevant_users$simscore))
}
return(predict)
}
test_batch <- function(test, train, ...){
pred <- true <- 0
item <- NA
for(i in 1:nrow(test)){
#get a random column which is not empty
#get the name of that column
#store these in "true" and "item" respectively
index <- sample(which(!is.na(test[i,])), 1)
true[i] <- test[i,index][[1]]
item[i] <- colnames(test[,index])
foo <- test[i,]
foo[item[i]] <- NA
pred[i] <- predict_score(foo, train, item[i], ...)
}
return(data.frame(predicted = pred, true = true, itemID = item))
}
We will randomly select 20% of the elite users and use them as the test cases. The rest will be used alongside each individual test case to calculate a predicted score.
indices <- sample(1:nrow(elite_sparse), floor(.2*nrow(elite_sparse)))
test <- elite_sparse[indices, -1]
train <- elite_sparse[-indices, -1]
preds <- test_batch(test, train)
head(preds) %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "responsive"))
| predicted | true | itemID |
|---|---|---|
| 6.384750 | 5.5 | 37380 |
| 6.412097 | 5.0 | 96848 |
| 7.141000 | 6.0 | 72125 |
| 7.014915 | 9.0 | 822 |
| 7.349141 | 7.0 | 18602 |
| 4.787054 | 6.0 | 2083 |
To see how our engine did, we can look at the root mean squared error of our predictions. This is the root of the average squared difference between the predicted and true values: \(RMSE = \sqrt{\frac{1}{U_{test}}\sum_u(p_{u,i}-r_{u,i})}\)
RMSE <- sqrt(mean((preds$predicted - preds$true)^2))
RMSE
## [1] 1.319188
On average, our predictions are off by almost one and a half points, not particularly impressive. To improve performance, we can scale the difference for each user score from their mean by the standard deviation of that users ratings:
\[p_{u,i} = \overline{r}_u+\sigma_u\frac{\sum_{u'\in U'}s(u,u')(r_{u',i}-\overline{r}_{u'})/\sigma_{u'}}{\sum_{u'\in U'}\vert s(u,u')\vert}\]
Where \(\sigma_u\) and \(\sigma_{u'}\) indicate the standard deviation of the scores for the user in question and an arbitrary other user respectively. Lets see if our predictions improved:
preds <- test_batch(test, train, sd = TRUE)
RMSE <- sqrt(mean((preds$predicted - preds$true)^2))
RMSE
## [1] 1.170208
We see a slight improvement, and our engine is a little over a point off on average. Not terrible for a basic implementation, though one problem of note is that the process takes a non-trivial amount of computing time. In practice, this may be an issue, and is one of the main reasons why Item-Item filtering is preferred in some cases. Next, we can try manipulating the \(\alpha\) parameter to see if it improves performance.