Credit to Matthew Borthwick for scraping and providing this data.
Data Source: boardgamegeek.com

The data in question are user ratings for boardgames. This notebook covers the cleaning/reshaping of the data, principal component analysis, and ratings prediction with User-User collaborative filtering.

Our tasks are as follows:

Reshape the data such that it is in a format that is useful for our analyses
Impute values for the NA’s
Perform PCA
Implement a prediction engine using user-user collaborative filtering

Lets first download the data and take a look.

source("requirements.R")
ggthemr("dust")

elite <- read_csv("Data/boardgame-elite-users.csv") %>% rename(userID = `Compiled from boardgamegeek.com by Matt Borthwick`)

titles <- read_csv("Data/boardgame-titles.csv") %>% rename(gameID = `boardgamegeek.com game ID`)

elite[1:5,] %>% kable("html") %>%
                kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

userID	gameID	rating
272	9216	1
272	9217	1
272	39938	1
272	15363	5
272	24068	1

titles[1:5,] %>% kable("html") %>%
                 kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover"))

gameID	title
13004	The Downfall of Pompeii
66188	Fresco
503	Through the Desert
66690	Dominion: Prosperity
150	PitchCar

For each row, we have a user ID, a game ID, and a rating for the game by the indicated user. We can explore the data a bit more. Lets calculate the average and standard deviation of ratings for each game as well as how many times it was reviewed. A plot of average rating shows somewhat of an upward trend as number of reviews increases.

#Group by game, calculate the within gruoup average, standard deviation, and ratings count, and join with the titles dataframe.
elite <- elite %>% group_by(gameID) %>% 
              mutate(avg_rating = mean(rating), sd_rating = sd(rating), freq = n()) %>%  
              left_join(titles) 

#kable() function makes nice tables
elite[1:5,] %>% kable("html") %>%
                kable_styling(full_width = FALSE, position = "left", 
                              bootstrap_options = c("striped", "hover", "responsive"))

userID	gameID	rating	avg_rating	sd_rating	freq	title
272	9216	1	7.555253	1.362515	178	Goa
272	9217	1	7.078808	1.399039	193	Saint Petersburg
272	39938	1	6.643584	1.457874	173	Carson City
272	15363	5	7.036125	1.317551	160	Nexus Ops
272	24068	1	6.440321	1.378874	156	Shadow Hunters

#Plot mean and standard deviation against review frequency
ggplot(elite, aes(x = x)) + 
  geom_point(aes(x = freq, y = sd_rating, col = "red")) + 
  geom_point(aes(x = freq, y = avg_rating, col = "blue")) +
  xlab("Number of Reviews") + ylab("Value") +
  scale_color_manual(values = c("red" = "red", "blue" = "blue"), 
                     labels = c("Average", "Standard Deviation"))

Scatterplots of mean rating (red) and standard deviation (blue) against number of reviews

Data Preparation

We ultimately want our data to contain a single row for each user where the columns for that user represent their rating on each game among all games

To get the data in this format, we lean on dplyr, and the spread() function in particular, which will convert the data into a wide format:

elite_sparse <- elite%>%
                select(userID, gameID, rating) %>%
                spread(gameID, rating)

elite_sparse[1:5, 1:15] %>% kable("html") %>% 
                            kable_styling(full_width = FALSE, position = "left", 
                                          bootstrap_options = c("striped", "hover", "responsive"))

userID	3	5	10	11	12	13	18	41	42	45	46	47	50	51
272	1	1	1	NA	1	7.5	1	4	1	6	1	1	NA	3
388	8	7	8	8	8	7.0	8	9	9	8	10	7	10	6
430	6	6	7	7	6	8.0	NA	5	6	9	6	6	3	7
2044	8	7	7	7	NA	8.0	7	7	7	6	7	NA	8	7
3080	9	NA	5	8	7	8.0	8	7	8	9	7	NA	7	6

As you can see, each reviewer does not have a score for every game, and we end up with a lot of missing values even for this set of “elite” reviewers. There are a few methods that can be used to fill the missing cells, including using the mean/median rating. Here I will use a set of functions from the missMDA package that uses iterative methods to find likely values. Some of the resulting data will not make sense (negative ratings, ratings higher than the maximum), however this will still provide a workable base for performing PCA.

Two steps are performed to fill the table: 1. Estimate the number of dimensions(components) to be used in PCA 2. Fill the missing cells assuming we have a number of dimensions estimated in 1.

#Estimate dimensions
ncomp <- estim_ncpPCA(as.data.frame(elite_sparse[,2:ncol(elite_sparse)]), ncp.min = 0, ncp.max = 6)

#Fill the matrix
elite_sparse_filled <- imputePCA(as.data.frame(elite_sparse[,2:ncol(elite_sparse)]), 
                                 ncp = ncomp$ncp, scale = TRUE,
                                 method = "EM", row.w = NULL, coeff.ridge = 1,
                                 threshold = 1e-06, seed = NULL, nb.init = 1, maxiter = 1000)$completeObs
#Perform PCA
PCA <- prcomp(elite_sparse_filled, retx = TRUE)

#Dataframe for plotting
components <- as.data.frame(PCA$rotation) %>% mutate(gameID = as.numeric(rownames(.))) %>% left_join(titles)

We graph the data in terms of pairs of principal components. We can see how PCA chooses the component axes: The data varies most in the direction of principal component 1, then principal component 2, and so on. In our case, the variance explained by each component appears to stabilize after the first.

theme <- list(geom_point(color = "black", fill = "green", pch = 21),
              theme(plot.margin=unit(c(0.25,0.25,0.25,0.25), "cm"),
                    axis.title.x = element_text(size = 8, vjust = -0.5),
                    axis.title.y = element_text(size = 8)
                    ), xlim(-60, 60), ylim(-60, 60))

grid.arrange(
ggplot(as.data.frame(PCA$x), aes(x = PC1, y=PC2)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC2, y=PC3)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC3, y=PC4)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC4, y=PC5)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC5, y=PC6)) + theme,
ggplot(as.data.frame(PCA$x), aes(x = PC6, y=PC7)) + theme,
ncol = 3
)

Pairs of the first 7 principal components

We can look at the top 10 loadings for each principal component and see what kinds of games are included to see if they describe some sort of genre or series of games.

cbind(top_n(components, 10, PC1) %>% arrange(desc(PC1)) %>% select(PC1, title),
      top_n(components, 10, PC2) %>% select(PC2, title),
      top_n(components, 10, PC3) %>% select(PC3, title)) %>% kable("html") %>% 
                               kable_styling(full_width = FALSE, position = "left", 
                                             bootstrap_options = c("striped", "hover", "responsive"))

PC1	title	PC2	title	PC3	title
0.0866010	Catan: 5-6 Player Extension	0.0937714	Tichu	0.0998507	War of the Ring (first edition)
0.0754696	Catan: Cities & Knights	0.1207736	Taj Mahal	0.0964667	Twilight Imperium (Third Edition)
0.0738649	Pandemic Legacy: Season 1	0.0982333	The Princes of Florence	0.1299529	Through the Ages: A Story of Civilization
0.0736338	Boss Monster: The Dungeon Building Card Game	0.1147404	Genoa	0.0952696	Brass: Lancashire
0.0735421	XCOM: The Board Game	0.1039938	Amun-Re	0.0943403	Le Havre
0.0702398	Mansions of Madness: Second Edition	0.1049687	Agricola	0.0976572	Kemet
0.0679850	Space Hulk (third edition)	0.0986251	In the Year of the Dragon	0.1001427	Scythe
0.0662898	T.I.M.E Stories	0.1211776	The Castles of Burgundy	0.1109133	Mombasa
0.0662785	7 Wonders: Leaders	0.1003861	Kingdom Builder	0.0988313	Food Chain Magnate
0.0661670	Catan: Seafarers	0.1048600	Bora Bora	0.1274151	Through the Ages: A New Story of Civilization

Looking at the descriptions for the lists of games, there does not appear to be a common theme except that the first PC loads heavily on Settlers of Catan and its expansions. I cannot see a reason for attributing a theme to the other components aside from confirmation bias.

A quick plot of the cumulative explained variance from each component shows greatly diminshed returns after the 6th or so PC. The variance explained by the first PC is significantly greater than the rest.

explained_variance <- PCA$sdev^2/sum(PCA$sdev^2)

qplot(0:199, c(0,cumsum(explained_variance))) + xlab("Principal Component #") + ylab("Cumulative Explained Variance")

Aside from exploratory analysis, we can use these components for further analysis such as simple regression or K-means clustering. For now, we will put this PCA aside and move on to building a recommendation engine from scratch.

Recommendation Engine with User-User Collaborative Filtering.

We will now attempt to build a recommendation engine for the elite user data. The process will use a basic User-User collaborative filtering scheme with some experimental adjustments to predict scores.

The similarity between two users will be determined by calculating a scaled Pearson’s Correlation coefficient:

\[s(u,v)=\frac{1}{\vert{D_{u,v}(N)\vert}^\alpha}{\huge[}\frac{\sum_{i\in I_{common}}(r_{u,i}-\overline{r}_{u})(r_{v,i}-\overline{r}_{v})}{\sqrt{\sum_{i\in I_{common}}(r_{u,i}-\overline{r}_{u})^2}\sqrt{\sum_{i\in I_{common}}(r_{v,i}-\overline{r}_{v})^2}}+1{\huge]}\] Where \(r_{u,i}\) and \(\overline{r}_{u}\) are user \(u\)’s rating for item \(i\) and \(u\)’s average rating across all items respectively. \(I_{common}\) denotes the set of all items that \(u\) and \(v\) have both rated. \(\vert D_{u,v}(N)\vert\) is the absolute difference in the number of ratings between user \(u\) and \(v\). I include this adjustment under the suspicion that users who rate a lot of games tend form a group that rates games similarly. The parameter \(\alpha\) controls the magnitude of this effect and will be optimized for predictive power later. For now, we will perform the analysis assuming \(\alpha = 1\)

We will write three functions to do our predictions. The first, similarity(user1, user2), computes the similarity score between two users. The second, predict_score(test, train, itemID) predicts a score given a user, training data, and an item ID. The formula begins with the user-in-question’s average rating, and moves up or down depending on other users rating of the game in question weighted by the similarity score:

\[p_{u,i} = \overline{r}_u+\frac{\sum_{u'\in U'}s(u,u')(r_{u',i}-\overline{r}_{u'})}{\sum_{u'\in U'}\vert s(u,u')\vert}\] where \(p_{u,i}\) is the predicted score for user \(u\) and item \(i\), \(\overline{r}_u\) is the average score for user \(u\). \(U'\) and \(u'\) indicate the set of all other users and an individual other user respectively.

The last function, test_batch(test, train), loops over a batch of test cases. For each test case, it randomly selects an item with a rating and holds it out, and then attempts to predict the score for that item using predict_score(). At the end of the loop a dataframe containing the predicted/true values and the item ID will be returned.

#getUsers <- function(df, itemID){
#  quoted_item <- enquo(itemID)
#  
#  df %>% filter(!is.na(!!quoted_item))
#}

similarity <- function(user1, user2, alpha = 1){
  df <- rbind(user1, user2)
  df <- df[,which((!is.na(df[1,])) & (!is.na(df[2,])))]
  
  diff <- sum(!is.na(user1)) - sum(!is.na(user2))
  
  score = 1/(abs(diff)+1)^alpha*(cor(as.matrix(df)[1,], as.matrix(df)[2,])+1)
  
  return(score)
}

predict_score <- function(user, df_train, itemID, sd = FALSE){
  
  user <- user %>% 
          rowwise() %>% 
          mutate(n = sum(!is.na(.))) %>% 
          ungroup() %>% 
          mutate(mean = apply(user %>% select(-n), 1, mean, na.rm = TRUE), sd = apply(user %>% select(-n), 1, sd, na.rm = TRUE))

  
  relevant_users <- df_train %>% filter(!is.na(!!sym(itemID))) 
  #someone please tell me why the pipe must be broken here
  
  
  relevant_users <-  relevant_users %>%
                              rowwise() %>% 
                              mutate(n = sum(!is.na(.))) %>% 
                              ungroup() %>%
                              mutate(mean = apply(as.matrix(relevant_users %>% select(-n)), 1, mean, na.rm = TRUE), 
                                     sd = apply(as.matrix(relevant_users %>% select(-n)), 1, sd, na.rm = TRUE)) %>%
                                      filter(sd != 0)
  
  #we must do this part of the dplyr pipe seperately due to having to call the data frame within apply()   
  relevant_users <- relevant_users %>% 
    mutate(simscore = apply(relevant_users %>% select(c(-n, -mean, -sd)), 1, similarity, user))
  
  if(sd){
    predict = user$mean + 
              user$sd*sum(relevant_users$simscore*((relevant_users %>% select(itemID))-relevant_users$mean)/relevant_users$sd)/
              sum(abs(relevant_users$simscore))  
  }
  else{
    predict = user$mean + 
              sum(relevant_users$simscore*((relevant_users %>% select(itemID))-relevant_users$mean))/
              sum(abs(relevant_users$simscore)) 
  }
  
  return(predict)
}

test_batch <- function(test, train, ...){  
  pred <- true <- 0
  item <- NA
  
  for(i in 1:nrow(test)){
      #get a random column which is not empty
      #get the name of that column
      #store these in "true" and "item" respectively
      index <- sample(which(!is.na(test[i,])), 1)
    
      true[i] <- test[i,index][[1]] 
      item[i] <- colnames(test[,index])
      
      foo <- test[i,]
      foo[item[i]] <- NA
      
      pred[i] <- predict_score(foo, train, item[i], ...) 
      
      
  }
  
  return(data.frame(predicted = pred, true = true, itemID = item))

}

We will randomly select 20% of the elite users and use them as the test cases. The rest will be used alongside each individual test case to calculate a predicted score.

indices <- sample(1:nrow(elite_sparse), floor(.2*nrow(elite_sparse)))
  
test <- elite_sparse[indices, -1]
train <- elite_sparse[-indices, -1]
  
preds <- test_batch(test, train)

head(preds) %>% kable("html") %>% 
                            kable_styling(full_width = FALSE, position = "left", 
                                          bootstrap_options = c("striped", "hover", "responsive"))

predicted	true	itemID
6.384750	5.5	37380
6.412097	5.0	96848
7.141000	6.0	72125
7.014915	9.0	822
7.349141	7.0	18602
4.787054	6.0	2083

To see how our engine did, we can look at the root mean squared error of our predictions. This is the root of the average squared difference between the predicted and true values: \(RMSE = \sqrt{\frac{1}{U_{test}}\sum_u(p_{u,i}-r_{u,i})}\)

RMSE <- sqrt(mean((preds$predicted - preds$true)^2))

RMSE

## [1] 1.319188

On average, our predictions are off by almost one and a half points, not particularly impressive. To improve performance, we can scale the difference for each user score from their mean by the standard deviation of that users ratings:

\[p_{u,i} = \overline{r}_u+\sigma_u\frac{\sum_{u'\in U'}s(u,u')(r_{u',i}-\overline{r}_{u'})/\sigma_{u'}}{\sum_{u'\in U'}\vert s(u,u')\vert}\]

Where \(\sigma_u\) and \(\sigma_{u'}\) indicate the standard deviation of the scores for the user in question and an arbitrary other user respectively. Lets see if our predictions improved:

preds <- test_batch(test, train, sd = TRUE)

RMSE <- sqrt(mean((preds$predicted - preds$true)^2))

RMSE

## [1] 1.170208

We see a slight improvement, and our engine is a little over a point off on average. Not terrible for a basic implementation, though one problem of note is that the process takes a non-trivial amount of computing time. In practice, this may be an issue, and is one of the main reasons why Item-Item filtering is preferred in some cases. Next, we can try manipulating the \(\alpha\) parameter to see if it improves performance.

PCA and User-User Collaborative Filtering for Boardgame Ratings Data

Data Preparation

Recommendation Engine with User-User Collaborative Filtering.