The aim of this paper is to provide personalized recommendations for users based on their interests and preferences. This is possible due to the world’s largest board game site. The dataset contains almost 19 millions of reviews and more than 400K users. The data comes from January 2022 and was downloaded from Kaggle.

I would like to use a method for finding recommendations called collaborative filtering which is based on similar histories and preferences of rated games by users.

source: https://unsplash.com/photos/MvkyyiZy1JQ

Assocation Rules

Association rules algorithm is used for discovering relations between variables. It may help to find rules and determinants of existing connections in dataset.

The following are basic parameters of Association Rules used for determination their quality1:

support - for a particular rule A->B it is the proportion of transactions in a dataset that contain both A and B:

confidence - a measure of the accuracy of the rule determined by the percentage of transactions containing A and B:

lift - this measure can quantify the usefulness of an association rule. It shows how much likely is for a customer to buy good y if buys good x comparing to the entire dataset. Events A and B are independent when P(A∩B) = P(A)P(B). Thus, the ratio P(A∩B)/P(A)P(B) being close to 1 implies that A and B are independent events.

Loading necessary packages

# Loading necessary packages
library(arules)
library(arulesViz)
library(arulesCBA)

library(readr)
library(tidyverse)
library(digest)
library(knitr)
library(DT)

Dataset

As mentioned before the dataset contains almost 19 millions rows. First I am going to prepare it for the analysis.

There are around 400K unique users and around 21K unique game boards.

I transformed ratings lower than 1 to be equal 1 as there were some outliers.

# Loading dataset
bgg_19m_reviews <- read_csv("bgg-19m-reviews.csv")
# Outliers lower than 1 equals 1, rounding ratings
bgg_19m_reviews <- bgg_19m_reviews %>% 
  mutate(rating=replace(rating, rating<1, 1)) %>% 
  mutate(rating=round(rating,1))

# Summary of dataset
summary(bgg_19m_reviews)
##       ...1              user               rating         comment         
##  Min.   :       0   Length:18964807    Min.   : 1.000   Length:18964807   
##  1st Qu.: 4741202   Class :character   1st Qu.: 6.000   Class :character  
##  Median : 9482403   Mode  :character   Median : 7.000   Mode  :character  
##  Mean   : 9482403                      Mean   : 7.082                     
##  3rd Qu.:14223604                      3rd Qu.: 8.000                     
##  Max.   :18964806                      Max.   :10.000                     
##        ID             name          
##  Min.   :     1   Length:18964807   
##  1st Qu.: 15987   Class :character  
##  Median :107529   Mode  :character  
##  Mean   :110146                     
##  3rd Qu.:181304                     
##  Max.   :350992

In the summary we can see that scale of ratings is between 1 and 10 and 75% of games are rated as equal or more than 6. We can also see 6 most popular games.

After analyzing the distribution of game rated per user it is evident that there occur outliers which should no be taken into consideration. Therefore I checked quantiles of the variable and decided to remove 5% of users with the highest number of rated games. The reason is that these kind of users may be some professional reviewers who are not valid in my recommendation analysis.

bgg_19m_reviews_cnt <- bgg_19m_reviews %>% group_by(user) %>% tally(sort=TRUE)
hist(bgg_19m_reviews_cnt$n, xlab = 'Number of games per user')

quantile(bgg_19m_reviews_cnt$n, c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99))
## 25% 50% 75% 90% 95% 99% 
##   2  12  44 115 197 496
bgg_19m_reviews_95 <- bgg_19m_reviews %>% group_by(user) %>% mutate(n = n()) %>% ungroup() %>% filter(n <= 197) 
hist(bgg_19m_reviews_95$n, xlab = 'Number of games per user')

Empirical Analysis

I decided to leave only high ratings for my first analysis. The goal is to compare games rated at least 8 stars (3rd quartile) and implement associations rules for game boards preferences based on liked games.

bgg_best <- bgg_19m_reviews_95 %>% filter(rating >= 8.00) %>% select(user,name)
bgg_best <- rename(bgg_best, game=name)
summary(bgg_best)
##      user               game          
##  Length:4917044     Length:4917044    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
head(bgg_best)
## # A tibble: 6 × 2
##   user           game    
##   <chr>          <chr>   
## 1 katrinacarenne Pandemic
## 2 DSpangler      Pandemic
## 3 gregd          Pandemic
## 4 calbearfan     Pandemic
## 5 odustin        Pandemic
## 6 treece keenes  Pandemic

Now it is time to convert transaction data to a transaction object, which is a special data structure that is required for association rule mining.

On the plot there are presented the most popular games in the data set, the top 3 are: Pandemic, Terraforming Mars and 7 Wonders.

write.csv(bgg_best, file="bgg_best.csv")

bgg_best_trans <- read.transactions("bgg_best.csv", format="single", sep=",", cols=c("user","game"), header=TRUE)

LIST(head(bgg_best_trans))
## $`- V -`
## [1] "Carcassonne" "Citadels"    "Puerto Rico"
## 
## $`-DE-`
## [1] "Neuroshima Hex! 3.0"
## 
## $`-grizzly-`
## [1] "Memoir '44"     "San Juan"       "Ticket to Ride" "Tikal"         
## 
## $`-Loren-`
##  [1] "13 Dead End Drive"                                                     
##  [2] "Arkham Horror"                                                         
##  [3] "Aton"                                                                  
##  [4] "Babel"                                                                 
##  [5] "Camelot Legends"                                                       
##  [6] "Carcassonne"                                                           
##  [7] "Castle"                                                                
##  [8] "Catan Card Game"                                                       
##  [9] "Condottiere"                                                           
## [10] "Cuba"                                                                  
## [11] "Dragon Strike"                                                         
## [12] "Dungeon Twister"                                                       
## [13] "Dungeons & Dragons Miniatures"                                         
## [14] "Eldritch Horror"                                                       
## [15] "Fireball Island"                                                       
## [16] "HeroQuest Advanced Quest"                                              
## [17] "Magic: The Gathering"                                                  
## [18] "Middle-earth"                                                          
## [19] "Middle-Earth Quest"                                                    
## [20] "Othello"                                                               
## [21] "Puerto Rico"                                                           
## [22] "Rivals for Catan"                                                      
## [23] "Roma"                                                                  
## [24] "Scarab Lords"                                                          
## [25] "Scrabble"                                                              
## [26] "Sherlock Holmes Consulting Detective: The Thames Murders & Other Cases"
## [27] "Star Wars Miniatures"                                                  
## [28] "Stone Age"                                                             
## [29] "Super Fantasy: Ugly Snouts Assault"                                    
## [30] "Talisman: Revised 4th Edition"                                         
## [31] "Through the Ages: A New Story of Civilization"                         
## [32] "War of the Ring"                                                       
## [33] "World of Warcraft: The Boardgame"                                      
## 
## $`-LucaS-`
##  [1] "Amun-Re"                           "Citadels"                         
##  [3] "Domaine"                           "Dschunke"                         
##  [5] "Entdecker: Exploring New Horizons" "Fantasy Pub"                      
##  [7] "Kingsburg"                         "Neuroshima Hex! 3.0"              
##  [9] "Puerto Rico"                       "Race for the Galaxy"              
## [11] "San Juan"                          "Tortuga"                          
## 
## $`-mik-`
##  [1] "7 Wonders"                           
##  [2] "Arkham Horror"                       
##  [3] "Battlestar Galactica: The Board Game"
##  [4] "Brass: Lancashire"                   
##  [5] "Ca$h 'n Guns (Second Edition)"       
##  [6] "Carcassonne"                         
##  [7] "Catan"                               
##  [8] "Cosmic Encounter"                    
##  [9] "El Grande"                           
## [10] "Illuminati"                          
## [11] "Lord of the Rings: The Confrontation"
## [12] "Love Letter"                         
## [13] "Mansions of Madness"                 
## [14] "Pandemic"                            
## [15] "Power Grid"                          
## [16] "Puerto Rico"                         
## [17] "Space Alert"                         
## [18] "Stephenson's Rocket"                 
## [19] "The Resistance: Avalon"              
## [20] "Tigris & Euphrates"                  
## [21] "Wiz-War"
itemFrequencyPlot(bgg_best_trans, topN=30, type="relative", main="Item Frequency") 

Based on the density plot it is seen that support of most games in the data set is very low due to its size. Therefore I decided to set threshold of support on 0.2%. The absolute minimum support count is still high enough and equals 758.

plot(density(itemFrequency(bgg_best_trans)), main = "Item Support Distribution", xlab = "Item Support", xlim=c(0,0.01))

Confidence threshold is set at 70% and minimum lift is 5.321 which is a good result regarding that it is a measure that shows how much likely is for a user to give high rate to game ‘y’ if also give high rate to game ‘x’ comparing to other games in the data set.

freq.items<-eclat(bgg_best_trans, parameter=list(supp=0.002, maxlen=20))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.002      1     20 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 758 
## 
## create itemset ... 
## set transactions ...[20716 item(s), 379007 transaction(s)] done [2.81s].
## sorting and recoding items ... [954 item(s)] done [0.09s].
## creating sparse bit matrix ... [954 row(s), 379007 column(s)] done [0.11s].
## writing  ... [51474 set(s)] done [14.95s].
## Creating S4 object  ... done [0.01s].
freq.rules<-ruleInduction(freq.items, bgg_best_trans, confidence=0.7) 

summary(freq.rules)
## set of 4384 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##   20  468 2620 1251   25 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   4.000   4.181   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence          lift            itemset     
##  Min.   :0.002003   Min.   :0.7000   Min.   :  5.321   Min.   :    1  
##  1st Qu.:0.002142   1st Qu.:0.7200   1st Qu.:  5.751   1st Qu.:15934  
##  Median :0.002383   Median :0.7444   Median :  6.164   Median :25117  
##  Mean   :0.002658   Mean   :0.7583   Mean   :  7.614   Mean   :25594  
##  3rd Qu.:0.002826   3rd Qu.:0.7814   3rd Qu.:  8.340   3rd Qu.:35345  
##  Max.   :0.029279   Max.   :0.9589   Max.   :256.997   Max.   :50392  
## 
## mining info:
##            data ntransactions support
##  bgg_best_trans        379007   0.002
##                                                                       call
##  eclat(data = bgg_best_trans, parameter = list(supp = 0.002, maxlen = 20))
##  confidence
##         0.7

In the first 10 generated rules it is seen that very high level of lift receive games and its add-ons or different version of the same game which is rather intuitive.

rules.by.lift<-sort(freq.rules, by="lift", decreasing=TRUE) # sorting by lift
inspect(head(rules.by.lift,10))
##      lhs                                           rhs                                              support confidence      lift itemset
## [1]  {Disney Villainous,                                                                                                                
##       Disney Villainous: Wicked to the Core}    => {Disney Villainous: Evil Comes Prepared}     0.002218956  0.7736891 256.99699       7
## [2]  {Disney Villainous,                                                                                                                
##       Disney Villainous: Evil Comes Prepared}   => {Disney Villainous: Wicked to the Core}      0.002218956  0.8310277 255.86133       7
## [3]  {Disney Villainous: Evil Comes Prepared}   => {Disney Villainous: Wicked to the Core}      0.002371988  0.7879053 242.58460       9
## [4]  {Disney Villainous: Wicked to the Core}    => {Disney Villainous: Evil Comes Prepared}     0.002371988  0.7303006 242.58460       9
## [5]  {Shadows of Brimstone: Swamps of Death}    => {Shadows of Brimstone: City of the Ancients} 0.002472250  0.8105536 169.25923       3
## [6]  {Smash Up,                                                                                                                         
##       Smash Up: Science Fiction Double Feature} => {Smash Up: Awesome Level 9000}               0.002461696  0.8583257 155.80050      11
## [7]  {Smash Up: Science Fiction Double Feature} => {Smash Up: Awesome Level 9000}               0.002567235  0.8259762 149.92853      13
## [8]  {Unmatched: Robin Hood vs. Bigfoot}        => {Unmatched: Battle of Legends, Volume One}   0.002073840  0.8178980 147.54359       4
## [9]  {Disney Villainous: Evil Comes Prepared,                                                                                           
##       Disney Villainous: Wicked to the Core}    => {Disney Villainous}                          0.002218956  0.9354839  96.79359       7
## [10] {Disney Villainous: Evil Comes Prepared}   => {Disney Villainous}                          0.002670135  0.8869413  91.77094       8

On the below graph it is seen that games titles listed above with the highest level of lift are separated from the rest of titles which creates two big groups.

Games which appear rather often and are characterized with high confidence are marked as bigger orange node.

# plot(freq.rules, method = "graph", measure = "support", shading = "confidence", main = 
# "Association Rules Graph")

As seen on the plot below most games appear in less than 1% of transactions. The scale of lift is broad, the max value equals 257 but 75% of titles are between 5.3 and 8.3 which are still high and valid results.

plot(freq.rules, measure=c("support","lift"), shading="confidence")

Now let us check some specific example of games and recommendations from the created model. Recommendations are sorted by the level of lift.

First game taken into account is “Terraforming Mars”. The game is already is sets with different games and is strongly associated with game “Great Western Trail” which often occurs for different sets.

Another example is “7 Wonders Duel” which is strongly associated with Pandemic Legacy, in sets it often occurs together with Pandemic Legacy: Season 2 which is probably the reason of association with Pandemic Legacy: Season 1.

For users who give high rates to “Robinson Crusoe: Adventures on the Cursed Island” and “Great Western Trail” the model recommends “Terraforming Mars”.

# specific games
TMars <- sort(subset(freq.rules, lhs %in% "Terraforming Mars"), by="lift", decreasing=TRUE)
inspect(head(TMars,10))
##      lhs                               rhs                       support confidence     lift itemset
## [1]  {Star Realms: Colony Wars,                                                                     
##       Terraforming Mars}            => {Star Realms}         0.002448504  0.8505958 20.96110     429
## [2]  {Barrage,                                                                                      
##       Gaia Project,                                                                                 
##       Terraforming Mars}            => {Brass: Birmingham}   0.002073840  0.7164995 16.37571    3254
## [3]  {Barrage,                                                                                      
##       Great Western Trail,                                                                          
##       Terraforming Mars}            => {Brass: Birmingham}   0.002596258  0.7028571 16.06391    3265
## [4]  {Mombasa,                                                                                      
##       Terraforming Mars,                                                                            
##       Tzolk'in: The Mayan Calendar} => {Great Western Trail} 0.002026348  0.7603960 15.95060    3125
## [5]  {Mombasa,                                                                                      
##       Terraforming Mars,                                                                            
##       The Castles of Burgundy}      => {Great Western Trail} 0.002387819  0.7541667 15.81993    3136
## [6]  {Mombasa,                                                                                      
##       Scythe,                                                                                       
##       Terraforming Mars}            => {Great Western Trail} 0.002158271  0.7463504 15.65597    3137
## [7]  {Mombasa,                                                                                      
##       Terra Mystica,                                                                                
##       Terraforming Mars}            => {Great Western Trail} 0.002058009  0.7317073 15.34880    3135
## [8]  {Brass: Birmingham,                                                                            
##       Terraforming Mars,                                                                            
##       The Castles of Burgundy,                                                                      
##       Tzolk'in: The Mayan Calendar} => {Great Western Trail} 0.002189933  0.7223673 15.15288   34663
## [9]  {Concordia,                                                                                    
##       Gaia Project,                                                                                 
##       Terraforming Mars,                                                                            
##       The Castles of Burgundy}      => {Great Western Trail} 0.002134525  0.7203918 15.11144   19591
## [10] {Clans of Caledonia,                                                                           
##       Terraforming Mars,                                                                            
##       The Voyages of Marco Polo}    => {Great Western Trail} 0.002013155  0.7171053 15.04250   10670
wonders <- sort(subset(freq.rules, lhs %in% "7 Wonders Duel"), by="lift", decreasing=TRUE)
inspect(head(wonders,10))
##      lhs                             rhs                             support confidence     lift itemset
## [1]  {7 Wonders Duel,                                                                                   
##       Star Realms: Colony Wars}   => {Star Realms}               0.002532935  0.8384279 20.66125     431
## [2]  {7 Wonders Duel,                                                                                   
##       Codenames,                                                                                        
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.002218956  0.9292818 11.74602    6530
## [3]  {7 Wonders Duel,                                                                                   
##       Gloomhaven,                                                                                       
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.002744013  0.9285714 11.73704    6507
## [4]  {7 Wonders Duel,                                                                                   
##       Pandemic,                                                                                         
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.003084376  0.9226519 11.66222    6543
## [5]  {7 Wonders Duel,                                                                                   
##       Azul,                                                                                             
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.002332411  0.9141675 11.55497    6525
## [6]  {7 Wonders Duel,                                                                                   
##       Pandemic Legacy: Season 2,                                                                        
##       Wingspan}                   => {Pandemic Legacy: Season 1} 0.002411565  0.9140000 11.55286    6533
## [7]  {7 Wonders,                                                                                        
##       7 Wonders Duel,                                                                                   
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.002625281  0.9120073 11.52767    6541
## [8]  {7 Wonders Duel,                                                                                   
##       Pandemic Legacy: Season 2,                                                                        
##       Scythe}                     => {Pandemic Legacy: Season 1} 0.002722905  0.9100529 11.50297    6537
## [9]  {7 Wonders Duel,                                                                                   
##       Pandemic Legacy: Season 2}  => {Pandemic Legacy: Season 1} 0.005968227  0.8916043 11.26978    6547
## [10] {7 Wonders Duel,                                                                                   
##       Pandemic Legacy: Season 2,                                                                        
##       Terraforming Mars}          => {Pandemic Legacy: Season 1} 0.003358777  0.8871080 11.21294    6542
robinson <- sort(subset(freq.rules, lhs %in% "Robinson Crusoe: Adventures on the Cursed Island"), by="lift", decreasing=TRUE)
inspect(head(robinson,10))
##      lhs                                                    rhs                             support confidence      lift itemset
## [1]  {Pandemic Legacy: Season 2,                                                                                                
##       Robinson Crusoe: Adventures on the Cursed Island}  => {Pandemic Legacy: Season 1} 0.003174084  0.9072398 11.467408    6475
## [2]  {Dominion: Intrigue,                                                                                                       
##       Robinson Crusoe: Adventures on the Cursed Island}  => {Dominion}                  0.002765120  0.8111455  8.434289   23383
## [3]  {7 Wonders,                                                                                                                
##       Azul,                                                                                                                     
##       Robinson Crusoe: Adventures on the Cursed Island}  => {7 Wonders Duel}            0.002063286  0.7057762  6.467459   38869
## [4]  {Gaia Project,                                                                                                             
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       Scythe}                                            => {Terraforming Mars}         0.002142441  0.8136273  6.360383   19625
## [5]  {Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       Through the Ages: A New Story of Civilization}     => {Terraforming Mars}         0.002224233  0.7733945  6.045870   27589
## [6]  {Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       Scythe}                                            => {Terraforming Mars}         0.003287538  0.7686613  6.008870   38635
## [7]  {Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       Viticulture Essential Edition}                     => {Terraforming Mars}         0.002269087  0.7658059  5.986548   38627
## [8]  {Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       Wingspan}                                          => {Terraforming Mars}         0.002263810  0.7647059  5.977949   38634
## [9]  {Gloomhaven,                                                                                                               
##       Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island}  => {Terraforming Mars}         0.002306026  0.7547496  5.900117   38629
## [10] {Great Western Trail,                                                                                                      
##       Robinson Crusoe: Adventures on the Cursed Island,                                                                         
##       The Castles of Burgundy}                           => {Terraforming Mars}         0.002754567  0.7510791  5.871424   38631

Conclusion

The low level of support occurring after applying apriori algorithm is caused by big data set with many unique users who have different preferences. Therefore it is not possible to find a game which will be suitable to everyone but rather to find some local rules, among fans of some specific game. It may be very useful in a recommendation tool which may be implemented by the company selling games which would like to send offers of recommended products to its customers.

Bibliography


  1. Larose, D.T., Larose, C.D.,(2014). Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition. John Wiley & Sons, Inc. Pages 250-251,259-260.doi:https://doi.org/10.1002/9781118874059.ch12.↩︎