Board Game Geek is the most popular website aggregating all board games ever published, as well as all the addons.
For many games when you do a Google Search, the BGG comes in TOP 5 results, frequently as the first spot. This means that the when someone tries to find out if a particular game is good, they frequently land on BGG.
However, unlike other mediums (books, movies etc.) the reviews on BGG do not represent the generic population. Because the website is targeted to enthusiasts, a significant portion of grades are from only a (geeky) part of the gaming community.
Since reviews on BGG are mostly prepared by enthusiasts, but are used to drive buying decisions of the general population, we want to see what it takes to achieve a favorable position in the general ranking.
Are some genres more favored (e.g. strategy/war games)? Does complexity matter? Are controversial games more, or less popular?
We start with the standard libraries (tidyverse, ggplot2) and add some interesting extensions.
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(ggrepel)
library(cowplot)
library(gganimate)
library(ggridges)
library(viridis)
library(ggtech) # For the airbnb theme
library(extrafont) # To add the font for Airbnb theme
library(GGally) # Grapihical correlogram
library(knitr) # Nice tables-kables
library(kableExtra)
# Download this fonts and install manually
# https://github.com/ricardo-bion/ggtech/blob/master/Circular%20Air-Medium%203.46.45%20PM.ttf
# https://github.com/ricardo-bion/ggtech/blob/master/Circular%20Air-Bold%203.46.45%20PM.ttf
# Import all fonts, takes a few minutes but removes the annoying warnings
# font_import()
# font_import(pattern = 'Circular', prompt=FALSE)
# loadfonts(device = "win")
# Lets set the theme to this nice one from Airbnb
# Colors: c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051")
theme_set(theme_airbnb_fancy())
pink = "#FF5A5F"
orange = "#FFB400"
blueGreen = "#007A87"
flesh = "#FFAA91"
purple = "#7B0051"
options(scipen=999) #avoiding e10 notation
Some facts about the dataset:
reviews <- 'data/bgg-13m-reviews.csv' %>%
read_csv()
games <- 'data/games_detailed_info.csv' %>%
read_csv()
We need to drop some columns and rename others to make life easier with ggplot.
# Drop the index
reviews <- reviews %>%
select(-X1)
# Rename and drop some variables
games <- games %>%
select(
"Abstract_Rank" = "Abstract Game Rank",
"Rank" = "Board Game Rank",
"Childrens_Rank" = "Children's Game Rank",
"Family_Rank" = "Family Game Rank",
"Party_Rank" = "Party Game Rank",
"Strategy_Rank" = "Strategy Game Rank",
"Thematic_Rank" = "Thematic Rank",
"War_Game_Rank" = "War Game Rank",
"Ratings_average" = "average",
"Complexity" = "averageweight",
"Geekscore" = "bayesaverage",
"Category" = "boardgamecategory",
"Mechanics" = "boardgamemechanic",
'Publisher' = "boardgamepublisher",
'Description' = "description",
"id",
"image",
"maxplayers",
"maxplaytime",
"minage",
"minplayers",
"minplaytime",
"numcomments",
"numweights",
"owned",
"playingtime",
"name" = "primary",
"Ratings_std_dev" = "stddev",
"thumbnail",
"trading",
"usersrated",
"wanting",
"wishing",
"yearpublished"
)
There are some ‘joke’ games, e.g. a war game that takes 1 500 hours to complete (The Campaign for North Africa).
# Lets remove some outliers to make life easier
# Some outliers are ok, eg. for most popular games the "wishing" will be very high
games <- games %>%
filter(yearpublished > 1950,
maxplayers < 20,
maxplaytime < 600,
maxplaytime < 600,
playingtime < 600)
For each game, the participants can vote on the complexity rating (from 1 to 5). These votes are later averaged and presented as a single score.
We want to bin this variable, to make plotting easier. We dropped the 5th category, because only a few games have a score of > 4.25.
games %>%
ggplot(aes(x = Complexity)) +
geom_histogram(bins = 60, color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Distribution on average complexity', x = 'Complexity', y = 'Count')
# Values are from 1-5, but there is very little after 4.25
games$ComplexityBinned <- games$Complexity %>%
cut(
breaks = c(-Inf, 1.5, 2.5, 3.25, Inf),
labels = c(1, 2, 3, 4))
a <- games$ComplexityBinned %>%
table()
kable(a, col.names = c('Complexity rating', 'Number of games')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F)
| Complexity rating | Number of games |
|---|---|
| 1 | 5487 |
| 2 | 6586 |
| 3 | 2992 |
| 4 | 1274 |
In addition to the main ranking, BGG provides additional ratings for some main categories.
We want to see if there is some overlapping between this rankings.
# Create binary variables
# A game can have multiple main categories, so we can't create one column
games$isAbstract <- as.integer( !is.na(games$Abstract_Rank) )
games$isChildrens <- as.integer( !is.na(games$Childrens_Rank) )
games$isFamily <- as.integer( !is.na(games$Family_Rank) )
games$isParty <- as.integer( !is.na(games$Party_Rank) )
games$isStrategy <- as.integer( !is.na(games$Strategy_Rank) )
games$isThematic <- as.integer( !is.na(games$Thematic_Rank) )
games$isWarGame <- as.integer( !is.na(games$War_Game_Rank) )
# Some overlap between categories
a <- games %>%
group_by(isAbstract,
isChildrens,
isFamily,
isParty,
isStrategy,
isThematic,
isWarGame) %>%
summarise(n = n()) %>%
arrange(desc(n))
a[a == 0] <- ''
kable(a, col.names = c('Abstract',
'Children',
'Family',
'Party',
'Strategy',
'Thematic',
'War game',
'Number of games'),
align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| Abstract | Children | Family | Party | Strategy | Thematic | War game | Number of games |
|---|---|---|---|---|---|---|---|
| 8094 | |||||||
| 1 | 2548 | ||||||
| 1 | 1247 | ||||||
| 1 | 1036 | ||||||
| 1 | 719 | ||||||
| 1 | 600 | ||||||
| 1 | 576 | ||||||
| 1 | 1 | 311 | |||||
| 1 | 302 | ||||||
| 1 | 1 | 209 | |||||
| 1 | 1 | 117 | |||||
| 1 | 1 | 108 | |||||
| 1 | 1 | 96 | |||||
| 1 | 1 | 88 | |||||
| 1 | 1 | 87 | |||||
| 1 | 1 | 70 | |||||
| 1 | 1 | 39 | |||||
| 1 | 1 | 33 | |||||
| 1 | 1 | 20 | |||||
| 1 | 1 | 10 | |||||
| 1 | 1 | 6 | |||||
| 1 | 1 | 5 | |||||
| 1 | 1 | 5 | |||||
| 1 | 1 | 1 | 4 | ||||
| 1 | 1 | 4 | |||||
| 1 | 1 | 1 | 1 | ||||
| 1 | 1 | 1 | |||||
| 1 | 1 | 1 | 1 | ||||
| 1 | 1 | 1 | |||||
| 1 | 1 | 1 | 1 |
There is - this creates a challenge. How to choose a main category for each game?
For each game we create a quantile rank - 0.01 means that the game is in top 1% in the category.
# Convert category ranks to quantiles
games$quantileAbstract <- games %>%
.$Abstract_Rank / max(games$Abstract_Rank, na.rm = TRUE)
games$quantileChildrens <- games %>%
.$Childrens_Rank / max(games$Childrens_Rank, na.rm = TRUE)
games$quantileFamily <- games %>%
.$Family_Rank / max(games$Family_Rank, na.rm = TRUE)
games$quantileParty <- games %>%
.$Party_Rank / max(games$Party_Rank, na.rm = TRUE)
games$quantileStrategy <- games %>%
.$Strategy_Rank / max(games$Strategy_Rank, na.rm = TRUE)
games$quantileThematic <- games %>%
.$Thematic_Rank / max(games$Thematic_Rank, na.rm = TRUE)
games$quantileWarGame <- games %>%
.$War_Game_Rank / max(games$War_Game_Rank, na.rm = TRUE)
To find the main category of the game, we look at all the quantiles and choose the lowest one.
Our logic is that it is much more significant if the game is 5th in ranking that has 100 games, than in one that has 1000.
# Find the primary category
# How? By the lowest quantile, eg. the most important in this catergory
find_lowest_quantile <- function(row, output) {
abstract = row["quantileAbstract"]
family = row["quantileChildrens"]
child = row["quantileFamily"]
party = row["quantileParty"]
strategy = row["quantileStrategy"]
thematic = row["quantileThematic"]
war = row["quantileWarGame"]
other = 1
lowest <- min(
abstract,
family,
child,
party,
strategy,
thematic,
war,
other,
na.rm = TRUE)
case_when(
lowest == abstract ~ 'Abstract',
lowest == family ~ 'Family',
lowest == child ~ "Children",
lowest == party ~ 'Party',
lowest == strategy ~ 'Strategy',
lowest == thematic ~ 'Thematic',
lowest == war ~ 'War Game',
lowest == 1 ~ 'Other'
) %>%
return()
}
# Takes a while
games$cat <- games %>%
apply(1, find_lowest_quantile) %>%
as.factor()
# A nice distribution, a lot of other
a <- games$cat %>% table()
kable(a, col.names = c('Category',
'Number of games')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| Category | Number of games |
|---|---|
| Abstract | 852 |
| Children | 1453 |
| Family | 718 |
| Other | 8098 |
| Party | 445 |
| Strategy | 1249 |
| Thematic | 791 |
| War Game | 2733 |
This plot will help us choose next steps. From the graphs we can see, that in some situations we’ll need to use log(x).
my_dens <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density_ridges(aes(y = games$ComplexityBinned, fill = games$ComplexityBinned), scale = 3, rel_min_height = 0.01, alpha = 0.7, color = NA ) +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051"))
}
my_dots <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.1) +
scale_color_tech(theme="airbnb")
}
games %>%
select( "Rank",
"Ratings_average",
"yearpublished",
"usersrated",
"playingtime",
"usersrated") %>%
ggpairs(
mapping = aes(color = games$ComplexityBinned, fill = games$ComplexityBinned),
diag = list(continuous = my_dens),
lower = list(continuous = my_dots),
progress = TRUE,
axisLabels = 'none',
legend = 1) +
theme(text = element_text(size = 10)) +
labs(title = '')
Definitely.
Compared to 2000, the number of games published in 2019 nearly tripled. The market has never been so big and so competitive.
games %>%
ggplot(aes(x = yearpublished)) +
geom_density(color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Board games published per year', x = 'Year', y = 'Density')
It would certainly seem so. The ratings are quite stable for older games, but they start to increase if we go 3-4 years back.
Notice an interesting pattern - the standard deviation of newer games is bigger. This may be due to the fact than newer games have fewer reviews.
There also is an interesting ‘preorder hype’ effect. This dataset is from 2019, and games from 2020 have significantly higher scores.
plot <- games %>%
filter(yearpublished > 2009) %>%
ggplot(aes(x = Ratings_average)) +
geom_density(color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Ratings of games published in: {floor(frame_time)}', x = 'Ratings', y = 'Density') +
transition_time(yearpublished) +
ease_aes('linear')
plot %>% animate(width = 900,
height = 400,
renderer = gifski_renderer())
Yes. Simple party games have an average score of 6, while the most complex games have a score of 7.5. This is a hint of a pattern that we’ll analyse further.
games %>%
ggplot(aes(x = Ratings_average, y = ComplexityBinned, fill = ComplexityBinned)) +
geom_density_ridges(scale = 3, rel_min_height = 0.01, alpha = 0.7, color = NA) +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051")) +
scale_x_continuous(limits = c(3, 9.3)) +
labs(title = 'Does the complexity of the game influence the ratings?',
x = 'Ratings',
y = 'Complexity of the game',
fill = 'Complexity')
Rank vs rating has an interesting distribution - there are some clear patterns.
It seems that for a high rank, the average rating has to be quite high. However, just having high reviews does not guarantee a good position.
There is a very curious contraction near the ~ 6.0 average. This is a similar score to the scores of very low complexity games.
games %>%
ggplot(aes(x = Ratings_average, y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.2) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the average rating influence the final rank?',
subtitle = 'Chicken Drumstick distribution',
x = 'Average rating',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
Log number of reviews is correlated with the rank, but there are some interesting points here:
games %>%
ggplot(aes(x = log10(usersrated), y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.3) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the number of user ratings influence the final rank?',
subtitle = 'Jaws distribution',
x = 'Log 10 of the number of ratings ',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
For all games – yes. Strategy is highly favored, while family games are disfavored.
Generally speaking the choice of category will influence the rank in a significant way.
games$isTop <- ifelse(games$Rank < 300, 1, 0)
games %>%
ggplot(aes(x = Rank, fill = cat)) +
geom_density(alpha = 0.7, color = NA) +
facet_wrap('cat') +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051", "#FF5A5F", "#FFB400", "#007A87")) +
theme(legend.position = 'none') +
labs(title = 'Does the main category of the game influence the rank?',
subtitle = 'All games',
x = 'Rank (lower is better)',
y = 'Density')
The top 1% is dominated by strategy games. Other categories have only a few positions, with the exception of thematic games. Surprisingly, there are some children’s games too.
games %>%
filter(isTop == 1) %>%
ggplot(aes(x = Rank, fill = cat)) +
geom_histogram(alpha = 0.7, color = NA, bins = 30) +
facet_wrap('cat') +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051", "#FF5A5F", "#FFB400", "#007A87")) +
theme(legend.position = 'none') +
labs(title = 'Does the main category of the game influence the rank?',
subtitle = 'Top 300 games',
x = 'Rank (lower is better)',
y = 'Count')
By this we mean people adding games to their wishlists.
There is a strong correlation between these two variables. However, we believe the causality here may be reversed. A high position in the ranking will boosts the popularity and sales of a game.
games %>%
filter(wishing > 10) %>%
ggplot(aes(x = log10(wishing), y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.3) +
geom_smooth(color = purple) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the number of user wishlisting influence the final rank?',
subtitle = 'Removed observations with < 10 users',
x = 'Log 10 of the number of users wishlisting',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
Generally speaking, it doesn’t matter if the game has mostly textual reviews, or just numerical grades.
There is some pattern in the very top part of the ranking. However, we believe that this is due to the number of reviews.
games %>%
filter(numcomments / usersrated < 1) %>%
ggplot(aes(x = numcomments / usersrated , y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.5) +
scale_color_gradient(low = orange, high = purple) +
labs(title = '% of ratings with written reviews',
x = 'Number of comments / Number of ratings',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
We interpret the standard deviation of the ratings to represent how controversial the game is. Higher deviation means more disagreement. Perhaps surprisingly, this has no significant effect on the ranking.
games %>%
ggplot(aes(x = Ratings_std_dev, y = Rank))+
geom_point(aes(colour = Rank), alpha = 0.5) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'How controversial the game is',
x = 'Standard deviation of ratings',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
When we removed the outliers and used a density plot, a pattern emerged. Still, we believe the effects are not significant, taking into account the deviation.
games %>%
filter(Ratings_std_dev > 0.75,
Ratings_std_dev < 2.5) %>%
ggplot(aes(x = Ratings_std_dev, y = Rank)) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_gradient(high = orange, low = purple) +
labs(title = 'How controversial the game is',
subtitle = 'Outliers removed',
x = 'Standard deviation of ratings',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
Coming back to the point made in the introduction. The results confirmed that the people who review games on Board Game Geek are a specific part of the whole population.
In normal circumstances, this would create a problem, because one cannot generalize on the population from biased data.
However, this wasn’t our goal.
We wanted to analyze what factors influence the position in the Board Game Geek ranking system with the idea that this would influence the purchasing decisions the same way as searching for reviews on Amazon does. And since the whole population uses BGG to check game reviews, we believe we can use this data for this purpose.
By analyzing the 17 000 board games from the Board Game Geek database, we were able to extract some insights that could be useful to designers and publishers preparing their own board games. To maximize the chances of a game achieving a favorable position in the ranking, several factors have to be considered.
The most important one is the genre of the game. We strongly recommend creating a strategy game, maybe with some thematic elements. It should be a complex game, requiring a long time to play.
Creating ‘hype’ before publishing is advised, as this will create a first wave of positive reviews, even before the game is actually published.
Number of people wishlisting the game is also a very important factor. However, we do not advise using shady practices such as creating fake accounts. Such a practice runs the risk of receiving a ban from BGG, which could have negative influence on the sales.