Board games have been around for centuries, starting from ancient games like GO in China, Chess in India or Senet in Ancient Egypt, to modern games like Terraforming Mars, Pandemic and many many more. They’re a great tool to connect people, relieve stress and pratice startegic and problem solving skills. But have you ever wondered what makes some games more popular than others?
In this paper I tried to uncover what makes a board game popular, using BoardGameGeek (BGG) - the most popular website about board games - dataset and Principal Component Analysis (PCA).
Firstly we need to load the data and check its structure.
boardgames_data <- read.xlsx2('BGG_Data_Set.xlsx', 1)
head(boardgames_data)
## ID Name Year_Published Min.Players
## 1 174430 Gloomhaven 2017 1
## 2 161936 Pandemic Legacy: Season 1 2015 2
## 3 224517 Brass: Birmingham 2018 2
## 4 167791 Terraforming Mars 2016 1
## 5 233078 Twilight Imperium: Fourth Edition 2017 3
## 6 291457 Gloomhaven: Jaws of the Lion 2020 1
## Max.Players Play.Time Min.Age Users.Rated Rating.Average BGG.Rank
## 1 4 120 14 42055 8.79244 1
## 2 4 60 13 41643 8.61278 2
## 3 4 120 14 19217 8.66337 3
## 4 5 120 12 64864 8.43254 4
## 5 6 480 14 13468 8.69649 5
## 6 4 120 14 8392 8.87363 6
## Complexity.Average Owned.Users
## 1 3.8604 68323
## 2 2.8405 65294
## 3 3.9129 28785
## 4 3.2406 87099
## 5 4.2219 16831
## 6 3.5472 21609
## Mechanics
## 1 Action Queue, Action Retrieval, Campaign / Battle Card Driven, Card Play Conflict Resolution, Communication Limits, Cooperative Game, Deck Construction, Deck Bag and Pool Building, Grid Movement, Hand Management, Hexagon Grid, Legacy Game, Modular Board, Once-Per-Game Abilities, Scenario / Mission / Campaign Game, Simultaneous Action Selection, Solo / Solitaire Game, Storytelling, Variable Player Powers
## 2 Action Points, Cooperative Game, Hand Management, Legacy Game, Point to Point Movement, Set Collection, Trading, Variable Player Powers
## 3 Hand Management, Income, Loans, Market, Network and Route Building, Score-and-Reset Game, Tech Trees / Tech Tracks, Turn Order: Stat-Based, Variable Set-up
## 4 Card Drafting, Drafting, End Game Bonuses, Hand Management, Hexagon Grid, Income, Set Collection, Solo / Solitaire Game, Take That, Tile Placement, Turn Order: Progressive, Variable Player Powers
## 5 Action Drafting, Area Majority / Influence, Area-Impulse, Dice Rolling, Follow, Grid Movement, Hexagon Grid, Modular Board, Trading, Variable Phase Order, Variable Player Powers, Voting
## 6 Action Queue, Campaign / Battle Card Driven, Communication Limits, Cooperative Game, Critical Hits and Failures, Deck Construction, Grid Movement, Hand Management, Hexagon Grid, Legacy Game, Line of Sight, Once-Per-Game Abilities, Scenario / Mission / Campaign Game, Simultaneous Action Selection, Solo / Solitaire Game, Variable Player Powers
## Domains
## 1 Strategy Games, Thematic Games
## 2 Strategy Games, Thematic Games
## 3 Strategy Games
## 4 Strategy Games
## 5 Strategy Games, Thematic Games
## 6 Strategy Games, Thematic Games
str(boardgames_data)
## 'data.frame': 20343 obs. of 14 variables:
## $ ID : chr "174430" "161936" "224517" "167791" ...
## $ Name : chr "Gloomhaven" "Pandemic Legacy: Season 1" "Brass: Birmingham" "Terraforming Mars" ...
## $ Year_Published : chr "2017" "2015" "2018" "2016" ...
## $ Min.Players : chr "1" "2" "2" "1" ...
## $ Max.Players : chr "4" "4" "4" "5" ...
## $ Play.Time : chr "120" "60" "120" "120" ...
## $ Min.Age : chr "14" "13" "14" "12" ...
## $ Users.Rated : chr "42055" "41643" "19217" "64864" ...
## $ Rating.Average : chr "8.79244" "8.61278" "8.66337" "8.43254" ...
## $ BGG.Rank : chr "1" "2" "3" "4" ...
## $ Complexity.Average: chr "3.8604" "2.8405" "3.9129" "3.2406" ...
## $ Owned.Users : chr "68323" "65294" "28785" "87099" ...
## $ Mechanics : chr "Action Queue, Action Retrieval, Campaign / Battle Card Driven, Card Play Conflict Resolution, Communication Lim"| __truncated__ "Action Points, Cooperative Game, Hand Management, Legacy Game, Point to Point Movement, Set Collection, Trading"| __truncated__ "Hand Management, Income, Loans, Market, Network and Route Building, Score-and-Reset Game, Tech Trees / Tech Tra"| __truncated__ "Card Drafting, Drafting, End Game Bonuses, Hand Management, Hexagon Grid, Income, Set Collection, Solo / Solita"| __truncated__ ...
## $ Domains : chr "Strategy Games, Thematic Games" "Strategy Games, Thematic Games" "Strategy Games" "Strategy Games" ...
In this dataset we have following columns:
For PCA we need to have only numerical values, so we are choosing only columns that will contribute to our analysis - meaning we don’t need Name of the game or its mechanics and domains.
I also filtered the data to remove outliers, leaving games that have been published after 1900s, have maximum number of players 20 and their playtime is shorter than 300 minutes (5 hours).
boardgames <- boardgames_data %>% select(Year_Published, Min.Players, Max.Players, Play.Time, Min.Age, Users.Rated, Rating.Average, Complexity.Average)
boardgames <- boardgames %>% filter(Year_Published > 1900)
boardgames <- boardgames %>% filter(Max.Players <= 20)
boardgames <- boardgames %>% filter(Play.Time <= 300)
Even though my chosen columns represent numbers, their type was character, not numeric. I converted them to numeric and removed NA values. To Users.Rated column I applied logarithmic transformation. The number of users rating board games varies significantly, with some games receiving only a handful of ratings while others accumulate thousands. This creates a highly skewed distribution, where a few extremely popular games dominate the dataset.
boardgames_numeric <- boardgames %>%
mutate(across(everything(), as.numeric))
boardgames_numeric <- na.omit(boardgames_numeric)
boardgames_numeric$Users.Rated <- log(boardgames_numeric$Users.Rated + 1)
Since PCA relies on variance, we need to now standardize our values to be on the same scale.
We can also compute the correlation matrix to check the relationship between variables. By using correlation we ensure that all features contribute equally.
boardgames_stand <- scale(boardgames_numeric)
summary(boardgames_numeric)
## Year_Published Min.Players Max.Players Play.Time
## Min. : 400 Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.:1993 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 20
## Median :2007 Median : 2.000 Median : 2.000 Median : 30
## Mean :2001 Mean : 1.901 Mean : 3.678 Mean : 104
## 3rd Qu.:2014 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 120
## Max. :2021 Max. :10.000 Max. :163.000 Max. :22500
## Min.Age Users.Rated Rating.Average Complexity.Average
## Min. : 0.00 Min. : 3.434 Min. :1.427 Min. :0.000
## 1st Qu.: 8.00 1st Qu.: 3.871 1st Qu.:5.956 1st Qu.:1.500
## Median :10.00 Median : 4.431 Median :6.557 Median :2.125
## Mean : 9.26 Mean : 4.776 Mean :6.527 Mean :2.173
## 3rd Qu.:12.00 3rd Qu.: 5.297 3rd Qu.:7.138 3rd Qu.:2.818
## Max. :18.00 Max. :11.007 Max. :9.463 Max. :5.000
cor(boardgames_numeric)
## Year_Published Min.Players Max.Players Play.Time
## Year_Published 1.0000000000 -0.0009422675 0.02022352 -0.003818487
## Min.Players -0.0009422675 1.0000000000 0.25095281 -0.028648889
## Max.Players 0.0202235234 0.2509528128 1.00000000 -0.031813362
## Play.Time -0.0038184872 -0.0286488887 -0.03181336 1.000000000
## Min.Age -0.0181721227 0.0940379051 0.04525981 0.047669059
## Users.Rated -0.0370120117 0.0867748940 0.04288666 0.003297534
## Rating.Average 0.1004036709 -0.1510540289 -0.07860953 0.131721306
## Complexity.Average -0.0434511961 -0.1863024363 -0.19574513 0.227456447
## Min.Age Users.Rated Rating.Average Complexity.Average
## Year_Published -0.01817212 -0.037012012 0.10040367 -0.04345120
## Min.Players 0.09403791 0.086774894 -0.15105403 -0.18630244
## Max.Players 0.04525981 0.042886660 -0.07860953 -0.19574513
## Play.Time 0.04766906 0.003297534 0.13172131 0.22745645
## Min.Age 1.00000000 0.124529668 0.01499469 0.13771600
## Users.Rated 0.12452967 1.000000000 0.14804709 -0.03034979
## Rating.Average 0.01499469 0.148047086 1.00000000 0.42110551
## Complexity.Average 0.13771600 -0.030349788 0.42110551 1.00000000
From correlation matrix above we can take following insights:
Complexity and Rating (0.42 correlation) - more complex games tend to have higher user ratings, suggesting that experienced players, prefer more strategic and deep games.
Complexity and Playtime (0.23 correlation) - longer playtime suggests that the game is more complex
Users Rated and Rating (0.15 correlation) - popular games tend to have higher rating, but the correlation is not particularly strong
Year Published has weak correlations - game’s age doesn’t have strong influence on other variables
No single variable dominates the dataset, which makes PCA a useful technique here. We’ll perform PCA using prcomp() function.
pca_result <- prcomp(boardgames_stand, center = TRUE, scale. = TRUE)
print(summary(pca_result))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3174 1.1360 1.0250 0.9955 0.9520 0.9040 0.85451
## Proportion of Variance 0.2169 0.1613 0.1313 0.1239 0.1133 0.1022 0.09127
## Cumulative Proportion 0.2169 0.3783 0.5096 0.6335 0.7468 0.8489 0.94017
## PC8
## Standard deviation 0.69182
## Proportion of Variance 0.05983
## Cumulative Proportion 1.00000
From the summary above and the scree plot below, we can see that 6 first components (PC1 - PC6) explain ~ 85% variance, making them a good choice for dimensionality reduction.
fviz_eig(pca_result, barfill = "#beedd4")
fviz_pca_var(pca_result, col.var="#7a319e")
pca_result$rotation
## PC1 PC2 PC3 PC4 PC5
## Year_Published -0.02200395 0.02071852 -0.88274854 0.03983853 0.34288485
## Min.Players 0.37693338 -0.44851927 -0.04725960 0.22184683 -0.06649300
## Max.Players 0.34988682 -0.38329479 -0.22896035 0.25817173 -0.26974233
## Play.Time -0.31317132 -0.19950284 0.04376372 0.63744806 -0.33523783
## Min.Age -0.07742229 -0.53019084 0.24506497 0.06155596 0.74574410
## Users.Rated -0.02920612 -0.51932488 0.01606381 -0.64347670 -0.28877103
## Rating.Average -0.52316585 -0.21467739 -0.31078383 -0.19168657 -0.21922290
## Complexity.Average -0.59698431 -0.12164654 0.08533444 0.14700022 0.07267518
## PC6 PC7 PC8
## Year_Published -0.263733868 -0.01080419 -0.1761027
## Min.Players -0.005492906 0.77492636 0.0184352
## Max.Players 0.518865670 -0.49569985 -0.1545574
## Play.Time -0.544625438 -0.17830606 0.1143360
## Min.Age -0.013107521 -0.21316507 0.2175134
## Users.Rated -0.382687022 -0.12236101 -0.2651755
## Rating.Average 0.346719095 0.13551236 0.6002036
## Complexity.Average 0.312634818 0.20757133 -0.6736563
The PCA variable correlation plot helps visualize how different features contribute to the first two principal components. Variables with longer arrows, like Users Rated and Max Players, have a stronger influence, while the direction shows how they relate to each other. The pca_result$rotation matrix provides the exact numerical loadings, confirming which variables contribute the most to each principal component.
As mentioned before, we’ll chose the first 6 principal components.
pca_selected <- as.data.frame(pca_result$x[, 1:6])
We can create a scatter plot now to visualize how boardgames are distributed based on their complexity, quality and player count.
ggplot(pca_selected, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.6) +
labs(title = "PCA: Board Game Clustering",
x = "PC1 (Complexity & Quality)",
y = "PC2 (Popularity & Players)") +
theme_light()
Most games are clustered near the center, suggesting that they share similar characteristics in terms of complexity and player engagement. The overall trend suggests that as complexity increases, popularity might decrease.
Even though our previous analysis showed that publication year doesn’t affect other variables and popularity, I decided to visualize how boardgames from different decades are distributed across the prinicpal components, to show other, interesting perspective.
#generated with the help of AI
pca_selected$Year_Published <- boardgames_numeric$Year_Published
pca_selected$Decade <- cut(pca_selected$Year_Published,
breaks = seq(1950, 2020, by = 10),
labels = paste0(seq(1950, 2010, by = 10), "s"))
ggplot(pca_selected, aes(x = PC1, y = PC2, color = Decade)) +
geom_point(alpha = 0.8, size = 1.5) +
scale_color_brewer(palette = "Set1") + # Use different colors for decades
labs(title = "PCA: Board Game Clustering (Colored by Decade)",
x = "PC1 (Complexity & Quality)",
y = "PC2 (Popularity & Players)",
color = "Decade") +
theme_minimal()
## Warning: Removed 31 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can also visualize how complexity of board games has evolved over time.
boardgames_numeric$Decade <- cut(boardgames_numeric$Year_Published,
breaks = seq(1950, 2020, by = 10),
labels = c("1950s", "1960s", "1970s", "1980s", "1990s", "2000s", "2010s"),
include.lowest = TRUE)
ggplot(boardgames_numeric, aes(x = Decade, y = Complexity.Average, fill = Decade)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.2, width = 0.2, color = "black") +
theme_minimal() +
labs(title = "Board Game Complexity Over Time",
x = "Decade",
y = "Complexity (Average)") +
theme(legend.position = "none")
This visualization shows that the complexity of boardgames has generally growed over time, with recent games showing wider complexity range. This trend aligns with the growing interest in strategic gameplay mechanics in modern board game design.
For the final part of this paper, I’ve created a linear model, to understand how the principal components influence the game popularity, measured by users rating.
regression_data <- cbind(boardgames_numeric$Users.Rated, pca_selected)
colnames(regression_data)[1] <- "Users.Rated"
lm_model <- lm(Users.Rated ~ ., data = regression_data)
summary(lm_model)
##
## Call:
## lm(formula = Users.Rated ~ ., data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72016 -0.04695 -0.01191 0.04520 1.32369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.311e+01 1.581e+00 -46.251 < 2e-16 ***
## PC1 6.964e-03 1.476e-03 4.718 2.48e-06 ***
## PC2 -6.754e-01 1.677e-03 -402.794 < 2e-16 ***
## PC3 1.737e+00 1.541e-02 112.718 < 2e-16 ***
## PC4 -8.626e-01 1.976e-03 -436.473 < 2e-16 ***
## PC5 -1.019e+00 6.035e-03 -168.761 < 2e-16 ***
## PC6 4.401e-02 4.873e-03 9.030 < 2e-16 ***
## Year_Published 3.899e-02 8.037e-04 48.516 < 2e-16 ***
## Decade1960s -9.277e-02 3.768e-02 -2.462 0.01388 *
## Decade1970s -1.018e-01 3.661e-02 -2.782 0.00543 **
## Decade1980s -1.163e-01 3.944e-02 -2.949 0.00321 **
## Decade1990s -1.253e-01 4.319e-02 -2.902 0.00373 **
## Decade2000s -1.337e-01 4.788e-02 -2.792 0.00526 **
## Decade2010s -1.174e-01 5.239e-02 -2.242 0.02505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1104 on 3481 degrees of freedom
## (31 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
## Multiple R-squared: 0.9918, Adjusted R-squared: 0.9918
## F-statistic: 3.254e+04 on 13 and 3481 DF, p-value: < 2.2e-16
Insights from the model:
PC1 (Complexity and Quality) is positively correlated with popularity, meaning games with higher complexity ten to be rated more
PC2 (Players) is negatively correlated, meaning the more players the game requires, the less ratings it will get
Decades - older games tend to have less ratings than newer games
This paper helped to reveal that boardgame popularity is influenced by a mix of complexity, quality, time trends and number of players. More recent games, especially those with higher complexity and quality, receive significantly more ratings. Number of players is an important factor in whether the game will become popular or not. It shows that popularity is driven by both game feature and external factors like market trends and community enagegement.