Introduction

Board games have been around for centuries, starting from ancient games like GO in China, Chess in India or Senet in Ancient Egypt, to modern games like Terraforming Mars, Pandemic and many many more. They’re a great tool to connect people, relieve stress and pratice startegic and problem solving skills. But have you ever wondered what makes some games more popular than others?

In this paper I tried to uncover what makes a board game popular, using BoardGameGeek (BGG) - the most popular website about board games - dataset and Principal Component Analysis (PCA).

Data

Firstly we need to load the data and check its structure.

boardgames_data <- read.xlsx2('BGG_Data_Set.xlsx', 1)

head(boardgames_data)
##       ID                              Name Year_Published Min.Players
## 1 174430                        Gloomhaven           2017           1
## 2 161936         Pandemic Legacy: Season 1           2015           2
## 3 224517                 Brass: Birmingham           2018           2
## 4 167791                 Terraforming Mars           2016           1
## 5 233078 Twilight Imperium: Fourth Edition           2017           3
## 6 291457      Gloomhaven: Jaws of the Lion           2020           1
##   Max.Players Play.Time Min.Age Users.Rated Rating.Average BGG.Rank
## 1           4       120      14       42055        8.79244        1
## 2           4        60      13       41643        8.61278        2
## 3           4       120      14       19217        8.66337        3
## 4           5       120      12       64864        8.43254        4
## 5           6       480      14       13468        8.69649        5
## 6           4       120      14        8392        8.87363        6
##   Complexity.Average Owned.Users
## 1             3.8604       68323
## 2             2.8405       65294
## 3             3.9129       28785
## 4             3.2406       87099
## 5             4.2219       16831
## 6             3.5472       21609
##                                                                                                                                                                                                                                                                                                                                                                                                                Mechanics
## 1 Action Queue, Action Retrieval, Campaign / Battle Card Driven, Card Play Conflict Resolution, Communication Limits, Cooperative Game, Deck Construction, Deck Bag and Pool Building, Grid Movement, Hand Management, Hexagon Grid, Legacy Game, Modular Board, Once-Per-Game Abilities, Scenario / Mission / Campaign Game, Simultaneous Action Selection, Solo / Solitaire Game, Storytelling, Variable Player Powers
## 2                                                                                                                                                                                                                                                                                Action Points, Cooperative Game, Hand Management, Legacy Game, Point to Point Movement, Set Collection, Trading, Variable Player Powers
## 3                                                                                                                                                                                                                                                            Hand Management, Income, Loans, Market, Network and Route Building, Score-and-Reset Game, Tech Trees / Tech Tracks, Turn Order: Stat-Based, Variable Set-up
## 4                                                                                                                                                                                                                    Card Drafting, Drafting, End Game Bonuses, Hand Management, Hexagon Grid, Income, Set Collection, Solo / Solitaire Game, Take That, Tile Placement, Turn Order: Progressive, Variable Player Powers
## 5                                                                                                                                                                                                                              Action Drafting, Area Majority / Influence, Area-Impulse, Dice Rolling, Follow, Grid Movement, Hexagon Grid, Modular Board, Trading, Variable Phase Order, Variable Player Powers, Voting
## 6                                                                Action Queue, Campaign / Battle Card Driven, Communication Limits, Cooperative Game, Critical Hits and Failures, Deck Construction, Grid Movement, Hand Management, Hexagon Grid, Legacy Game, Line of Sight, Once-Per-Game Abilities, Scenario / Mission / Campaign Game, Simultaneous Action Selection, Solo / Solitaire Game, Variable Player Powers
##                          Domains
## 1 Strategy Games, Thematic Games
## 2 Strategy Games, Thematic Games
## 3                 Strategy Games
## 4                 Strategy Games
## 5 Strategy Games, Thematic Games
## 6 Strategy Games, Thematic Games
str(boardgames_data)
## 'data.frame':    20343 obs. of  14 variables:
##  $ ID                : chr  "174430" "161936" "224517" "167791" ...
##  $ Name              : chr  "Gloomhaven" "Pandemic Legacy: Season 1" "Brass: Birmingham" "Terraforming Mars" ...
##  $ Year_Published    : chr  "2017" "2015" "2018" "2016" ...
##  $ Min.Players       : chr  "1" "2" "2" "1" ...
##  $ Max.Players       : chr  "4" "4" "4" "5" ...
##  $ Play.Time         : chr  "120" "60" "120" "120" ...
##  $ Min.Age           : chr  "14" "13" "14" "12" ...
##  $ Users.Rated       : chr  "42055" "41643" "19217" "64864" ...
##  $ Rating.Average    : chr  "8.79244" "8.61278" "8.66337" "8.43254" ...
##  $ BGG.Rank          : chr  "1" "2" "3" "4" ...
##  $ Complexity.Average: chr  "3.8604" "2.8405" "3.9129" "3.2406" ...
##  $ Owned.Users       : chr  "68323" "65294" "28785" "87099" ...
##  $ Mechanics         : chr  "Action Queue, Action Retrieval, Campaign / Battle Card Driven, Card Play Conflict Resolution, Communication Lim"| __truncated__ "Action Points, Cooperative Game, Hand Management, Legacy Game, Point to Point Movement, Set Collection, Trading"| __truncated__ "Hand Management, Income, Loans, Market, Network and Route Building, Score-and-Reset Game, Tech Trees / Tech Tra"| __truncated__ "Card Drafting, Drafting, End Game Bonuses, Hand Management, Hexagon Grid, Income, Set Collection, Solo / Solita"| __truncated__ ...
##  $ Domains           : chr  "Strategy Games, Thematic Games" "Strategy Games, Thematic Games" "Strategy Games" "Strategy Games" ...

In this dataset we have following columns:

For PCA we need to have only numerical values, so we are choosing only columns that will contribute to our analysis - meaning we don’t need Name of the game or its mechanics and domains.

I also filtered the data to remove outliers, leaving games that have been published after 1900s, have maximum number of players 20 and their playtime is shorter than 300 minutes (5 hours).

boardgames <- boardgames_data %>% select(Year_Published, Min.Players, Max.Players, Play.Time, Min.Age, Users.Rated, Rating.Average, Complexity.Average)

boardgames <- boardgames %>% filter(Year_Published > 1900)

boardgames <- boardgames %>% filter(Max.Players <= 20)

boardgames <- boardgames %>% filter(Play.Time <= 300)

Even though my chosen columns represent numbers, their type was character, not numeric. I converted them to numeric and removed NA values. To Users.Rated column I applied logarithmic transformation. The number of users rating board games varies significantly, with some games receiving only a handful of ratings while others accumulate thousands. This creates a highly skewed distribution, where a few extremely popular games dominate the dataset.

boardgames_numeric <- boardgames %>%
  mutate(across(everything(), as.numeric))

boardgames_numeric <- na.omit(boardgames_numeric)

boardgames_numeric$Users.Rated <- log(boardgames_numeric$Users.Rated + 1)

Since PCA relies on variance, we need to now standardize our values to be on the same scale.

We can also compute the correlation matrix to check the relationship between variables. By using correlation we ensure that all features contribute equally.

boardgames_stand <- scale(boardgames_numeric)

summary(boardgames_numeric)
##  Year_Published  Min.Players      Max.Players        Play.Time    
##  Min.   : 400   Min.   : 0.000   Min.   :  0.000   Min.   :    0  
##  1st Qu.:1993   1st Qu.: 2.000   1st Qu.:  2.000   1st Qu.:   20  
##  Median :2007   Median : 2.000   Median :  2.000   Median :   30  
##  Mean   :2001   Mean   : 1.901   Mean   :  3.678   Mean   :  104  
##  3rd Qu.:2014   3rd Qu.: 2.000   3rd Qu.:  2.000   3rd Qu.:  120  
##  Max.   :2021   Max.   :10.000   Max.   :163.000   Max.   :22500  
##     Min.Age       Users.Rated     Rating.Average  Complexity.Average
##  Min.   : 0.00   Min.   : 3.434   Min.   :1.427   Min.   :0.000     
##  1st Qu.: 8.00   1st Qu.: 3.871   1st Qu.:5.956   1st Qu.:1.500     
##  Median :10.00   Median : 4.431   Median :6.557   Median :2.125     
##  Mean   : 9.26   Mean   : 4.776   Mean   :6.527   Mean   :2.173     
##  3rd Qu.:12.00   3rd Qu.: 5.297   3rd Qu.:7.138   3rd Qu.:2.818     
##  Max.   :18.00   Max.   :11.007   Max.   :9.463   Max.   :5.000
cor(boardgames_numeric)
##                    Year_Published   Min.Players Max.Players    Play.Time
## Year_Published       1.0000000000 -0.0009422675  0.02022352 -0.003818487
## Min.Players         -0.0009422675  1.0000000000  0.25095281 -0.028648889
## Max.Players          0.0202235234  0.2509528128  1.00000000 -0.031813362
## Play.Time           -0.0038184872 -0.0286488887 -0.03181336  1.000000000
## Min.Age             -0.0181721227  0.0940379051  0.04525981  0.047669059
## Users.Rated         -0.0370120117  0.0867748940  0.04288666  0.003297534
## Rating.Average       0.1004036709 -0.1510540289 -0.07860953  0.131721306
## Complexity.Average  -0.0434511961 -0.1863024363 -0.19574513  0.227456447
##                        Min.Age  Users.Rated Rating.Average Complexity.Average
## Year_Published     -0.01817212 -0.037012012     0.10040367        -0.04345120
## Min.Players         0.09403791  0.086774894    -0.15105403        -0.18630244
## Max.Players         0.04525981  0.042886660    -0.07860953        -0.19574513
## Play.Time           0.04766906  0.003297534     0.13172131         0.22745645
## Min.Age             1.00000000  0.124529668     0.01499469         0.13771600
## Users.Rated         0.12452967  1.000000000     0.14804709        -0.03034979
## Rating.Average      0.01499469  0.148047086     1.00000000         0.42110551
## Complexity.Average  0.13771600 -0.030349788     0.42110551         1.00000000

From correlation matrix above we can take following insights:

No single variable dominates the dataset, which makes PCA a useful technique here. We’ll perform PCA using prcomp() function.

Principal Component Analysis

pca_result <- prcomp(boardgames_stand, center = TRUE, scale. = TRUE)

print(summary(pca_result))
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.3174 1.1360 1.0250 0.9955 0.9520 0.9040 0.85451
## Proportion of Variance 0.2169 0.1613 0.1313 0.1239 0.1133 0.1022 0.09127
## Cumulative Proportion  0.2169 0.3783 0.5096 0.6335 0.7468 0.8489 0.94017
##                            PC8
## Standard deviation     0.69182
## Proportion of Variance 0.05983
## Cumulative Proportion  1.00000

From the summary above and the scree plot below, we can see that 6 first components (PC1 - PC6) explain ~ 85% variance, making them a good choice for dimensionality reduction.

fviz_eig(pca_result, barfill = "#beedd4")

fviz_pca_var(pca_result, col.var="#7a319e")

pca_result$rotation
##                            PC1         PC2         PC3         PC4         PC5
## Year_Published     -0.02200395  0.02071852 -0.88274854  0.03983853  0.34288485
## Min.Players         0.37693338 -0.44851927 -0.04725960  0.22184683 -0.06649300
## Max.Players         0.34988682 -0.38329479 -0.22896035  0.25817173 -0.26974233
## Play.Time          -0.31317132 -0.19950284  0.04376372  0.63744806 -0.33523783
## Min.Age            -0.07742229 -0.53019084  0.24506497  0.06155596  0.74574410
## Users.Rated        -0.02920612 -0.51932488  0.01606381 -0.64347670 -0.28877103
## Rating.Average     -0.52316585 -0.21467739 -0.31078383 -0.19168657 -0.21922290
## Complexity.Average -0.59698431 -0.12164654  0.08533444  0.14700022  0.07267518
##                             PC6         PC7        PC8
## Year_Published     -0.263733868 -0.01080419 -0.1761027
## Min.Players        -0.005492906  0.77492636  0.0184352
## Max.Players         0.518865670 -0.49569985 -0.1545574
## Play.Time          -0.544625438 -0.17830606  0.1143360
## Min.Age            -0.013107521 -0.21316507  0.2175134
## Users.Rated        -0.382687022 -0.12236101 -0.2651755
## Rating.Average      0.346719095  0.13551236  0.6002036
## Complexity.Average  0.312634818  0.20757133 -0.6736563

The PCA variable correlation plot helps visualize how different features contribute to the first two principal components. Variables with longer arrows, like Users Rated and Max Players, have a stronger influence, while the direction shows how they relate to each other. The pca_result$rotation matrix provides the exact numerical loadings, confirming which variables contribute the most to each principal component.

As mentioned before, we’ll chose the first 6 principal components.

pca_selected <- as.data.frame(pca_result$x[, 1:6])

We can create a scatter plot now to visualize how boardgames are distributed based on their complexity, quality and player count.

ggplot(pca_selected, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.6) +
  labs(title = "PCA: Board Game Clustering",
       x = "PC1 (Complexity & Quality)", 
       y = "PC2 (Popularity & Players)") +
  theme_light()

Most games are clustered near the center, suggesting that they share similar characteristics in terms of complexity and player engagement. The overall trend suggests that as complexity increases, popularity might decrease.

Even though our previous analysis showed that publication year doesn’t affect other variables and popularity, I decided to visualize how boardgames from different decades are distributed across the prinicpal components, to show other, interesting perspective.

#generated with the help of AI
pca_selected$Year_Published <- boardgames_numeric$Year_Published
pca_selected$Decade <- cut(pca_selected$Year_Published,
                           breaks = seq(1950, 2020, by = 10), 
                           labels = paste0(seq(1950, 2010, by = 10), "s"))

ggplot(pca_selected, aes(x = PC1, y = PC2, color = Decade)) +
  geom_point(alpha = 0.8, size = 1.5) +
  scale_color_brewer(palette = "Set1") +  # Use different colors for decades
  labs(title = "PCA: Board Game Clustering (Colored by Decade)",
       x = "PC1 (Complexity & Quality)", 
       y = "PC2 (Popularity & Players)",
       color = "Decade") +
  theme_minimal()
## Warning: Removed 31 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can also visualize how complexity of board games has evolved over time.

boardgames_numeric$Decade <- cut(boardgames_numeric$Year_Published, 
                         breaks = seq(1950, 2020, by = 10), 
                         labels = c("1950s", "1960s", "1970s", "1980s", "1990s", "2000s", "2010s"),
                         include.lowest = TRUE)

ggplot(boardgames_numeric, aes(x = Decade, y = Complexity.Average, fill = Decade)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(alpha = 0.2, width = 0.2, color = "black") +  
  theme_minimal() +
  labs(title = "Board Game Complexity Over Time", 
       x = "Decade", 
       y = "Complexity (Average)") +
  theme(legend.position = "none")  

This visualization shows that the complexity of boardgames has generally growed over time, with recent games showing wider complexity range. This trend aligns with the growing interest in strategic gameplay mechanics in modern board game design.

For the final part of this paper, I’ve created a linear model, to understand how the principal components influence the game popularity, measured by users rating.

regression_data <- cbind(boardgames_numeric$Users.Rated, pca_selected)
colnames(regression_data)[1] <- "Users.Rated"

lm_model <- lm(Users.Rated ~ ., data = regression_data)

summary(lm_model)
## 
## Call:
## lm(formula = Users.Rated ~ ., data = regression_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72016 -0.04695 -0.01191  0.04520  1.32369 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    -7.311e+01  1.581e+00  -46.251  < 2e-16 ***
## PC1             6.964e-03  1.476e-03    4.718 2.48e-06 ***
## PC2            -6.754e-01  1.677e-03 -402.794  < 2e-16 ***
## PC3             1.737e+00  1.541e-02  112.718  < 2e-16 ***
## PC4            -8.626e-01  1.976e-03 -436.473  < 2e-16 ***
## PC5            -1.019e+00  6.035e-03 -168.761  < 2e-16 ***
## PC6             4.401e-02  4.873e-03    9.030  < 2e-16 ***
## Year_Published  3.899e-02  8.037e-04   48.516  < 2e-16 ***
## Decade1960s    -9.277e-02  3.768e-02   -2.462  0.01388 *  
## Decade1970s    -1.018e-01  3.661e-02   -2.782  0.00543 ** 
## Decade1980s    -1.163e-01  3.944e-02   -2.949  0.00321 ** 
## Decade1990s    -1.253e-01  4.319e-02   -2.902  0.00373 ** 
## Decade2000s    -1.337e-01  4.788e-02   -2.792  0.00526 ** 
## Decade2010s    -1.174e-01  5.239e-02   -2.242  0.02505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1104 on 3481 degrees of freedom
##   (31 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
## Multiple R-squared:  0.9918, Adjusted R-squared:  0.9918 
## F-statistic: 3.254e+04 on 13 and 3481 DF,  p-value: < 2.2e-16

Insights from the model:

Conclusion

This paper helped to reveal that boardgame popularity is influenced by a mix of complexity, quality, time trends and number of players. More recent games, especially those with higher complexity and quality, receive significantly more ratings. Number of players is an important factor in whether the game will become popular or not. It shows that popularity is driven by both game feature and external factors like market trends and community enagegement.