1 Introduction

BoardGameGeek (BGG) is a comprehensive database and community centered around board games. BGG provides extensive details on a myriad of games, including game mechanics, themes, and user interactions such as ratings and reviews. Our analysis utilizes a data set extracted from BGG, focusing on several key variables that offer insights into the popularity and reception of these games among the gaming community. The goal of this report is to provide a statistical analysis of board games, particularly exmaining how various factors such as game complexity and intended audience age affects a game’s average rating.

1.1 Data Source

The data set under consideration is sourced from BoardGameGeek, an authoritative and extensive database for board games. It contains detailed information on thousands of games, including user ratings, recommended ages, play times, and much more. This platform is not only a repository of game information but also a community for board game enthusiasts.

1.2 Variables of Interest

Our analysis focuses on the following variables as provided by BGG:

  • AvgRating: The average rating given to a game, which serves as our response variable (Y-variable).
  • GameWeight: An index that measures game complexity and difficulty.
  • MinPlayers: Minimum number of players recommended for the game.
  • MaxPlayers: Maximum number of players recommended for the game.
  • ComAgeRec: The community’s recommended age for players.
  • MfgPlaytime: The Manufacturer’s stated playtime for a game.

These variables were selected to explore relationships between the game’s complexity, suitability for various age groups, playtime, and its overall rating within the community.

2 Exploratory Data Analysis

For our Exploratory Data Analysis, we will be looking at the Descriptive Statistics, generating histograms to assess general patterns, trends, and shape, and finally performing correlation analysis.

2.1 Descriptive Statistics

FALSE Descriptive Statistics  
FALSE games$AvgRating  
FALSE N: 21925  
FALSE 
FALSE                     AvgRating
FALSE ----------------- -----------
FALSE              Mean        6.42
FALSE           Std.Dev        0.93
FALSE               Min        1.04
FALSE                Q1        5.84
FALSE            Median        6.45
FALSE                Q3        7.05
FALSE               Max        9.91
FALSE               MAD        0.90
FALSE               IQR        1.22
FALSE                CV        0.15
FALSE          Skewness       -0.31
FALSE       SE.Skewness        0.02
FALSE          Kurtosis        0.62
FALSE           N.Valid    21925.00
FALSE         Pct.Valid      100.00
FALSE Descriptive Statistics  
FALSE games$GameWeight  
FALSE N: 21925  
FALSE 
FALSE                     GameWeight
FALSE ----------------- ------------
FALSE              Mean         1.98
FALSE           Std.Dev         0.85
FALSE               Min         0.00
FALSE                Q1         1.33
FALSE            Median         1.97
FALSE                Q3         2.53
FALSE               Max         5.00
FALSE               MAD         0.94
FALSE               IQR         1.19
FALSE                CV         0.43
FALSE          Skewness         0.40
FALSE       SE.Skewness         0.02
FALSE          Kurtosis         0.05
FALSE           N.Valid     21925.00
FALSE         Pct.Valid       100.00
FALSE Descriptive Statistics  
FALSE games$MinPlayers  
FALSE N: 21925  
FALSE 
FALSE                     MinPlayers
FALSE ----------------- ------------
FALSE              Mean         2.01
FALSE           Std.Dev         0.69
FALSE               Min         0.00
FALSE                Q1         2.00
FALSE            Median         2.00
FALSE                Q3         2.00
FALSE               Max        10.00
FALSE               MAD         0.00
FALSE               IQR         0.00
FALSE                CV         0.35
FALSE          Skewness         1.70
FALSE       SE.Skewness         0.02
FALSE          Kurtosis        10.72
FALSE           N.Valid     21925.00
FALSE         Pct.Valid       100.00
FALSE Descriptive Statistics  
FALSE games$MaxPlayers  
FALSE N: 21925  
FALSE 
FALSE                     MaxPlayers
FALSE ----------------- ------------
FALSE              Mean         5.71
FALSE           Std.Dev        15.01
FALSE               Min         0.00
FALSE                Q1         4.00
FALSE            Median         4.00
FALSE                Q3         6.00
FALSE               Max       999.00
FALSE               MAD         2.97
FALSE               IQR         2.00
FALSE                CV         2.63
FALSE          Skewness        42.38
FALSE       SE.Skewness         0.02
FALSE          Kurtosis      2646.43
FALSE           N.Valid     21925.00
FALSE         Pct.Valid       100.00
FALSE Descriptive Statistics  
FALSE games$ComAgeRec  
FALSE N: 21925  
FALSE 
FALSE                     ComAgeRec
FALSE ----------------- -----------
FALSE              Mean       10.00
FALSE           Std.Dev        3.27
FALSE               Min        2.00
FALSE                Q1        8.00
FALSE            Median       10.00
FALSE                Q3       12.00
FALSE               Max       21.00
FALSE               MAD        2.97
FALSE               IQR        4.00
FALSE                CV        0.33
FALSE          Skewness        0.14
FALSE       SE.Skewness        0.02
FALSE          Kurtosis       -0.38
FALSE           N.Valid    16395.00
FALSE         Pct.Valid       74.78
FALSE Descriptive Statistics  
FALSE games$MfgPlaytime  
FALSE N: 21925  
FALSE 
FALSE                     MfgPlaytime
FALSE ----------------- -------------
FALSE              Mean         90.51
FALSE           Std.Dev        529.66
FALSE               Min          0.00
FALSE                Q1         25.00
FALSE            Median         45.00
FALSE                Q3         90.00
FALSE               Max      60000.00
FALSE               MAD         37.06
FALSE               IQR         65.00
FALSE                CV          5.85
FALSE          Skewness         74.73
FALSE       SE.Skewness          0.02
FALSE          Kurtosis       7728.10
FALSE           N.Valid      21925.00
FALSE         Pct.Valid        100.00

2.2 Histograms

Looking at the histograms, we see that 3 of the distributions - Board Game Ratings, Minimum Player Counts, and BGG Community Age Recommendation - show a relatively normal distribution.

The histogram for BGG Game Weights appears moderately right-skewed; however, because the sample size N = 21925, we can use the Central Limit Theorem to assume that the distribution is approximately normally distributed.

However, the histograms for Maximum player counts and Manufacturer’s stated playtime are heavily right-skewed. We will perform transformations to try and adjust because that clearly fails the assumption of normality that’s required for Linear Regression.

2.3 Correlation Analysis

  • AvgRating and GameWeight: A correlation of 0.48 suggests a moderate positive relationship, indicating that games with higher complexity (or GameWeight) tend to have higher ratings.
  • GameWeight and MfgPlaytime: A correlation of 0.65 is relatively strong, suggesting that more complex games also tend to have longer play times.
  • MinPlayers and MaxPlayers: These variables have very little correlation with most other variables.

Overall our pairwise plots show that other than Game Weight and Community Age Recommendation, most of these variables may not contribute much to the variability in the response variable.

3 Our Models

Now, we’ll look at the candidate models and their residual analyses.

3.1 Full Model

The initial full linear model will consist of 5 predictor variables. The linear model is represented as the following equation:

\(AvgRating = \beta_0 + \beta_1 * GameWeight + \beta_2 * MinPlayers + \beta_3 * MaxPlayers + \beta_4 * ComAgeRec + \beta_5 * MfgPlaytime\)

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.60 0.03 202.53 0.00
GameWeight 0.55 0.01 55.95 0.00
MinPlayers -0.13 0.01 -13.85 0.00
MaxPlayers 0.00 0.00 2.00 0.05
ComAgeRec 0.00 0.00 1.52 0.13
MfgPlaytime 0.00 0.00 -0.21 0.84

We now look at the residual analysis.

The conditions seem to be satisfied for the most part. The Residuals vs Fitted plots shows that the condition of linearity is confirmed. The Scale-Location plot shows that constant variance is satisfied. The Residuals vs Leverage plot shows that there are a handful of influential points (large outliers or leverage points), but this shouldn’t impact the model’s performance significantly.

However, the Q-Q plot isn’t linear meaning that the assumption of normality is insufficient. As a result, we will explore other models.

3.2 Stepwise Regression Model

We will be using bidirectional stepwise regression to create our second candidate model. Stepwise regression is a method of automatic variable selection used in statistical modeling, particularly useful when dealing with data sets that contain multiple potential predictors. This method simplifies the model-building process by systematically adding or removing variables based on specific criteria and evaluating the statistical significance of each model iteration. Bidirectional stepwise regression means that we’ll be adding and removing predictor variables from the model in an iterative fashion and eventually leaves us with only the appropriate candidate variables.

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.60 0.03 203.98 0.00
GameWeight 0.55 0.01 56.54 0.00
MinPlayers -0.13 0.01 -13.89 0.00
MaxPlayers 0.00 0.00 2.00 0.05
ComAgeRec 0.00 0.00 1.52 0.13

Running a bidirectional Stepwise regression removed MfgPlaytime from the function, which resulted in an improved model - increasing the F-statistic from F(5,16389) = 1323 to F(4,16390) = 1654.

Let’s look now at the Residual plots.

The conditions seem to be satisfied . The Residuals vs Fitted plots shows that the condition of linearity is confirmed. The Scale-Location plot shows that constant variance is satisfied. The Residuals vs Leverage plot shows that there are no more influential points. The Q-Q plot shows a slight skew, but the sample size seems large enough for the central limit theorem to apply.

3.3 Box-Cox Transformed Model

The Box-Cox transformation is a statistical technique used to stabilize variance and normalize data, which is particularly useful for enhancing the performance of models that assume normality and homoscedasticity (constant variance). This transformation applies a power function to the response variable, parameterized by a lambda (λ) value, which determines the specific transformation applied.

When λ equals zero, the Box-Cox transformation becomes a natural logarithm transformation, and for other values, it modifies the data by raising it to a power specified by λ. By transforming the data in this way, the Box-Cox transformation can make skewed data more symmetric and normally distributed, improving the accuracy and validity of statistical analyses and predictive models.

Our plot demonstrates that lambda is roughly 2, which means that we’ll be taking our response variable, Average Rating, to the second power.

We’ll apply this transformation to both the original model as well as the stepwise model.

FALSE 
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ ., data = new_games)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -49.552  -6.064  -0.022   6.079  48.023 
FALSE 
FALSE Coefficients:
FALSE                Estimate  Std. Error t value Pr(>|t|)    
FALSE (Intercept) 31.40302675  0.35026027  89.656   <2e-16 ***
FALSE GameWeight   7.16643926  0.12434936  57.631   <2e-16 ***
FALSE MinPlayers  -1.76154271  0.11540766 -15.264   <2e-16 ***
FALSE MaxPlayers   0.01065303  0.00467436   2.279   0.0227 *  
FALSE ComAgeRec    0.05208694  0.03027162   1.721   0.0853 .  
FALSE MfgPlaytime  0.00006905  0.00012719   0.543   0.5872    
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 9.556 on 16389 degrees of freedom
FALSE   (5530 observations deleted due to missingness)
FALSE Multiple R-squared:  0.3029,  Adjusted R-squared:  0.3027 
FALSE F-statistic:  1425 on 5 and 16389 DF,  p-value: < 2.2e-16
FALSE 
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ . - MfgPlaytime, data = new_games)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -49.587  -6.066  -0.028   6.080  48.029 
FALSE 
FALSE Coefficients:
FALSE              Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept) 31.380627   0.347814  90.222   <2e-16 ***
FALSE GameWeight   7.176395   0.122987  58.351   <2e-16 ***
FALSE MinPlayers  -1.757699   0.115188 -15.259   <2e-16 ***
FALSE MaxPlayers   0.010666   0.004674   2.282   0.0225 *  
FALSE ComAgeRec    0.052181   0.030270   1.724   0.0848 .  
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 9.556 on 16390 degrees of freedom
FALSE   (5530 observations deleted due to missingness)
FALSE Multiple R-squared:  0.3029,  Adjusted R-squared:  0.3028 
FALSE F-statistic:  1781 on 4 and 16390 DF,  p-value: < 2.2e-16

The Box-Cox transformation on the stepwise model performed the best with a F-statistic of F(4, 16390) = 1781.

Looking at our box-cox transformed stepwise model’s residual plots show that all conditions are met.

3.4 Interaction Term Model

Next, we’re exploring the possibility of any interaction term or polynomial terms. We’re attempting to use MaxPlayers and MinPlayers as an interaction term as they potentially have an impact on one another. We’re creating 3 interaction term models with and without the Box-Cox Transformation as well as with and without the MfgPlaytime, which was removed when we conducted stepwise regression.

FALSE 
FALSE Call:
FALSE lm(formula = AvgRating ~ MaxPlayers * MinPlayers + GameWeight + 
FALSE     ComAgeRec + MfgPlaytime, data = new_games)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -5.5954 -0.4360  0.0432  0.4945  3.2186 
FALSE 
FALSE Coefficients:
FALSE                          Estimate  Std. Error t value Pr(>|t|)    
FALSE (Intercept)            5.67703164  0.03036631 186.952  < 2e-16 ***
FALSE MaxPlayers            -0.00798460  0.00140771  -5.672 1.43e-08 ***
FALSE MinPlayers            -0.16457814  0.01089116 -15.111  < 2e-16 ***
FALSE GameWeight             0.55210218  0.00980992  56.280  < 2e-16 ***
FALSE ComAgeRec              0.00223260  0.00239472   0.932    0.351    
FALSE MfgPlaytime           -0.00000259  0.00001002  -0.258    0.796    
FALSE MaxPlayers:MinPlayers  0.00430822  0.00067104   6.420 1.40e-10 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 0.7529 on 16388 degrees of freedom
FALSE   (5530 observations deleted due to missingness)
FALSE Multiple R-squared:  0.2894,  Adjusted R-squared:  0.2891 
FALSE F-statistic:  1112 on 6 and 16388 DF,  p-value: < 2.2e-16
FALSE 
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ MaxPlayers * MinPlayers + GameWeight + 
FALSE     ComAgeRec + MfgPlaytime, data = new_games)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -49.521  -6.028  -0.031   6.080  48.056 
FALSE 
FALSE Coefficients:
FALSE                          Estimate  Std. Error t value Pr(>|t|)    
FALSE (Intercept)           32.47512176  0.38492894  84.367  < 2e-16 ***
FALSE MaxPlayers            -0.10427380  0.01784437  -5.844 5.21e-09 ***
FALSE MinPlayers            -2.26870295  0.13805838 -16.433  < 2e-16 ***
FALSE GameWeight             7.20953393  0.12435237  57.977  < 2e-16 ***
FALSE ComAgeRec              0.03376989  0.03035586   1.112    0.266    
FALSE MfgPlaytime            0.00006213  0.00012702   0.489    0.625    
FALSE MaxPlayers:MinPlayers  0.05676144  0.00850627   6.673 2.59e-11 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 9.543 on 16388 degrees of freedom
FALSE   (5530 observations deleted due to missingness)
FALSE Multiple R-squared:  0.3048,  Adjusted R-squared:  0.3046 
FALSE F-statistic:  1198 on 6 and 16388 DF,  p-value: < 2.2e-16
FALSE 
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ MaxPlayers * MinPlayers + GameWeight + 
FALSE     ComAgeRec, data = new_games)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -49.552  -6.030  -0.034   6.077  48.062 
FALSE 
FALSE Coefficients:
FALSE                        Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)           32.455608   0.382847  84.774  < 2e-16 ***
FALSE MaxPlayers            -0.104331   0.017844  -5.847 5.10e-09 ***
FALSE MinPlayers            -2.265548   0.137904 -16.428  < 2e-16 ***
FALSE GameWeight             7.218517   0.122986  58.694  < 2e-16 ***
FALSE ComAgeRec              0.033844   0.030355   1.115    0.265    
FALSE MaxPlayers:MinPlayers  0.056795   0.008506   6.677 2.51e-11 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 9.543 on 16389 degrees of freedom
FALSE   (5530 observations deleted due to missingness)
FALSE Multiple R-squared:  0.3048,  Adjusted R-squared:  0.3046 
FALSE F-statistic:  1437 on 5 and 16389 DF,  p-value: < 2.2e-16

Looking at all three models, we see that they perform worse than all prior models, showing us that adding an interaction term makes the model perform worse. As a result, we choose to not have an interaction term or polynomial term in our final model.

4 Results and Conclusions

After evaluating various models, the Box-Cox transformed stepwise regression model provided the best fit for the data, with an F-statistic of 1781. This model excluded the MfgPlaytime variable, which improved the model’s performance significantly compared to the initial full model. The stepwise regression helped refine the model by systematically removing predictors that were not significant contributors to the response variable, AvgRating.

The analysis revealed several key insights:

  • Game Complexity (GameWeight): There is a moderate positive correlation between game complexity and rating, indicating that more complex games tend to receive higher ratings. This may suggest that the target audience of BGG consists of enthusiasts who appreciate challenging games.

  • Minimum and Maximum Players: These variables showed weak correlations with other predictors and did not significantly contribute to the variability in game ratings. Interaction between these variables and other predictors did not improve the model either.

  • Recommended Age (ComAgeRec): The community’s recommended age for players was a significant predictor in the final model. This could imply that games intended for older audiences are rated higher due to their thematic complexity or challenging mechanics.

The Box-Cox transformation on the response variable AvgRating improved model performance by stabilizing variance and enhancing normality, demonstrating the importance of preprocessing and transformation in statistical modeling.

5 Limitations and Suggestions

Several limitations must be considered:

  • Limited Scope of Variables: The chosen variables might not comprehensively capture all aspects influencing a game’s rating. Factors like whether or not the board game was kick-started or the ratio between owners of the game and those who want to own the game.

  • Community Bias: The data source, BGG, primarily caters to enthusiasts who may have biases towards specific game types or complexities. This community’s ratings may not represent broader public opinion.

  • Non-Normal Data Distribution: Despite transformations, some variables remained skewed, potentially affecting model performance. Additionally, the ratings might contain some subjective biases that can’t be accounted for entirely in a model.

  • Influence of Outliers: Although some attempts were made to remove outliers, the data may still contain influential points affecting model accuracy. Further analysis could delve into removing extreme outliers based on robust statistical techniques.

Future research could benefit from incorporating a more comprehensive set of predictors, such as user demographics, marketing strategies, or data from other platforms. Testing different modeling techniques like random forests or deep learning may also yield better predictions by capturing more complex relationships in the data.