BoardGameGeek (BGG) is a comprehensive database and community centered around board games. BGG provides extensive details on a myriad of games, including game mechanics, themes, and user interactions such as ratings and reviews. Our analysis utilizes a data set extracted from BGG, focusing on several key variables that offer insights into the popularity and reception of these games among the gaming community. The goal of this report is to provide a statistical analysis of board games, particularly exmaining how various factors such as game complexity and intended audience age affects a game’s average rating.
The data set under consideration is sourced from BoardGameGeek, an authoritative and extensive database for board games. It contains detailed information on thousands of games, including user ratings, recommended ages, play times, and much more. This platform is not only a repository of game information but also a community for board game enthusiasts.
Our analysis focuses on the following variables as provided by BGG:
These variables were selected to explore relationships between the game’s complexity, suitability for various age groups, playtime, and its overall rating within the community.
For our Exploratory Data Analysis, we will be looking at the Descriptive Statistics, generating histograms to assess general patterns, trends, and shape, and finally performing correlation analysis.
FALSE Descriptive Statistics
FALSE games$AvgRating
FALSE N: 21925
FALSE
FALSE AvgRating
FALSE ----------------- -----------
FALSE Mean 6.42
FALSE Std.Dev 0.93
FALSE Min 1.04
FALSE Q1 5.84
FALSE Median 6.45
FALSE Q3 7.05
FALSE Max 9.91
FALSE MAD 0.90
FALSE IQR 1.22
FALSE CV 0.15
FALSE Skewness -0.31
FALSE SE.Skewness 0.02
FALSE Kurtosis 0.62
FALSE N.Valid 21925.00
FALSE Pct.Valid 100.00
FALSE Descriptive Statistics
FALSE games$GameWeight
FALSE N: 21925
FALSE
FALSE GameWeight
FALSE ----------------- ------------
FALSE Mean 1.98
FALSE Std.Dev 0.85
FALSE Min 0.00
FALSE Q1 1.33
FALSE Median 1.97
FALSE Q3 2.53
FALSE Max 5.00
FALSE MAD 0.94
FALSE IQR 1.19
FALSE CV 0.43
FALSE Skewness 0.40
FALSE SE.Skewness 0.02
FALSE Kurtosis 0.05
FALSE N.Valid 21925.00
FALSE Pct.Valid 100.00
FALSE Descriptive Statistics
FALSE games$MinPlayers
FALSE N: 21925
FALSE
FALSE MinPlayers
FALSE ----------------- ------------
FALSE Mean 2.01
FALSE Std.Dev 0.69
FALSE Min 0.00
FALSE Q1 2.00
FALSE Median 2.00
FALSE Q3 2.00
FALSE Max 10.00
FALSE MAD 0.00
FALSE IQR 0.00
FALSE CV 0.35
FALSE Skewness 1.70
FALSE SE.Skewness 0.02
FALSE Kurtosis 10.72
FALSE N.Valid 21925.00
FALSE Pct.Valid 100.00
FALSE Descriptive Statistics
FALSE games$MaxPlayers
FALSE N: 21925
FALSE
FALSE MaxPlayers
FALSE ----------------- ------------
FALSE Mean 5.71
FALSE Std.Dev 15.01
FALSE Min 0.00
FALSE Q1 4.00
FALSE Median 4.00
FALSE Q3 6.00
FALSE Max 999.00
FALSE MAD 2.97
FALSE IQR 2.00
FALSE CV 2.63
FALSE Skewness 42.38
FALSE SE.Skewness 0.02
FALSE Kurtosis 2646.43
FALSE N.Valid 21925.00
FALSE Pct.Valid 100.00
FALSE Descriptive Statistics
FALSE games$ComAgeRec
FALSE N: 21925
FALSE
FALSE ComAgeRec
FALSE ----------------- -----------
FALSE Mean 10.00
FALSE Std.Dev 3.27
FALSE Min 2.00
FALSE Q1 8.00
FALSE Median 10.00
FALSE Q3 12.00
FALSE Max 21.00
FALSE MAD 2.97
FALSE IQR 4.00
FALSE CV 0.33
FALSE Skewness 0.14
FALSE SE.Skewness 0.02
FALSE Kurtosis -0.38
FALSE N.Valid 16395.00
FALSE Pct.Valid 74.78
FALSE Descriptive Statistics
FALSE games$MfgPlaytime
FALSE N: 21925
FALSE
FALSE MfgPlaytime
FALSE ----------------- -------------
FALSE Mean 90.51
FALSE Std.Dev 529.66
FALSE Min 0.00
FALSE Q1 25.00
FALSE Median 45.00
FALSE Q3 90.00
FALSE Max 60000.00
FALSE MAD 37.06
FALSE IQR 65.00
FALSE CV 5.85
FALSE Skewness 74.73
FALSE SE.Skewness 0.02
FALSE Kurtosis 7728.10
FALSE N.Valid 21925.00
FALSE Pct.Valid 100.00
Looking at the histograms, we see that 3 of the distributions - Board Game Ratings, Minimum Player Counts, and BGG Community Age Recommendation - show a relatively normal distribution.
The histogram for BGG Game Weights appears moderately right-skewed; however, because the sample size N = 21925, we can use the Central Limit Theorem to assume that the distribution is approximately normally distributed.
However, the histograms for Maximum player counts and Manufacturer’s stated playtime are heavily right-skewed. We will perform transformations to try and adjust because that clearly fails the assumption of normality that’s required for Linear Regression.
Overall our pairwise plots show that other than Game Weight and Community Age Recommendation, most of these variables may not contribute much to the variability in the response variable.
Now, we’ll look at the candidate models and their residual analyses.
The initial full linear model will consist of 5 predictor variables. The linear model is represented as the following equation:
\(AvgRating = \beta_0 + \beta_1 * GameWeight + \beta_2 * MinPlayers + \beta_3 * MaxPlayers + \beta_4 * ComAgeRec + \beta_5 * MfgPlaytime\)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5.60 | 0.03 | 202.53 | 0.00 |
| GameWeight | 0.55 | 0.01 | 55.95 | 0.00 |
| MinPlayers | -0.13 | 0.01 | -13.85 | 0.00 |
| MaxPlayers | 0.00 | 0.00 | 2.00 | 0.05 |
| ComAgeRec | 0.00 | 0.00 | 1.52 | 0.13 |
| MfgPlaytime | 0.00 | 0.00 | -0.21 | 0.84 |
We now look at the residual analysis.
The conditions seem to be satisfied for the most part. The Residuals vs Fitted plots shows that the condition of linearity is confirmed. The Scale-Location plot shows that constant variance is satisfied. The Residuals vs Leverage plot shows that there are a handful of influential points (large outliers or leverage points), but this shouldn’t impact the model’s performance significantly.
However, the Q-Q plot isn’t linear meaning that the assumption of normality is insufficient. As a result, we will explore other models.
We will be using bidirectional stepwise regression to create our second candidate model. Stepwise regression is a method of automatic variable selection used in statistical modeling, particularly useful when dealing with data sets that contain multiple potential predictors. This method simplifies the model-building process by systematically adding or removing variables based on specific criteria and evaluating the statistical significance of each model iteration. Bidirectional stepwise regression means that we’ll be adding and removing predictor variables from the model in an iterative fashion and eventually leaves us with only the appropriate candidate variables.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5.60 | 0.03 | 203.98 | 0.00 |
| GameWeight | 0.55 | 0.01 | 56.54 | 0.00 |
| MinPlayers | -0.13 | 0.01 | -13.89 | 0.00 |
| MaxPlayers | 0.00 | 0.00 | 2.00 | 0.05 |
| ComAgeRec | 0.00 | 0.00 | 1.52 | 0.13 |
Running a bidirectional Stepwise regression removed MfgPlaytime from the function, which resulted in an improved model - increasing the F-statistic from F(5,16389) = 1323 to F(4,16390) = 1654.
Let’s look now at the Residual plots.
The conditions seem to be satisfied . The Residuals vs Fitted plots shows that the condition of linearity is confirmed. The Scale-Location plot shows that constant variance is satisfied. The Residuals vs Leverage plot shows that there are no more influential points. The Q-Q plot shows a slight skew, but the sample size seems large enough for the central limit theorem to apply.
The Box-Cox transformation is a statistical technique used to stabilize variance and normalize data, which is particularly useful for enhancing the performance of models that assume normality and homoscedasticity (constant variance). This transformation applies a power function to the response variable, parameterized by a lambda (λ) value, which determines the specific transformation applied.
When λ equals zero, the Box-Cox transformation becomes a natural logarithm transformation, and for other values, it modifies the data by raising it to a power specified by λ. By transforming the data in this way, the Box-Cox transformation can make skewed data more symmetric and normally distributed, improving the accuracy and validity of statistical analyses and predictive models.
Our plot demonstrates that lambda is roughly 2, which means that we’ll be taking our response variable, Average Rating, to the second power.
We’ll apply this transformation to both the original model as well as the stepwise model.
FALSE
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ ., data = new_games)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -49.552 -6.064 -0.022 6.079 48.023
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 31.40302675 0.35026027 89.656 <2e-16 ***
FALSE GameWeight 7.16643926 0.12434936 57.631 <2e-16 ***
FALSE MinPlayers -1.76154271 0.11540766 -15.264 <2e-16 ***
FALSE MaxPlayers 0.01065303 0.00467436 2.279 0.0227 *
FALSE ComAgeRec 0.05208694 0.03027162 1.721 0.0853 .
FALSE MfgPlaytime 0.00006905 0.00012719 0.543 0.5872
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 9.556 on 16389 degrees of freedom
FALSE (5530 observations deleted due to missingness)
FALSE Multiple R-squared: 0.3029, Adjusted R-squared: 0.3027
FALSE F-statistic: 1425 on 5 and 16389 DF, p-value: < 2.2e-16
FALSE
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ . - MfgPlaytime, data = new_games)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -49.587 -6.066 -0.028 6.080 48.029
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 31.380627 0.347814 90.222 <2e-16 ***
FALSE GameWeight 7.176395 0.122987 58.351 <2e-16 ***
FALSE MinPlayers -1.757699 0.115188 -15.259 <2e-16 ***
FALSE MaxPlayers 0.010666 0.004674 2.282 0.0225 *
FALSE ComAgeRec 0.052181 0.030270 1.724 0.0848 .
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 9.556 on 16390 degrees of freedom
FALSE (5530 observations deleted due to missingness)
FALSE Multiple R-squared: 0.3029, Adjusted R-squared: 0.3028
FALSE F-statistic: 1781 on 4 and 16390 DF, p-value: < 2.2e-16
The Box-Cox transformation on the stepwise model performed the best with a F-statistic of F(4, 16390) = 1781.
Looking at our box-cox transformed stepwise model’s residual plots show that all conditions are met.
Next, we’re exploring the possibility of any interaction term or polynomial terms. We’re attempting to use MaxPlayers and MinPlayers as an interaction term as they potentially have an impact on one another. We’re creating 3 interaction term models with and without the Box-Cox Transformation as well as with and without the MfgPlaytime, which was removed when we conducted stepwise regression.
FALSE
FALSE Call:
FALSE lm(formula = AvgRating ~ MaxPlayers * MinPlayers + GameWeight +
FALSE ComAgeRec + MfgPlaytime, data = new_games)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -5.5954 -0.4360 0.0432 0.4945 3.2186
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 5.67703164 0.03036631 186.952 < 2e-16 ***
FALSE MaxPlayers -0.00798460 0.00140771 -5.672 1.43e-08 ***
FALSE MinPlayers -0.16457814 0.01089116 -15.111 < 2e-16 ***
FALSE GameWeight 0.55210218 0.00980992 56.280 < 2e-16 ***
FALSE ComAgeRec 0.00223260 0.00239472 0.932 0.351
FALSE MfgPlaytime -0.00000259 0.00001002 -0.258 0.796
FALSE MaxPlayers:MinPlayers 0.00430822 0.00067104 6.420 1.40e-10 ***
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 0.7529 on 16388 degrees of freedom
FALSE (5530 observations deleted due to missingness)
FALSE Multiple R-squared: 0.2894, Adjusted R-squared: 0.2891
FALSE F-statistic: 1112 on 6 and 16388 DF, p-value: < 2.2e-16
FALSE
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ MaxPlayers * MinPlayers + GameWeight +
FALSE ComAgeRec + MfgPlaytime, data = new_games)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -49.521 -6.028 -0.031 6.080 48.056
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 32.47512176 0.38492894 84.367 < 2e-16 ***
FALSE MaxPlayers -0.10427380 0.01784437 -5.844 5.21e-09 ***
FALSE MinPlayers -2.26870295 0.13805838 -16.433 < 2e-16 ***
FALSE GameWeight 7.20953393 0.12435237 57.977 < 2e-16 ***
FALSE ComAgeRec 0.03376989 0.03035586 1.112 0.266
FALSE MfgPlaytime 0.00006213 0.00012702 0.489 0.625
FALSE MaxPlayers:MinPlayers 0.05676144 0.00850627 6.673 2.59e-11 ***
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 9.543 on 16388 degrees of freedom
FALSE (5530 observations deleted due to missingness)
FALSE Multiple R-squared: 0.3048, Adjusted R-squared: 0.3046
FALSE F-statistic: 1198 on 6 and 16388 DF, p-value: < 2.2e-16
FALSE
FALSE Call:
FALSE lm(formula = (AvgRating)^2 ~ MaxPlayers * MinPlayers + GameWeight +
FALSE ComAgeRec, data = new_games)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -49.552 -6.030 -0.034 6.077 48.062
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 32.455608 0.382847 84.774 < 2e-16 ***
FALSE MaxPlayers -0.104331 0.017844 -5.847 5.10e-09 ***
FALSE MinPlayers -2.265548 0.137904 -16.428 < 2e-16 ***
FALSE GameWeight 7.218517 0.122986 58.694 < 2e-16 ***
FALSE ComAgeRec 0.033844 0.030355 1.115 0.265
FALSE MaxPlayers:MinPlayers 0.056795 0.008506 6.677 2.51e-11 ***
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 9.543 on 16389 degrees of freedom
FALSE (5530 observations deleted due to missingness)
FALSE Multiple R-squared: 0.3048, Adjusted R-squared: 0.3046
FALSE F-statistic: 1437 on 5 and 16389 DF, p-value: < 2.2e-16
Looking at all three models, we see that they perform worse than all prior models, showing us that adding an interaction term makes the model perform worse. As a result, we choose to not have an interaction term or polynomial term in our final model.
After evaluating various models, the Box-Cox transformed stepwise regression model provided the best fit for the data, with an F-statistic of 1781. This model excluded the MfgPlaytime variable, which improved the model’s performance significantly compared to the initial full model. The stepwise regression helped refine the model by systematically removing predictors that were not significant contributors to the response variable, AvgRating.
The analysis revealed several key insights:
Game Complexity (GameWeight): There is a moderate positive correlation between game complexity and rating, indicating that more complex games tend to receive higher ratings. This may suggest that the target audience of BGG consists of enthusiasts who appreciate challenging games.
Minimum and Maximum Players: These variables showed weak correlations with other predictors and did not significantly contribute to the variability in game ratings. Interaction between these variables and other predictors did not improve the model either.
Recommended Age (ComAgeRec): The community’s recommended age for players was a significant predictor in the final model. This could imply that games intended for older audiences are rated higher due to their thematic complexity or challenging mechanics.
The Box-Cox transformation on the response variable AvgRating improved model performance by stabilizing variance and enhancing normality, demonstrating the importance of preprocessing and transformation in statistical modeling.
Several limitations must be considered:
Limited Scope of Variables: The chosen variables might not comprehensively capture all aspects influencing a game’s rating. Factors like whether or not the board game was kick-started or the ratio between owners of the game and those who want to own the game.
Community Bias: The data source, BGG, primarily caters to enthusiasts who may have biases towards specific game types or complexities. This community’s ratings may not represent broader public opinion.
Non-Normal Data Distribution: Despite transformations, some variables remained skewed, potentially affecting model performance. Additionally, the ratings might contain some subjective biases that can’t be accounted for entirely in a model.
Influence of Outliers: Although some attempts were made to remove outliers, the data may still contain influential points affecting model accuracy. Further analysis could delve into removing extreme outliers based on robust statistical techniques.
Future research could benefit from incorporating a more comprehensive set of predictors, such as user demographics, marketing strategies, or data from other platforms. Testing different modeling techniques like random forests or deep learning may also yield better predictions by capturing more complex relationships in the data.