py_install(c("pandas", "numpy", "matplotlib", "seaborn"))
Analysis and visualization of board games leveraging data collected on board games from the BoardGameGeek (BGG). BGG is the largest online collection of board game data which consists of data on more than 100,000 total games. This analysis focuses on boardgames that have at least 30 votes on BGG, approximately 20,000 games. This dataset includes games that were invented as early as 2650 BC, to as recent as 2022 AD. It includes interesting game features, such as domains (strategy, family, etc.), mechanics (dice, card, etc.), min/max players, playing time, as well as complexity and rating values.
Exploring these features led to many interesting insights about the wonderful world of board games.
Source: https://www.kaggle.com/andrewmvd/board-games
This dataset contains data collected on board games from the BoardGameGeek (BGG) website in February 2021. BGG is the largest online collection of board game data which consists of data on more than 100,000 total games (ranked and unranked).
The voluntary online community contributes to the site with reviews, ratings, images, videos, session reports and live discussion forums on the expanding database of board games.
This data set contains all ranked games (~20,000) as of the date of collection from the BGG database. Unranked games are ignored as they have not been rated by enough BGG users (a game should receive at least 30 votes to be eligible for ranking).
Data on board games collected include:
Dilini Samarasinghe, July 5, 2021, “BoardGameGeek Dataset on Board Games”, IEEE Dataport, doi: https://dx.doi.org/10.21227/9g61-bs59. License CC BY 4.0
We made several observations about the dataset:
Year Published includes negative values spanning from -3500 to 2022. Investigating, we learned that these numbers were accurate and pointed to board games invented in 2650 BC.We created new fields to aid in further exploration:
Decade Published column based on Year Published.
### Import dataste
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import os
import matplotlib.patches as mpatches
filename = 'C:/Users/mishabella/DS736_DataViz/bgg_dataset.csv'
size = os.stat(filename).st_size
print("BoardGame File Size: ", round(size/10**6,2), "MB")
df = pd.read_csv(filename, skiprows = 0, sep=";", decimal=",") #nrows = 5, limit to 5 rows to sanity check]
### Explore dataset
df.isna().sum()
# we see a lot nulls in Domains and Mechanics, but generally the dataset looks complete
# inspect year published
df['Year Published'].describe()
# Right away, we see something odd about the Year Published, with negative values, etc.
# look closer at Year Published
np.unique(df['Year Published'])
# We see unusual values in this dataset.
# investigate the min value returned
#df.loc[df['Year Published'].idxmin()]
# A quick wiki search (https://en.wikipedia.org/wiki/Senet) confirmed that senet is an early board game from 2650BC.
# This shows the data is not necessarily "impossible", but I am not sure it will be useful in this analysis..
np.unique(df['Min Players'])
# reasonable results
np.unique(df['Max Players'])
# the 999 def looks questionable
# invetigate the max value returned
df.loc[df['Max Players'].idxmax()]
# Could not find on wiki, a card game/party game could conceivably be large, but 999 still seems impossible.
df.describe()
# Seeing min values of 0 across many of these (min/max players, play time, and age)
# I feel that 0 values should be removed
df.describe(include = [object])
# rating avg and complexity avg look numerical, should investigate to determine if they need to be transformed
### Remove all Max Players >100 and 0 values
df = df[(df['Max Players'] < 100)]
df = df[(df != 0).all(1)]
df.reset_index(drop=True)
# dropped 2220 rows
df['Decade Published'] = df['Year Published'].apply(lambda x: int((x//10)*10))
df['Decade Published']
np.unique(df['Decade Published'])
# convert Year Published to Int
df['Year Published'] = df['Year Published'].astype('Int64', errors='ignore')
domains = pd.DataFrame(df['Domains'].str.split(',', expand=True).rename(columns={
0:'First_Domain'}))
mechanics = pd.DataFrame(df['Mechanics'].str.split(',', expand=True).rename(columns={
0:'First_Mechanic'}))
df = df.join(domains['First_Domain'])
df = df.join(mechanics['First_Mechanic'])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import os
import matplotlib.patches as mpatches
df_2000 = df.loc[df['Year Published'] > 1999]
df_2000 = df_2000.groupby(['Year Published', 'First_Domain'])['Year Published'].count().reset_index(name = "count")
df_2000 = pd.DataFrame(df_2000)
### plot scatterplot of year to domain for video game trend analysis
plt.figure(figsize=(18, 10))
# s = size,
plt.scatter(df_2000['First_Domain'], df_2000['Year Published'],
marker='8',
cmap='viridis',
c=df_2000['count'],
s=df_2000['count']*5,
edgecolors='black')
plt.title('Published 21st Century Board Games by Domain', fontsize = 18)
plt.xlabel('Domains', fontsize = 14)
plt.ylabel('Year Pubslished', fontsize = 14, labelpad=30)
cbar = plt.colorbar()
cbar.set_label('Number of Games Published', rotation = 270, fontsize=15, color='black', labelpad=30)
my_colorbar_ticks = [*range(1, int(df_2000['count'].max()), 7)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [*range(1, int(df_2000['count'].max()), 7)]
cbar.set_ticklabels(my_colorbar_tick_labels)
plt.xticks(fontsize = 12, color = 'darkorchid', rotation=45)
my_y_ticks = [ *range(df_2000['Year Published'].min(), df_2000['Year Published'].max()+1, 1 )]
plt.yticks(my_y_ticks, fontsize = 14, color = 'darkblue')
plt.show()
Insights:
import seaborn as sns
from matplotlib.ticker import FuncFormatter
# create heatmap data based on Owned Users
hm_df = df.groupby(['First_Domain','Min Players'])['Owned Users'].sum().reset_index()
hm_df = hm_df.dropna()
hm_df_pivot = hm_df.pivot(index='First_Domain', columns='Min Players', values='Owned Users')
# create heatmap data based on Average Rating
hm_df2 = df.groupby(['First_Domain','Min Players'])['Rating Average'].mean().reset_index()
hm_df2 = hm_df2.dropna()
hm_df_pivot2 = hm_df2.pivot(index='First_Domain', columns='Min Players', values='Rating Average')
# create heatmap data based on Complexity Rating
hm_df3 = df.groupby(['First_Domain','Min Players'])['Complexity Average'].mean().reset_index()
hm_df3 = hm_df3.dropna()
hm_df_pivot3 = hm_df3.pivot(index='First_Domain', columns='Min Players', values='Complexity Average')
### Create Heatmap based on Owned Games
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(1,1,1)
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
sns.heatmap(round(hm_df_pivot,2), linewidth=0.2, annot = True, cmap = 'BuPu',
fmt = ',.0f', square = True, annot_kws={'size':11},
cbar_kws={'orientation':'vertical'})
plt.title('Heatmap of the Number of Owned Games \nby Primary Domain and Minimum Players', fontsize=18, pad=15)
plt.xlabel('Minimum Players', fontsize=14, labelpad=10)
plt.ylabel('Game Primary Domain', fontsize=16, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
min_count = int(hm_df['Owned Users'].min())
max_count = int(hm_df['Owned Users'].max())
cbar = ax.collections[0].colorbar
my_colorbar_ticks =[*range(min_count, max_count, 500000)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)
cbar.set_label('Number of Owned Games', rotation=270, fontsize=14, color='black', labelpad=20)
plt.show();
We find that two player games are the most popular games based on min players, which was an interesting insight, especially considering the most popular two player domain is family games.
### Create Heatmap based on Average Rating
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(1,1,1)
round_fmt = FuncFormatter(lambda x, p: format(round(x), 2))
sns.heatmap(round(hm_df_pivot2,2), linewidth=0.2, annot = True, cmap ='RdPu',
fmt = ',.2f', square = True, annot_kws={'size':11},
cbar_kws={'orientation':'vertical'})
plt.title('Heatmap of Average Game Rating (scale of 1-10) \nby Primary Domain and Minimum Players', fontsize=18, pad=15)
plt.xlabel('Minimum Players', fontsize=14, labelpad=10)
plt.ylabel('Game Primary Domain', fontsize=16, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
cbar = ax.collections[0].colorbar
cbar.set_label('Average Game Rating', rotation=270, fontsize=14, color='black', labelpad=20)
plt.show();
We find strategy games with 5 players are the highest rated. Overall childrens appear to be the lowest rated based on domain, while there was no clear pattern based on minimum players.
### Create Heatmap based on Complexity Rating
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(1,1,1)
round_fmt = FuncFormatter(lambda x, p: format(round(x), 2))
sns.heatmap(round(hm_df_pivot3,2), linewidth=0.2, annot = True, cmap ='YlGnBu',
fmt = ',.2f', square = True, annot_kws={'size':11},
cbar_kws={'orientation':'vertical'})
plt.title('Heatmap of the Average Complexity Rating (scale of 1-5) \nby Primary Domain and Minimum Players', fontsize=18, pad=15)
plt.xlabel('Minimum Players', fontsize=14, labelpad=10)
plt.ylabel('Game Primary Domain', fontsize=16, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
cbar = ax.collections[0].colorbar
cbar.set_label('Average Complexity Rating', rotation=270, fontsize=14, color='black', labelpad=20)
plt.show();
We find that war games with 8 players are the most complex games, which makes intuitive sense. The childrens games, family games, and party games are the least complex.
# generate subset of data for past century
df_century = df.loc[df['Year Published'] > 1920]
# Group data by year and aggregate the count, rating average mean, and complexity average mean
x = df_century.groupby(['Year Published']).agg({'Year Published': ['count'], 'Rating Average':['mean'],'Complexity Average':['mean']}).reset_index()
x.columns = ['YearPublished', 'Count', 'RatingAverage_mean','ComplexityAverage_mean']
x = x.sort_values('Count', ascending=False).reset_index(drop=True)
def pick_colors_according_to_mean(this_data):
colors = []
avg = this_data.mean()
for each in this_data:
if each > avg*1.01:
colors.append('orchid')
elif each < avg*0.99:
colors.append('darkorchid')
else:
colors.append('navy')
return(colors)
# Generate plot for Average Rating and Average Complexity Analysis
# Build legend handles
Above = mpatches.Patch(color='orchid', label = 'Above Average')
At = mpatches.Patch(color='navy', label = 'Within 1% of Average')
Below = mpatches.Patch(color='darkorchid', label = 'Below Average')
# Set figure size and title
fig = plt.figure(figsize =(16,12))
fig.suptitle('Average Rating and Average Complexity Analysis of\n Board Games from Last Century',
fontsize=18, fontweight='bold')
#first figure
ax1 = fig.add_subplot(2,1,1)
ax1.bar(x.YearPublished, x.RatingAverage_mean, label='Rating Average Mean',
color = pick_colors_according_to_mean(x.RatingAverage_mean))
ax1.set_title('Average Rating', size = 20)
plt.axhline(x.RatingAverage_mean.mean(), color='black', linestyle='dotted')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False) # outerbox gridline
ax1.text(2003, x.RatingAverage_mean.mean()-.5, 'Mean = '+ str(round(x.RatingAverage_mean.mean(),2)),
rotation=0, fontsize=14, backgroundcolor='white')
ax1.legend(handles=[Above, At, Below], loc='upper left', fontsize=13)
# second figure
ax2 = fig.add_subplot(2,1,2)
ax2.bar(x.YearPublished, x.ComplexityAverage_mean, label='Complexity Average Mean',
color = pick_colors_according_to_mean(x.ComplexityAverage_mean))
ax2.set_title('Average Complexity', size = 20)
plt.axhline(x.ComplexityAverage_mean.mean(), color='black', linestyle='dotted')
ax2.text(2003, x.ComplexityAverage_mean.mean()-.18, 'Mean = '+ str(round(x.ComplexityAverage_mean.mean(),2)),
rotation=0, fontsize=14, backgroundcolor='white')
ax2.axes.xaxis.set_visible(True)
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False) # outerbox gridline
ax2.legend(handles=[Above, At, Below], loc='upper left', fontsize=13)
ax2.set_ylim([0, x.ComplexityAverage_mean.max()+.6]);
fig.subplots_adjust(hspace = 0.25)
plt.show();
The transition from below average to above average trends are roughly the same time period for both ratings and complexity (around 1970). I wonder if this is related to the rise in table-top role playing games, such as D&D, and other similar-style games that became popular in this time frame.
Complexity increased more noticeably in the 70s and 80s than the ratings (again, potentially related to the rise in complex role playing games). The average ratings had a slower growth trend that primarily rose in the 2000s.
df_2010 = df.loc[(df['Year Published'] > 2009) & (df['Year Published'] < 2021) ]
bump_df_owned = df_2010.groupby(['First_Domain','Year Published'])['Owned Users'].sum().reset_index()
bump_df_owned = bump_df_owned.pivot(index='First_Domain', columns='Year Published', values = 'Owned Users')
bump_df_owned = bump_df_owned.dropna()
bump_df_rating = df_2010.groupby(['First_Domain','Year Published'])['Rating Average'].mean().reset_index()
bump_df_rating = bump_df_rating.pivot(index='First_Domain', columns='Year Published', values = 'Rating Average')
bump_df_rating = bump_df_rating.dropna()
bump_df_ranked_owned = bump_df_owned.rank(0, ascending=True, method='dense') #0 = row by row, 1 = col by col
bump_df_ranked_owned = bump_df_ranked_owned.T
bump_df_ranked_rating = bump_df_rating.rank(0, ascending=True, method='dense') #0 = row by row, 1 = col by col
bump_df_ranked_rating = bump_df_ranked_rating.T
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
bump_df_ranked_rating.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
markersize=44, markerfacecolor='white')
num_rows = bump_df_ranked_rating.shape[0]
num_cols = bump_df_ranked_rating.shape[1]
plt.ylabel('Game Primary Domain Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Average Ratings by \nGame Domain and Year', fontsize=24, pad=10)
plt.xticks(np.arange(2010, 2021, 1), fontsize=14)
plt.yticks(range(num_cols+1, 1), fontsize=14)
ax.set_xlabel('Year Published', labelpad = 20, fontsize=18)
ax.set_xlim(2009.5, 2020.5)
handles, labels = ax.get_legend_handles_labels()
handles = [ handles[5], handles[7], handles[6], handles[2], handles[0], handles[3], handles[4], handles[1] ]
labels = [labels[5], labels[7], labels[6], labels[2], labels[0], labels[3], labels[4], labels[1] ]
ax.legend(handles, labels,
bbox_to_anchor=(1.01, 1.01),
fontsize=14,
labelspacing=1,
markerscale=0.4,
borderpad = 1,
handletextpad=0.8)
i = 0
j = 0
for eachcol in bump_df_ranked_rating.columns:
for eachrow in bump_df_ranked_rating.index: # 11
this_rank = bump_df_ranked_rating.iloc[i,j]
ax.text(eachrow, this_rank, str(round(bump_df_rating.iloc[j,i],2)), ha='center', va='center', fontsize=12)
i += 1
j+=1
i=0
plt.tight_layout()
plt.show();
### Second plot
fig2 = plt.figure(figsize=(18,10))
ax2 = fig2.add_subplot(1,1,1)
bump_df_ranked_owned.plot(kind='line', ax=ax2, marker='o', markeredgewidth=1, linewidth=6,
markersize=44, markerfacecolor='w')
num_rows = bump_df_ranked_owned.shape[0]
num_cols = bump_df_ranked_owned.shape[1]
plt.ylabel('Number of Games Owned Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Games Owned by \nGame Domain and Year', fontsize=24, pad=10)
plt.xticks(np.arange(2010, 2021, 1), fontsize=14)
plt.yticks(range(num_cols+1, 1), fontsize=14)
ax2.set_xlabel('Year Published', labelpad = 20, fontsize=18)
ax2.set_xlim(2009.5, 2020.5)
handles, labels = ax2.get_legend_handles_labels()
handles = [ handles[3], handles[5], handles[6], handles[7], handles[4], handles[1], handles[0], handles[2] ]
labels = [labels[3], labels[5], labels[6], labels[7], labels[4], labels[1], labels[0], labels[2] ]
ax2.legend(handles, labels,
bbox_to_anchor=(1.01, 1.01),
fontsize=14,
labelspacing=1,
markerscale=0.4,
borderpad = 1,
handletextpad=0.8)
i = 0
j = 0
for eachcol in bump_df_ranked_owned.columns:
for eachrow in bump_df_ranked_owned.index: # 11
this_rank = bump_df_ranked_owned.iloc[i,j]
# print(bump_df_owned.iloc[j,i])
ax2.text(eachrow, this_rank, str(round(bump_df_owned.iloc[j,i]/1e3,1)) +'K', ha='center', va='center', fontsize=12)
i += 1
j+=1
i=0
plt.tight_layout()
plt.show();
Insights:
This demonstrates that within this dataset, the popularity of a game is not directly reflected in the rating of a game.
These charts leveraged parts of code focused on implementing a grid of pie charts. Code leveraged from the below source: https://sharkcoder.com/data-visualization/mpl-pie-charts
df_100 = df.loc[(df['Year Published'] > 1959) & (df['Year Published'] < 2020) ]
pie_df = df_100.groupby(['Decade Published','First_Domain'])['Name'].size().reset_index() # 'Year Published'
pie_df.sort_values(by=['Decade Published'], inplace=True)
pie_df.reset_index(inplace = True, drop=True)
df_size_pivot = pd.pivot_table(pie_df, index='Decade Published', columns='First_Domain', values='Name')
df_size_pivot = df_size_pivot.fillna(0.1)
### generate subset of data since 1960 (examine interesting growth trends over 1970-1990s)
number_outside_colors = df_size_pivot.shape[1]
outside_color_ref_number = np.arange(number_outside_colors)
colormap = plt.get_cmap('Set3')
outer_colors = colormap(outside_color_ref_number)
labels = df_size_pivot.columns.values.tolist()
color_dict = dict(zip(labels, outer_colors))
fig, axes = plt.subplots(3, 2, figsize=(20,20))
fig.delaxes(ax= axes[2,1])
# leveraged code that helped break out into a grid: https://sharkcoder.com/data-visualization/mpl-pie-charts
for i, (idx, row) in enumerate(df_size_pivot.iterrows()):
ax = axes[i // 2, i % 2]
all_published = int(df_size_pivot.loc[idx].sum())
row = row[row.gt(row.sum() * .01)]
ax.pie(row,
labels=[x for x in row.index],
startangle=90,
pctdistance = .85,
labeldistance = 1.1,
wedgeprops = dict(width=.5),
autopct='%1.0f%%',
colors=[color_dict[x] for x in row.index],
textprops={'fontsize':14})
ax.set_title(idx, fontsize=18, color='black')
ax.text(0,0, 'Total Games:\n' + str(round(all_published)), ha = 'center', va='center', fontsize=14)
ax1.legend = plt.legend([x for x in row.index],
bbox_to_anchor=(2.2, 0.8),
#bbox_to_anchor=(2.5, .50),#Legend position
loc='best',
ncol=1,
mode='expand',
fontsize = 20,
fancybox=True)
fig.subplots_adjust(wspace=.2) # Space between charts
plt.tight_layout() #good autofixer
title = fig.suptitle('Breakdown of Primary Domain for \nNewly Published Games by Decade', y=.95, fontsize=24, color='black')
# To prevent the title from being cropped
plt.subplots_adjust(top=0.85, bottom=0.15, wspace=0.1)
plt.show();
Insights:
We saw an enormous growth in newly published wargames in the 1970s, 1980s and 1990s. Games that fall in this category are: Quebec 1759, Blitzkrieg, Empires in Arms, Axis and Allies, as well as things like Warhammer. I found it especially interesting seeing the significant growth in wargames so closely aligned to the end of conflict in the 1970s. This period comes after the end of multiple decades of conclict, i.e, World War II in the 40s, Korean War in the 50s, and Vietnam in the 60s/70s. Further, Desert Storm, Somalia, Bosnia, and Kosovo armed conflicts began in the early 90’s which aligns to a decline in Wargames.
Based on the findings within this dataset, the pattern of decline and growth in wargames seems to inversely follow the US involvement in armed conflict.
This was an interesting dataset to analyze and produced multiple insights, some expected and some unexpected or surprising.
Expected:
Unexpected:
Most Surprising: