In this term project I used the Board Games dataset from the Tidy Tuesday github page. In the next part of the paper I am exploring the data with various plots using ggplot, while the codes for the plots are displayed.
In this section I am preparing the data for plotting.
# merging ratings and details
data <- merge(ratings, details, by = 'id')
# removing all columns but the ones in the list
list <- c('id', 'name', 'year', 'rank', 'average', 'users_rated', 'minplayers',
'maxplayers', 'playingtime', 'minage', 'boardgamecategory')
data <- data.table(data[ ,colnames(data) %in% list])
# filtering
data <- data[playingtime < 750] # removing games with more than 12.5 hrs of playing time
data <- data[year != 0] # removing games with a release year of 0
data <- data[rank < 1000] # keeping only the top 1000 games
data <- data[year < 2022 & year > 2000] # keeping games released between 2000 and today
data <- data[minage > 3] # removing games with a minimum age below 3 years
# selecting a main category for the top 20 games (for later analysis)
data2 <- data
data2 <- data2[rank < 21]
data2 <- data2[, c(2,11)]
data2$category <- c("Wargame", "Economic", "Territory Building",
"Territory Building", "Territory Building", "Economic",
"Environmental", "Environmental", "Adventure",
"Territory Building", "Economic", "Economic",
"Adventure", "Economic", "Wargame", "Economic",
"Territory Building", "Economic", "Wargame", "Adventure")
# cleaning the category column in the merged data table
data$boardgamecategory <- gsub("\\[", "", data$boardgamecategory)
data$boardgamecategory <- gsub("\\]", "", data$boardgamecategory)
data <- separate(data, boardgamecategory, ' ', into = c('boardgamecategory', 'garbage'))
data <- data %>% select(-c(garbage))
data$boardgamecategory <- gsub("\\'", "", data$boardgamecategory)
data$boardgamecategory <- gsub("\\,", "", data$boardgamecategory)
# pooling average scores
data <- data %>% mutate(
average_2 = cut(average, c(min(data$average),6.75,7.25,7.75,8.25,8.75,max(data$average)),
labels=c(6.5,7,7.5,8,8.5,9), right = F) )
# creating separate table for the top 20 games & merging it with the main category names selected above
data1 <- data[rank < 21]
data1 <- merge(data1, data2, by = 'name')
data1 <- data1[,-c(11,12)]
The number of observations after preparing the data is 919. Unless indicated otherwise, this is the number of observations I used for the following plots.
ggplot(subset(data, !is.na(average_2)), aes(users_rated, rank)) +
geom_point(aes(color = average_2)) +
geom_smooth(method = 'loess') +
scale_y_continuous(breaks=seq(from = 250, to = 1000, by = 250),
limits=c(0, 1000)) +
labs(title = 'Number of ratings and ranking', subtitle = 'Scatterplot' ,
x = "Number of user rating", y="Rank" , color = 'Average score') +
scale_color_viridis_d() +
theme_minimal() +
theme(legend.position = "top")
## `geom_smooth()` using formula 'y ~ x'
Above we can see, that the number of user-ratings a game receives is correlated with its rank. Although the highest rated games (yellow on the chart) don’t have the most ratings.
ggplot(data, aes(factor(minage), playingtime)) +
geom_boxplot(fill = '#FFAEBC', color = 'green', alpha = 0.8) +
labs(title = 'Playing time per minimum age', subtitle = 'Boxplot' ,
x = "Minimum required age", y="Playingtime" ) +
theme_minimal()
On the boxplot we can observe, that the games with a higher minimum age have longer playtimes on average. Meaning, these games (12 years and above) have higher average playtime and the games with outlier playing times also belong in these groups.
ggplot(data, aes(average, users_rated)) +
geom_hex(bins = 10) +
scale_y_continuous(breaks=seq(from = 5000, to = 850000, by = 10000), limits=c(0, 90000)) +
labs(title = 'Average score per number of ratings', subtitle = 'Hexagon-map' ,
x = "Average rating score", y="Number of ratings" , fill = 'Number of games') +
scale_fill_gradientn(colors = cm.colors(8)) +
theme_minimal()
Above we can see the average rating of a game per the number of reviews. The colors of the map indicate the number of games per each bin. The binsize was selected, so that the largest bins would contain around 100 games. The games belonging to this most populated bin have around 10000 reviews and an average rating of 7.25.
ggplot(data, aes(year, minplayers)) +
geom_tile(aes(fill = average)) +
labs(title = 'Minimum required players per year', subtitle = 'Heatmap' ,
x = "", y="Minimum required players" , fill = 'Average score') +
scale_fill_gradientn(colours = heat.colors(7, rev = T)) +
theme_minimal() +
theme(legend.position = "bottom")
On the heatmap above we can see the minimum required players for a game per year of release. The colors of the heatmap indicate the average score. For example we can see, that in 2021 the released games required 3 or less person and had 8.5 or higher rating according to the plot.
This is also the plot using gganimate.
ggplot(subset(data, !is.na(average_2)), aes(average_2, users_rated)) +
geom_col(fill = '#FFAEBC') +
labs(title = 'Number of ratings per score', subtitle = '{closest_state}' ,
x = "Average user rating", y="Number of ratings" ) +
scale_y_continuous(labels = comma) +
theme_minimal() +
transition_states(year,
transition_length = 2,
state_length = 2) +
enter_fade() +
exit_shrink()
On the barchart above, we can see the number of ratings per the average user ratings. The ratings are rounded to the nearest 0.5 value.
In this part I’m doing PCA on the top 20 games. First let’s take a look at a barchart of these games.
ggplot(data1, aes(year)) +
geom_bar(aes(fill = category)) +
labs(title = 'Top 20 ranked boardgames', subtitle = 'Release year and category' ,
x = "Release year", y="" , fill = 'Main category') +
scale_y_continuous(breaks = NULL) +
theme_minimal()
On this chart, we can see, that most games were released between 2015 and 2018. Other than that, the most common main category is ‘economic’.
Now let’s take a look at the PCA.
# selecting the numeric columns for prcomp
data1.pca <- prcomp(data1[, c(3:10)], center = TRUE,scale. = TRUE)
# plotting with ggbiplot, labels set to game names and groups set to the main categories selected in the data prep part
ggbiplot(data1.pca,ellipse=TRUE, labels=data1$name, groups=data1$category) +
theme_void()
On the PCA plot we can see, that the games in different categories have an overlap. In other words, the game categories usually don’t have similar components, for example playing time or rank. That being said, there are games that are really similar and are in the same category. For example ‘Pandemic Legacy: Season 1’ and ‘Spirit Island’ or ‘Through the Ages: A New Story of Civilization’ and ‘Great Western Trail’. On the other hand, ‘Twilight Imperium: Fourth Edition’ and ‘Twilight Struggle’ are pretty different, despite belonging to the same category. Their minimum required players are similar, but their ranks and playing times are different.