From the GitHub Repo:
Board Games Database This week’s data comes from the Board Game Geek database. The site’s database has more than 90,000 games, with crowd-sourced ratings. There is also an R package with the fulldataset (bggAnalysis) but it hasn’t been updated in ~2 years.
To follow along with a fivethirtyeight article, I limited to only games with at least 50 ratings and for games between 1950 and 2016. This still leaves us with 10,532 games!
https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12.
library(tidyverse)
library(tidytext)
library(caret)
library(randomForest)
library(cowplot)
games<-read_csv("board_games.csv") %>%
mutate(category = gsub(" \\/ ", "", category), category=gsub("\\/", "", category))
glimpse(games)
## Observations: 10,532
## Variables: 22
## $ game_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ description <chr> "Die Macher is a game about seven sequential po...
## $ image <chr> "//cf.geekdo-images.com/images/pic159509.jpg", ...
## $ max_players <int> 5, 4, 4, 4, 6, 6, 2, 5, 4, 6, 7, 5, 4, 4, 6, 4,...
## $ max_playtime <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ min_age <int> 14, 12, 10, 12, 12, 12, 8, 12, 13, 10, 13, 12, ...
## $ min_players <int> 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3,...
## $ min_playtime <int> 240, 30, 30, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ name <chr> "Die Macher", "Dragonmaster", "Samurai", "Tal d...
## $ playing_time <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ thumbnail <chr> "//cf.geekdo-images.com/images/pic159509_t.jpg"...
## $ year_published <int> 1986, 1981, 1998, 1992, 1964, 1989, 1978, 1993,...
## $ artist <chr> "Marcus Gschwendtner", "Bob Pepper", "Franz Voh...
## $ category <chr> "Economic,Negotiation,Political", "Card Game,Fa...
## $ compilation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ designer <chr> "Karl-Heinz Schmiel", "G. W. \"Jerry\" D'Arcey"...
## $ expansion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Elfengold,...
## $ family <chr> "Country: Germany,Valley Games Classic Line", "...
## $ mechanic <chr> "Area Control / Area Influence,Auction/Bidding,...
## $ publisher <chr> "Hans im Gl\xfcck Verlags-GmbH,Moskito Spiele,V...
## $ average_rating <dbl> 7.66508, 6.60815, 7.44119, 6.60675, 7.35830, 6....
## $ users_rated <int> 4498, 478, 12019, 314, 15195, 73, 2751, 186, 12...
The thing I enjoy most about the weekly #TidyTuesday challenge is seeing how different people, with different interests and skill levels attack the data. This data set offers so many different areas of analysis, from popularity by category, how playing time has changed over time, to how prolific specific designers are. I encourage you to check the #TidyTuesday hashtag on Twitter weekly.
This week’s data included a lot of text, both in the mechanics field and the description field as well as plenty of numeric data. I wanted to see how accurately we can categorize a game first with just numeric data, then adding in more and more text analysis.
The category field is untidy. Many games are assigned to more than one category, and the collection of categories is contained within a single record
head(games$category)
## [1] "Economic,Negotiation,Political" "Card Game,Fantasy"
## [3] "Abstract Strategy,Medieval" "Ancient"
## [5] "Economic" "Civilization,Nautical"
In fact, there are 3861 unique combinations of categories. For simplicity’s sake, I’m going to tease out games that are assigned to only one category.
First, I’ll split the category field to see how many categories are assigned to each game. I’ll add the column cnt to hold that info…
games$cnt<-unlist(lapply(str_split(games$category, ","), length))
…and filter to select only those games assigned a single category:
single_cat<-games %>%
filter(cnt==1)
glimpse(unique(single_cat$category))
## chr [1:72] "Ancient" "Economic" "Abstract Strategy" "Exploration" ...
This leaves us with 2007 games and 72 unique categories. Again, for simplicity’s sake, I’ll find the top ten categories by number of games and create the games_df dataframe for use going forward.
top_cat<-single_cat %>%
filter(!is.na(category)) %>%
group_by(category) %>%
tally() %>%
top_n(8, n) %>%
arrange(desc(n))
top_cat
## # A tibble: 8 x 2
## category n
## <chr> <int>
## 1 Card Game 438
## 2 Abstract Strategy 284
## 3 Economic 116
## 4 Dice 107
## 5 Party Game 77
## 6 ActionDexterity 55
## 7 Fantasy 54
## 8 Medieval 51
games_df<-games %>% filter(category %in% top_cat$category)
We now have 1182 games to work with.
How accurately can variables such as max_players, max_playtime, average_rating, etc. assign a game to a category? First let’s create a dataframe that incudes our target category and numeric data
games_num_df<-games_df %>%
select_if(is.numeric) %>%
left_join(games_df %>% select(game_id, category)) %>%
select(-cnt) %>%
mutate_if(is.integer, as.numeric) %>%
mutate_if(is.character, as.factor) %>%
filter(complete.cases(.)) %>%
select(category, 2:10) %>%
magrittr::set_colnames(paste0(colnames(.), "_c",""))
games_num_df$category_c<-factor(games_num_df$category_c)
num_rf<-randomForest(category_c ~.,data=games_num_df,ntree=200)
knitr::kable(broom::tidy(num_rf$confusion))
| .rownames | Abstract.Strategy | ActionDexterity | Card.Game | Dice | Economic | Fantasy | Medieval | Party.Game | class.error |
|---|---|---|---|---|---|---|---|---|---|
| Abstract Strategy | 204 | 7 | 58 | 6 | 7 | 1 | 1 | 0 | 0.2816901 |
| ActionDexterity | 23 | 1 | 28 | 3 | 0 | 0 | 0 | 0 | 0.9818182 |
| Card Game | 52 | 4 | 319 | 18 | 23 | 3 | 5 | 14 | 0.2716895 |
| Dice | 23 | 2 | 57 | 17 | 2 | 2 | 2 | 2 | 0.8411215 |
| Economic | 8 | 0 | 38 | 0 | 59 | 3 | 8 | 0 | 0.4913793 |
| Fantasy | 9 | 0 | 25 | 1 | 7 | 3 | 8 | 1 | 0.9444444 |
| Medieval | 4 | 1 | 16 | 1 | 16 | 7 | 5 | 1 | 0.9019608 |
| Party Game | 4 | 0 | 32 | 0 | 3 | 0 | 1 | 37 | 0.5194805 |
Not so great: the accuracy measures 54.6%
The numeric data didn’t help much, let’s see if the mechanics field can improve accuracy. Mechanics is how the game is played, e.g., roll a dice, draw a card, etc.
Since we’re using text, it’s time to tokenize and analyze with Tidytext.
#select the variables
mech_df<-games_df %>% select(game_id, category, mechanic)
#tokenize the mechanics variable
mech_tokens<-mech_df%>%
unnest_tokens(output=word, input=mechanic) %>%
anti_join(tidytext::stop_words) %>%
mutate(word=SnowballC::wordStem(words = `word`)) %>%
mutate(word=tm::removeNumbers(word)) %>%
mutate(word=tm::removePunctuation(word))
Our corpus includes 82 words.
We’ll turn the tokenized list into a document term matrix to find out the term frequency and the inverse document term frequency. Term frequency tells us which terms appear most often, but that can be less useful than finding out which terms are frequent in ONE category, but not overall.
mech_dtm<-mech_tokens %>%
count(game_id, word) %>%
filter(word !="") %>%
tidytext::cast_dtm(document=game_id, term=word, value=n) %>%
tm::removeSparseTerms(sparse=.99)
mech_tfidf<-mech_tokens %>%
count(category, word) %>%
bind_tf_idf(term=word, document=category, n=n)
mech_terms<-mech_tfidf %>%
group_by(category) %>%
top_n(10,tf_idf) %>%
arrange(category, tf_idf) %>%
ungroup()
ggplot(mech_terms, aes(x=fct_reorder(word, tf_idf), y=tf_idf, group=category))+
geom_col()+
labs(y="Term Frequency - Inverse Document Frequency", x="")+
coord_flip()+
facet_wrap(~category, scales="free_y")
Because some of the tokens are also function words, we’ll add a suffix “_c" to each one in the mech_dtm matrix so that it doesn’t throw an error.
mech_fit_df<-as.data.frame(as.matrix(mech_dtm)) %>%
mutate(game_id=as.numeric(row.names(.))) %>%
left_join(games_df %>% select(category, game_id)) %>%
select(-NA, -game_id)
colnames(mech_fit_df)<-paste0(colnames(mech_fit_df), "_c", "")
#make sure the category field is a factor
mech_fit_df$category_c<-factor(mech_fit_df$category_c)
mechfit<-randomForest(category_c~., mech_fit_df, ntree=500)
mech_accuracy<-scales::percent(sum(diag(mechfit$confusion))/nrow(mech_fit_df))
mip<-as.data.frame(varImp(mechfit)) %>%
mutate(pred_var=gsub("_c", "", row.names(.))) %>%
filter(Overall>15) %>%
arrange(Overall) %>%
mutate(pred_var=fct_inorder(pred_var))
The accuracy of a model using the tokens from the mechanic field are somewhat better: 66.8%. Card Game, Dice and Abstract Strategy were the most accurately classified.
| class.error | |
|---|---|
| Card Game | 0.0988506 |
| Dice | 0.1308411 |
| Abstract Strategy | 0.2535211 |
| Economic | 0.4568966 |
| Party Game | 0.7012987 |
| Medieval | 0.9019608 |
| ActionDexterity | 1.0000000 |
| Fantasy | 1.0000000 |
The top variables by Mean Decrease Gini:
Description FieldI’m supressing the code here because it is essentially the same as for mechanic
Our new corpus contains Our corpus includes 7580 unique words.
The accuracy of a model using the tokens from the description field are somewhat better: 75.6%. Card Game, Dice and Abstract Strategy were still the most accurately classified.
| class.error | |
|---|---|
| Card Game | 0.0616438 |
| Dice | 0.0841121 |
| Abstract Strategy | 0.1021127 |
| Economic | 0.4137931 |
| Party Game | 0.4805195 |
| ActionDexterity | 0.7636364 |
| Medieval | 0.8823529 |
| Fantasy | 0.9444444 |
The top description variables by Mean Decrease Gini:
Can we increase further by pulling all the variables, both text and numeric?
The accuracy of a model using the tokens from the description and mechanic field added to the numeric variables boosts our accuracy to: 79.2%. Card Game, Dice and Abstract Strategy were still the most accurately classified though we did gain a bit on the category “Economic.”
| class.error | |
|---|---|
| Card Game | 0.0321839 |
| Dice | 0.0654206 |
| Abstract Strategy | 0.0739437 |
| Economic | 0.2672414 |
| Party Game | 0.4025974 |
| ActionDexterity | 0.7636364 |
| Medieval | 0.9019608 |
| Fantasy | 0.9814815 |
The top variables by Mean Decrease Gini:
Note: the suffix “_mech" indicates that the word was extracted from the mechanic field, not description
The final confusion matrix:
| Abstract Strategy | ActionDexterity | Card Game | Dice | Economic | Fantasy | Medieval | Party Game | class.error | |
|---|---|---|---|---|---|---|---|---|---|
| Abstract Strategy | 263 | 0 | 16 | 4 | 1 | 0 | 0 | 0 | 0.0739437 |
| ActionDexterity | 20 | 13 | 19 | 3 | 0 | 0 | 0 | 0 | 0.7636364 |
| Card Game | 5 | 0 | 421 | 3 | 4 | 0 | 0 | 2 | 0.0321839 |
| Dice | 4 | 0 | 3 | 100 | 0 | 0 | 0 | 0 | 0.0654206 |
| Economic | 6 | 0 | 20 | 5 | 85 | 0 | 0 | 0 | 0.2672414 |
| Fantasy | 7 | 0 | 30 | 7 | 8 | 1 | 1 | 0 | 0.9814815 |
| Medieval | 12 | 0 | 18 | 2 | 14 | 0 | 5 | 0 | 0.9019608 |
| Party Game | 3 | 0 | 25 | 0 | 3 | 0 | 0 | 46 | 0.4025974 |