Tidy Tuesday 2019 Week 11

From the GitHub Repo:

Board Games Database This week’s data comes from the Board Game Geek database. The site’s database has more than 90,000 games, with crowd-sourced ratings. There is also an R package with the fulldataset (bggAnalysis) but it hasn’t been updated in ~2 years.

To follow along with a fivethirtyeight article, I limited to only games with at least 50 ratings and for games between 1950 and 2016. This still leaves us with 10,532 games!

https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12.

library(tidyverse)
library(tidytext)
library(caret)
library(randomForest)
library(cowplot)
games<-read_csv("board_games.csv") %>% 
  mutate(category = gsub(" \\/ ", "", category), category=gsub("\\/", "", category))
glimpse(games)
## Observations: 10,532
## Variables: 22
## $ game_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ description    <chr> "Die Macher is a game about seven sequential po...
## $ image          <chr> "//cf.geekdo-images.com/images/pic159509.jpg", ...
## $ max_players    <int> 5, 4, 4, 4, 6, 6, 2, 5, 4, 6, 7, 5, 4, 4, 6, 4,...
## $ max_playtime   <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ min_age        <int> 14, 12, 10, 12, 12, 12, 8, 12, 13, 10, 13, 12, ...
## $ min_players    <int> 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3,...
## $ min_playtime   <int> 240, 30, 30, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ name           <chr> "Die Macher", "Dragonmaster", "Samurai", "Tal d...
## $ playing_time   <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, ...
## $ thumbnail      <chr> "//cf.geekdo-images.com/images/pic159509_t.jpg"...
## $ year_published <int> 1986, 1981, 1998, 1992, 1964, 1989, 1978, 1993,...
## $ artist         <chr> "Marcus Gschwendtner", "Bob Pepper", "Franz Voh...
## $ category       <chr> "Economic,Negotiation,Political", "Card Game,Fa...
## $ compilation    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ designer       <chr> "Karl-Heinz Schmiel", "G. W. \"Jerry\" D'Arcey"...
## $ expansion      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Elfengold,...
## $ family         <chr> "Country: Germany,Valley Games Classic Line", "...
## $ mechanic       <chr> "Area Control / Area Influence,Auction/Bidding,...
## $ publisher      <chr> "Hans im Gl\xfcck Verlags-GmbH,Moskito Spiele,V...
## $ average_rating <dbl> 7.66508, 6.60815, 7.44119, 6.60675, 7.35830, 6....
## $ users_rated    <int> 4498, 478, 12019, 314, 15195, 73, 2751, 186, 12...

TidyTuesday

The thing I enjoy most about the weekly #TidyTuesday challenge is seeing how different people, with different interests and skill levels attack the data. This data set offers so many different areas of analysis, from popularity by category, how playing time has changed over time, to how prolific specific designers are. I encourage you to check the #TidyTuesday hashtag on Twitter weekly.

This week’s data included a lot of text, both in the mechanics field and the description field as well as plenty of numeric data. I wanted to see how accurately we can categorize a game first with just numeric data, then adding in more and more text analysis.

The Challenge

The category field is untidy. Many games are assigned to more than one category, and the collection of categories is contained within a single record

head(games$category)
## [1] "Economic,Negotiation,Political" "Card Game,Fantasy"             
## [3] "Abstract Strategy,Medieval"     "Ancient"                       
## [5] "Economic"                       "Civilization,Nautical"

In fact, there are 3861 unique combinations of categories. For simplicity’s sake, I’m going to tease out games that are assigned to only one category.

First, I’ll split the category field to see how many categories are assigned to each game. I’ll add the column cnt to hold that info…

games$cnt<-unlist(lapply(str_split(games$category, ","), length))

…and filter to select only those games assigned a single category:

single_cat<-games %>% 
  filter(cnt==1)
glimpse(unique(single_cat$category))
##  chr [1:72] "Ancient" "Economic" "Abstract Strategy" "Exploration" ...

This leaves us with 2007 games and 72 unique categories. Again, for simplicity’s sake, I’ll find the top ten categories by number of games and create the games_df dataframe for use going forward.

top_cat<-single_cat %>% 
  filter(!is.na(category)) %>% 
  group_by(category) %>% 
  tally() %>% 
  top_n(8, n) %>% 
  arrange(desc(n))

top_cat
## # A tibble: 8 x 2
##   category              n
##   <chr>             <int>
## 1 Card Game           438
## 2 Abstract Strategy   284
## 3 Economic            116
## 4 Dice                107
## 5 Party Game           77
## 6 ActionDexterity      55
## 7 Fantasy              54
## 8 Medieval             51
games_df<-games %>% filter(category %in% top_cat$category)

We now have 1182 games to work with.

Classification By Numeric Data Only

How accurately can variables such as max_players, max_playtime, average_rating, etc. assign a game to a category? First let’s create a dataframe that incudes our target category and numeric data

games_num_df<-games_df %>% 
  select_if(is.numeric) %>% 
  left_join(games_df %>% select(game_id, category)) %>% 
  select(-cnt) %>% 
  mutate_if(is.integer, as.numeric) %>% 
  mutate_if(is.character, as.factor) %>% 
  filter(complete.cases(.)) %>% 
  select(category, 2:10) %>% 
  magrittr::set_colnames(paste0(colnames(.), "_c",""))

games_num_df$category_c<-factor(games_num_df$category_c)

num_rf<-randomForest(category_c ~.,data=games_num_df,ntree=200)
knitr::kable(broom::tidy(num_rf$confusion))
.rownames Abstract.Strategy ActionDexterity Card.Game Dice Economic Fantasy Medieval Party.Game class.error
Abstract Strategy 204 7 58 6 7 1 1 0 0.2816901
ActionDexterity 23 1 28 3 0 0 0 0 0.9818182
Card Game 52 4 319 18 23 3 5 14 0.2716895
Dice 23 2 57 17 2 2 2 2 0.8411215
Economic 8 0 38 0 59 3 8 0 0.4913793
Fantasy 9 0 25 1 7 3 8 1 0.9444444
Medieval 4 1 16 1 16 7 5 1 0.9019608
Party Game 4 0 32 0 3 0 1 37 0.5194805

Not so great: the accuracy measures 54.6%

Text/Tokens as Predictors

The numeric data didn’t help much, let’s see if the mechanics field can improve accuracy. Mechanics is how the game is played, e.g., roll a dice, draw a card, etc.

Since we’re using text, it’s time to tokenize and analyze with Tidytext.

#select the variables
mech_df<-games_df %>% select(game_id, category, mechanic)

#tokenize the mechanics variable
mech_tokens<-mech_df%>% 
  unnest_tokens(output=word, input=mechanic) %>%
  anti_join(tidytext::stop_words) %>%
  mutate(word=SnowballC::wordStem(words = `word`)) %>%
  mutate(word=tm::removeNumbers(word)) %>%
  mutate(word=tm::removePunctuation(word))

Our corpus includes 82 words.

We’ll turn the tokenized list into a document term matrix to find out the term frequency and the inverse document term frequency. Term frequency tells us which terms appear most often, but that can be less useful than finding out which terms are frequent in ONE category, but not overall.

mech_dtm<-mech_tokens %>%
  count(game_id, word) %>%
  filter(word !="") %>%
  tidytext::cast_dtm(document=game_id, term=word, value=n) %>%
  tm::removeSparseTerms(sparse=.99)

mech_tfidf<-mech_tokens %>%
  count(category, word) %>%
  bind_tf_idf(term=word, document=category, n=n)

mech_terms<-mech_tfidf %>% 
  group_by(category) %>% 
  top_n(10,tf_idf) %>% 
  arrange(category, tf_idf) %>% 
  ungroup()

ggplot(mech_terms, aes(x=fct_reorder(word, tf_idf), y=tf_idf, group=category))+
  geom_col()+
  labs(y="Term Frequency - Inverse Document Frequency", x="")+
  coord_flip()+
  facet_wrap(~category, scales="free_y")

Turn Tokens into Predictor Variables

Because some of the tokens are also function words, we’ll add a suffix “_c" to each one in the mech_dtm matrix so that it doesn’t throw an error.

mech_fit_df<-as.data.frame(as.matrix(mech_dtm)) %>%
  mutate(game_id=as.numeric(row.names(.))) %>%
  left_join(games_df %>% select(category, game_id)) %>% 
  select(-NA, -game_id)

colnames(mech_fit_df)<-paste0(colnames(mech_fit_df), "_c", "")

#make sure the category field is a factor 
mech_fit_df$category_c<-factor(mech_fit_df$category_c)

mechfit<-randomForest(category_c~., mech_fit_df, ntree=500)
mech_accuracy<-scales::percent(sum(diag(mechfit$confusion))/nrow(mech_fit_df))
mip<-as.data.frame(varImp(mechfit)) %>% 
  mutate(pred_var=gsub("_c", "", row.names(.))) %>% 
  filter(Overall>15) %>% 
  arrange(Overall) %>% 
  mutate(pred_var=fct_inorder(pred_var))

The accuracy of a model using the tokens from the mechanic field are somewhat better: 66.8%. Card Game, Dice and Abstract Strategy were the most accurately classified.

class.error
Card Game 0.0988506
Dice 0.1308411
Abstract Strategy 0.2535211
Economic 0.4568966
Party Game 0.7012987
Medieval 0.9019608
ActionDexterity 1.0000000
Fantasy 1.0000000

The top variables by Mean Decrease Gini:

Using Description Field

I’m supressing the code here because it is essentially the same as for mechanic

Our new corpus contains Our corpus includes 7580 unique words.

The accuracy of a model using the tokens from the description field are somewhat better: 75.6%. Card Game, Dice and Abstract Strategy were still the most accurately classified.

class.error
Card Game 0.0616438
Dice 0.0841121
Abstract Strategy 0.1021127
Economic 0.4137931
Party Game 0.4805195
ActionDexterity 0.7636364
Medieval 0.8823529
Fantasy 0.9444444

The top description variables by Mean Decrease Gini:

All Together (?)

Can we increase further by pulling all the variables, both text and numeric?

The accuracy of a model using the tokens from the description and mechanic field added to the numeric variables boosts our accuracy to: 79.2%. Card Game, Dice and Abstract Strategy were still the most accurately classified though we did gain a bit on the category “Economic.”

class.error
Card Game 0.0321839
Dice 0.0654206
Abstract Strategy 0.0739437
Economic 0.2672414
Party Game 0.4025974
ActionDexterity 0.7636364
Medieval 0.9019608
Fantasy 0.9814815

The top variables by Mean Decrease Gini:
Note: the suffix “_mech" indicates that the word was extracted from the mechanic field, not description

The final confusion matrix:

Abstract Strategy ActionDexterity Card Game Dice Economic Fantasy Medieval Party Game class.error
Abstract Strategy 263 0 16 4 1 0 0 0 0.0739437
ActionDexterity 20 13 19 3 0 0 0 0 0.7636364
Card Game 5 0 421 3 4 0 0 2 0.0321839
Dice 4 0 3 100 0 0 0 0 0.0654206
Economic 6 0 20 5 85 0 0 0 0.2672414
Fantasy 7 0 30 7 8 1 1 0 0.9814815
Medieval 12 0 18 2 14 0 5 0 0.9019608
Party Game 3 0 25 0 3 0 0 46 0.4025974