Example

In this example, we will use data collected from high level League of Legends games. League of Legends ia a competitive video game that consists of two teams of 5 players each. Each player selects a character (called a “champion”) from a roster of 146 different characters.

data("games_small")

head(games_small,10)
#>      match_id blue_win team          c1          c2         c3       c4
#> 1  3029804371     TRUE blue    renekton    nocturne     thresh    corki
#> 2  3029804371     TRUE  red    jarvaniv     caitlyn    morgana     ekko
#> 3  3046779313    FALSE blue        sona      ezreal      nasus masteryi
#> 4  3046779313    FALSE  red       kaisa       sylas cassiopeia    teemo
#> 5  3046814532     TRUE blue       teemo     caitlyn      braum    yasuo
#> 6  3046814532     TRUE  red twistedfate     kalista   malzahar   rengar
#> 7  3046864276     TRUE blue       yuumi       sivir    sejuani   rumble
#> 8  3046864276     TRUE  red     leblanc       vayne    karthus pantheon
#> 9  3052841499     TRUE blue       akali        olaf      sivir    yuumi
#> 10 3052841499     TRUE  red       teemo twistedfate     ezreal   leesin
#>          c5
#> 1     varus
#> 2   camille
#> 3     taric
#> 4    khazix
#> 5    reksai
#> 6    thresh
#> 7     teemo
#> 8     riven
#> 9  vladimir
#> 10     pyke

We see that values can appear in any column (for example, “teemo” appears in 3 different columns in just the first 10 rows!). In fact, the order is determined solely by the order in which they were selected, which is not very meaningful for most analysis purposes. Our first step will be to try one-hot-encoding the data using recolumnize::one_hot_encode.

one_hot_encode()

one_hot_encoded <- one_hot_encode(games_small, encode_cols = c(4:8), keep = "exists", min_occurences = 1)
head(one_hot_encoded,10)[,1:9]
#>      match_id blue_win team renekton jarvaniv sona kaisa teemo twistedfate
#> 1  3029804371     TRUE blue        1        0    0     0     0           0
#> 2  3029804371     TRUE  red        0        1    0     0     0           0
#> 3  3046779313    FALSE blue        0        0    1     0     0           0
#> 4  3046779313    FALSE  red        0        0    0     1     1           0
#> 5  3046814532     TRUE blue        0        0    0     0     1           0
#> 6  3046814532     TRUE  red        0        0    0     0     0           1
#> 7  3046864276     TRUE blue        0        0    0     0     1           0
#> 8  3046864276     TRUE  red        0        0    0     0     0           0
#> 9  3052841499     TRUE blue        0        0    0     0     0           0
#> 10 3052841499     TRUE  red        0        0    0     0     1           1

The one_hot_encode function creates a new column for each value that exists in the columns to be encoded. It then populates each row with a 1 if that value appears anywhere in the row and a 0 otherwise (note if we set keep = “sum”, we could also store the number of times the value appears in each row)

In this case, the usual implementation of one hot encoding (for example in the library vtreat) would not work well for this data (each value would get encoded 5 separate times if it appears in all 5 of the original columns).

So this data encoding requires substantially fewer variables. It also allows more easy generalization (if something holds for a character in one). Can we do better? For modelling/prediction purposes, we are probably best served using one-hot-encoding (many algorithms require it – plus, the reduced dimensionality is quite valuable). However, for data exploration, we may want to instead create some sort of meaningful ordering.

Generally, the 5 champions on each team have 5 separate roles (similar to positions you might see in sports): top, jungle, middle, bottom, and support.

data("champion_dictionary")
rownames(dict) <- dict[,1]
dict <- dict[-1]
head(dict,10)
#>                     top      jungle         mid         bot         sup
#> aatrox      0.733230532 0.082498072 0.178103315 0.003084040 0.003084040
#> ahri        0.008534851 0.008534851 0.965860597 0.008534851 0.008534851
#> akali       0.528396184 0.000908678 0.468877783 0.000908678 0.000908678
#> alistar     0.001474926 0.001474926 0.001474926 0.001474926 0.994100295
#> amumu       0.014018692 0.943925234 0.014018692 0.014018692 0.014018692
#> anivia      0.043378995 0.043378995 0.826484018 0.043378995 0.043378995
#> annie       0.107421875 0.107421875 0.570312500 0.107421875 0.107421875
#> ashe        0.010979122 0.010979122 0.010979122 0.956083513 0.010979122
#> aurelionsol 0.013888889 0.013888889 0.944444444 0.013888889 0.013888889
#> azir        0.015845070 0.015845070 0.936619718 0.015845070 0.015845070

We need a dataframe with row names consisting of the values we want to match and with columns representing the probability they fall into each category. Every value should have a probability for each category, so this method is not very well suited to datasets without a lot of structure (in such a case, using the recategorize function will probably be more fruitful)

best_categories_brute_force()

categorized <- best_categories_brute_force(games_small, dict, encode_cols = c(4:8),ignore_warning = T)

Note that calling best_categories_brute_force() will be very slow on large datasets, as it is O(n*k!) due to calculating probabilities for every permutation of columns. However, it can be useful to run it on a subset of your data and to compare it to the approximated version as we will do below.

head(categorized)
#>     match_id blue_win team      top   jungle         mid     bot     sup
#> 1 3029804371     TRUE blue renekton nocturne       corki   varus  thresh
#> 2 3029804371     TRUE  red  camille jarvaniv        ekko caitlyn morgana
#> 3 3046779313    FALSE blue    nasus masteryi       taric  ezreal    sona
#> 4 3046779313    FALSE  red    teemo   khazix  cassiopeia   kaisa   sylas
#> 5 3046814532     TRUE blue    teemo   reksai       yasuo caitlyn   braum
#> 6 3046814532     TRUE  red   rengar malzahar twistedfate kalista  thresh

Now instead of seeing “teemo” in 3 different columns, he mostly only appears in the column for “top.” Likewise, we can already see other examples of values reoccuring. This allows us to do better analysis. For one, we can now compare a player to his counterpart on the other team. It also lets us capture the interaction between champion and role, rather than just the existence of the champion.

bars <- c(sum(apply(games_small[,4:8], 2, function(x) length(unique(x)))) ,sum(apply(categorized[,4:8],2,function(x) length(unique(x)))),sum(apply(one_hot_encoded[,4:148],2,function(x) length(unique(x)))))
ggplot(data= NULL, aes(x = c("original data", "categorized","one-hot-encoded"), y = bars,fill = c("#2b8cbe","#e34a33", "#fdbb84"))) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") + labs(y = "Total cardinality of variables needed", x = "")

We still have somewhat reduced the number of predictors needed from the original dataset, although not as much as if we one-hot-encoded.

best_categories_approximate()

Now, let’s use the faster best_categories_approximate() function. This function works by checking each category, finding the maximum percentage difference between the most and 2nd most likely value, then proceeding on the problem with one fewer category and one fewer value. This is substantially faster than checking every possible combination.

categorized_fast <- best_categories_approximate(games_small, dict, encode_cols = c(4:8))
head(categorized_fast)
#>     match_id blue_win team      top   jungle         mid     bot     sup
#> 1 3029804371     TRUE blue renekton nocturne       corki   varus  thresh
#> 2 3029804371     TRUE  red  camille jarvaniv        ekko caitlyn morgana
#> 3 3046779313    FALSE blue    nasus masteryi       taric  ezreal    sona
#> 4 3046779313    FALSE  red    teemo   khazix  cassiopeia   kaisa   sylas
#> 5 3046814532     TRUE blue    teemo   reksai       yasuo caitlyn   braum
#> 6 3046814532     TRUE  red   rengar malzahar twistedfate kalista  thresh

At first glance, it looks quite similar to the brute force output from above. But we may want to check how often the two differ.

get_probability_by_row <- function(row, probs) {
  likelihood <- 1
  for (i in (1:length(row))) {
    likelihood <- likelihood*probs[row[i],i]
  }
  return(likelihood)
}

probs_bf <- as.numeric(apply(categorized[,4:8], 1, get_probability_by_row, dict))
probs_approx <- as.numeric(apply(categorized_fast[ ,4:8], 1, get_probability_by_row, dict))
diffs <- probs_bf - probs_approx

Now we can check the mean difference in probabilities of the two methods which is 1.80485210^{-5}. We may also want to check when there is a difference in opinion which happens 47 times and has a mean 0.0019201. So when the two methods disagree (at least in this dataset), the two answers are usually close to equally likely.

get_individual_probabilities <- function(row, probs) {
  prob_vec <- rep(NA, length(row))
  for (i in (1:length(row))) {
    prob_vec[i] <- probs[row[i],i]
  }
  return(prob_vec)
}

categorized <- as.data.frame(categorized)
ind_probs <- as.vector(apply(categorized[,4:8], 1, get_individual_probabilities, dict))
ind_probs_df <- data.frame(probs = ind_probs, role = c(rep("top",5000),rep("jungle",5000), rep("mid",5000), rep("bot",5000), rep("support",5000)))

ggplot(ind_probs_df, aes(probs, fill = role)) +
  geom_density() +
  facet_wrap(~role) +
  theme(legend.position = "none") +
  labs(x = "Individual probability of placing an item correctly")

We can see that overall we seem to be putting most values into the most likely category (and there doesn’t appear to be much of a difference based on which category).

At this point, we could proceed with modelling or more data exploration with fairly strong confidence in our encoding method.

Recolumnize vignette

John Veech

2019-12-07

Introduction

Example

one_hot_encode()

best_categories_brute_force()

best_categories_approximate()