Recolumnize vignette

John Veech

2019-12-07

library(recolumnize)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.1

Introduction

I will give a quick introduction to using the recolumnize package for data wrangling and cleaning. The intended use cases for this package occur when dealing with very messy data found “in the wild.” Sometimes, one encounters data that has columns that are transposed or in a meaningless order. That is, a value found in row 1 column 1 may have no connection to the value found in row 2 column 1.

Often this occurs due to data entry errors, data not intended to be stored in table format, or from user/survey data that has not been properly validated (e.g., a user puts their ZIP code in the state field and their state in their address 2 field, etc.)

Example

In this example, we will use data collected from high level League of Legends games. League of Legends ia a competitive video game that consists of two teams of 5 players each. Each player selects a character (called a “champion”) from a roster of 146 different characters.

data("games_small")
head(games_small,10)
#>      match_id blue_win team          c1          c2         c3       c4
#> 1  3029804371     TRUE blue    renekton    nocturne     thresh    corki
#> 2  3029804371     TRUE  red    jarvaniv     caitlyn    morgana     ekko
#> 3  3046779313    FALSE blue        sona      ezreal      nasus masteryi
#> 4  3046779313    FALSE  red       kaisa       sylas cassiopeia    teemo
#> 5  3046814532     TRUE blue       teemo     caitlyn      braum    yasuo
#> 6  3046814532     TRUE  red twistedfate     kalista   malzahar   rengar
#> 7  3046864276     TRUE blue       yuumi       sivir    sejuani   rumble
#> 8  3046864276     TRUE  red     leblanc       vayne    karthus pantheon
#> 9  3052841499     TRUE blue       akali        olaf      sivir    yuumi
#> 10 3052841499     TRUE  red       teemo twistedfate     ezreal   leesin
#>          c5
#> 1     varus
#> 2   camille
#> 3     taric
#> 4    khazix
#> 5    reksai
#> 6    thresh
#> 7     teemo
#> 8     riven
#> 9  vladimir
#> 10     pyke

We see that values can appear in any column (for example, “teemo” appears in 3 different columns in just the first 10 rows!). In fact, the order is determined solely by the order in which they were selected, which is not very meaningful for most analysis purposes. Our first step will be to try one-hot-encoding the data using recolumnize::one_hot_encode.

one_hot_encode()

The one_hot_encode function creates a new column for each value that exists in the columns to be encoded. It then populates each row with a 1 if that value appears anywhere in the row and a 0 otherwise (note if we set keep = “sum”, we could also store the number of times the value appears in each row)

In this case, the usual implementation of one hot encoding (for example in the library vtreat) would not work well for this data (each value would get encoded 5 separate times if it appears in all 5 of the original columns).

So this data encoding requires substantially fewer variables. It also allows more easy generalization (if something holds for a character in one). Can we do better? For modelling/prediction purposes, we are probably best served using one-hot-encoding (many algorithms require it – plus, the reduced dimensionality is quite valuable). However, for data exploration, we may want to instead create some sort of meaningful ordering.

Generally, the 5 champions on each team have 5 separate roles (similar to positions you might see in sports): top, jungle, middle, bottom, and support.

We need a dataframe with row names consisting of the values we want to match and with columns representing the probability they fall into each category. Every value should have a probability for each category, so this method is not very well suited to datasets without a lot of structure (in such a case, using the recategorize function will probably be more fruitful)

best_categories_brute_force()

Note that calling best_categories_brute_force() will be very slow on large datasets, as it is O(n*k!) due to calculating probabilities for every permutation of columns. However, it can be useful to run it on a subset of your data and to compare it to the approximated version as we will do below.

Now instead of seeing “teemo” in 3 different columns, he mostly only appears in the column for “top.” Likewise, we can already see other examples of values reoccuring. This allows us to do better analysis. For one, we can now compare a player to his counterpart on the other team. It also lets us capture the interaction between champion and role, rather than just the existence of the champion.

We still have somewhat reduced the number of predictors needed from the original dataset, although not as much as if we one-hot-encoded.

best_categories_approximate()

Now, let’s use the faster best_categories_approximate() function. This function works by checking each category, finding the maximum percentage difference between the most and 2nd most likely value, then proceeding on the problem with one fewer category and one fewer value. This is substantially faster than checking every possible combination.

At first glance, it looks quite similar to the brute force output from above. But we may want to check how often the two differ.

Now we can check the mean difference in probabilities of the two methods which is 1.80485210^{-5}. We may also want to check when there is a difference in opinion which happens 47 times and has a mean 0.0019201. So when the two methods disagree (at least in this dataset), the two answers are usually close to equally likely.

We can see that overall we seem to be putting most values into the most likely category (and there doesn’t appear to be much of a difference based on which category).

At this point, we could proceed with modelling or more data exploration with fairly strong confidence in our encoding method.