The purpose of this vignette is to explore the “forcats” package. I will be using a dataset from 538 that gives the expected win percentage of every NBA game, called “ELO”. Forcats gives some nice options for working with displaying categorical data. I’m going to focus on “team1” as the variable.

We’ll start by reading the ELO dataset and taking a look at the columns.

library(forcats)
library(dplyr)
library(readr)
elo <- read_csv("https://projects.fivethirtyeight.com/nba-model/nba_elo.csv")
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   date = col_date(format = ""),
##   season = col_double(),
##   neutral = col_double(),
##   team1 = col_character(),
##   team2 = col_character(),
##   elo1_pre = col_double(),
##   elo2_pre = col_double(),
##   elo_prob1 = col_double(),
##   elo_prob2 = col_double(),
##   elo1_post = col_double(),
##   elo2_post = col_double(),
##   score1 = col_double(),
##   score2 = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 47423 parsing failures.
##   row            col           expected              actual                                                         file
## 63158 carm-elo1_pre  1/0/T/F/TRUE/FALSE 1564.372491         'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo2_pre  1/0/T/F/TRUE/FALSE 1732.025482         'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo_prob1 1/0/T/F/TRUE/FALSE 0.40307778555455737 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo_prob2 1/0/T/F/TRUE/FALSE 0.5969222144454427  'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo1_post 1/0/T/F/TRUE/FALSE 1570.47393668307    'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## ..... .............. .................. ................... ............................................................
## See problems(...) for more details.
str(elo)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 69636 obs. of  24 variables:
##  $ date          : Date, format: "1946-11-01" "1946-11-02" ...
##  $ season        : num  1947 1947 1947 1947 1947 ...
##  $ neutral       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ playoff       : logi  NA NA NA NA NA NA ...
##  $ team1         : chr  "TRH" "CHS" "PRO" "STB" ...
##  $ team2         : chr  "NYK" "NYK" "BOS" "PIT" ...
##  $ elo1_pre      : num  1300 1300 1300 1300 1300 ...
##  $ elo2_pre      : num  1300 1307 1300 1300 1300 ...
##  $ elo_prob1     : num  0.64 0.631 0.64 0.64 0.64 ...
##  $ elo_prob2     : num  0.36 0.369 0.36 0.36 0.36 ...
##  $ elo1_post     : num  1293 1310 1305 1305 1280 ...
##  $ elo2_post     : num  1307 1297 1295 1295 1320 ...
##  $ carm-elo1_pre : logi  NA NA NA NA NA NA ...
##  $ carm-elo2_pre : logi  NA NA NA NA NA NA ...
##  $ carm-elo_prob1: logi  NA NA NA NA NA NA ...
##  $ carm-elo_prob2: logi  NA NA NA NA NA NA ...
##  $ carm-elo1_post: logi  NA NA NA NA NA NA ...
##  $ carm-elo2_post: logi  NA NA NA NA NA NA ...
##  $ raptor1_pre   : logi  NA NA NA NA NA NA ...
##  $ raptor2_pre   : logi  NA NA NA NA NA NA ...
##  $ raptor_prob1  : logi  NA NA NA NA NA NA ...
##  $ raptor_prob2  : logi  NA NA NA NA NA NA ...
##  $ score1        : num  66 63 59 56 33 71 56 55 49 81 ...
##  $ score2        : num  68 47 53 51 50 60 71 57 53 75 ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 47423 obs. of  5 variables:
##   ..$ row     : int  63158 63158 63158 63158 63158 63158 63159 63159 63159 63159 ...
##   ..$ col     : chr  "carm-elo1_pre" "carm-elo2_pre" "carm-elo_prob1" "carm-elo_prob2" ...
##   ..$ expected: chr  "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
##   ..$ actual  : chr  "1564.372491" "1732.025482" "0.40307778555455737" "0.5969222144454427" ...
##   ..$ file    : chr  "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   season = col_double(),
##   ..   neutral = col_double(),
##   ..   playoff = col_logical(),
##   ..   team1 = col_character(),
##   ..   team2 = col_character(),
##   ..   elo1_pre = col_double(),
##   ..   elo2_pre = col_double(),
##   ..   elo_prob1 = col_double(),
##   ..   elo_prob2 = col_double(),
##   ..   elo1_post = col_double(),
##   ..   elo2_post = col_double(),
##   ..   `carm-elo1_pre` = col_logical(),
##   ..   `carm-elo2_pre` = col_logical(),
##   ..   `carm-elo_prob1` = col_logical(),
##   ..   `carm-elo_prob2` = col_logical(),
##   ..   `carm-elo1_post` = col_logical(),
##   ..   `carm-elo2_post` = col_logical(),
##   ..   raptor1_pre = col_logical(),
##   ..   raptor2_pre = col_logical(),
##   ..   raptor_prob1 = col_logical(),
##   ..   raptor_prob2 = col_logical(),
##   ..   score1 = col_double(),
##   ..   score2 = col_double()
##   .. )

In order to use forcats, we need the variable(s) we’re going to use to be encoded as factors. Let’s take the character variable team1 and change to factor.

elo$team1 = as.factor(elo$team1)

Let’s take a look at the levels of the variable.

table(elo$team1)
## 
##  ANA  AND  ATL  BAL  BLB  BOS  BRK  BUF  CAP  CAR  CHH  CHI  CHO  CHP  CHS 
##   41   38 2244  486  278 3333  342  346   44  226  579 2389  655   54  120 
##  CHZ  CIN  CLE  CLR  DAL  DEN  DET  DLC  DNA  DNN  DNR  DTF  FLO  FTW  GSW 
##   50  656 2143   31 1710 1865 2741  225   98   34  307   30   89  370 2098 
##  HOU  HSM  INA  IND  INJ  INO  KCK  KCO  KEN  LAC  LAL  LAS  MEM  MIA  MIL 
## 2129   79  433 1895   31  153  421  126  431 1496 2764   89  808 1405 2223 
##  MIN  MLH  MMF  MMP  MMS  MMT  MNL  MNM  MNP  NJA  NJN  NOB  NOJ  NOK  NOP 
## 1272  146   87   83   44   82  460   43   41   40 1466  132  205   82  672 
##  NYA  NYK  NYN  OAK  OKC  ORL  PHI  PHO  PHW  PIT  POR  PRO  PTC  PTP  ROC 
##  373 3061   41   86  538 1311 2431 2241  518   30 2154   80   77   88  307 
##  SAA  SAC  SAS  SDA  SDC  SDR  SDS  SEA  SFW  SHE  SSL  STB  STL  SYR  TEX 
##  135 1448 1981  132  246  166    3 1756  353   32   88  124  497  482   45 
##  TOR  TRH  TRI  UTA  UTS  VAN  VIR  WAS  WAT  WSA  WSB  WSC 
## 1055   30   69 1792  251  230  267  951   31   34  995  147

The levels are in alphabetical order. Let’s see how it looks if we sort by the first game the team played instead. We’ll use fct_reorder. We need to specify that “date” is the column we want to use, and the aggregate function will be “min”.

table(fct_reorder(.f = elo$team1, .x = elo$date, .fun = min))
## 
##  TRH  CHS  DTF  PRO  STB  CLR  PIT  BOS  PHW  NYK  WSC  BLB  INJ  FTW  ROC 
##   30  120   30   80  124   31   30 3333  518 3061  147  278   31  370  307 
##  MNL  TRI  INO  WAT  AND  SHE  SYR  DNN  MLH  STL  DET  CIN  LAL  CHP  CHZ 
##  460   69  153   31   38   32  482   34  146  497 2741  656 2764   54   50 
##  SFW  BAL  PHI  CHI  OAK  INA  SDR  DNR  DLC  NOB  KEN  SEA  MNM  HSM  NJA 
##  353  486 2431 2389   86  433  166  307  225  132  431 1756   43   79   40 
##  PTP  ANA  ATL  MIL  PHO  NYA  MNP  LAS  MMF  CAR  WSA  BUF  UTS  FLO  PTC 
##   88   41 2244 2223 2241  373   41   89   87  226   34  346  251   89   77 
##  POR  VIR  MMP  TEX  CLE  HOU  GSW  KCO  MMT  SDA  SAA  CAP  SSL  DNA  WSB 
## 2154  267   83   45 2143 2129 2098  126   82  132  135   44   88   98  995 
##  MMS  NOJ  SDS  KCK  IND  DEN  NYN  SAS  NJN  SDC  UTA  DAL  LAC  SAC  CHH 
##   44  205    3  421 1895 1865   41 1981 1466  246 1792 1710 1496 1448  579 
##  MIA  ORL  MIN  TOR  VAN  WAS  MEM  NOP  CHO  NOK  OKC  BRK 
## 1405 1311 1272 1055  230  951  808  672  655   82  538  342

Knowing a bit about NBA history can help us confirm this makes sense. The last team on the list “BRK” represents the Brooklyn Nets, the most recent team to enter the league.

It might be easier to look at if they were sorted be frequency. For that we’ll use “fct_infreq”

table(fct_infreq(elo$team1))
## 
##  BOS  NYK  LAL  DET  PHI  CHI  ATL  PHO  MIL  POR  CLE  HOU  GSW  SAS  IND 
## 3333 3061 2764 2741 2431 2389 2244 2241 2223 2154 2143 2129 2098 1981 1895 
##  DEN  UTA  SEA  DAL  LAC  NJN  SAC  MIA  ORL  MIN  TOR  WSB  WAS  MEM  NOP 
## 1865 1792 1756 1710 1496 1466 1448 1405 1311 1272 1055  995  951  808  672 
##  CIN  CHO  CHH  OKC  PHW  STL  BAL  SYR  MNL  INA  KEN  KCK  NYA  FTW  SFW 
##  656  655  579  538  518  497  486  482  460  433  431  421  373  370  353 
##  BUF  BRK  DNR  ROC  BLB  VIR  UTS  SDC  VAN  CAR  DLC  NOJ  SDR  INO  WSC 
##  346  342  307  307  278  267  251  246  230  226  225  205  166  153  147 
##  MLH  SAA  NOB  SDA  KCO  STB  CHS  DNA  FLO  LAS  PTP  SSL  MMF  OAK  MMP 
##  146  135  132  132  126  124  120   98   89   89   88   88   87   86   83 
##  MMT  NOK  PRO  HSM  PTC  TRI  CHP  CHZ  TEX  CAP  MMS  MNM  ANA  MNP  NYN 
##   82   82   80   79   77   69   54   50   45   44   44   43   41   41   41 
##  NJA  AND  DNN  WSA  SHE  CLR  INJ  WAT  DTF  PIT  TRH  SDS 
##   40   38   34   34   32   31   31   31   30   30   30    3

Because this dataset covers years all the way back to 1947, we have a lot of teams that only have a small amount of games. One way we can shorten the list is with “fct_lump”. We have three choices concerning which values will get “lumped” together: specify “n”, “prop”, or neither. First we’ll try it with both “n” and “prop” left blank.

table(fct_infreq(fct_lump(elo$team1)))
## 
##   BOS   NYK   LAL   DET   PHI   CHI   ATL   PHO   MIL   POR   CLE   HOU 
##  3333  3061  2764  2741  2431  2389  2244  2241  2223  2154  2143  2129 
##   GSW   SAS   IND   DEN   UTA   SEA   DAL   LAC   NJN   SAC   MIA   ORL 
##  2098  1981  1895  1865  1792  1756  1710  1496  1466  1448  1405  1311 
##   MIN   TOR   WSB   WAS   MEM   NOP   CIN   CHO   CHH   OKC   PHW   STL 
##  1272  1055   995   951   808   672   656   655   579   538   518   497 
##   BAL   SYR   MNL   INA   KEN   KCK   NYA   FTW   SFW   BUF   BRK   DNR 
##   486   482   460   433   431   421   373   370   353   346   342   307 
##   ROC   BLB   VIR   UTS   SDC   VAN   CAR   DLC   NOJ   SDR   INO   WSC 
##   307   278   267   251   246   230   226   225   205   166   153   147 
##   MLH   SAA   NOB   SDA   KCO   STB   CHS   DNA   FLO   LAS   PTP   SSL 
##   146   135   132   132   126   124   120    98    89    89    88    88 
##   MMF   OAK   MMP   MMT   NOK   PRO   HSM   PTC   TRI   CHP   CHZ   TEX 
##    87    86    83    82    82    80    79    77    69    54    50    45 
##   CAP   MMS   MNM   ANA   MNP   NYN   NJA   AND   DNN   WSA   SHE   CLR 
##    44    44    43    41    41    41    40    38    34    34    32    31 
##   INJ   WAT   DTF   PIT   TRH Other 
##    31    31    30    30    30     3

As you can see, all it did was change “SDS” to “Other”. That’s because if you don’t specify “n” or “prop”, the function will combine as many as possible while making sure that the “other” category is still the smallest. Because adding “TRH” would have made the “Other” category larget than “PIT”, it stopped there.

Instead of this, let’s just grab the top 10 by specifying “n”.

table(fct_infreq(fct_lump(elo$team1, n = 10)))
## 
## Other   BOS   NYK   LAL   DET   PHI   CHI   ATL   PHO   MIL   POR 
## 44055  3333  3061  2764  2741  2431  2389  2244  2241  2223  2154

As you can see, we have 11 factors now. The 10 most frequent, and then another factor that combines the rest. Since we’re still using “fct_infreq” to sort, the “Other” factor ends up at the top of the list.
The third option is to specify a minimum percentage of the total that this factor makes up. Let’s grab any where there are mosre than 5% of the total.

table(fct_infreq(fct_lump(elo$team1, prop = 0.05)))
## 
## Other 
## 69636

And here we ended up with everything grouped together. Because our factors are fairly evenly spread, we don’t have any that amount for 5% or more of the games. Let’s try 0.5%

table(fct_infreq(fct_lump(elo$team1, prop = 0.005)))
## 
## Other   BOS   NYK   LAL   DET   PHI   CHI   ATL   PHO   MIL   POR   CLE 
##  6555  3333  3061  2764  2741  2431  2389  2244  2241  2223  2154  2143 
##   HOU   GSW   SAS   IND   DEN   UTA   SEA   DAL   LAC   NJN   SAC   MIA 
##  2129  2098  1981  1895  1865  1792  1756  1710  1496  1466  1448  1405 
##   ORL   MIN   TOR   WSB   WAS   MEM   NOP   CIN   CHO   CHH   OKC   PHW 
##  1311  1272  1055   995   951   808   672   656   655   579   538   518 
##   STL   BAL   SYR   MNL   INA   KEN   KCK   NYA   FTW   SFW 
##   497   486   482   460   433   431   421   373   370   353

Let’s say we want to reorder the factors manually. Let’s say say we are a west coast person. Let’s take the list from above but move the California teams to the beginning of the list.

relevel = fct_infreq(fct_lump(elo$team1, prop = 0.005))
relevel = fct_relevel(relevel,c("LAL","LAC"))
table(relevel)
## relevel
##   LAL   LAC Other   BOS   NYK   DET   PHI   CHI   ATL   PHO   MIL   POR 
##  2764  1496  6555  3333  3061  2741  2431  2389  2244  2241  2223  2154 
##   CLE   HOU   GSW   SAS   IND   DEN   UTA   SEA   DAL   NJN   SAC   MIA 
##  2143  2129  2098  1981  1895  1865  1792  1756  1710  1466  1448  1405 
##   ORL   MIN   TOR   WSB   WAS   MEM   NOP   CIN   CHO   CHH   OKC   PHW 
##  1311  1272  1055   995   951   808   672   656   655   579   538   518 
##   STL   BAL   SYR   MNL   INA   KEN   KCK   NYA   FTW   SFW 
##   497   486   482   460   433   431   421   373   370   353

Ah, but we missed Golden State. Lets move that to the third level using the “after” option.

relevel = fct_relevel(relevel,c("GSW"), after = 2)
table(relevel)
## relevel
##   LAL   LAC   GSW Other   BOS   NYK   DET   PHI   CHI   ATL   PHO   MIL 
##  2764  1496  2098  6555  3333  3061  2741  2431  2389  2244  2241  2223 
##   POR   CLE   HOU   SAS   IND   DEN   UTA   SEA   DAL   NJN   SAC   MIA 
##  2154  2143  2129  1981  1895  1865  1792  1756  1710  1466  1448  1405 
##   ORL   MIN   TOR   WSB   WAS   MEM   NOP   CIN   CHO   CHH   OKC   PHW 
##  1311  1272  1055   995   951   808   672   656   655   579   538   518 
##   STL   BAL   SYR   MNL   INA   KEN   KCK   NYA   FTW   SFW 
##   497   486   482   460   433   431   421   373   370   353

“Other” Now looks a little weird, let’s put that at the end. We can just set the “after” to “Inf”.

relevel = fct_relevel(relevel, "Other", after = Inf)
table(relevel)
## relevel
##   LAL   LAC   GSW   BOS   NYK   DET   PHI   CHI   ATL   PHO   MIL   POR 
##  2764  1496  2098  3333  3061  2741  2431  2389  2244  2241  2223  2154 
##   CLE   HOU   SAS   IND   DEN   UTA   SEA   DAL   NJN   SAC   MIA   ORL 
##  2143  2129  1981  1895  1865  1792  1756  1710  1466  1448  1405  1311 
##   MIN   TOR   WSB   WAS   MEM   NOP   CIN   CHO   CHH   OKC   PHW   STL 
##  1272  1055   995   951   808   672   656   655   579   538   518   497 
##   BAL   SYR   MNL   INA   KEN   KCK   NYA   FTW   SFW Other 
##   486   482   460   433   431   421   373   370   353  6555

This should give you some basic idea of how to use the forcats functionality for displaying your factors. There are more options that you can use, and this package really shines when used in conjunction with ggplot2 for making plots.