The purpose of this vignette is to explore the “forcats” package. I will be using a dataset from 538 that gives the expected win percentage of every NBA game, called “ELO”. Forcats gives some nice options for working with displaying categorical data. I’m going to focus on “team1” as the variable.
We’ll start by reading the ELO dataset and taking a look at the columns.
library(forcats)
library(dplyr)
library(readr)
elo <- read_csv("https://projects.fivethirtyeight.com/nba-model/nba_elo.csv")
## Parsed with column specification:
## cols(
## .default = col_logical(),
## date = col_date(format = ""),
## season = col_double(),
## neutral = col_double(),
## team1 = col_character(),
## team2 = col_character(),
## elo1_pre = col_double(),
## elo2_pre = col_double(),
## elo_prob1 = col_double(),
## elo_prob2 = col_double(),
## elo1_post = col_double(),
## elo2_post = col_double(),
## score1 = col_double(),
## score2 = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 47423 parsing failures.
## row col expected actual file
## 63158 carm-elo1_pre 1/0/T/F/TRUE/FALSE 1564.372491 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo2_pre 1/0/T/F/TRUE/FALSE 1732.025482 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo_prob1 1/0/T/F/TRUE/FALSE 0.40307778555455737 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo_prob2 1/0/T/F/TRUE/FALSE 0.5969222144454427 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## 63158 carm-elo1_post 1/0/T/F/TRUE/FALSE 1570.47393668307 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
## ..... .............. .................. ................... ............................................................
## See problems(...) for more details.
str(elo)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 69636 obs. of 24 variables:
## $ date : Date, format: "1946-11-01" "1946-11-02" ...
## $ season : num 1947 1947 1947 1947 1947 ...
## $ neutral : num 0 0 0 0 0 0 0 0 0 0 ...
## $ playoff : logi NA NA NA NA NA NA ...
## $ team1 : chr "TRH" "CHS" "PRO" "STB" ...
## $ team2 : chr "NYK" "NYK" "BOS" "PIT" ...
## $ elo1_pre : num 1300 1300 1300 1300 1300 ...
## $ elo2_pre : num 1300 1307 1300 1300 1300 ...
## $ elo_prob1 : num 0.64 0.631 0.64 0.64 0.64 ...
## $ elo_prob2 : num 0.36 0.369 0.36 0.36 0.36 ...
## $ elo1_post : num 1293 1310 1305 1305 1280 ...
## $ elo2_post : num 1307 1297 1295 1295 1320 ...
## $ carm-elo1_pre : logi NA NA NA NA NA NA ...
## $ carm-elo2_pre : logi NA NA NA NA NA NA ...
## $ carm-elo_prob1: logi NA NA NA NA NA NA ...
## $ carm-elo_prob2: logi NA NA NA NA NA NA ...
## $ carm-elo1_post: logi NA NA NA NA NA NA ...
## $ carm-elo2_post: logi NA NA NA NA NA NA ...
## $ raptor1_pre : logi NA NA NA NA NA NA ...
## $ raptor2_pre : logi NA NA NA NA NA NA ...
## $ raptor_prob1 : logi NA NA NA NA NA NA ...
## $ raptor_prob2 : logi NA NA NA NA NA NA ...
## $ score1 : num 66 63 59 56 33 71 56 55 49 81 ...
## $ score2 : num 68 47 53 51 50 60 71 57 53 75 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 47423 obs. of 5 variables:
## ..$ row : int 63158 63158 63158 63158 63158 63158 63159 63159 63159 63159 ...
## ..$ col : chr "carm-elo1_pre" "carm-elo2_pre" "carm-elo_prob1" "carm-elo_prob2" ...
## ..$ expected: chr "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
## ..$ actual : chr "1564.372491" "1732.025482" "0.40307778555455737" "0.5969222144454427" ...
## ..$ file : chr "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" "'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. season = col_double(),
## .. neutral = col_double(),
## .. playoff = col_logical(),
## .. team1 = col_character(),
## .. team2 = col_character(),
## .. elo1_pre = col_double(),
## .. elo2_pre = col_double(),
## .. elo_prob1 = col_double(),
## .. elo_prob2 = col_double(),
## .. elo1_post = col_double(),
## .. elo2_post = col_double(),
## .. `carm-elo1_pre` = col_logical(),
## .. `carm-elo2_pre` = col_logical(),
## .. `carm-elo_prob1` = col_logical(),
## .. `carm-elo_prob2` = col_logical(),
## .. `carm-elo1_post` = col_logical(),
## .. `carm-elo2_post` = col_logical(),
## .. raptor1_pre = col_logical(),
## .. raptor2_pre = col_logical(),
## .. raptor_prob1 = col_logical(),
## .. raptor_prob2 = col_logical(),
## .. score1 = col_double(),
## .. score2 = col_double()
## .. )
In order to use forcats, we need the variable(s) we’re going to use to be encoded as factors. Let’s take the character variable team1 and change to factor.
elo$team1 = as.factor(elo$team1)
Let’s take a look at the levels of the variable.
table(elo$team1)
##
## ANA AND ATL BAL BLB BOS BRK BUF CAP CAR CHH CHI CHO CHP CHS
## 41 38 2244 486 278 3333 342 346 44 226 579 2389 655 54 120
## CHZ CIN CLE CLR DAL DEN DET DLC DNA DNN DNR DTF FLO FTW GSW
## 50 656 2143 31 1710 1865 2741 225 98 34 307 30 89 370 2098
## HOU HSM INA IND INJ INO KCK KCO KEN LAC LAL LAS MEM MIA MIL
## 2129 79 433 1895 31 153 421 126 431 1496 2764 89 808 1405 2223
## MIN MLH MMF MMP MMS MMT MNL MNM MNP NJA NJN NOB NOJ NOK NOP
## 1272 146 87 83 44 82 460 43 41 40 1466 132 205 82 672
## NYA NYK NYN OAK OKC ORL PHI PHO PHW PIT POR PRO PTC PTP ROC
## 373 3061 41 86 538 1311 2431 2241 518 30 2154 80 77 88 307
## SAA SAC SAS SDA SDC SDR SDS SEA SFW SHE SSL STB STL SYR TEX
## 135 1448 1981 132 246 166 3 1756 353 32 88 124 497 482 45
## TOR TRH TRI UTA UTS VAN VIR WAS WAT WSA WSB WSC
## 1055 30 69 1792 251 230 267 951 31 34 995 147
The levels are in alphabetical order. Let’s see how it looks if we sort by the first game the team played instead. We’ll use fct_reorder. We need to specify that “date” is the column we want to use, and the aggregate function will be “min”.
table(fct_reorder(.f = elo$team1, .x = elo$date, .fun = min))
##
## TRH CHS DTF PRO STB CLR PIT BOS PHW NYK WSC BLB INJ FTW ROC
## 30 120 30 80 124 31 30 3333 518 3061 147 278 31 370 307
## MNL TRI INO WAT AND SHE SYR DNN MLH STL DET CIN LAL CHP CHZ
## 460 69 153 31 38 32 482 34 146 497 2741 656 2764 54 50
## SFW BAL PHI CHI OAK INA SDR DNR DLC NOB KEN SEA MNM HSM NJA
## 353 486 2431 2389 86 433 166 307 225 132 431 1756 43 79 40
## PTP ANA ATL MIL PHO NYA MNP LAS MMF CAR WSA BUF UTS FLO PTC
## 88 41 2244 2223 2241 373 41 89 87 226 34 346 251 89 77
## POR VIR MMP TEX CLE HOU GSW KCO MMT SDA SAA CAP SSL DNA WSB
## 2154 267 83 45 2143 2129 2098 126 82 132 135 44 88 98 995
## MMS NOJ SDS KCK IND DEN NYN SAS NJN SDC UTA DAL LAC SAC CHH
## 44 205 3 421 1895 1865 41 1981 1466 246 1792 1710 1496 1448 579
## MIA ORL MIN TOR VAN WAS MEM NOP CHO NOK OKC BRK
## 1405 1311 1272 1055 230 951 808 672 655 82 538 342
Knowing a bit about NBA history can help us confirm this makes sense. The last team on the list “BRK” represents the Brooklyn Nets, the most recent team to enter the league.
It might be easier to look at if they were sorted be frequency. For that we’ll use “fct_infreq”
table(fct_infreq(elo$team1))
##
## BOS NYK LAL DET PHI CHI ATL PHO MIL POR CLE HOU GSW SAS IND
## 3333 3061 2764 2741 2431 2389 2244 2241 2223 2154 2143 2129 2098 1981 1895
## DEN UTA SEA DAL LAC NJN SAC MIA ORL MIN TOR WSB WAS MEM NOP
## 1865 1792 1756 1710 1496 1466 1448 1405 1311 1272 1055 995 951 808 672
## CIN CHO CHH OKC PHW STL BAL SYR MNL INA KEN KCK NYA FTW SFW
## 656 655 579 538 518 497 486 482 460 433 431 421 373 370 353
## BUF BRK DNR ROC BLB VIR UTS SDC VAN CAR DLC NOJ SDR INO WSC
## 346 342 307 307 278 267 251 246 230 226 225 205 166 153 147
## MLH SAA NOB SDA KCO STB CHS DNA FLO LAS PTP SSL MMF OAK MMP
## 146 135 132 132 126 124 120 98 89 89 88 88 87 86 83
## MMT NOK PRO HSM PTC TRI CHP CHZ TEX CAP MMS MNM ANA MNP NYN
## 82 82 80 79 77 69 54 50 45 44 44 43 41 41 41
## NJA AND DNN WSA SHE CLR INJ WAT DTF PIT TRH SDS
## 40 38 34 34 32 31 31 31 30 30 30 3
Because this dataset covers years all the way back to 1947, we have a lot of teams that only have a small amount of games. One way we can shorten the list is with “fct_lump”. We have three choices concerning which values will get “lumped” together: specify “n”, “prop”, or neither. First we’ll try it with both “n” and “prop” left blank.
table(fct_infreq(fct_lump(elo$team1)))
##
## BOS NYK LAL DET PHI CHI ATL PHO MIL POR CLE HOU
## 3333 3061 2764 2741 2431 2389 2244 2241 2223 2154 2143 2129
## GSW SAS IND DEN UTA SEA DAL LAC NJN SAC MIA ORL
## 2098 1981 1895 1865 1792 1756 1710 1496 1466 1448 1405 1311
## MIN TOR WSB WAS MEM NOP CIN CHO CHH OKC PHW STL
## 1272 1055 995 951 808 672 656 655 579 538 518 497
## BAL SYR MNL INA KEN KCK NYA FTW SFW BUF BRK DNR
## 486 482 460 433 431 421 373 370 353 346 342 307
## ROC BLB VIR UTS SDC VAN CAR DLC NOJ SDR INO WSC
## 307 278 267 251 246 230 226 225 205 166 153 147
## MLH SAA NOB SDA KCO STB CHS DNA FLO LAS PTP SSL
## 146 135 132 132 126 124 120 98 89 89 88 88
## MMF OAK MMP MMT NOK PRO HSM PTC TRI CHP CHZ TEX
## 87 86 83 82 82 80 79 77 69 54 50 45
## CAP MMS MNM ANA MNP NYN NJA AND DNN WSA SHE CLR
## 44 44 43 41 41 41 40 38 34 34 32 31
## INJ WAT DTF PIT TRH Other
## 31 31 30 30 30 3
As you can see, all it did was change “SDS” to “Other”. That’s because if you don’t specify “n” or “prop”, the function will combine as many as possible while making sure that the “other” category is still the smallest. Because adding “TRH” would have made the “Other” category larget than “PIT”, it stopped there.
Instead of this, let’s just grab the top 10 by specifying “n”.
table(fct_infreq(fct_lump(elo$team1, n = 10)))
##
## Other BOS NYK LAL DET PHI CHI ATL PHO MIL POR
## 44055 3333 3061 2764 2741 2431 2389 2244 2241 2223 2154
As you can see, we have 11 factors now. The 10 most frequent, and then another factor that combines the rest. Since we’re still using “fct_infreq” to sort, the “Other” factor ends up at the top of the list.
The third option is to specify a minimum percentage of the total that this factor makes up. Let’s grab any where there are mosre than 5% of the total.
table(fct_infreq(fct_lump(elo$team1, prop = 0.05)))
##
## Other
## 69636
And here we ended up with everything grouped together. Because our factors are fairly evenly spread, we don’t have any that amount for 5% or more of the games. Let’s try 0.5%
table(fct_infreq(fct_lump(elo$team1, prop = 0.005)))
##
## Other BOS NYK LAL DET PHI CHI ATL PHO MIL POR CLE
## 6555 3333 3061 2764 2741 2431 2389 2244 2241 2223 2154 2143
## HOU GSW SAS IND DEN UTA SEA DAL LAC NJN SAC MIA
## 2129 2098 1981 1895 1865 1792 1756 1710 1496 1466 1448 1405
## ORL MIN TOR WSB WAS MEM NOP CIN CHO CHH OKC PHW
## 1311 1272 1055 995 951 808 672 656 655 579 538 518
## STL BAL SYR MNL INA KEN KCK NYA FTW SFW
## 497 486 482 460 433 431 421 373 370 353
Let’s say we want to reorder the factors manually. Let’s say say we are a west coast person. Let’s take the list from above but move the California teams to the beginning of the list.
relevel = fct_infreq(fct_lump(elo$team1, prop = 0.005))
relevel = fct_relevel(relevel,c("LAL","LAC"))
table(relevel)
## relevel
## LAL LAC Other BOS NYK DET PHI CHI ATL PHO MIL POR
## 2764 1496 6555 3333 3061 2741 2431 2389 2244 2241 2223 2154
## CLE HOU GSW SAS IND DEN UTA SEA DAL NJN SAC MIA
## 2143 2129 2098 1981 1895 1865 1792 1756 1710 1466 1448 1405
## ORL MIN TOR WSB WAS MEM NOP CIN CHO CHH OKC PHW
## 1311 1272 1055 995 951 808 672 656 655 579 538 518
## STL BAL SYR MNL INA KEN KCK NYA FTW SFW
## 497 486 482 460 433 431 421 373 370 353
Ah, but we missed Golden State. Lets move that to the third level using the “after” option.
relevel = fct_relevel(relevel,c("GSW"), after = 2)
table(relevel)
## relevel
## LAL LAC GSW Other BOS NYK DET PHI CHI ATL PHO MIL
## 2764 1496 2098 6555 3333 3061 2741 2431 2389 2244 2241 2223
## POR CLE HOU SAS IND DEN UTA SEA DAL NJN SAC MIA
## 2154 2143 2129 1981 1895 1865 1792 1756 1710 1466 1448 1405
## ORL MIN TOR WSB WAS MEM NOP CIN CHO CHH OKC PHW
## 1311 1272 1055 995 951 808 672 656 655 579 538 518
## STL BAL SYR MNL INA KEN KCK NYA FTW SFW
## 497 486 482 460 433 431 421 373 370 353
“Other” Now looks a little weird, let’s put that at the end. We can just set the “after” to “Inf”.
relevel = fct_relevel(relevel, "Other", after = Inf)
table(relevel)
## relevel
## LAL LAC GSW BOS NYK DET PHI CHI ATL PHO MIL POR
## 2764 1496 2098 3333 3061 2741 2431 2389 2244 2241 2223 2154
## CLE HOU SAS IND DEN UTA SEA DAL NJN SAC MIA ORL
## 2143 2129 1981 1895 1865 1792 1756 1710 1466 1448 1405 1311
## MIN TOR WSB WAS MEM NOP CIN CHO CHH OKC PHW STL
## 1272 1055 995 951 808 672 656 655 579 538 518 497
## BAL SYR MNL INA KEN KCK NYA FTW SFW Other
## 486 482 460 433 431 421 373 370 353 6555
This should give you some basic idea of how to use the forcats functionality for displaying your factors. There are more options that you can use, and this package really shines when used in conjunction with ggplot2 for making plots.