Introduction

This document will attempt to show how to make more efficient certain functional programming processes in R using Tidyverse’s purrr library. The data that will be used in this case comes from FiveThirtyEight’s data set containing predictions for the 2022-2023 NBA season, which can be found on Github. To make the predictions FiveThirtyEight uses a ELO rating methodology, which can be read about in more detail on thier website.

Load and Clean Data

The cell below loads the data from Github and transforms it into a R dataframe.

csv_data <- getURL("https://projects.fivethirtyeight.com/nba-model/nba_elo_latest.csv")
df <- data.frame(read.csv(text=csv_data))
glimpse(df)
## Rows: 1,230
## Columns: 27
## $ date           <chr> "2022-10-18", "2022-10-18", "2022-10-19", "2022-10-19",…
## $ season         <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ neutral        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ playoff        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ team1          <chr> "BOS", "GSW", "IND", "DET", "MEM", "MIA", "ATL", "TOR",…
## $ team2          <chr> "PHI", "LAL", "WAS", "ORL", "NYK", "CHI", "HOU", "CLE",…
## $ elo1_pre       <dbl> 1657.640, 1660.620, 1399.202, 1393.525, 1605.025, 1617.…
## $ elo2_pre       <dbl> 1582.247, 1442.352, 1440.077, 1366.089, 1520.387, 1447.…
## $ elo_prob1      <dbl> 0.7329497, 0.8620114, 0.5842751, 0.6755904, 0.7432364, …
## $ elo_prob2      <dbl> 0.2670503, 0.1379886, 0.4157249, 0.3244096, 0.2567636, …
## $ elo1_post      <dbl> 1662.199, 1663.449, 1388.883, 1397.249, 1607.526, 1598.…
## $ elo2_post      <dbl> 1577.688, 1439.523, 1450.396, 1362.366, 1517.886, 1466.…
## $ carm.elo1_pre  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo2_pre  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo_prob1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo_prob2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo1_post <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo2_post <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ raptor1_pre    <dbl> 1693.243, 1615.718, 1462.353, 1308.970, 1612.012, 1649.…
## $ raptor2_pre    <dbl> 1641.877, 1472.174, 1472.018, 1349.865, 1549.909, 1494.…
## $ raptor_prob1   <dbl> 0.6706123, 0.7765022, 0.5995097, 0.5632696, 0.6918508, …
## $ raptor_prob2   <dbl> 0.32938773, 0.22349780, 0.40049033, 0.43673038, 0.30814…
## $ score1         <int> 126, 123, 107, 113, 115, 108, 117, 108, 108, 115, 102, …
## $ score2         <int> 117, 109, 114, 109, 112, 116, 107, 105, 130, 108, 129, …
## $ quality        <int> 96, 67, 37, 3, 80, 76, 24, 86, 80, 34, 37, 79, 92, 42, …
## $ importance     <int> 13, 20, 28, 1, 25, 19, 1, 40, 44, 4, 34, 32, 19, 49, 28…
## $ total_rating   <int> 55, 44, 33, 2, 53, 48, 13, 63, 62, 19, 36, 56, 56, 46, …

To limit the scope of the data, the following cell converts df to only include predictions for games in which the New York Knicks are playing:

df <- df %>%
  filter(team1 == 'NYK' | team2 == 'NYK')
nrow(df)
## [1] 82

We see now that the data has been limited to only the 82 regular season games that the Knicks will play. Next, the cell below takes the data cleaning one step further by transforming the dataframe to only include stats pertaining to Knicks (as opposed to their opponents):

df <- df %>%
  transmute(
    opponent = ifelse(team1=='NYK', team2, team1),
    win_percentage = ifelse(team1=='NYK', elo_prob1, elo_prob2),
    rating_pre = ifelse(team1=='NYK', elo1_pre, elo2_pre),
    #rating_post = ifelse(team1=='NYK', elo1_post, elo2_post),
    #points_scored = ifelse(team1=='NYK', score1, score2),
    quality = quality,
    importantance = importance
  )

glimpse(df)
## Rows: 82
## Columns: 5
## $ opponent       <chr> "MEM", "DET", "ORL", "CHO", "MIL", "CLE", "ATL", "PHI",…
## $ win_percentage <dbl> 0.2567636, 0.7807578, 0.8241579, 0.6008730, 0.2777317, …
## $ rating_pre     <dbl> 1520.387, 1517.886, 1524.826, 1528.374, 1532.596, 1532.…
## $ quality        <int> 80, 21, 19, 36, 77, 69, 77, 80, 88, 78, 76, 32, 32, 67,…
## $ importantance  <int> 25, 8, 7, 44, 38, 67, 52, 61, 38, 50, 74, 12, 12, 67, 2…

The data in its final format only includes the following columns:

  1. opponent - Who the Knicks are playing against.
  2. win_percentage - The predicted probability that the Knicks will win.
  3. rating_pre - The predicted pre-grame ELO rating of the Knicks.
  4. quality - A relative score of how “fun the game will be to watch”, scored from 0-100.
  5. importance - A relative score of “how important will this game be in getting either team to the post-season”, scored from 0-100.

Functional Programming

This section will go through an example of how one might use R functions to answer questions about the data, and how those functions can be enhanced using the Tidyverse purrr library. The goal in this case: how does one quickly calculate the mean of each column?

The old way

The below code chunk shows how we might carry out this task using typical R functions:

calc_means <- function(df_input) {
  means <- vector("double", length(df_input))
  for (i in seq_along(df_input)) {
    means[i] <- mean(df_input[[i]])
  }
  results <- data.frame(matrix(ncol = ncol(df_input), nrow = 1))
  colnames(results) <- colnames(df_input)
  results[1,] <- means
  results
}

calc_means(df)
##   opponent win_percentage rating_pre quality importantance
## 1       NA      0.5330998   1532.122 67.0122      60.81707

The code above creates a function calc_means that calculates and reports the means of each column using a for loop. The results of this are shown above.

The purrr way

Using the purrr package, running a function over every column in a dataframe becomes incredibly easy via use of the package’s map functions. There exists a different map function for the different R datatypes (i.e. for integers it becomes map_int), and each one returns a vector of the datatype specified. In this case, we will use map_dbl since calculating means of each column will return a vector of type double elements. This is done in the cell below:

map_dbl(df, mean)
##       opponent win_percentage     rating_pre        quality  importantance 
##             NA      0.5330998   1532.1218598     67.0121951     60.8170732

The Benefit

As can be seen in the code above, finding and reporting the means of each numerical column only takes one line of code using the purrr package. The function we are passing over the columns is the mean function, and similar descriptive stats functions (such as sd and median, can be used in the same way):

map_dbl(df, median)
##       opponent win_percentage     rating_pre        quality  importantance 
##             NA       0.562258    1532.596404      76.000000      66.000000
map_dbl(df, sd)
##       opponent win_percentage     rating_pre        quality  importantance 
##             NA      0.1720806      2.2911732     19.2510984     23.1767913

While calling the calc_means function we defined earlier now only requires one line since its code has been previously written, instantiating the map_dbl function turns out to have a faster runtime:

start_time_base <- Sys.time()
calc_means(df)
end_time_base <- Sys.time()

start_time_purrr <- Sys.time()
map_dbl(df, mean)
end_time_purrr <- Sys.time()

time_base = end_time_base - start_time_base
time_purrr = end_time_purrr - start_time_purrr
time_base - time_purrr
## Time difference of 0.001914978 secs

As can be seen in the output above, using the purrr map function saved about 0.001 seconds when compared to using the user-defined calc_means function. Of course this is an almost an immaterial time save for this example, but that could change should the size of the data or number of function calls increase.

Conclusion

The purr library as part of the larger Tidyverse set of packages provides a couple of great tools to help a user with their functional programming capabilities. Though this document focused solely on the map suite of functions, be sure to check others in the documentation that may suit your use case.