This document will attempt to show how to make more efficient certain
functional programming processes in R using Tidyverse’s purrr library. The
data that will be used in this case comes from FiveThirtyEight’s data
set containing predictions for the 2022-2023 NBA season, which can be
found on Github.
To make the predictions FiveThirtyEight uses a ELO rating methodology,
which can be read about in more detail on thier website.
The cell below loads the data from Github and transforms it into a R dataframe.
csv_data <- getURL("https://projects.fivethirtyeight.com/nba-model/nba_elo_latest.csv")
df <- data.frame(read.csv(text=csv_data))
glimpse(df)
## Rows: 1,230
## Columns: 27
## $ date <chr> "2022-10-18", "2022-10-18", "2022-10-19", "2022-10-19",…
## $ season <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ neutral <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ playoff <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ team1 <chr> "BOS", "GSW", "IND", "DET", "MEM", "MIA", "ATL", "TOR",…
## $ team2 <chr> "PHI", "LAL", "WAS", "ORL", "NYK", "CHI", "HOU", "CLE",…
## $ elo1_pre <dbl> 1657.640, 1660.620, 1399.202, 1393.525, 1605.025, 1617.…
## $ elo2_pre <dbl> 1582.247, 1442.352, 1440.077, 1366.089, 1520.387, 1447.…
## $ elo_prob1 <dbl> 0.7329497, 0.8620114, 0.5842751, 0.6755904, 0.7432364, …
## $ elo_prob2 <dbl> 0.2670503, 0.1379886, 0.4157249, 0.3244096, 0.2567636, …
## $ elo1_post <dbl> 1662.199, 1663.449, 1388.883, 1397.249, 1607.526, 1598.…
## $ elo2_post <dbl> 1577.688, 1439.523, 1450.396, 1362.366, 1517.886, 1466.…
## $ carm.elo1_pre <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo2_pre <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo_prob1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo_prob2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo1_post <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ carm.elo2_post <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ raptor1_pre <dbl> 1693.243, 1615.718, 1462.353, 1308.970, 1612.012, 1649.…
## $ raptor2_pre <dbl> 1641.877, 1472.174, 1472.018, 1349.865, 1549.909, 1494.…
## $ raptor_prob1 <dbl> 0.6706123, 0.7765022, 0.5995097, 0.5632696, 0.6918508, …
## $ raptor_prob2 <dbl> 0.32938773, 0.22349780, 0.40049033, 0.43673038, 0.30814…
## $ score1 <int> 126, 123, 107, 113, 115, 108, 117, 108, 108, 115, 102, …
## $ score2 <int> 117, 109, 114, 109, 112, 116, 107, 105, 130, 108, 129, …
## $ quality <int> 96, 67, 37, 3, 80, 76, 24, 86, 80, 34, 37, 79, 92, 42, …
## $ importance <int> 13, 20, 28, 1, 25, 19, 1, 40, 44, 4, 34, 32, 19, 49, 28…
## $ total_rating <int> 55, 44, 33, 2, 53, 48, 13, 63, 62, 19, 36, 56, 56, 46, …
To limit the scope of the data, the following cell converts
df to only include predictions for games in which the New
York Knicks are playing:
df <- df %>%
filter(team1 == 'NYK' | team2 == 'NYK')
nrow(df)
## [1] 82
We see now that the data has been limited to only the 82 regular season games that the Knicks will play. Next, the cell below takes the data cleaning one step further by transforming the dataframe to only include stats pertaining to Knicks (as opposed to their opponents):
df <- df %>%
transmute(
opponent = ifelse(team1=='NYK', team2, team1),
win_percentage = ifelse(team1=='NYK', elo_prob1, elo_prob2),
rating_pre = ifelse(team1=='NYK', elo1_pre, elo2_pre),
#rating_post = ifelse(team1=='NYK', elo1_post, elo2_post),
#points_scored = ifelse(team1=='NYK', score1, score2),
quality = quality,
importantance = importance
)
glimpse(df)
## Rows: 82
## Columns: 5
## $ opponent <chr> "MEM", "DET", "ORL", "CHO", "MIL", "CLE", "ATL", "PHI",…
## $ win_percentage <dbl> 0.2567636, 0.7807578, 0.8241579, 0.6008730, 0.2777317, …
## $ rating_pre <dbl> 1520.387, 1517.886, 1524.826, 1528.374, 1532.596, 1532.…
## $ quality <int> 80, 21, 19, 36, 77, 69, 77, 80, 88, 78, 76, 32, 32, 67,…
## $ importantance <int> 25, 8, 7, 44, 38, 67, 52, 61, 38, 50, 74, 12, 12, 67, 2…
The data in its final format only includes the following columns:
opponent - Who the Knicks are playing against.win_percentage - The predicted probability that the
Knicks will win.rating_pre - The predicted pre-grame ELO rating of the
Knicks.quality - A relative score of how “fun the game will be
to watch”, scored from 0-100.importance - A relative score of “how important will
this game be in getting either team to the post-season”, scored from
0-100.This section will go through an example of how one might use R
functions to answer questions about the data, and how those functions
can be enhanced using the Tidyverse purrr library. The goal
in this case: how does one quickly calculate the mean of each
column?
The below code chunk shows how we might carry out this task using typical R functions:
calc_means <- function(df_input) {
means <- vector("double", length(df_input))
for (i in seq_along(df_input)) {
means[i] <- mean(df_input[[i]])
}
results <- data.frame(matrix(ncol = ncol(df_input), nrow = 1))
colnames(results) <- colnames(df_input)
results[1,] <- means
results
}
calc_means(df)
## opponent win_percentage rating_pre quality importantance
## 1 NA 0.5330998 1532.122 67.0122 60.81707
The code above creates a function calc_means that
calculates and reports the means of each column using a for loop. The
results of this are shown above.
purrr wayUsing the purrr package, running a function over every
column in a dataframe becomes incredibly easy via use of the package’s
map functions. There exists a different map
function for the different R datatypes (i.e. for integers it becomes
map_int), and each one returns a vector of the datatype
specified. In this case, we will use map_dbl since
calculating means of each column will return a vector of type
double elements. This is done in the cell below:
map_dbl(df, mean)
## opponent win_percentage rating_pre quality importantance
## NA 0.5330998 1532.1218598 67.0121951 60.8170732
As can be seen in the code above, finding and reporting the means of
each numerical column only takes one line of code using the
purrr package. The function we are passing over the columns
is the mean function, and similar descriptive stats
functions (such as sd and median, can be used
in the same way):
map_dbl(df, median)
## opponent win_percentage rating_pre quality importantance
## NA 0.562258 1532.596404 76.000000 66.000000
map_dbl(df, sd)
## opponent win_percentage rating_pre quality importantance
## NA 0.1720806 2.2911732 19.2510984 23.1767913
While calling the calc_means function we defined earlier
now only requires one line since its code has been previously written,
instantiating the map_dbl function turns out to have a
faster runtime:
start_time_base <- Sys.time()
calc_means(df)
end_time_base <- Sys.time()
start_time_purrr <- Sys.time()
map_dbl(df, mean)
end_time_purrr <- Sys.time()
time_base = end_time_base - start_time_base
time_purrr = end_time_purrr - start_time_purrr
time_base - time_purrr
## Time difference of 0.001914978 secs
As can be seen in the output above, using the purrr map
function saved about 0.001 seconds when compared to using the
user-defined calc_means function. Of course this is an
almost an immaterial time save for this example, but that could change
should the size of the data or number of function calls increase.
The purr library as part of the larger Tidyverse set of
packages provides a couple of great tools to help a user with their
functional programming capabilities. Though this document focused solely
on the map suite of functions, be sure to check others in
the documentation that may
suit your use case.