The purpose of this vignette is to provide examples of different tidyverse functions to help with tidying and transforming data.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The example data set in this case will be FiveThirtyEight’s Elo statistics for all NBA seasons where data was available spanning back to 1946.
link = 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
bball_csv <- read_csv(link,show_col_types=FALSE)
Dplyr allows for chaining of commands using the pipe operator (%>% or |>) to list multiple processes in a logical flow.
bball_csv |>
dplyr::group_by (season) |>
dplyr::summarize(cnt=n()) |>
dplyr::arrange(desc(season))
## # A tibble: 77 × 2
## season cnt
## <dbl> <int>
## 1 2023 1230
## 2 2022 1323
## 3 2021 1171
## 4 2020 1143
## 5 2019 1312
## 6 2018 1312
## 7 2017 1309
## 8 2016 1316
## 9 2015 1311
## 10 2014 1319
## # … with 67 more rows
Given that the current season only just begun in October 2022, we will look at the prior 3 seasons for these examples
last_three_seasons <- bball_csv |>
dplyr::filter(season %in% c(2020,2021,2022))
#alternatively
last_three_seasons_v2 <- bball_csv |>
dplyr::filter(season>=2020 & season<2023)
dim(last_three_seasons) == dim(last_three_seasons_v2)
## [1] TRUE TRUE
The filter command allows for a number of different ways to slice the same data giving considerable flexibility to the end user. The dimensions are confirmed to be the same for either case
Similar to dplyr’s filter, the select function provides many alternatives to accomplish the same task depending on user preference. It allows for custom subset of columns to keep within a dataframe
last_three_seasons |> select(date,season,team1,team2,elo1_pre,elo2_pre,elo1_post,elo2_post,score1,score2)
## # A tibble: 3,637 × 10
## date season team1 team2 elo1_pre elo2_pre elo1_…¹ elo2_…² score1 score2
## <date> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-10-22 2020 TOR NOP 1673. 1415. 1675. 1414. 130 122
## 2 2019-10-22 2020 LAC LAL 1517. 1473. 1522. 1467. 112 102
## 3 2019-10-23 2020 CHO CHI 1497. 1350. 1499. 1349. 126 125
## 4 2019-10-23 2020 ORL CLE 1543. 1350. 1546. 1347. 94 85
## 5 2019-10-23 2020 IND DET 1510. 1476. 1495. 1491. 110 119
## 6 2019-10-23 2020 MIA MEM 1499. 1459. 1508. 1450. 120 101
## 7 2019-10-23 2020 PHI BOS 1582. 1578. 1591. 1569. 107 93
## 8 2019-10-23 2020 BRK MIN 1495. 1465. 1489. 1472. 126 127
## 9 2019-10-23 2020 SAS NYK 1554. 1319. 1556. 1317. 120 111
## 10 2019-10-23 2020 DAL WAS 1462. 1435. 1468. 1429. 108 100
## # … with 3,627 more rows, and abbreviated variable names ¹​elo1_post, ²​elo2_post
One question to potentially answer is which teams outperformed their Elo ratings and increased their total Elo for the season. By using group_by and summarize, we can choose specific variables to group on before we calculate minimum and maximum elo ratings for each team1 and season.
sum_stats <- last_three_seasons |>
group_by(team1,season) |>
summarize(min_elo = min(elo1_pre),
max_elo = max(elo1_post),
season_diff = max_elo - min_elo) |>
arrange(-season_diff)
## `summarise()` has grouped output by 'team1'. You can override using the
## `.groups` argument.
We can sort the data within the dataframe to review the teams with the best improvement, but it doesn’t necessarily give us a full picture except providing the top values.
Building a base plot using ggplot, the geom_point is added as a layer to show a point within the scatterplot based on the x and y coordinating provided in the initial aesthetic which in this case is the minimum and maximum elo rating of a team. The specific aesthetics for the points themselves are provided within the geom_point instantiation which in this case adjusted the shape and size of the points to be driven off specific numeric variables in the dat.
sum_stats %>%
ggplot(mapping=aes(min_elo,max_elo)) +
geom_point(aes(shape=as.character(season),size=season_diff)) +
geom_text(aes(label=team1),vjust=-1.5, size=2) +
labs(title='Elo Ratings over 3 prior NBA Seasons',caption='538 Data Set',size='Season Elo Difference',shape='Season')
Based on the initial head summary from the earlier chunk and this additional plot it is much more apparent that teams with lower initial expectations are typically the one with the biggest variability with in-season Elo ratings. This makes a bit of sense given that the model was somewhat bearish on these teams and they outperformed expectations during the season that were not initially accounted at their lowest point. There is a clear linear trend in the data overall and in general teams are improving their ratings over the full season.