Data607: Tidyverse Create

Displaying Tidyverse Functionality

The purpose of this vignette is to provide examples of different tidyverse functions to help with tidying and transforming data.

Load Libaries

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Read in FiveThirtyEight NBA Elo data

The example data set in this case will be FiveThirtyEight’s Elo statistics for all NBA seasons where data was available spanning back to 1946.

link = 'https://projects.fivethirtyeight.com/nba-model/nba_elo.csv'
bball_csv <- read_csv(link,show_col_types=FALSE)

Reviewing season game counts for entire dataset

Dplyr allows for chaining of commands using the pipe operator (%>% or |>) to list multiple processes in a logical flow.

bball_csv |>
    dplyr::group_by (season) |>
    dplyr::summarize(cnt=n()) |>
    dplyr::arrange(desc(season))

## # A tibble: 77 × 2
##    season   cnt
##     <dbl> <int>
##  1   2023  1230
##  2   2022  1323
##  3   2021  1171
##  4   2020  1143
##  5   2019  1312
##  6   2018  1312
##  7   2017  1309
##  8   2016  1316
##  9   2015  1311
## 10   2014  1319
## # … with 67 more rows

1A: Filtering dataframes to review past 3 complete seasons

Given that the current season only just begun in October 2022, we will look at the prior 3 seasons for these examples

last_three_seasons <- bball_csv |>
    dplyr::filter(season %in% c(2020,2021,2022))
#alternatively
last_three_seasons_v2 <- bball_csv |>
    dplyr::filter(season>=2020 & season<2023)

dim(last_three_seasons) == dim(last_three_seasons_v2)

## [1] TRUE TRUE

The filter command allows for a number of different ways to slice the same data giving considerable flexibility to the end user. The dimensions are confirmed to be the same for either case

1B: Select only the columns that are useful for the specified analysis

Similar to dplyr’s filter, the select function provides many alternatives to accomplish the same task depending on user preference. It allows for custom subset of columns to keep within a dataframe

last_three_seasons |> select(date,season,team1,team2,elo1_pre,elo2_pre,elo1_post,elo2_post,score1,score2)

## # A tibble: 3,637 × 10
##    date       season team1 team2 elo1_pre elo2_pre elo1_…¹ elo2_…² score1 score2
##    <date>      <dbl> <chr> <chr>    <dbl>    <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
##  1 2019-10-22   2020 TOR   NOP      1673.    1415.   1675.   1414.    130    122
##  2 2019-10-22   2020 LAC   LAL      1517.    1473.   1522.   1467.    112    102
##  3 2019-10-23   2020 CHO   CHI      1497.    1350.   1499.   1349.    126    125
##  4 2019-10-23   2020 ORL   CLE      1543.    1350.   1546.   1347.     94     85
##  5 2019-10-23   2020 IND   DET      1510.    1476.   1495.   1491.    110    119
##  6 2019-10-23   2020 MIA   MEM      1499.    1459.   1508.   1450.    120    101
##  7 2019-10-23   2020 PHI   BOS      1582.    1578.   1591.   1569.    107     93
##  8 2019-10-23   2020 BRK   MIN      1495.    1465.   1489.   1472.    126    127
##  9 2019-10-23   2020 SAS   NYK      1554.    1319.   1556.   1317.    120    111
## 10 2019-10-23   2020 DAL   WAS      1462.    1435.   1468.   1429.    108    100
## # … with 3,627 more rows, and abbreviated variable names ¹elo1_post, ²elo2_post

1C: Aggregate data for review of comparative statistics

One question to potentially answer is which teams outperformed their Elo ratings and increased their total Elo for the season. By using group_by and summarize, we can choose specific variables to group on before we calculate minimum and maximum elo ratings for each team1 and season.

sum_stats <- last_three_seasons |>
    group_by(team1,season) |>
    summarize(min_elo = min(elo1_pre),
              max_elo = max(elo1_post),
              season_diff = max_elo - min_elo) |>
    arrange(-season_diff)

## `summarise()` has grouped output by 'team1'. You can override using the
## `.groups` argument.

We can sort the data within the dataframe to review the teams with the best improvement, but it doesn’t necessarily give us a full picture except providing the top values.

2: Build a ggplot scatterplot to review the league wide changes in these metrics

Building a base plot using ggplot, the geom_point is added as a layer to show a point within the scatterplot based on the x and y coordinating provided in the initial aesthetic which in this case is the minimum and maximum elo rating of a team. The specific aesthetics for the points themselves are provided within the geom_point instantiation which in this case adjusted the shape and size of the points to be driven off specific numeric variables in the dat.

sum_stats %>%
    ggplot(mapping=aes(min_elo,max_elo)) +
    geom_point(aes(shape=as.character(season),size=season_diff)) +
    geom_text(aes(label=team1),vjust=-1.5, size=2) +
    labs(title='Elo Ratings over 3 prior NBA Seasons',caption='538 Data Set',size='Season Elo Difference',shape='Season')

Based on the initial head summary from the earlier chunk and this additional plot it is much more apparent that teams with lower initial expectations are typically the one with the biggest variability with in-season Elo ratings. This makes a bit of sense given that the model was somewhat bearish on these teams and they outperformed expectations during the season that were not initially accounted at their lowest point. There is a clear linear trend in the data overall and in general teams are improving their ratings over the full season.