Overview: The task here is to create an example by using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
The core tidyverse packages includes the following packages: 1. ggplot2 2. dplyr 3. tidyr 4. readr 5. purr 6. tibble 7. stringr 8. forcats
I will focus on the packages dplyr in this exercise, there are close to 70 functions in the packages dplyr. I will focus on the following functions for Basic single-table: 1. select(), rename() - Select/rename variables by name 2. arrange() - Arrange rows by variables 3. filter() - Return rows with matching conditions 4. pull() - Pull out a single variable 5. mutate() transmute() - Create or transform variables 6. summarise() summarize() - Reduce multiple values down to a single value 7. group_by() ungroup() - Group by one or more variables
Load tidyverse package
library(tidyverse)
We need to read a CSV file from FiveThirtyEight.com. The Raptor data is a ratings which uses play-by-play and player tracking data to calculate each player’s individual plus-minus measurements and wins above replacement, which accounts for playing time. Function read_csv can be used to read the CSV file and convert in a dataframe
theUrl1 <- "https://projects.fivethirtyeight.com/nba-model/2020/latest_RAPTOR_by_team.csv"
raptor <- read_csv(file=theUrl1, na = c("", "NA"))
## Parsed with column specification:
## cols(
## .default = col_double(),
## player_name = col_character(),
## player_id = col_character(),
## season_type = col_character(),
## team = col_character()
## )
## See spec(...) for full column specifications.
#Table raptor is a dataframe
raptor
## # A tibble: 570 x 23
## player_name player_id season season_type team poss mp raptor_box_offe~
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Steven Ada~ adamsst01 2020 RS OKC 3280 1564 0.806
## 2 Bam Adebayo adebaba01 2020 RS MIA 4573 2235 -1.13
## 3 LaMarcus A~ aldrila01 2020 RS SAS 3648 1754 -0.697
## 4 Nickeil Al~ alexani01 2020 RS NOP 1098 501 -3.10
## 5 Grayson Al~ allengr01 2020 RS MEM 1100 498 -0.316
## 6 Jarrett Al~ allenja01 2020 RS BRK 3484 1647 -1.29
## 7 Kadeem All~ allenka01 2020 RS NYK 252 117 0.937
## 8 Al-Farouq ~ aminual01 2020 RS ORL 793 380 -4.24
## 9 Justin And~ anderju01 2020 RS BRK 38 17 -7.86
## 10 Kyle Ander~ anderky01 2020 RS MEM 2489 1140 -1.18
## # ... with 560 more rows, and 15 more variables: raptor_box_defense <dbl>,
## # raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## # raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## # raptor_defense <dbl>, raptor_total <dbl>, war_total <dbl>,
## # war_reg_season <dbl>, war_playoffs <dbl>, predator_offense <dbl>,
## # predator_defense <dbl>, predator_total <dbl>, pace_impact <dbl>
#1. As there are too many columns, I only want to look at basic data and column with raptor data.It will drop all war, predator and pace columns.
raptor1 <- raptor %>%
select(player_name:raptor_total)
raptor1
## # A tibble: 570 x 16
## player_name player_id season season_type team poss mp raptor_box_offe~
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Steven Ada~ adamsst01 2020 RS OKC 3280 1564 0.806
## 2 Bam Adebayo adebaba01 2020 RS MIA 4573 2235 -1.13
## 3 LaMarcus A~ aldrila01 2020 RS SAS 3648 1754 -0.697
## 4 Nickeil Al~ alexani01 2020 RS NOP 1098 501 -3.10
## 5 Grayson Al~ allengr01 2020 RS MEM 1100 498 -0.316
## 6 Jarrett Al~ allenja01 2020 RS BRK 3484 1647 -1.29
## 7 Kadeem All~ allenka01 2020 RS NYK 252 117 0.937
## 8 Al-Farouq ~ aminual01 2020 RS ORL 793 380 -4.24
## 9 Justin And~ anderju01 2020 RS BRK 38 17 -7.86
## 10 Kyle Ander~ anderky01 2020 RS MEM 2489 1140 -1.18
## # ... with 560 more rows, and 8 more variables: raptor_box_defense <dbl>,
## # raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## # raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## # raptor_defense <dbl>, raptor_total <dbl>
#2. The next is arrange, I would like to arrange the dataframe by player_name
#The next function is arrange
raptor1 <- raptor1 %>%
arrange(player_name)
raptor1
## # A tibble: 570 x 16
## player_name player_id season season_type team poss mp raptor_box_offe~
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Aaron Gord~ gordoaa01 2020 RS ORL 3971 1914 -0.467
## 2 Aaron Holi~ holidaa01 2020 RS IND 2853 1368 -0.156
## 3 Abdel Nader naderab01 2020 RS OKC 1603 756 -2.59
## 4 Adam Mokoka mokokad01 2020 RS CHI 234 112 0.0400
## 5 Admiral Sc~ schofad01 2020 RS WAS 631 293 -1.82
## 6 Al-Farouq ~ aminual01 2020 RS ORL 793 380 -4.24
## 7 Al Horford horfoal01 2020 RS PHI 3860 1848 -0.0712
## 8 Alec Burks burksal01 2020 RS PHI 464 222 0.445
## 9 Alec Burks burksal01 2020 RS GSW 2960 1390 1.51
## 10 Alen Smail~ smailal01 2020 RS GSW 304 139 -1.32
## # ... with 560 more rows, and 8 more variables: raptor_box_defense <dbl>,
## # raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## # raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## # raptor_defense <dbl>, raptor_total <dbl>
#3. The next is filter, I would like to look at players who play more than 2000 minutes in the season
#The next function is arrange
raptor1 <- raptor1 %>%
filter(mp>=2000)
raptor1
## # A tibble: 27 x 16
## player_name player_id season season_type team poss mp raptor_box_offe~
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Bam Adebayo adebaba01 2020 RS MIA 4573 2235 -1.13
## 2 Bojan Bogd~ bogdabo02 2020 RS UTA 4316 2083 0.803
## 3 Bradley Be~ bealbr01 2020 RS WAS 4433 2053 5.38
## 4 Buddy Hield hieldbu01 2020 RS SAC 4280 2045 1.64
## 5 Chris Paul paulch01 2020 RS OKC 4191 2003 5.50
## 6 CJ McCollum mccolcj01 2020 RS POR 4712 2229 2.28
## 7 Collin Sex~ sextoco01 2020 RS CLE 4481 2143 0.949
## 8 Damian Lil~ lillada01 2020 RS POR 4551 2140 8.47
## 9 De'Andre H~ huntede01 2020 RS ATL 4371 2018 -2.94
## 10 DeMar DeRo~ derozde01 2020 RS SAS 4373 2091 2.62
## # ... with 17 more rows, and 8 more variables: raptor_box_defense <dbl>,
## # raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## # raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## # raptor_defense <dbl>, raptor_total <dbl>
#There are only 27 players
#4. The next function is pull, I would like to look at all the player names for these “high usage” players
#The next function is arrange
raptor2 <- raptor1 %>%
pull(1)
raptor2
## [1] "Bam Adebayo" "Bojan Bogdanovic"
## [3] "Bradley Beal" "Buddy Hield"
## [5] "Chris Paul" "CJ McCollum"
## [7] "Collin Sexton" "Damian Lillard"
## [9] "De'Andre Hunter" "DeMar DeRozan"
## [11] "Devin Booker" "Devonte' Graham"
## [13] "Domantas Sabonis" "Donovan Mitchell"
## [15] "Harrison Barnes" "James Harden"
## [17] "Jayson Tatum" "Julius Randle"
## [19] "LeBron James" "Nikola Jokic"
## [21] "P.J. Tucker" "Rudy Gobert"
## [23] "Shai Gilgeous-Alexander" "Terry Rozier"
## [25] "Tobias Harris" "Trae Young"
## [27] "Zach LaVine"
#5. The next function is mutate and/or transmute. As transmute will drop an existing variables, I will use mutate instead. I want to create a new column for a ratio between raptor_box_total and raptor_onoff_total
raptor1 <- raptor1 %>%
mutate(ratio=raptor_box_total/raptor_onoff_total)
raptor1
## # A tibble: 27 x 17
## player_name player_id season season_type team poss mp raptor_box_offe~
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Bam Adebayo adebaba01 2020 RS MIA 4573 2235 -1.13
## 2 Bojan Bogd~ bogdabo02 2020 RS UTA 4316 2083 0.803
## 3 Bradley Be~ bealbr01 2020 RS WAS 4433 2053 5.38
## 4 Buddy Hield hieldbu01 2020 RS SAC 4280 2045 1.64
## 5 Chris Paul paulch01 2020 RS OKC 4191 2003 5.50
## 6 CJ McCollum mccolcj01 2020 RS POR 4712 2229 2.28
## 7 Collin Sex~ sextoco01 2020 RS CLE 4481 2143 0.949
## 8 Damian Lil~ lillada01 2020 RS POR 4551 2140 8.47
## 9 De'Andre H~ huntede01 2020 RS ATL 4371 2018 -2.94
## 10 DeMar DeRo~ derozde01 2020 RS SAS 4373 2091 2.62
## # ... with 17 more rows, and 9 more variables: raptor_box_defense <dbl>,
## # raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## # raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## # raptor_defense <dbl>, raptor_total <dbl>, ratio <dbl>
#6 and 7. I would like to use group by and then summarise to look at the average raptor_box_total by teams.
raptor1 <- raptor1 %>%
group_by(team) %>%
summarise(mean(raptor_box_total))
raptor1
## # A tibble: 19 x 2
## team `mean(raptor_box_total)`
## <chr> <dbl>
## 1 ATL -0.452
## 2 BOS 4.30
## 3 CHA 0.0207
## 4 CHI 0.987
## 5 CLE -2.78
## 6 DEN 5.17
## 7 HOU 5.16
## 8 IND 0.863
## 9 LAL 6.44
## 10 MIA 1.05
## 11 NYK -1.78
## 12 OKC 2.18
## 13 PHI -0.665
## 14 PHO 0.332
## 15 POR 3.39
## 16 SAC -0.373
## 17 SAS 0.245
## 18 UTA 2.02
## 19 WAS 2.45
First, I’d like to admit that I’m not actually a basketball fan so these calculations aren’t very clear to me. From what I can tell, the raptor dataset is an evalulation of individual players based on different metrics.
The code above, by Chun San Yip, seems to ulitmately summarize the average box total by team, for players who have at least 2,000 minutes of play in the 2020 season.
I’ll be using the tidyverse package to create a few visualizations using ggplot2 to try and understand what’s being calculated above.
The first thing I want to take a look at is why 2000 minutes of play time was used above. Lets use ggplot to see the distribution of play time of each of the players in the raptor dataset and see how many players go into each team’s overall average box total.
# plot a histogram of each players' active play time
ggplot(raptor, aes(x = mp)) +
geom_histogram() +
theme_bw() +
labs(title = 'Histogram of Minutes Played',
x = element_blank(),
y = element_blank())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like we’re only looking at a very small fraction of the players in this dataset. Let’s determine how many, exactly. Let’s do this by creating a flag for the most active players and showing the results, using dplyr:
# add a flag for players with 2000 or more minutes of playtime
raptor_aflag <- raptor %>%
mutate(active_flag = ifelse(mp >= 2000, 1, 0))
# calculate the percent of the most active players
raptor_aflag %>%
summarize(active_players = sum(active_flag),
total_players = n(),
pct_active_players = active_players / total_players)
## # A tibble: 1 x 3
## active_players total_players pct_active_players
## <dbl> <int> <dbl>
## 1 27 570 0.0474
It seems like we’re barely looking at 5% of the players in this dataset. This makes for a curious team level average, as it seems possible that a few, perhaps even only one player, is making up the team average calculated above. Lets use dplyr again to take a look at how many active players there are per team:
# show the number of active players per team
team_pcts <- raptor_aflag %>%
group_by(team) %>%
summarize(active_players = sum(active_flag),
total_players = n(),
pct_active_players = active_players / total_players) %>%
arrange(desc(pct_active_players))
head(team_pcts)
## # A tibble: 6 x 4
## team active_players total_players pct_active_players
## <chr> <dbl> <int> <dbl>
## 1 UTA 3 20 0.15
## 2 CHA 2 16 0.125
## 3 OKC 2 17 0.118
## 4 POR 2 18 0.111
## 5 HOU 2 20 0.1
## 6 ATL 2 21 0.0952
# show the number of active players per team
ggplot(team_pcts, aes(x = active_players)) +
geom_bar() +
scale_y_continuous(limits = c(0,12),
breaks = seq(0,12,1)) +
labs(title = 'Distribution of Active Players by Teams',
x = "Number of Active Players",
y = "Number of Teams") +
theme_bw()
# table version of visualization
team_pcts %>%
group_by(active_players) %>%
summarize(number_of_teams = n(),
pct_of_grp = n() / nrow(team_pcts),
min_pct_active_players = min(pct_active_players),
max_pct_active_players = max(pct_active_players))
## # A tibble: 4 x 5
## active_players number_of_teams pct_of_grp min_pct_active_pl~ max_pct_active_p~
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 0 11 0.367 0 0
## 2 1 12 0.4 0.0476 0.0588
## 3 2 6 0.2 0.0909 0.125
## 4 3 1 0.0333 0.15 0.15
It looks like 11 teams or over a third of all the teams listed here won’t be included in the average team box totals above. Also, 12 teams, or 40% of the teams in the dataset, are averaged using only one player on their team. Even with 3 players with over 2,000 minutes of play time in one team, we’re only looking at 15% of that team’s players.
I’m still unclear as to what this metric is and why it’s important to basketball– but it suggests interest in the most active players rather than the combination of the team. Perhaps it would be better to look at box scores by player rather than by team.