Data Acquisition and Management Week 9 Assignment

Overview: The task here is to create an example by using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

The core tidyverse packages includes the following packages: 1. ggplot2 2. dplyr 3. tidyr 4. readr 5. purr 6. tibble 7. stringr 8. forcats

I will focus on the packages dplyr in this exercise, there are close to 70 functions in the packages dplyr. I will focus on the following functions for Basic single-table: 1. select(), rename() - Select/rename variables by name 2. arrange() - Arrange rows by variables 3. filter() - Return rows with matching conditions 4. pull() - Pull out a single variable 5. mutate() transmute() - Create or transform variables 6. summarise() summarize() - Reduce multiple values down to a single value 7. group_by() ungroup() - Group by one or more variables

Load tidyverse package

library(tidyverse)

We need to read a CSV file from FiveThirtyEight.com. The Raptor data is a ratings which uses play-by-play and player tracking data to calculate each player’s individual plus-minus measurements and wins above replacement, which accounts for playing time. Function read_csv can be used to read the CSV file and convert in a dataframe

theUrl1 <- "https://projects.fivethirtyeight.com/nba-model/2020/latest_RAPTOR_by_team.csv"

raptor <- read_csv(file=theUrl1,  na = c("", "NA"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   player_name = col_character(),
##   player_id = col_character(),
##   season_type = col_character(),
##   team = col_character()
## )

## See spec(...) for full column specifications.

#Table raptor is a dataframe
raptor

## # A tibble: 570 x 23
##    player_name player_id season season_type team   poss    mp raptor_box_offe~
##    <chr>       <chr>      <dbl> <chr>       <chr> <dbl> <dbl>            <dbl>
##  1 Steven Ada~ adamsst01   2020 RS          OKC    3280  1564            0.806
##  2 Bam Adebayo adebaba01   2020 RS          MIA    4573  2235           -1.13 
##  3 LaMarcus A~ aldrila01   2020 RS          SAS    3648  1754           -0.697
##  4 Nickeil Al~ alexani01   2020 RS          NOP    1098   501           -3.10 
##  5 Grayson Al~ allengr01   2020 RS          MEM    1100   498           -0.316
##  6 Jarrett Al~ allenja01   2020 RS          BRK    3484  1647           -1.29 
##  7 Kadeem All~ allenka01   2020 RS          NYK     252   117            0.937
##  8 Al-Farouq ~ aminual01   2020 RS          ORL     793   380           -4.24 
##  9 Justin And~ anderju01   2020 RS          BRK      38    17           -7.86 
## 10 Kyle Ander~ anderky01   2020 RS          MEM    2489  1140           -1.18 
## # ... with 560 more rows, and 15 more variables: raptor_box_defense <dbl>,
## #   raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## #   raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## #   raptor_defense <dbl>, raptor_total <dbl>, war_total <dbl>,
## #   war_reg_season <dbl>, war_playoffs <dbl>, predator_offense <dbl>,
## #   predator_defense <dbl>, predator_total <dbl>, pace_impact <dbl>

#1. As there are too many columns, I only want to look at basic data and column with raptor data.It will drop all war, predator and pace columns.

raptor1 <- raptor %>%
  select(player_name:raptor_total)

raptor1

## # A tibble: 570 x 16
##    player_name player_id season season_type team   poss    mp raptor_box_offe~
##    <chr>       <chr>      <dbl> <chr>       <chr> <dbl> <dbl>            <dbl>
##  1 Steven Ada~ adamsst01   2020 RS          OKC    3280  1564            0.806
##  2 Bam Adebayo adebaba01   2020 RS          MIA    4573  2235           -1.13 
##  3 LaMarcus A~ aldrila01   2020 RS          SAS    3648  1754           -0.697
##  4 Nickeil Al~ alexani01   2020 RS          NOP    1098   501           -3.10 
##  5 Grayson Al~ allengr01   2020 RS          MEM    1100   498           -0.316
##  6 Jarrett Al~ allenja01   2020 RS          BRK    3484  1647           -1.29 
##  7 Kadeem All~ allenka01   2020 RS          NYK     252   117            0.937
##  8 Al-Farouq ~ aminual01   2020 RS          ORL     793   380           -4.24 
##  9 Justin And~ anderju01   2020 RS          BRK      38    17           -7.86 
## 10 Kyle Ander~ anderky01   2020 RS          MEM    2489  1140           -1.18 
## # ... with 560 more rows, and 8 more variables: raptor_box_defense <dbl>,
## #   raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## #   raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## #   raptor_defense <dbl>, raptor_total <dbl>

#2. The next is arrange, I would like to arrange the dataframe by player_name

#The next function is arrange
raptor1 <- raptor1 %>%
  arrange(player_name)

raptor1

## # A tibble: 570 x 16
##    player_name player_id season season_type team   poss    mp raptor_box_offe~
##    <chr>       <chr>      <dbl> <chr>       <chr> <dbl> <dbl>            <dbl>
##  1 Aaron Gord~ gordoaa01   2020 RS          ORL    3971  1914          -0.467 
##  2 Aaron Holi~ holidaa01   2020 RS          IND    2853  1368          -0.156 
##  3 Abdel Nader naderab01   2020 RS          OKC    1603   756          -2.59  
##  4 Adam Mokoka mokokad01   2020 RS          CHI     234   112           0.0400
##  5 Admiral Sc~ schofad01   2020 RS          WAS     631   293          -1.82  
##  6 Al-Farouq ~ aminual01   2020 RS          ORL     793   380          -4.24  
##  7 Al Horford  horfoal01   2020 RS          PHI    3860  1848          -0.0712
##  8 Alec Burks  burksal01   2020 RS          PHI     464   222           0.445 
##  9 Alec Burks  burksal01   2020 RS          GSW    2960  1390           1.51  
## 10 Alen Smail~ smailal01   2020 RS          GSW     304   139          -1.32  
## # ... with 560 more rows, and 8 more variables: raptor_box_defense <dbl>,
## #   raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## #   raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## #   raptor_defense <dbl>, raptor_total <dbl>

#3. The next is filter, I would like to look at players who play more than 2000 minutes in the season

#The next function is arrange
raptor1 <- raptor1 %>%
  filter(mp>=2000)

raptor1

## # A tibble: 27 x 16
##    player_name player_id season season_type team   poss    mp raptor_box_offe~
##    <chr>       <chr>      <dbl> <chr>       <chr> <dbl> <dbl>            <dbl>
##  1 Bam Adebayo adebaba01   2020 RS          MIA    4573  2235           -1.13 
##  2 Bojan Bogd~ bogdabo02   2020 RS          UTA    4316  2083            0.803
##  3 Bradley Be~ bealbr01    2020 RS          WAS    4433  2053            5.38 
##  4 Buddy Hield hieldbu01   2020 RS          SAC    4280  2045            1.64 
##  5 Chris Paul  paulch01    2020 RS          OKC    4191  2003            5.50 
##  6 CJ McCollum mccolcj01   2020 RS          POR    4712  2229            2.28 
##  7 Collin Sex~ sextoco01   2020 RS          CLE    4481  2143            0.949
##  8 Damian Lil~ lillada01   2020 RS          POR    4551  2140            8.47 
##  9 De'Andre H~ huntede01   2020 RS          ATL    4371  2018           -2.94 
## 10 DeMar DeRo~ derozde01   2020 RS          SAS    4373  2091            2.62 
## # ... with 17 more rows, and 8 more variables: raptor_box_defense <dbl>,
## #   raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## #   raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## #   raptor_defense <dbl>, raptor_total <dbl>

#There are only 27 players

#4. The next function is pull, I would like to look at all the player names for these “high usage” players

#The next function is arrange
raptor2 <- raptor1 %>%
  pull(1)

raptor2

##  [1] "Bam Adebayo"             "Bojan Bogdanovic"       
##  [3] "Bradley Beal"            "Buddy Hield"            
##  [5] "Chris Paul"              "CJ McCollum"            
##  [7] "Collin Sexton"           "Damian Lillard"         
##  [9] "De'Andre Hunter"         "DeMar DeRozan"          
## [11] "Devin Booker"            "Devonte' Graham"        
## [13] "Domantas Sabonis"        "Donovan Mitchell"       
## [15] "Harrison Barnes"         "James Harden"           
## [17] "Jayson Tatum"            "Julius Randle"          
## [19] "LeBron James"            "Nikola Jokic"           
## [21] "P.J. Tucker"             "Rudy Gobert"            
## [23] "Shai Gilgeous-Alexander" "Terry Rozier"           
## [25] "Tobias Harris"           "Trae Young"             
## [27] "Zach LaVine"

#5. The next function is mutate and/or transmute. As transmute will drop an existing variables, I will use mutate instead. I want to create a new column for a ratio between raptor_box_total and raptor_onoff_total

raptor1 <- raptor1 %>%
  mutate(ratio=raptor_box_total/raptor_onoff_total)

raptor1

## # A tibble: 27 x 17
##    player_name player_id season season_type team   poss    mp raptor_box_offe~
##    <chr>       <chr>      <dbl> <chr>       <chr> <dbl> <dbl>            <dbl>
##  1 Bam Adebayo adebaba01   2020 RS          MIA    4573  2235           -1.13 
##  2 Bojan Bogd~ bogdabo02   2020 RS          UTA    4316  2083            0.803
##  3 Bradley Be~ bealbr01    2020 RS          WAS    4433  2053            5.38 
##  4 Buddy Hield hieldbu01   2020 RS          SAC    4280  2045            1.64 
##  5 Chris Paul  paulch01    2020 RS          OKC    4191  2003            5.50 
##  6 CJ McCollum mccolcj01   2020 RS          POR    4712  2229            2.28 
##  7 Collin Sex~ sextoco01   2020 RS          CLE    4481  2143            0.949
##  8 Damian Lil~ lillada01   2020 RS          POR    4551  2140            8.47 
##  9 De'Andre H~ huntede01   2020 RS          ATL    4371  2018           -2.94 
## 10 DeMar DeRo~ derozde01   2020 RS          SAS    4373  2091            2.62 
## # ... with 17 more rows, and 9 more variables: raptor_box_defense <dbl>,
## #   raptor_box_total <dbl>, raptor_onoff_offense <dbl>,
## #   raptor_onoff_defense <dbl>, raptor_onoff_total <dbl>, raptor_offense <dbl>,
## #   raptor_defense <dbl>, raptor_total <dbl>, ratio <dbl>

#6 and 7. I would like to use group by and then summarise to look at the average raptor_box_total by teams.

raptor1 <- raptor1 %>%
  group_by(team) %>%
  summarise(mean(raptor_box_total))

raptor1

## # A tibble: 19 x 2
##    team  `mean(raptor_box_total)`
##    <chr>                    <dbl>
##  1 ATL                    -0.452 
##  2 BOS                     4.30  
##  3 CHA                     0.0207
##  4 CHI                     0.987 
##  5 CLE                    -2.78  
##  6 DEN                     5.17  
##  7 HOU                     5.16  
##  8 IND                     0.863 
##  9 LAL                     6.44  
## 10 MIA                     1.05  
## 11 NYK                    -1.78  
## 12 OKC                     2.18  
## 13 PHI                    -0.665 
## 14 PHO                     0.332 
## 15 POR                     3.39  
## 16 SAC                    -0.373 
## 17 SAS                     0.245 
## 18 UTA                     2.02  
## 19 WAS                     2.45

Extend Assignment - Leo Yi

First, I’d like to admit that I’m not actually a basketball fan so these calculations aren’t very clear to me. From what I can tell, the raptor dataset is an evalulation of individual players based on different metrics.

The code above, by Chun San Yip, seems to ulitmately summarize the average box total by team, for players who have at least 2,000 minutes of play in the 2020 season.

I’ll be using the tidyverse package to create a few visualizations using ggplot2 to try and understand what’s being calculated above.

2,000 Minutes of Play Time

The first thing I want to take a look at is why 2000 minutes of play time was used above. Lets use ggplot to see the distribution of play time of each of the players in the raptor dataset and see how many players go into each team’s overall average box total.

# plot a histogram of each players' active play time
ggplot(raptor, aes(x = mp)) +
  geom_histogram() +
  theme_bw() +
  labs(title = 'Histogram of Minutes Played',
       x = element_blank(),
       y = element_blank())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like we’re only looking at a very small fraction of the players in this dataset. Let’s determine how many, exactly. Let’s do this by creating a flag for the most active players and showing the results, using dplyr:

# add a flag for players with 2000 or more minutes of playtime
raptor_aflag <- raptor %>%
  mutate(active_flag = ifelse(mp >= 2000, 1, 0))

# calculate the percent of the most active players
raptor_aflag %>%
summarize(active_players = sum(active_flag),
          total_players = n(),
          pct_active_players = active_players / total_players)

## # A tibble: 1 x 3
##   active_players total_players pct_active_players
##            <dbl>         <int>              <dbl>
## 1             27           570             0.0474

It seems like we’re barely looking at 5% of the players in this dataset. This makes for a curious team level average, as it seems possible that a few, perhaps even only one player, is making up the team average calculated above. Lets use dplyr again to take a look at how many active players there are per team:

# show the number of active players per team
team_pcts <- raptor_aflag %>%  
  group_by(team) %>%
  summarize(active_players = sum(active_flag),
            total_players = n(),
            pct_active_players = active_players / total_players) %>%
  arrange(desc(pct_active_players))

head(team_pcts)

## # A tibble: 6 x 4
##   team  active_players total_players pct_active_players
##   <chr>          <dbl>         <int>              <dbl>
## 1 UTA                3            20             0.15  
## 2 CHA                2            16             0.125 
## 3 OKC                2            17             0.118 
## 4 POR                2            18             0.111 
## 5 HOU                2            20             0.1   
## 6 ATL                2            21             0.0952

# show the number of active players per team
ggplot(team_pcts, aes(x = active_players)) +
  geom_bar() +
  scale_y_continuous(limits = c(0,12),
                     breaks = seq(0,12,1)) +
  labs(title = 'Distribution of Active Players by Teams',
       x = "Number of Active Players",
       y = "Number of Teams") +
  theme_bw()

# table version of visualization
team_pcts %>%
  group_by(active_players) %>%
  summarize(number_of_teams = n(),
            pct_of_grp = n() / nrow(team_pcts),
            min_pct_active_players = min(pct_active_players),
            max_pct_active_players = max(pct_active_players))

## # A tibble: 4 x 5
##   active_players number_of_teams pct_of_grp min_pct_active_pl~ max_pct_active_p~
##            <dbl>           <int>      <dbl>              <dbl>             <dbl>
## 1              0              11     0.367              0                 0     
## 2              1              12     0.4                0.0476            0.0588
## 3              2               6     0.2                0.0909            0.125 
## 4              3               1     0.0333             0.15              0.15

Results

It looks like 11 teams or over a third of all the teams listed here won’t be included in the average team box totals above. Also, 12 teams, or 40% of the teams in the dataset, are averaged using only one player on their team. Even with 3 players with over 2,000 minutes of play time in one team, we’re only looking at 15% of that team’s players.

I’m still unclear as to what this metric is and why it’s important to basketball– but it suggests interest in the most active players rather than the combination of the team. Perhaps it would be better to look at box scores by player rather than by team.

Data Acquisition and Management Week 9 Assignment - TidyVerse

Chun San Yip

2020/03/29

Extend Assignment - Leo Yi

2,000 Minutes of Play Time

Results