Introduction


This capstone project will utilize historical NBA statistics from 1937 to 2012 to predict All-NBA Teams. Data used can be found here. The github repository for this project can also be found here.

Loading the data sets


# Store data as a tbl
players <- as_tibble(read.csv("basketball_players.csv"))
awards <- as_tibble(read.csv("basketball_awards_players.csv"))
allstar <- as_tibble(read.csv("basketball_player_allstar.csv"))
master <- as_tibble(read.csv("basketball_master.csv"))

# Check the players data
players
## # A tibble: 23,751 x 42
##    playerID  year stint tmID  lgID     GP    GS minutes points oRebounds
##    <fct>    <int> <int> <fct> <fct> <int> <int>   <int>  <int>     <int>
##  1 abramjo~  1946     1 PIT   NBA      47     0       0    527         0
##  2 aubucch~  1946     1 DTF   NBA      30     0       0     65         0
##  3 bakerno~  1946     1 CHS   NBA       4     0       0      0         0
##  4 baltihe~  1946     1 STB   NBA      58     0       0    138         0
##  5 barrjo01  1946     1 STB   NBA      58     0       0    295         0
##  6 baumhfr~  1946     1 CLR   NBA      45     0       0    631         0
##  7 beckemo~  1946     1 PIT   NBA      17     0       0    108         0
##  8 beckemo~  1946     2 BOS   NBA       6     0       0     13         0
##  9 beckemo~  1946     3 DTF   NBA      20     0       0     41         0
## 10 beendha~  1946     1 PRO   NBA      58     0       0    713         0
## # ... with 23,741 more rows, and 32 more variables: dRebounds <int>,
## #   rebounds <int>, assists <int>, steals <int>, blocks <int>,
## #   turnovers <int>, PF <int>, fgAttempted <int>, fgMade <int>,
## #   ftAttempted <int>, ftMade <int>, threeAttempted <int>,
## #   threeMade <int>, PostGP <int>, PostGS <int>, PostMinutes <int>,
## #   PostPoints <int>, PostoRebounds <int>, PostdRebounds <int>,
## #   PostRebounds <int>, PostAssists <int>, PostSteals <int>,
## #   PostBlocks <int>, PostTurnovers <int>, PostPF <int>,
## #   PostfgAttempted <int>, PostfgMade <int>, PostftAttempted <int>,
## #   PostftMade <int>, PostthreeAttempted <int>, PostthreeMade <int>,
## #   note <fct>

Player data already appears to be tidy. As expected, nearly all of our variables of interest are numeric, and it looks like several variables (e.g. oRebounds, dRebounds, threeAttempted, threeMade) were not tracked in the early seasons.

Data Cleaning


First, the setequal() and distinct() functions are used to check for duplicate observations (just in case).

# Check for duplicate observations
setequal(master, distinct(master))
## TRUE
setequal(players, distinct(players))
## TRUE
setequal(awards, distinct(awards))
## TRUE
setequal(allstar, distinct(allstar))
## TRUE

The players Data Set

We’ll be working with players first. The data set currently contains data from other leagues (e.g., ABA), as well as data from seasons before the first All-Star Game (pre-1950). We can start by removing these extra observations. Post-season stats, players with no games played, and miscellaneous features are also removed (Note that lgID is also removed after filtering).

# Remove extraneous rows and columns 
players <- players %>% 
  filter(lgID == "NBA", year >= 1950, GP != 0) %>% 
  select(-contains("Post"), -c(note, stint, lgID, GS))

players
## # A tibble: 19,458 x 20
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 arizipa~  1950 PHW      65       0   1121         0         0      640
##  2 armstcu~  1950 FTW      38       0    202         0         0       89
##  3 barkecl~  1950 INO      56       0    152         0         0      100
##  4 barnhle~  1950 INO      68       0    546         0         0      296
##  5 barteed~  1950 WSC      17       0     72         0         0       84
##  6 beardra~  1950 INO      66       0   1111         0         0      251
##  7 blevile~  1950 INO       3       0      2         0         0        2
##  8 bobbne01  1950 PHW      53       0    148         0         0      101
##  9 borsaik~  1950 PHW      24       0     64         0         0       24
## 10 borylvi~  1950 NYK      66       0    982         0         0      249
## # ... with 19,448 more rows, and 11 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>

Now that the data set has been trimmed a bit, we can use the summary() function to view some general summary statistics.

##       year            GP           minutes         points    
##  Min.   :1950   Min.   : 1.00   Min.   :   0   Min.   :   0  
##  1st Qu.:1978   1st Qu.:28.00   1st Qu.: 332   1st Qu.: 109  
##  Median :1992   Median :59.00   Median :1083   Median : 384  
##  Mean   :1989   Mean   :51.43   Mean   :1235   Mean   : 531  
##  3rd Qu.:2002   3rd Qu.:76.00   3rd Qu.:2026   3rd Qu.: 813  
##  Max.   :2011   Max.   :82.00   Max.   :3882   Max.   :4029  
##    oRebounds        dRebounds         rebounds       assists      
##  Min.   :  0.00   Min.   :   0.0   Min.   :   0   Min.   :   0.0  
##  1st Qu.:  3.00   1st Qu.:   7.0   1st Qu.:  52   1st Qu.:  19.0  
##  Median : 28.00   Median :  75.0   Median : 167   Median :  71.0  
##  Mean   : 55.02   Mean   : 125.6   Mean   : 233   Mean   : 119.5  
##  3rd Qu.: 83.00   3rd Qu.: 190.0   3rd Qu.: 333   3rd Qu.: 169.0  
##  Max.   :587.00   Max.   :1111.0   Max.   :2149   Max.   :1164.0  
##      steals           blocks         turnovers            PF       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:  2.00   1st Qu.:  0.00   1st Qu.:  1.00   1st Qu.: 41.0  
##  Median : 21.00   Median :  7.00   Median : 35.00   Median :115.0  
##  Mean   : 34.65   Mean   : 21.34   Mean   : 60.27   Mean   :120.9  
##  3rd Qu.: 55.00   3rd Qu.: 25.00   3rd Qu.: 99.00   3rd Qu.:190.0  
##  Max.   :301.00   Max.   :456.00   Max.   :366.00   Max.   :386.0  
##   fgAttempted         fgMade        ftAttempted         ftMade     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 100.0   1st Qu.:  42.0   1st Qu.:  28.0   1st Qu.: 19.0  
##  Median : 334.0   Median : 148.0   Median :  95.0   Median : 68.0  
##  Mean   : 446.9   Mean   : 204.1   Mean   : 144.3   Mean   :108.2  
##  3rd Qu.: 691.0   3rd Qu.: 314.0   3rd Qu.: 213.0   3rd Qu.:159.0  
##  Max.   :3159.0   Max.   :1597.0   Max.   :1363.0   Max.   :840.0  
##  threeAttempted     threeMade     
##  Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  2.00   Median :  0.00  
##  Mean   : 42.19   Mean   : 14.67  
##  3rd Qu.: 34.00   3rd Qu.: 10.00  
##  Max.   :678.00   Max.   :269.00

Unfortunately, it looks like there are still several statistics that were not recorded since the inception of the All-Star game (which plenty of NBA historians would love to remind us of). We’re concerned specifically with rebounds and 3-pointers. A bit of research reveals that the NBA didn’t begin recording offensive and defensive rebounds until the 1973-1974 season, and the 3-pointer wasn’t adopted until 1979. This means that we should only keep data from 1979 and later, but let’s confirm that with which.min() just in case:

# Group by year and sort by maxima
check <- players %>% 
  select(year, oRebounds, dRebounds, threeAttempted) %>% 
  group_by(year) %>% 
  summarize_if(is.numeric, max)

# Store the indices for the earliest year that each stat maxed out at 0
keep <- c(
  which.min(check$oRebounds == 0), 
  which.min(check$dRebounds == 0), 
  which.min(check$threeAttempted == 0))

# What's the earliest year we can keep?
check$year[max(keep)]
## [1] 1979

To prevent bias we need to retain data only for years in which every statistic is available, so we should get rid of observations before the 1979-1980 season.

players <- players %>% 
  filter(year >= 1979)

players
## # A tibble: 14,577 x 20
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,567 more rows, and 11 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>

Look’s like we’ve shaved the players data set down from 23,751 x 42 to 14,577 x 20!

The master Data Set

master
## # A tibble: 5,061 x 26
##    bioID useFirst firstName middleName lastName nameGiven fullGivenName
##    <fct> <fct>    <fct>     <fct>      <fct>    <fct>     <fct>        
##  1 abde~ Alaa     Alaa      ""         Abdelna~ ""        ""           
##  2 abdu~ Kareem   Kareem    ""         Abdul-J~ ""        Ferdinand Le~
##  3 abdu~ Mahdi    Mahdi     ""         Abdul-R~ ""        Walter Rapha~
##  4 abdu~ Mahmoud  Mahmoud   ""         Abdul-R~ ""        Chris Wayne ~
##  5 abdu~ Tariq    Tariq     ""         Abdul-W~ ""        Olivier Mich~
##  6 abdu~ Zaid     Zaid      ""         Abdul-A~ ""        Donald A. Sm~
##  7 able~ Forest   Forest    Edward     Able     ""        ""           
##  8 abra~ John     John      ""         Abramov~ ""        ""           
##  9 acke~ Alex     Alex      ""         Acker    ""        ""           
## 10 acre~ Mark     Mark      Richard    Acres    ""        ""           
## # ... with 5,051 more rows, and 19 more variables: nameSuffix <fct>,
## #   nameNick <fct>, pos <fct>, firstseason <int>, lastseason <int>,
## #   height <dbl>, weight <int>, college <fct>, collegeOther <fct>,
## #   birthDate <fct>, birthCity <fct>, birthState <fct>,
## #   birthCountry <fct>, highSchool <fct>, hsCity <fct>, hsState <fct>,
## #   hsCountry <fct>, deathDate <fct>, race <fct>

This data set is the master reference for first and last name data and, more importantly, the associated player ID. We’ll use this data set to cross-check any unexpected behavior down the road. It also contains additional biographical data, such as height, weight, and college attended. The only relevant variable for us (other than player ID) is the position played, so we’ll omit everything else.

# Select relevant variables, rename, and create indicators
pos <- master %>% 
  select(bioID, pos) %>% 
  rename("playerID" = "bioID") %>% 
  mutate(
    center = case_when(
      grepl("C", pos) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    forward = case_when(
      grepl("F", pos) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    guard = case_when(
      grepl("G", pos) ~ as.integer(1),
      TRUE ~ as.integer(0))
  ) %>% 
  select(-pos)

pos$playerID <- as.character(pos$playerID) # Initialize as a character vector
pos
## # A tibble: 5,061 x 4
##    playerID  center forward guard
##    <chr>      <int>   <int> <int>
##  1 abdelal01      1       1     0
##  2 abdulka01      1       0     0
##  3 abdulma01      0       0     1
##  4 abdulma02      0       0     1
##  5 abdulta01      0       1     1
##  6 abdulza01      1       1     0
##  7 ablefo01       0       0     1
##  8 abramjo01      0       1     0
##  9 ackeral01      0       0     1
## 10 acresma01      1       1     0
## # ... with 5,051 more rows

The awards Data Set

Now we can make similar adjustments to the data sets for end-of-season awards and All-Star games, which are much smaller than the players data set:

awards
## # A tibble: 1,719 x 6
##    playerID  award                year lgID  note  pos  
##    <chr>     <fct>               <int> <fct> <fct> <fct>
##  1 feeribo01 All-NBA First Team   1946 NBA   ""    ""   
##  2 fulksjo01 All-NBA First Team   1946 NBA   ""    ""   
##  3 mckinho01 All-NBA First Team   1946 NBA   ""    ""   
##  4 miasest01 All-NBA First Team   1946 NBA   ""    ""   
##  5 zasloma01 All-NBA First Team   1946 NBA   ""    ""   
##  6 baumhfr01 All-NBA Second Team  1946 NBA   ""    ""   
##  7 calveer01 All-NBA Second Team  1946 NBA   ""    ""   
##  8 halbech01 All-NBA Second Team  1946 NBA   ""    ""   
##  9 loganjo01 All-NBA Second Team  1946 NBA   ""    ""   
## 10 scolafr01 All-NBA Second Team  1946 NBA   ""    ""   
## # ... with 1,709 more rows

Similar to players, the awards data set has data for non-NBA leagues. It also has awards other than the All-NBA teams, such as “Most Valuable Player” and “Rookie of the Year.” We’ll only keep NBA data and relevant awards (Note that grepl() is used here, along with an appropriate regular expression).

# Filter for league and year
awards <- awards %>% 
  filter(lgID == "NBA", year >= 1979) %>% 
  select(-c(note, lgID))

# Keep relevant awards
keep <- grepl("All-NBA\\sFirst|All-NBA\\sSecond|All-D|Valuable|Defensive", awards$award)
awards <- awards[keep,]

# Check for duplicate entries
setequal(awards, distinct(awards))
## TRUE

To prevent potential bias, “All-NBA Third Team” honors are disregarded, since the Third Team has only been selected from 1988 and on. “Defensive Player of the Year” is retained as there are only 3 seasons for which data is missing (Defensive POTY has been awarded since 1982). The pos variable is also retained, as each team must consist of two guards, two forwards, and one center.

As a final step, we create indicator variables for each award. However, the awards data set currently stores multiple rows for years in which a player won multiple awards. For instance:

awards %>% filter(playerID == "abdulka01", year == 1979) %>% group_by(year)
## # A tibble: 3 x 4
## # Groups:   year [1]
##   playerID  award                     year pos  
##   <chr>     <fct>                    <int> <fct>
## 1 abdulka01 All-Defensive First Team  1979 ""   
## 2 abdulka01 All-NBA First Team        1979 C    
## 3 abdulka01 Most Valuable Player      1979 ""

To properly associate awards with the years they were awarded (as well as prevent any odd behavior when merging with the original data set), we need to combine such observations into a single entry per player, per year. This is accomplished with summarize(award = paste(award, collapse = ", ")) after grouping by playerID and year. case_when() and grepl() are then used to create the desired indicator variables.

# Create indicator variables for each award
awards <- awards %>% 
  group_by(playerID, year) %>% 
  summarize(award = paste(award, collapse = ", ")) %>% 
  mutate(
    allDefFirstTeam = case_when(
      grepl("All-Defensive\\sFirst", award) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    allDefSecondTeam = case_when(
      grepl("All-Defensive\\sSecond", award) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    allNBAFirstTeam = case_when(
      grepl("All-NBA\\sFirst", award) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    allNBASecondTeam = case_when(
      grepl("All-NBA\\sSecond", award) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    MVP = case_when(
      grepl("Most", award) ~ as.integer(1),
      TRUE ~ as.integer(0)),
    
    defPOTY = case_when(
      grepl("Defensive\\sPlayer", x = award) ~ as.integer(1),
      TRUE ~ as.integer(0)))

awards
## # A tibble: 542 x 9
## # Groups:   playerID [156]
##    playerID  year award allDefFirstTeam allDefSecondTeam allNBAFirstTeam
##    <chr>    <int> <chr>           <int>            <int>           <int>
##  1 abdulka~  1979 All-~               1                0               1
##  2 abdulka~  1980 All-~               1                0               1
##  3 abdulka~  1982 All-~               0                0               0
##  4 abdulka~  1983 All-~               0                1               1
##  5 abdulka~  1984 All-~               0                0               0
##  6 abdulka~  1985 All-~               0                0               1
##  7 allenra~  2004 All-~               0                0               0
##  8 allento~  2010 All-~               0                1               0
##  9 allento~  2011 All-~               1                0               0
## 10 anthoca~  2009 All-~               0                0               0
## # ... with 532 more rows, and 3 more variables: allNBASecondTeam <int>,
## #   MVP <int>, defPOTY <int>

The allstar Data Set

allstar
## # A tibble: 1,609 x 23
##    player_id last_name first_name season_id conference league_id
##    <chr>     <fct>     <fct>          <int> <fct>      <fct>    
##  1 abdulka01 Abdul-Ja~ Kareem          1978 West       NBA      
##  2 abdulka01 Abdul-Ja~ Kareem          1969 East       NBA      
##  3 abdulka01 Abdul-Ja~ Kareem          1988 West       NBA      
##  4 abdulka01 Abdul-Ja~ Kareem          1987 West       NBA      
##  5 abdulka01 Abdul-Ja~ Kareem          1986 West       NBA      
##  6 abdulka01 Abdul-Ja~ Kareem          1985 West       NBA      
##  7 abdulka01 Abdul-Ja~ Kareem          1984 West       NBA      
##  8 abdulka01 Abdul-Ja~ Kareem          1983 West       NBA      
##  9 abdulka01 Abdul-Ja~ Kareem          1982 West       NBA      
## 10 abdulka01 Abdul-Ja~ Kareem          1980 West       NBA      
## # ... with 1,599 more rows, and 17 more variables: games_played <int>,
## #   minutes <int>, points <int>, o_rebounds <int>, d_rebounds <int>,
## #   rebounds <int>, assists <int>, steals <int>, blocks <int>,
## #   turnovers <int>, personal_fouls <int>, fg_attempted <int>,
## #   fg_made <int>, ft_attempted <int>, ft_made <int>,
## #   three_attempted <int>, three_made <int>

We wish to use All-Star data as an additional explanatory variable. To avoid conflicts with existing player statistics, we can disregard actual game performance and simply create an indicator variable for team membership.

# Filter for league and year
allstar <- allstar %>% 
  filter(league_id == "NBA", season_id >= 1979) %>%
  select(player_id, season_id) %>%
  mutate(allstar = as.integer(1)) %>% 
  rename("playerID" = "player_id", "year" = "season_id") # Rename variables to match format of players

# Check for duplicate entries
setequal(allstar, distinct(allstar))
## FALSE: Different number of rows

Oops, looks like there are duplicate entries. After a prolonged investigation, we find that Marques Johnson was incorrectly coded as johnsma02 instead of johnsma01, causing a discrepancy for the All-Star team in 1979, which had both Marques Johnson and Magic Johnson.

## # A tibble: 5 x 23
##   player_id last_name first_name season_id conference league_id
##   <fct>     <fct>     <fct>          <int> <fct>      <fct>    
## 1 johnsma01 Johnson   Marques         1979 West       NBA      
## 2 johnsma01 Johnson   Marques         1978 West       NBA      
## 3 johnsma02 Johnson   Marques         1985 West       NBA      
## 4 johnsma02 Johnson   Marques         1982 East       NBA      
## 5 johnsma02 Johnson   Marques         1980 East       NBA      
## # ... with 17 more variables: games_played <int>, minutes <int>,
## #   points <int>, o_rebounds <int>, d_rebounds <int>, rebounds <int>,
## #   assists <int>, steals <int>, blocks <int>, turnovers <int>,
## #   personal_fouls <int>, fg_attempted <int>, fg_made <int>,
## #   ft_attempted <int>, ft_made <int>, three_attempted <int>,
## #   three_made <int>

We can correct this quirk with allstar$player_id[allstar$first_name == "Marques"] <- "johnsma01" and re-run our code.

# Check for duplicate entries
setequal(allstar, distinct(allstar))
## FALSE: Different number of rows

Much better! We should (hopefully) be fine when we merge our data.

Merging the Data


Now that we’ve cleaned up each data set a bit, we can left_join() all of our data into a single data set. left_join() takes two arguments, x and y, and augments the primary data frame (x) with the data from the secondary data frame (y). The join is performed based on a common key which should be present in both data frames (in our case, playerID and year). In the following code, we can join the data frames separately to check whether rows are added, which would indicate that the join didn’t go as expected.

# Merge player data with awards data
players %>% 
  left_join(awards, by = c("playerID", "year"))
## # A tibble: 14,577 x 27
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,567 more rows, and 18 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>, award <chr>,
## #   allDefFirstTeam <int>, allDefSecondTeam <int>, allNBAFirstTeam <int>,
## #   allNBASecondTeam <int>, MVP <int>, defPOTY <int>
# Merge player data with position data
players %>% 
  left_join(pos, by = c("playerID"))
## # A tibble: 14,577 x 23
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,567 more rows, and 14 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>, center <int>, forward <int>,
## #   guard <int>
# Merge player data with allstar data
players %>% 
  left_join(allstar, by = c("playerID", "year"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
## # A tibble: 14,580 x 21
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,570 more rows, and 12 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>, allstar <int>

The awards and pos data sets seemed to merge correctly, but it looks like the number of rows changed after the allstar merge. To investigate this error, we can use the anti_join() function to compare the original allstar data set to the entries that were actually transferred over. anti_join() only returns rows from the primary data frame that are not in the secondary, while keeping the columns from the primary.

anti_join(allstar, players)
## Joining, by = c("playerID", "year")
## Warning: Column `playerID` joining factor and character vector, coercing
## into character vector
## # A tibble: 50 x 3
##    playerID   year allstar
##    <fct>     <int>   <int>
##  1 anderke02  1993       1
##  2 antonca01  2007       1
##  3 floyder01  1986       1
##  4 ginobma01  2004       1
##  5 johnske01  1993       1
##  6 johnske01  1990       1
##  7 johnske01  1989       1
##  8 johnsla01  1994       1
##  9 johnsla01  1992       1
## 10 johnsma01  1988       1
## # ... with 40 more rows

After manually cross-referencing the excluded All-Star rows and playerID’s with Basketball-Reference, Wikipedia, and the master list, we find that all of the missing rows had their playerID values incorrectly coded per the players and master data sets, which resulted in skipped entries during the join (e.g., o’neaje01, o’neasha01). We can correct these errors and try joining the data again.

# Fix coding errors for all players in the anti_join
allstar$player_id[allstar$first_name == "Marques" & allstar$last_name == "Johnson"] <- "johnsma01"
allstar$player_id[allstar$first_name == "Magic" & allstar$last_name == "Johnson"] <- "johnsma02"
allstar$player_id[allstar$first_name == "Kenny" & allstar$last_name == "Anderson"] <- "anderke01"
allstar$player_id[allstar$first_name == "Carmelo" & allstar$last_name == "Anthony"] <- "anthoca01"
allstar$player_id[allstar$first_name == "Eric" & allstar$last_name == "Floyd"] <- "floydsl01"
allstar$player_id[allstar$first_name == "Manu" & allstar$last_name == "Ginobili"] <- "ginobem01"
allstar$player_id[allstar$first_name == "Kevin" & allstar$last_name == "Johnson"] <- "johnske02"
allstar$player_id[allstar$first_name == "Larry" & allstar$last_name == "Johnson"] <- "johnsla02"
allstar$player_id[allstar$first_name == "Eddie" & allstar$last_name == "Jones"] <- "jonesed02"
allstar$player_id[allstar$first_name == "Lafayette" & allstar$last_name == "Lever"] <- "leverfa01"
allstar$player_id[allstar$first_name == "Rashard" & allstar$last_name == "Lewis"] <- "lewisra02"
allstar$player_id[allstar$first_name == "Jermaine" & allstar$last_name == "O'Neal"] <- "onealje01"
allstar$player_id[allstar$first_name == "Shaquille" & allstar$last_name == "O'Neal"] <- "onealsh01"
allstar$player_id[allstar$first_name == "Jim" & allstar$last_name == "Paxson"] <- "paxsoji02"
allstar$player_id[allstar$first_name == "Jeff" & allstar$last_name == "Rutland"] <- "rulanje01"
allstar$player_id[allstar$first_name == "Wally" & allstar$last_name == "Szczerbiak"] <- "szczewa02"
allstar$player_id[allstar$first_name == "David" & allstar$last_name == "Thompson"] <- "thompda01"
allstar$player_id[allstar$first_name == "Antoine" & allstar$last_name == "Walker"] <- "walkean02"
allstar$player_id[allstar$first_name == "Kevin" & allstar$last_name == "Willis"] <- "willike02"

# Apply previous filters, etc.
allstar <- allstar %>% 
  filter(league_id == "NBA", season_id >= 1979) %>%
  select(player_id, season_id) %>%
  mutate(allstar = as.integer(1)) %>% 
  rename("playerID" = "player_id", "year" = "season_id") 

# Retry the join and save to allstar.join
allstar.join <- players %>% 
  left_join(allstar, by = c("playerID", "year"))
allstar.join
## # A tibble: 14,577 x 21
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,567 more rows, and 12 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>, allstar <int>

The resulting data frame, allstar.join, has the correct amount of 14577 observations, which confirms that there are no extranneous rows added. As a quick aside, summing up the allstar column will show a number greater than 743:

# Add all observations for which the player was an All-Star that year
sum(allstar.join$allstar, na.rm = TRUE)
## [1] 751

The could indicate that some players were incorrectly coded as All-Stars. However, after a bit of digging with the duplicated() function, we find that these additional allstar entries are an artifact of the players structure; players who were traded during the season have multiple rows for the year that they were traded, so players who were traded that ALSO happened to make the All-Star team roster will have both rows coded with allstar = 1.

# Filter All-Stars and select rows to match the allstar data set
allstar.join <- allstar.join %>% 
  select(playerID, year, allstar) %>% 
  filter(allstar == 1)

# Check rows with duplicate entries
allstar.join[duplicated(allstar.join),]
## # A tibble: 10 x 3
##    playerID   year allstar
##    <chr>     <int>   <int>
##  1 piercri01  1990       1
##  2 mannida01  1993       1
##  3 wilkido01  1993       1
##  4 gatlich01  1996       1
##  5 mutomdi01  2000       1
##  6 paytoga01  2002       1
##  7 cartevi01  2004       1
##  8 kiddja01   2007       1
##  9 billuch01  2008       1
## 10 iversal01  2008       1

The astute reader will notice that there are still 2 observations missing from the join (summing the allstar column resulted in 751 rows, but there are 10 additional rows added from traded All-Star players).

anti_join(allstar, allstar.join, by = c("playerID", "year"))
## # A tibble: 2 x 3
##   playerID   year allstar
##   <chr>     <int>   <int>
## 1 johnsma02  1991       1
## 2 thomsda01  1982       1

The line of code allstar$player_id[allstar$first_name == "David" & allstar$last_name == "Thompson"] <- "thompda01" should have corrected David Thompson’s 1982 All-Star game appearance. But, after reviewing the original allstar data (again), we find that there were two entries for the thomsda01 player ID: one for “David Thompson” and one for “David Thomspon”. This is an error that won’t affect our players data. The second observation (for Magic Johnson) is for the 1992 All-Star game. From Wikipedia,

The game is most remembered for the return of Los Angeles Lakers legend Magic Johnson, who retired before the 1991-92 NBA season after contracting HIV.

This explains why this particular row from the allstar data wasn’t transferred over - even though he played in the All-Star game, Magic Johnson didn’t actually play during the regular season.

Finally, we’ve taken care of the many discrepancies that arose from the attempted joining of the allstar and player data. At this point we can be confident that all of our data sets will safely merge with players.

# Join all data sets
players.joined <- players %>% 
  left_join(pos, by = "playerID") %>% 
  left_join(awards, by = c("playerID", "year")) %>% 
  left_join(allstar, by = c("playerID", "year")) %>% 
  mutate_if(is.numeric, coalesce, 0L) %>%  # Return non-All-Star entries as integer zeroes
  mutate_if(is.character, coalesce, "None") # Replace NA values in award with "None"
players.joined
## # A tibble: 14,577 x 31
##    playerID  year tmID     GP minutes points oRebounds dRebounds rebounds
##    <chr>    <int> <fct> <int>   <int>  <int>     <int>     <int>    <int>
##  1 abdulka~  1979 LAL      82    3143   2034       190       696      886
##  2 abernto~  1979 GSW      67    1222    362        62       129      191
##  3 adamsal~  1979 PHO      75    2168   1118       158       451      609
##  4 archina~  1979 BOS      80    2864   1131        59       138      197
##  5 awtrede~  1979 CHI      26     560     86        29        86      115
##  6 bailegu~  1979 WSB      20     180     38         6        22       28
##  7 baileja~  1979 SEA      67     726    312        71       126      197
##  8 ballagr~  1979 WSB      82    2438   1277       240       398      638
##  9 bantomi~  1979 IND      77    2330    908       192       264      456
## 10 barnema~  1979 SDC      20     287     64        34        43       77
## # ... with 14,567 more rows, and 22 more variables: assists <int>,
## #   steals <int>, blocks <int>, turnovers <int>, PF <int>,
## #   fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## #   threeAttempted <int>, threeMade <int>, center <int>, forward <int>,
## #   guard <int>, award <chr>, allDefFirstTeam <int>,
## #   allDefSecondTeam <int>, allNBAFirstTeam <int>, allNBASecondTeam <int>,
## #   MVP <int>, defPOTY <int>, allstar <int>

Save the Cleaned Data


Now all that’s left is to save the cleaned data set!

write.csv(players.joined, "players_clean.csv", row.names = FALSE)

With the data cleaned, we’re ready for exploratory data analysis, which will be continued in another report.