This capstone project will utilize historical NBA statistics from 1937 to 2012 to predict All-NBA Teams. Data used can be found here. The github repository for this project can also be found here.
# Store data as a tbl
players <- as_tibble(read.csv("basketball_players.csv"))
awards <- as_tibble(read.csv("basketball_awards_players.csv"))
allstar <- as_tibble(read.csv("basketball_player_allstar.csv"))
master <- as_tibble(read.csv("basketball_master.csv"))
# Check the players data
players
## # A tibble: 23,751 x 42
## playerID year stint tmID lgID GP GS minutes points oRebounds
## <fct> <int> <int> <fct> <fct> <int> <int> <int> <int> <int>
## 1 abramjo~ 1946 1 PIT NBA 47 0 0 527 0
## 2 aubucch~ 1946 1 DTF NBA 30 0 0 65 0
## 3 bakerno~ 1946 1 CHS NBA 4 0 0 0 0
## 4 baltihe~ 1946 1 STB NBA 58 0 0 138 0
## 5 barrjo01 1946 1 STB NBA 58 0 0 295 0
## 6 baumhfr~ 1946 1 CLR NBA 45 0 0 631 0
## 7 beckemo~ 1946 1 PIT NBA 17 0 0 108 0
## 8 beckemo~ 1946 2 BOS NBA 6 0 0 13 0
## 9 beckemo~ 1946 3 DTF NBA 20 0 0 41 0
## 10 beendha~ 1946 1 PRO NBA 58 0 0 713 0
## # ... with 23,741 more rows, and 32 more variables: dRebounds <int>,
## # rebounds <int>, assists <int>, steals <int>, blocks <int>,
## # turnovers <int>, PF <int>, fgAttempted <int>, fgMade <int>,
## # ftAttempted <int>, ftMade <int>, threeAttempted <int>,
## # threeMade <int>, PostGP <int>, PostGS <int>, PostMinutes <int>,
## # PostPoints <int>, PostoRebounds <int>, PostdRebounds <int>,
## # PostRebounds <int>, PostAssists <int>, PostSteals <int>,
## # PostBlocks <int>, PostTurnovers <int>, PostPF <int>,
## # PostfgAttempted <int>, PostfgMade <int>, PostftAttempted <int>,
## # PostftMade <int>, PostthreeAttempted <int>, PostthreeMade <int>,
## # note <fct>
Player data already appears to be tidy. As expected, nearly all of our variables of interest are numeric, and it looks like several variables (e.g. oRebounds, dRebounds, threeAttempted, threeMade) were not tracked in the early seasons.
First, the setequal() and distinct() functions are used to check for duplicate observations (just in case).
# Check for duplicate observations
setequal(master, distinct(master))
## TRUE
setequal(players, distinct(players))
## TRUE
setequal(awards, distinct(awards))
## TRUE
setequal(allstar, distinct(allstar))
## TRUE
We’ll be working with players first. The data set currently contains data from other leagues (e.g., ABA), as well as data from seasons before the first All-Star Game (pre-1950). We can start by removing these extra observations. Post-season stats, players with no games played, and miscellaneous features are also removed (Note that lgID is also removed after filtering).
# Remove extraneous rows and columns
players <- players %>%
filter(lgID == "NBA", year >= 1950, GP != 0) %>%
select(-contains("Post"), -c(note, stint, lgID, GS))
players
## # A tibble: 19,458 x 20
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 arizipa~ 1950 PHW 65 0 1121 0 0 640
## 2 armstcu~ 1950 FTW 38 0 202 0 0 89
## 3 barkecl~ 1950 INO 56 0 152 0 0 100
## 4 barnhle~ 1950 INO 68 0 546 0 0 296
## 5 barteed~ 1950 WSC 17 0 72 0 0 84
## 6 beardra~ 1950 INO 66 0 1111 0 0 251
## 7 blevile~ 1950 INO 3 0 2 0 0 2
## 8 bobbne01 1950 PHW 53 0 148 0 0 101
## 9 borsaik~ 1950 PHW 24 0 64 0 0 24
## 10 borylvi~ 1950 NYK 66 0 982 0 0 249
## # ... with 19,448 more rows, and 11 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>
Now that the data set has been trimmed a bit, we can use the summary() function to view some general summary statistics.
## year GP minutes points
## Min. :1950 Min. : 1.00 Min. : 0 Min. : 0
## 1st Qu.:1978 1st Qu.:28.00 1st Qu.: 332 1st Qu.: 109
## Median :1992 Median :59.00 Median :1083 Median : 384
## Mean :1989 Mean :51.43 Mean :1235 Mean : 531
## 3rd Qu.:2002 3rd Qu.:76.00 3rd Qu.:2026 3rd Qu.: 813
## Max. :2011 Max. :82.00 Max. :3882 Max. :4029
## oRebounds dRebounds rebounds assists
## Min. : 0.00 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 3.00 1st Qu.: 7.0 1st Qu.: 52 1st Qu.: 19.0
## Median : 28.00 Median : 75.0 Median : 167 Median : 71.0
## Mean : 55.02 Mean : 125.6 Mean : 233 Mean : 119.5
## 3rd Qu.: 83.00 3rd Qu.: 190.0 3rd Qu.: 333 3rd Qu.: 169.0
## Max. :587.00 Max. :1111.0 Max. :2149 Max. :1164.0
## steals blocks turnovers PF
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 41.0
## Median : 21.00 Median : 7.00 Median : 35.00 Median :115.0
## Mean : 34.65 Mean : 21.34 Mean : 60.27 Mean :120.9
## 3rd Qu.: 55.00 3rd Qu.: 25.00 3rd Qu.: 99.00 3rd Qu.:190.0
## Max. :301.00 Max. :456.00 Max. :366.00 Max. :386.0
## fgAttempted fgMade ftAttempted ftMade
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 100.0 1st Qu.: 42.0 1st Qu.: 28.0 1st Qu.: 19.0
## Median : 334.0 Median : 148.0 Median : 95.0 Median : 68.0
## Mean : 446.9 Mean : 204.1 Mean : 144.3 Mean :108.2
## 3rd Qu.: 691.0 3rd Qu.: 314.0 3rd Qu.: 213.0 3rd Qu.:159.0
## Max. :3159.0 Max. :1597.0 Max. :1363.0 Max. :840.0
## threeAttempted threeMade
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 2.00 Median : 0.00
## Mean : 42.19 Mean : 14.67
## 3rd Qu.: 34.00 3rd Qu.: 10.00
## Max. :678.00 Max. :269.00
Unfortunately, it looks like there are still several statistics that were not recorded since the inception of the All-Star game (which plenty of NBA historians would love to remind us of). We’re concerned specifically with rebounds and 3-pointers. A bit of research reveals that the NBA didn’t begin recording offensive and defensive rebounds until the 1973-1974 season, and the 3-pointer wasn’t adopted until 1979. This means that we should only keep data from 1979 and later, but let’s confirm that with which.min() just in case:
# Group by year and sort by maxima
check <- players %>%
select(year, oRebounds, dRebounds, threeAttempted) %>%
group_by(year) %>%
summarize_if(is.numeric, max)
# Store the indices for the earliest year that each stat maxed out at 0
keep <- c(
which.min(check$oRebounds == 0),
which.min(check$dRebounds == 0),
which.min(check$threeAttempted == 0))
# What's the earliest year we can keep?
check$year[max(keep)]
## [1] 1979
To prevent bias we need to retain data only for years in which every statistic is available, so we should get rid of observations before the 1979-1980 season.
players <- players %>%
filter(year >= 1979)
players
## # A tibble: 14,577 x 20
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,567 more rows, and 11 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>
Look’s like we’ve shaved the players data set down from 23,751 x 42 to 14,577 x 20!
master
## # A tibble: 5,061 x 26
## bioID useFirst firstName middleName lastName nameGiven fullGivenName
## <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 abde~ Alaa Alaa "" Abdelna~ "" ""
## 2 abdu~ Kareem Kareem "" Abdul-J~ "" Ferdinand Le~
## 3 abdu~ Mahdi Mahdi "" Abdul-R~ "" Walter Rapha~
## 4 abdu~ Mahmoud Mahmoud "" Abdul-R~ "" Chris Wayne ~
## 5 abdu~ Tariq Tariq "" Abdul-W~ "" Olivier Mich~
## 6 abdu~ Zaid Zaid "" Abdul-A~ "" Donald A. Sm~
## 7 able~ Forest Forest Edward Able "" ""
## 8 abra~ John John "" Abramov~ "" ""
## 9 acke~ Alex Alex "" Acker "" ""
## 10 acre~ Mark Mark Richard Acres "" ""
## # ... with 5,051 more rows, and 19 more variables: nameSuffix <fct>,
## # nameNick <fct>, pos <fct>, firstseason <int>, lastseason <int>,
## # height <dbl>, weight <int>, college <fct>, collegeOther <fct>,
## # birthDate <fct>, birthCity <fct>, birthState <fct>,
## # birthCountry <fct>, highSchool <fct>, hsCity <fct>, hsState <fct>,
## # hsCountry <fct>, deathDate <fct>, race <fct>
This data set is the master reference for first and last name data and, more importantly, the associated player ID. We’ll use this data set to cross-check any unexpected behavior down the road. It also contains additional biographical data, such as height, weight, and college attended. The only relevant variable for us (other than player ID) is the position played, so we’ll omit everything else.
# Select relevant variables, rename, and create indicators
pos <- master %>%
select(bioID, pos) %>%
rename("playerID" = "bioID") %>%
mutate(
center = case_when(
grepl("C", pos) ~ as.integer(1),
TRUE ~ as.integer(0)),
forward = case_when(
grepl("F", pos) ~ as.integer(1),
TRUE ~ as.integer(0)),
guard = case_when(
grepl("G", pos) ~ as.integer(1),
TRUE ~ as.integer(0))
) %>%
select(-pos)
pos$playerID <- as.character(pos$playerID) # Initialize as a character vector
pos
## # A tibble: 5,061 x 4
## playerID center forward guard
## <chr> <int> <int> <int>
## 1 abdelal01 1 1 0
## 2 abdulka01 1 0 0
## 3 abdulma01 0 0 1
## 4 abdulma02 0 0 1
## 5 abdulta01 0 1 1
## 6 abdulza01 1 1 0
## 7 ablefo01 0 0 1
## 8 abramjo01 0 1 0
## 9 ackeral01 0 0 1
## 10 acresma01 1 1 0
## # ... with 5,051 more rows
Now we can make similar adjustments to the data sets for end-of-season awards and All-Star games, which are much smaller than the players data set:
awards
## # A tibble: 1,719 x 6
## playerID award year lgID note pos
## <chr> <fct> <int> <fct> <fct> <fct>
## 1 feeribo01 All-NBA First Team 1946 NBA "" ""
## 2 fulksjo01 All-NBA First Team 1946 NBA "" ""
## 3 mckinho01 All-NBA First Team 1946 NBA "" ""
## 4 miasest01 All-NBA First Team 1946 NBA "" ""
## 5 zasloma01 All-NBA First Team 1946 NBA "" ""
## 6 baumhfr01 All-NBA Second Team 1946 NBA "" ""
## 7 calveer01 All-NBA Second Team 1946 NBA "" ""
## 8 halbech01 All-NBA Second Team 1946 NBA "" ""
## 9 loganjo01 All-NBA Second Team 1946 NBA "" ""
## 10 scolafr01 All-NBA Second Team 1946 NBA "" ""
## # ... with 1,709 more rows
Similar to players, the awards data set has data for non-NBA leagues. It also has awards other than the All-NBA teams, such as “Most Valuable Player” and “Rookie of the Year.” We’ll only keep NBA data and relevant awards (Note that grepl() is used here, along with an appropriate regular expression).
# Filter for league and year
awards <- awards %>%
filter(lgID == "NBA", year >= 1979) %>%
select(-c(note, lgID))
# Keep relevant awards
keep <- grepl("All-NBA\\sFirst|All-NBA\\sSecond|All-D|Valuable|Defensive", awards$award)
awards <- awards[keep,]
# Check for duplicate entries
setequal(awards, distinct(awards))
## TRUE
To prevent potential bias, “All-NBA Third Team” honors are disregarded, since the Third Team has only been selected from 1988 and on. “Defensive Player of the Year” is retained as there are only 3 seasons for which data is missing (Defensive POTY has been awarded since 1982). The pos variable is also retained, as each team must consist of two guards, two forwards, and one center.
As a final step, we create indicator variables for each award. However, the awards data set currently stores multiple rows for years in which a player won multiple awards. For instance:
awards %>% filter(playerID == "abdulka01", year == 1979) %>% group_by(year)
## # A tibble: 3 x 4
## # Groups: year [1]
## playerID award year pos
## <chr> <fct> <int> <fct>
## 1 abdulka01 All-Defensive First Team 1979 ""
## 2 abdulka01 All-NBA First Team 1979 C
## 3 abdulka01 Most Valuable Player 1979 ""
To properly associate awards with the years they were awarded (as well as prevent any odd behavior when merging with the original data set), we need to combine such observations into a single entry per player, per year. This is accomplished with summarize(award = paste(award, collapse = ", ")) after grouping by playerID and year. case_when() and grepl() are then used to create the desired indicator variables.
# Create indicator variables for each award
awards <- awards %>%
group_by(playerID, year) %>%
summarize(award = paste(award, collapse = ", ")) %>%
mutate(
allDefFirstTeam = case_when(
grepl("All-Defensive\\sFirst", award) ~ as.integer(1),
TRUE ~ as.integer(0)),
allDefSecondTeam = case_when(
grepl("All-Defensive\\sSecond", award) ~ as.integer(1),
TRUE ~ as.integer(0)),
allNBAFirstTeam = case_when(
grepl("All-NBA\\sFirst", award) ~ as.integer(1),
TRUE ~ as.integer(0)),
allNBASecondTeam = case_when(
grepl("All-NBA\\sSecond", award) ~ as.integer(1),
TRUE ~ as.integer(0)),
MVP = case_when(
grepl("Most", award) ~ as.integer(1),
TRUE ~ as.integer(0)),
defPOTY = case_when(
grepl("Defensive\\sPlayer", x = award) ~ as.integer(1),
TRUE ~ as.integer(0)))
awards
## # A tibble: 542 x 9
## # Groups: playerID [156]
## playerID year award allDefFirstTeam allDefSecondTeam allNBAFirstTeam
## <chr> <int> <chr> <int> <int> <int>
## 1 abdulka~ 1979 All-~ 1 0 1
## 2 abdulka~ 1980 All-~ 1 0 1
## 3 abdulka~ 1982 All-~ 0 0 0
## 4 abdulka~ 1983 All-~ 0 1 1
## 5 abdulka~ 1984 All-~ 0 0 0
## 6 abdulka~ 1985 All-~ 0 0 1
## 7 allenra~ 2004 All-~ 0 0 0
## 8 allento~ 2010 All-~ 0 1 0
## 9 allento~ 2011 All-~ 1 0 0
## 10 anthoca~ 2009 All-~ 0 0 0
## # ... with 532 more rows, and 3 more variables: allNBASecondTeam <int>,
## # MVP <int>, defPOTY <int>
allstar
## # A tibble: 1,609 x 23
## player_id last_name first_name season_id conference league_id
## <chr> <fct> <fct> <int> <fct> <fct>
## 1 abdulka01 Abdul-Ja~ Kareem 1978 West NBA
## 2 abdulka01 Abdul-Ja~ Kareem 1969 East NBA
## 3 abdulka01 Abdul-Ja~ Kareem 1988 West NBA
## 4 abdulka01 Abdul-Ja~ Kareem 1987 West NBA
## 5 abdulka01 Abdul-Ja~ Kareem 1986 West NBA
## 6 abdulka01 Abdul-Ja~ Kareem 1985 West NBA
## 7 abdulka01 Abdul-Ja~ Kareem 1984 West NBA
## 8 abdulka01 Abdul-Ja~ Kareem 1983 West NBA
## 9 abdulka01 Abdul-Ja~ Kareem 1982 West NBA
## 10 abdulka01 Abdul-Ja~ Kareem 1980 West NBA
## # ... with 1,599 more rows, and 17 more variables: games_played <int>,
## # minutes <int>, points <int>, o_rebounds <int>, d_rebounds <int>,
## # rebounds <int>, assists <int>, steals <int>, blocks <int>,
## # turnovers <int>, personal_fouls <int>, fg_attempted <int>,
## # fg_made <int>, ft_attempted <int>, ft_made <int>,
## # three_attempted <int>, three_made <int>
We wish to use All-Star data as an additional explanatory variable. To avoid conflicts with existing player statistics, we can disregard actual game performance and simply create an indicator variable for team membership.
# Filter for league and year
allstar <- allstar %>%
filter(league_id == "NBA", season_id >= 1979) %>%
select(player_id, season_id) %>%
mutate(allstar = as.integer(1)) %>%
rename("playerID" = "player_id", "year" = "season_id") # Rename variables to match format of players
# Check for duplicate entries
setequal(allstar, distinct(allstar))
## FALSE: Different number of rows
Oops, looks like there are duplicate entries. After a prolonged investigation, we find that Marques Johnson was incorrectly coded as johnsma02 instead of johnsma01, causing a discrepancy for the All-Star team in 1979, which had both Marques Johnson and Magic Johnson.
## # A tibble: 5 x 23
## player_id last_name first_name season_id conference league_id
## <fct> <fct> <fct> <int> <fct> <fct>
## 1 johnsma01 Johnson Marques 1979 West NBA
## 2 johnsma01 Johnson Marques 1978 West NBA
## 3 johnsma02 Johnson Marques 1985 West NBA
## 4 johnsma02 Johnson Marques 1982 East NBA
## 5 johnsma02 Johnson Marques 1980 East NBA
## # ... with 17 more variables: games_played <int>, minutes <int>,
## # points <int>, o_rebounds <int>, d_rebounds <int>, rebounds <int>,
## # assists <int>, steals <int>, blocks <int>, turnovers <int>,
## # personal_fouls <int>, fg_attempted <int>, fg_made <int>,
## # ft_attempted <int>, ft_made <int>, three_attempted <int>,
## # three_made <int>
We can correct this quirk with allstar$player_id[allstar$first_name == "Marques"] <- "johnsma01" and re-run our code.
# Check for duplicate entries
setequal(allstar, distinct(allstar))
## FALSE: Different number of rows
Much better! We should (hopefully) be fine when we merge our data.
Now that we’ve cleaned up each data set a bit, we can left_join() all of our data into a single data set. left_join() takes two arguments, x and y, and augments the primary data frame (x) with the data from the secondary data frame (y). The join is performed based on a common key which should be present in both data frames (in our case, playerID and year). In the following code, we can join the data frames separately to check whether rows are added, which would indicate that the join didn’t go as expected.
# Merge player data with awards data
players %>%
left_join(awards, by = c("playerID", "year"))
## # A tibble: 14,577 x 27
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,567 more rows, and 18 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>, award <chr>,
## # allDefFirstTeam <int>, allDefSecondTeam <int>, allNBAFirstTeam <int>,
## # allNBASecondTeam <int>, MVP <int>, defPOTY <int>
# Merge player data with position data
players %>%
left_join(pos, by = c("playerID"))
## # A tibble: 14,577 x 23
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,567 more rows, and 14 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>, center <int>, forward <int>,
## # guard <int>
# Merge player data with allstar data
players %>%
left_join(allstar, by = c("playerID", "year"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
## # A tibble: 14,580 x 21
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,570 more rows, and 12 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>, allstar <int>
The awards and pos data sets seemed to merge correctly, but it looks like the number of rows changed after the allstar merge. To investigate this error, we can use the anti_join() function to compare the original allstar data set to the entries that were actually transferred over. anti_join() only returns rows from the primary data frame that are not in the secondary, while keeping the columns from the primary.
anti_join(allstar, players)
## Joining, by = c("playerID", "year")
## Warning: Column `playerID` joining factor and character vector, coercing
## into character vector
## # A tibble: 50 x 3
## playerID year allstar
## <fct> <int> <int>
## 1 anderke02 1993 1
## 2 antonca01 2007 1
## 3 floyder01 1986 1
## 4 ginobma01 2004 1
## 5 johnske01 1993 1
## 6 johnske01 1990 1
## 7 johnske01 1989 1
## 8 johnsla01 1994 1
## 9 johnsla01 1992 1
## 10 johnsma01 1988 1
## # ... with 40 more rows
After manually cross-referencing the excluded All-Star rows and playerID’s with Basketball-Reference, Wikipedia, and the master list, we find that all of the missing rows had their playerID values incorrectly coded per the players and master data sets, which resulted in skipped entries during the join (e.g., o’neaje01, o’neasha01). We can correct these errors and try joining the data again.
# Fix coding errors for all players in the anti_join
allstar$player_id[allstar$first_name == "Marques" & allstar$last_name == "Johnson"] <- "johnsma01"
allstar$player_id[allstar$first_name == "Magic" & allstar$last_name == "Johnson"] <- "johnsma02"
allstar$player_id[allstar$first_name == "Kenny" & allstar$last_name == "Anderson"] <- "anderke01"
allstar$player_id[allstar$first_name == "Carmelo" & allstar$last_name == "Anthony"] <- "anthoca01"
allstar$player_id[allstar$first_name == "Eric" & allstar$last_name == "Floyd"] <- "floydsl01"
allstar$player_id[allstar$first_name == "Manu" & allstar$last_name == "Ginobili"] <- "ginobem01"
allstar$player_id[allstar$first_name == "Kevin" & allstar$last_name == "Johnson"] <- "johnske02"
allstar$player_id[allstar$first_name == "Larry" & allstar$last_name == "Johnson"] <- "johnsla02"
allstar$player_id[allstar$first_name == "Eddie" & allstar$last_name == "Jones"] <- "jonesed02"
allstar$player_id[allstar$first_name == "Lafayette" & allstar$last_name == "Lever"] <- "leverfa01"
allstar$player_id[allstar$first_name == "Rashard" & allstar$last_name == "Lewis"] <- "lewisra02"
allstar$player_id[allstar$first_name == "Jermaine" & allstar$last_name == "O'Neal"] <- "onealje01"
allstar$player_id[allstar$first_name == "Shaquille" & allstar$last_name == "O'Neal"] <- "onealsh01"
allstar$player_id[allstar$first_name == "Jim" & allstar$last_name == "Paxson"] <- "paxsoji02"
allstar$player_id[allstar$first_name == "Jeff" & allstar$last_name == "Rutland"] <- "rulanje01"
allstar$player_id[allstar$first_name == "Wally" & allstar$last_name == "Szczerbiak"] <- "szczewa02"
allstar$player_id[allstar$first_name == "David" & allstar$last_name == "Thompson"] <- "thompda01"
allstar$player_id[allstar$first_name == "Antoine" & allstar$last_name == "Walker"] <- "walkean02"
allstar$player_id[allstar$first_name == "Kevin" & allstar$last_name == "Willis"] <- "willike02"
# Apply previous filters, etc.
allstar <- allstar %>%
filter(league_id == "NBA", season_id >= 1979) %>%
select(player_id, season_id) %>%
mutate(allstar = as.integer(1)) %>%
rename("playerID" = "player_id", "year" = "season_id")
# Retry the join and save to allstar.join
allstar.join <- players %>%
left_join(allstar, by = c("playerID", "year"))
allstar.join
## # A tibble: 14,577 x 21
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,567 more rows, and 12 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>, allstar <int>
The resulting data frame, allstar.join, has the correct amount of 14577 observations, which confirms that there are no extranneous rows added. As a quick aside, summing up the allstar column will show a number greater than 743:
# Add all observations for which the player was an All-Star that year
sum(allstar.join$allstar, na.rm = TRUE)
## [1] 751
The could indicate that some players were incorrectly coded as All-Stars. However, after a bit of digging with the duplicated() function, we find that these additional allstar entries are an artifact of the players structure; players who were traded during the season have multiple rows for the year that they were traded, so players who were traded that ALSO happened to make the All-Star team roster will have both rows coded with allstar = 1.
# Filter All-Stars and select rows to match the allstar data set
allstar.join <- allstar.join %>%
select(playerID, year, allstar) %>%
filter(allstar == 1)
# Check rows with duplicate entries
allstar.join[duplicated(allstar.join),]
## # A tibble: 10 x 3
## playerID year allstar
## <chr> <int> <int>
## 1 piercri01 1990 1
## 2 mannida01 1993 1
## 3 wilkido01 1993 1
## 4 gatlich01 1996 1
## 5 mutomdi01 2000 1
## 6 paytoga01 2002 1
## 7 cartevi01 2004 1
## 8 kiddja01 2007 1
## 9 billuch01 2008 1
## 10 iversal01 2008 1
The astute reader will notice that there are still 2 observations missing from the join (summing the allstar column resulted in 751 rows, but there are 10 additional rows added from traded All-Star players).
anti_join(allstar, allstar.join, by = c("playerID", "year"))
## # A tibble: 2 x 3
## playerID year allstar
## <chr> <int> <int>
## 1 johnsma02 1991 1
## 2 thomsda01 1982 1
The line of code allstar$player_id[allstar$first_name == "David" & allstar$last_name == "Thompson"] <- "thompda01" should have corrected David Thompson’s 1982 All-Star game appearance. But, after reviewing the original allstar data (again), we find that there were two entries for the thomsda01 player ID: one for “David Thompson” and one for “David Thomspon”. This is an error that won’t affect our players data. The second observation (for Magic Johnson) is for the 1992 All-Star game. From Wikipedia,
The game is most remembered for the return of Los Angeles Lakers legend Magic Johnson, who retired before the 1991-92 NBA season after contracting HIV.
This explains why this particular row from the allstar data wasn’t transferred over - even though he played in the All-Star game, Magic Johnson didn’t actually play during the regular season.
Finally, we’ve taken care of the many discrepancies that arose from the attempted joining of the allstar and player data. At this point we can be confident that all of our data sets will safely merge with players.
# Join all data sets
players.joined <- players %>%
left_join(pos, by = "playerID") %>%
left_join(awards, by = c("playerID", "year")) %>%
left_join(allstar, by = c("playerID", "year")) %>%
mutate_if(is.numeric, coalesce, 0L) %>% # Return non-All-Star entries as integer zeroes
mutate_if(is.character, coalesce, "None") # Replace NA values in award with "None"
players.joined
## # A tibble: 14,577 x 31
## playerID year tmID GP minutes points oRebounds dRebounds rebounds
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 abdulka~ 1979 LAL 82 3143 2034 190 696 886
## 2 abernto~ 1979 GSW 67 1222 362 62 129 191
## 3 adamsal~ 1979 PHO 75 2168 1118 158 451 609
## 4 archina~ 1979 BOS 80 2864 1131 59 138 197
## 5 awtrede~ 1979 CHI 26 560 86 29 86 115
## 6 bailegu~ 1979 WSB 20 180 38 6 22 28
## 7 baileja~ 1979 SEA 67 726 312 71 126 197
## 8 ballagr~ 1979 WSB 82 2438 1277 240 398 638
## 9 bantomi~ 1979 IND 77 2330 908 192 264 456
## 10 barnema~ 1979 SDC 20 287 64 34 43 77
## # ... with 14,567 more rows, and 22 more variables: assists <int>,
## # steals <int>, blocks <int>, turnovers <int>, PF <int>,
## # fgAttempted <int>, fgMade <int>, ftAttempted <int>, ftMade <int>,
## # threeAttempted <int>, threeMade <int>, center <int>, forward <int>,
## # guard <int>, award <chr>, allDefFirstTeam <int>,
## # allDefSecondTeam <int>, allNBAFirstTeam <int>, allNBASecondTeam <int>,
## # MVP <int>, defPOTY <int>, allstar <int>
Now all that’s left is to save the cleaned data set!
write.csv(players.joined, "players_clean.csv", row.names = FALSE)
With the data cleaned, we’re ready for exploratory data analysis, which will be continued in another report.