How well does your NBA team draft?

Comparing team’s draft success over the years is a tricky task. While you could easily compile a list of all drafted players by team and match, you’d be missing one very important aspect - where in the draft was the player picked? Drafting a player like Kawhi Leonard at 15 is a much more succesful draft pick than Lebron James at 1. The chance of selecting a Hall of Fame player is much greater the higher your draft selection is.

In this project, I have measured the success of a draft pick by comparing their career success to all others drafted at that draft position. I will be using a combination of the player’s PER, VORP and WS (all evenly weighted) and then utilzing the zscore of this combined stat as a measurement of success. I am only looking at first round picks (picks 1:30 in years where there weren’t 30 teams) between the years of 1980-2012.

I have also weighted their zscore based on the draft class. Selecting a player in a draft full of stars is not as difficult as selecting a player in a draft with only a select amount of stars. Draft classes with a higher combined score have been penalized while the scores of players in classes who have a lower average are bumped up.

The goal of this project is to see how teams compare in their draft results. While there are many factors that go into the success of a player’s career, I felt the combined PER, VORP and WS gives a decent representation of what a team would be looking for when selecting a player. Note that these stats are for the player’s whole career and therefore do not represent what the player actually contributed for their drafted team.

Let’s get started:

First i need to load the packages I will be using:

Here I define a function created to convert previous team names to their current names when cleaning the data (ex. SEA -> OKC):

teamsub <- function(pattern, replacement, x, ...) {
  if (length(pattern)!=length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result <- gsub(pattern[i], replacement[i], result, ...)
  }
  result
}

The next function I define is to convert factor variables to numeric:

as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

Step 1: Setting up our dataframes

I initially downloaded datasets from basketballreference.com of each draft class and their relevant stats. Here I load these csv files in my project and combine them into one main dataframe. I also clean up the df by:

OGnames <- sprintf("%sclass.csv", 1980:2012)
for (i in OGnames){
  name <- gsub("class.csv", "", i)
name <- paste("class", name, sep = "")
assign(name, read.csv(i, header = T))
}

dfnames <- sprintf("class%s", 1980:2012)


Alldrafted <- rbind.fill(list(class1980, class1981, class1982, class1983, class1984, class1985, class1986, class1987, class1988, class1989, class1990, class1991, class1992, class1993, class1994, class1995, class1996, class1997, class1998, class1999, class2000, class2001, class2002, class2003, class2004, class2005, class2006, class2007, class2008, class2009, class2010, class2011, class2012))

Alldrafted$Player <- gsub("\\\\.*", "", Alldrafted$Player)
Alldrafted <- subset(Alldrafted, select = -c(Rk))
Alldrafted$Tm <- as.character(Alldrafted$Tm)
Alldrafted$Yrs[is.na(Alldrafted$Yrs)] <- 0
Alldrafted$G[is.na(Alldrafted$G)] <- 0
Alldrafted$WS[is.na(Alldrafted$WS)] <- 0
Alldrafted$WS.48[is.na(Alldrafted$WS.48)] <- 0

Alldrafted$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), Alldrafted$Tm)

head(Alldrafted)
##   Pk  Tm            Player Yrs   G    WS  WS.48
## 1  1 GSW Joe Barry Carroll  10 705  35.6  0.075
## 2  2 UTA  Darrell Griffith  10 765  22.0  0.049
## 3  3 BOS      Kevin McHale  13 971 113.0  0.180
## 4  4 CHI     Kelvin Ransey   6 474  17.9  0.074
## 5  5 DEN         James Ray   3 103  -0.3 -0.017
## 6  6 BRK      Mike O'Koren   8 407  14.6  0.080

I also utilized data from Real GM that had more information on every first round draft selection which included their draft year, height, weight, Pre NBA Source. I add the data and clean up some of its columns by:

playerinfo <- read.csv("draftedplayerinfo.csv", header = T, fill = T, stringsAsFactors=FALSE, fileEncoding="latin1")
playerinfo$Height <- gsub("11-May", "5-11", playerinfo$Height)
playerinfo$Height <- gsub("11-Jun", "6-11", playerinfo$Height)
playerinfo$Height <- gsub("10-Jun", "6-10", playerinfo$Height)
playerinfo$Height <- gsub("09-Jun", "6-9", playerinfo$Height)
playerinfo$Height <- gsub("08-Jun", "6-8", playerinfo$Height)
playerinfo$Height <- gsub("07-Jun", "6-7", playerinfo$Height)
playerinfo$Height <- gsub("06-Jun", "6-6", playerinfo$Height)
playerinfo$Height <- gsub("05-Jun", "6-5", playerinfo$Height)
playerinfo$Height <- gsub("04-Jun", "6-4", playerinfo$Height)
playerinfo$Height <- gsub("03-Jun", "6-3", playerinfo$Height)
playerinfo$Height <- gsub("02-Jun", "6-2", playerinfo$Height)
playerinfo$Height <- gsub("01-Jun", "6-1", playerinfo$Height)
playerinfo$Height <- gsub("10-May", "5-10", playerinfo$Height)
playerinfo$Height <- gsub("09-May", "5-9", playerinfo$Height)
playerinfo$Height <- gsub("08-May", "5-8", playerinfo$Height)
playerinfo$Height <- gsub("07-May", "5-7", playerinfo$Height)
playerinfo$Height <- gsub("06-May", "5-6", playerinfo$Height)
playerinfo$Height <- gsub("05-May", "5-5", playerinfo$Height)
playerinfo$Height <- gsub("03-May", "5-3", playerinfo$Height)
playerinfo$Height <- gsub("01-Jul", "7-1", playerinfo$Height)
playerinfo$Height <- gsub("02-Jul", "7-2", playerinfo$Height)
playerinfo$Height <- gsub("03-Jul", "7-3", playerinfo$Height)
playerinfo$Height <- gsub("04-Jul", "7-4", playerinfo$Height)
playerinfo$Height <- gsub("05-Jul", "7-5", playerinfo$Height)
playerinfo$Height <- gsub("06-Jul", "7-6", playerinfo$Height)
playerinfo$Height <- gsub("07-Jul", "7-7", playerinfo$Height)

playerinfo$Team <- teamsub(c("Portland Trail Blazers", "Milwaukee Bucks", "Toronto Raptors", "Houston Rockets", "Washington Wizards", "Cleveland Cavaliers", "Golden State Warriors" , "Orlando Magic", "New Jersey Nets", "Chicago Bulls", "Los Angeles Clippers", "Philadelphia Sixers", "San Antonio Spurs", "New York Knicks", "Sacramento Kings", "Charlotte Hornets (1988)", "Los Angeles Lakers", "Dallas Mavericks", "New Orleans Pelicans", "Minnesota Timberwolves", "Seattle SuperSonics", "Atlanta Hawks", "Vancouver Grizzlies", "Detroit Pistons", "Charlotte Hornets", "Miami Heat", "Indiana Pacers", "Phoenix Suns", "San Diego Clippers", "Boston Celtics", "Memphis Grizzlies", "Utah Jazz", "Denver Nuggets", "Oklahoma City Thunder", "Kansas City Kings", "Brooklyn Nets" ), c("POR", "MIL", "TOR", "HOU", "WAS", "CLE", "GSW", "ORL", "BRK", "CHI", "LAC", "PHI", "SAS", "NYK", "SAC", "NOP", "LAL", "DAL", "NOP", "MIN", "OKC", "ATL", "MEM", "DET", "CHO", "MIA", "IND", "PHO", "LAC", "BOS", "MEM", "UTA", "DEN", "OKC", "SAC", "BRK"), playerinfo$Team)

playerinfo$Draft.Trades <- teamsub(c("UTH", "SAN", "PHL", "GOS", "SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("UTA", "SAS", "PHI", "GSW", "LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), playerinfo$Draft.Trades)

head(Alldrafted)
##   Pk  Tm            Player Yrs   G    WS  WS.48
## 1  1 GSW Joe Barry Carroll  10 705  35.6  0.075
## 2  2 UTA  Darrell Griffith  10 765  22.0  0.049
## 3  3 BOS      Kevin McHale  13 971 113.0  0.180
## 4  4 CHI     Kelvin Ransey   6 474  17.9  0.074
## 5  5 DEN         James Ray   3 103  -0.3 -0.017
## 6  6 BRK      Mike O'Koren   8 407  14.6  0.080

More cleaning of the playerinfo df:

playerinfo <- subset(playerinfo, playerinfo$Year != 2013)
playerinfo <- subset(playerinfo, playerinfo$Year != 2014)
playerinfo <- subset(playerinfo, playerinfo$Year != 2015)
playerinfo <- subset(playerinfo, playerinfo$Year != 2016)

namecheck <- merge(Alldrafted, playerinfo, all.x = T, by = "Player")
investigate <- namecheck[complete.cases(namecheck) == F,]

playerinfo$Player[playerinfo$Player == "Alek Radojevic"] <- "Aleksandar Radojevic"
playerinfo$Player[playerinfo$Player == "Penny Hardaway"] <- "Anfernee Hardaway"
playerinfo$Player[playerinfo$Player == "Charles Daniel Smith"] <- "Charles Smith"
playerinfo$Player[playerinfo$Player == "Dontae Jones"] <- "Dontae' Jones"
playerinfo$Player[playerinfo$Player == "Efthimios Rentzias"] <- "Efthimi Rentzias"
playerinfo$Player[playerinfo$Player == "Frankie Williams"] <- "Frank Williams"
playerinfo$Player[playerinfo$Player == "Freddie Banks"] <- "Fred Banks"
playerinfo$Player[playerinfo$Player == "George Muresan"] <- "Gheorghe Muresan"
playerinfo$Player[playerinfo$Player == "Cadillac Anderson"] <- "Greg Anderson"
playerinfo$Player[playerinfo$Player == "J.R. Rider"] <- "Isaiah Rider"
playerinfo$Player[playerinfo$Player == "Jose Rafael Ortiz-Rijos"] <- "Jose Ortiz"
playerinfo$Player[playerinfo$Player == "Marcus D. Williams"] <- "Marcus Williams"
playerinfo$Player[playerinfo$Player == "Mo Taylor"] <- "Maurice Taylor"
playerinfo$Player[playerinfo$Player == "Mel Turpin"] <- "Melvin Turpin"
playerinfo$Player[playerinfo$Player == "Babby Araujo"] <- "Rafael Araujo"
playerinfo$Player[playerinfo$Player == "Saer Sene"] <- "Mouhamed Sene"
playerinfo$Player[playerinfo$Player == "Maybyner Nene"] <- "Nene Hilario"
playerinfo$Player[playerinfo$Player == "Roy Rogers, Jr."] <- "Roy Rogers"
playerinfo$Player[playerinfo$Player == "Jianlian Yi"] <- "Yi Jianlian"

timmy <- c("1997", "1", "1", "Tim Duncan", "SAS", "", "C", "83", "250", "21", "19", "Wake Forest")
playerinfo <-rbind(playerinfo, timmy)

head(playerinfo)
##   Year Rnd Pick          Player Team Draft.Trades   Pos Height Weight Age
## 1 2007   1    1       Greg Oden  POR                  C    7-0    273  19
## 2 2005   1    1    Andrew Bogut  MIL                  C    7-0    260  20
## 3 2006   1    1 Andrea Bargnani  TOR               PF/C    7-0    245  20
## 4 2002   1    1        Yao Ming  HOU                  C    7-6    310  21
## 5 2001   1    1     Kwame Brown  WAS                  C   6-11    290  19
## 6 2003   1    1    LeBron James  CLE              SF/PF    6-8    250  18
##   YOS                          Pre.Draft.Team
## 1   6                              Ohio State
## 2  11                                    Utah
## 3  10      Universo De'Longhi Treviso (Italy)
## 4   9               Shanghai Dongfang (China)
## 5  12                 Glynn Academy (Georgia)
## 6  13 St. Vincent St. Mary High School (Ohio)

The final set of data I use is scraped from the Basketball Reference website. This data holds the season by season records (1980-2016) of every player and their stats. I will be using this to find the career stats of players. Once loaded, I will be cleaning the data by: - Converting team names

***I’m using this data instead of set with career totals/averages because I will be looking at data for players on their drafted team in a future project.

URLs <- sprintf("http://www.basketball-reference.com/leagues/NBA_%s_advanced.html", 1981:2017)
tabla <- list()
for(v in seq_along(URLs))
{
  total <- readHTMLTable(URLs[v])
  n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
  tabla[[v]] <- as.data.frame(total[[which.max(n.rows)]])
}

allplayers <- do.call(rbind.data.frame, tabla)

allplayers$Player <- gsub("\\*", "", allplayers$Player)
allplayers$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), allplayers$Tm)

allplayers <- subset(allplayers, Tm != "Tm")
allplayers <- subset(allplayers, Tm != "TOT")

head(allplayers, n=3)
##   Rk              Player Pos Age  Tm  G   MP  PER  TS% 3PAr   FTr ORB%
## 1  1 Kareem Abdul-Jabbar   C  33 LAL 80 2976 25.5 .616 .001  .379  7.6
## 3  2       Tom Abernethy  SF  26 GSW 10   39  3.2 .463 .000 1.000  2.8
## 4  2       Tom Abernethy  SF  26 IND 29  259  8.7 .458 .018  .339  7.8
##   DRB% TRB% AST% STL% BLK% TOV% USG%   OWS DWS   WS  WS/48  .1 OBPM DBPM
## 1 21.5 15.0 13.6  0.9  4.0 12.8 26.3   9.6 4.6 14.3   .230      3.9  1.4
## 3 20.2 11.4  2.9  1.2  0.0 31.6  6.4   0.0 0.0  0.0 -0.004     -6.0 -0.2
## 4  9.1  8.4  8.8  1.1  0.7  8.5 10.9   0.2 0.2  0.4   .072     -2.0 -0.8
##    BPM VORP
## 1  5.3  5.4
## 3 -6.2  0.0
## 4 -2.8 -0.1

Step 2: Cleaning the dataframes.

With all three datasets in place, I continue cleaning the data. First off, I want to make sure I uniquely define any players with the same names (ex. Tim Hardaway and Tim Hardaway JR.).

I’ll first find duplicates in the Alldrafted and playerinfo df’s:

unique_test <- data.frame(table(Alldrafted$Player))
unique_test[unique_test$Freq > 1,]
##              Var1 Freq
## 140 Charles Smith    2
Alldrafted$Player[Alldrafted$Player == "Charles Smith" & Alldrafted$Pk == 26] <- "Charles Cornelius Smith"

unique_test2 <- data.frame(table(playerinfo$Player))
unique_test2[unique_test2$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)
playerinfo$Player[playerinfo$Player == "Charles Daniel Smith" & playerinfo$Pick == 3] <- "Charles Smith"

Now for the allplayers dataframe. This one is a bit more difficult as there are records for each season played by a player, so there will for sure be duplicates of player names. I will take two approaches to combat this.

The allplayers dataset does not include the season played but does include the age of the player. When I compile the career stats, I want to make sure I am not adding the stats of two seperate players into one. Here I aggregate by player and age and sum games played stats. If there are players at an age with more than 84 games played, I investigate to see if there are duplicate names:

c1 <- allplayers
c1$G <- as.numeric(levels(c1$G))[c1$G]
c2 <- aggregate(c1$G, list(c1$Player, c1$Age), sum)
c2 <- subset(c2, x>84)
names(c2) <- c("Player", "Age", "Games")
c2test <- merge(c2, Alldrafted, all.x = T)
head(c2test)
##          Player Age Games Pk   Tm Yrs   G   WS WS.48
## 1 Charles Jones  23   107 NA <NA>  NA  NA   NA    NA
## 2 Charles Jones  24    99 NA <NA>  NA  NA   NA    NA
## 3 Charles Smith  26   109  3  PHI   9 564 40.9  0.12
## 4 Charles Smith  22    94  3  PHI   9 564 40.9  0.12
## 5 Charles Smith  30    95  3  PHI   9 564 40.9  0.12
## 6 Charles Smith  23    99  3  PHI   9 564 40.9  0.12

Now with the list of players with more than 84 games played in a season, I attempt to remove any players from allplayers who aren’t in the targeted draft group (Round 1, 1980-2012) or change the names of any duplicated players.

allplayers <- allplayers[!(allplayers$Player == "Charles Smith" & allplayers$Tm == "BOS"),]
allplayers <- allplayers[!(allplayers$Player == "Charles Smith" & allplayers$Tm == "MIN"),]
allplayers$Player[allplayers$Player == "Charles Smith" & allplayers$Pos == "SG"] <- "Charles Cornelius Smith"
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "UTA"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "WAS"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "PHO"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "TOT"),]
allplayers <- allplayers[!(allplayers$Player == "Gerald Henderson" & allplayers$Pos == "PG"),]
allplayers <- allplayers[!(allplayers$Player == "Glen Rice" & allplayers$Pos == "SG"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Tm == "IND"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Pos == "SG"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Tm == "PHI" & allplayers$Age == 21),]
allplayers <- allplayers[!(allplayers$Player == "Larry Drew" & allplayers$Tm == "PHI"),]
allplayers <- allplayers[!(allplayers$Player == "Marcus Williams" & allplayers$Tm == "LAC"),]
allplayers <- allplayers[!(allplayers$Player == "Marcus Williams" & allplayers$Tm == "SAS"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "SAC"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "MEM"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "WAS"),]
allplayers <- allplayers[!(allplayers$Player == "Patrick Ewing" & allplayers$Tm == "NOP"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "GSW"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "NOP"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "CHO"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "OKC"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "SAS" & allplayers$G ==20),]
allplayers$Player[allplayers$Player == "Tim Hardaway" & allplayers$Tm == "NYK"] <- "Tim Hardaway Jr"
allplayers$Player[allplayers$Player == "Tim Hardaway" & allplayers$Tm == "ATL"] <- "Tim Hardaway Jr"
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "HOU"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "SAS"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "PHI"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "MIL" & allplayers$Pos == "PG"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "ATL"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "CLE"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "SEA" & allplayers$G == 24),]

Second Attempt at finding duplicate names. In the allplayers dataset, if a player was traded midseason, they would have three records for that season: one for the stats on their initial team, one for the stats on their second team. Knowing this, I find rows that have the same player and age. I then match these players against players who have been traded midseason. If they were traded midseason, there is the possibility that there were two players with the same name who played at the same age.

### I have saved the allplayers file to a csv and have loaded the original data
c3 <- read.csv("allplayers.csv")
c3$Player <- gsub("\\*", "", c3$Player)
c3$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), c3$Tm)
c3$Player <- as.character(c3$Player)
c3$Age <- as.numeric.factor(c3$Age)
c3 <- c3[c3$Tm != "Tm",]

TradedPlayers <- subset(c3, Tm == "TOT")
TradedPlayers <- TradedPlayers[c(3,5,6)]

c3dupl <- c3[c(3,5,6)]
c3dupl <- c3dupl[c3dupl$Tm != "TOT",]
c3dupl$Freq <- 1
c3dupl <- aggregate(c3dupl$Freq, by = list(c3dupl$Player, c3dupl$Age), sum)
c3dupl <- subset(c3dupl, x >1)
c3check <- merge(c3dupl, TradedPlayers, by.x = c("Group.1", "Group.2"), by.y = c("Player", "Age"), all.x = T)
head(c3check[complete.cases(c3check) == F,])
##           Group.1 Group.2 x   Tm
## 241 Charles Jones      23 2 <NA>
## 242 Charles Jones      24 2 <NA>
## 243 Charles Jones      26 2 <NA>
## 246 Charles Smith      23 3 <NA>
## 247 Charles Smith      26 2 <NA>
## 248 Charles Smith      27 2 <NA>

After further investigation, I have ensured that the allplayers dataframe has unique player names for the targeted group.

We now have data frame with every draft selection, their draft team and their pick position. Some players were however traded on draft day or the subsequent days after to another team. To account for this, I will utilize some of the information provided in the playerinfo df on draft day trades. For players traded, I will change their Team variable in the Alldrafted df to the corresponding team they’ve been traded to:

draft_day_trades <- subset(playerinfo, select = c("Player", "Pick", "Draft.Trades", "Year"))
draft_day_trades$Draft.Trades[draft_day_trades$Draft.Trades == ""] <- NA
draft_day_trades <- draft_day_trades[complete.cases(draft_day_trades),]
draft_day_trades$Team <- substr(draft_day_trades$Draft.Trades, 8, 10)
draft_day_trades <- subset(draft_day_trades, select = -Draft.Trades)
names(draft_day_trades) <- c("Player", "Pk", "Year", "Team")
Alldrafted <- merge(Alldrafted, draft_day_trades, by = c("Player", "Pk"), all.x = T)
Alldrafted$Tm <- as.character(Alldrafted$Tm)
Alldrafted$Tm[is.na(Alldrafted$Team) == F] <- Alldrafted$Team[is.na(Alldrafted$Team) == F]
Alldrafted <- subset(Alldrafted, select = -c(8, 9))

I just want to check once more to see if there are any players that may have the wrong draft team. I check by seeing which players have no season stats with their drafted team and then investigate further to see if they were traded immediately after the draft.

drafttradetest <- merge(Alldrafted, allplayers, all.x = T, by.x = c("Player", "Tm"), by.y = c("Player", "Tm"))
drafttradetest2 <- drafttradetest[complete.cases(drafttradetest) == F,]
head(drafttradetest2, n=2)
##            Player  Tm Pk Yrs G.x WS.x WS.48   Rk  Pos  Age  G.y   MP  PER
## 223    Ansu Sesay DAL 30   4 127  1.4 0.051 <NA> <NA> <NA> <NA> <NA> <NA>
## 229 Anthony Avent DEN 15   6 352  0.3 0.002 <NA> <NA> <NA> <NA> <NA> <NA>
##      TS% 3PAr  FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG%       OWS  DWS
## 223 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 229 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
##     WS.y WS/48   .1 OBPM DBPM  BPM VORP
## 223 <NA>  <NA> <NA> <NA> <NA> <NA> <NA>
## 229 <NA>  <NA> <NA> <NA> <NA> <NA> <NA>
Alldrafted$Tm[Alldrafted$Player == "Brendan Haywood"] <-"WAS" 
Alldrafted$Tm[Alldrafted$Player == "Byron Houston" ] <- "GSW" 
Alldrafted$Tm[Alldrafted$Player == "Craig Brackins" ] <- "PHI"
Alldrafted$Tm[Alldrafted$Player == "Darrell Arthur" ] <- "MEM" 
Alldrafted$Tm[Alldrafted$Player == "Anthony Avent"] <- "MIL"
Alldrafted$Tm[Alldrafted$Player == "Efthimi Rentzias"] <- "DEN"   
Alldrafted$Tm[Alldrafted$Player == "Fred Roberts" ] <- "MIL"
Alldrafted$Tm[Alldrafted$Player == "Jamaal Tinsley" ] <- "IND" 
Alldrafted$Tm[Alldrafted$Player == "Larry Krystkowiak" ] <- "SAS"
Alldrafted$Tm[Alldrafted$Player == "Leandro Barbosa" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Kenny Battle" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Mirsad Turkcan" ] <- "HOU" 
Alldrafted$Tm[Alldrafted$Player == "Nikola Mirotic" ] <- "CHI"
Alldrafted$Tm[Alldrafted$Player == "Norris Cole" ] <- "MIA"
Alldrafted$Tm[Alldrafted$Player == "Pat Garrity" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Randy Foye" ] <- "MIN"
Alldrafted$Tm[Alldrafted$Player == "Tracy Murray" ] <- "POR"

Next I want to prepare the playerinfo dataframe to be merged with our Alldrafted df. I am going to remove some of the unnecessary columns first.

playerinfo <- subset(playerinfo, select =  -c(Rnd, Team, Draft.Trades))
names(playerinfo) <- c("Draft Year", "Pk", "Player", "POS", "Height", "Weight", "Age", "YOS", "PreNBA")

I’m adding a new column that details whether or not the draftee came from College (c), Overseas (o) or High school (h). Players from High School or Overseas have the region of their school or team in brackets (ex. Glynn Academy (Georgia)). I start by finding only the players from Overseas or High School as they have a the region of their former team in brackets. I use regex to extract only players with brackets in their PreNba variable:

IorH <- playerinfo[which(grepl("\\(", playerinfo$PreNBA) == T),]

There are however some draftees who have brackets in their PreNBA column:

playerinfo$PreNBA[playerinfo$Player == "Alfredrick Hughes"] <- "Loyola"
playerinfo$PreNBA[playerinfo$Player == "John Salmons"] <- "University of Miami"
playerinfo$PreNBA[playerinfo$Player == "Ron Harper"] <- "Miami University"
playerinfo$PreNBA[playerinfo$Player == "Tim James"] <- "University of Miami"
playerinfo$PreNBA[playerinfo$Player == "Wally Szczerbiak"] <- "Miami University"

Let’s try again:

IorH <- playerinfo[which(grepl("\\(", playerinfo$PreNBA) == T),]
head(IorH)
##    Draft Year Pk          Player   POS Height Weight Age YOS
## 3        2006  1 Andrea Bargnani  PF/C    7-0    245  20  10
## 4        2002  1        Yao Ming     C    7-6    310  21   9
## 5        2001  1     Kwame Brown     C   6-11    290  19  12
## 6        2003  1    LeBron James SF/PF    6-8    250  18  13
## 8        2004  1   Dwight Howard  C/PF   6-11    265  18  12
## 41       2003  2   Darko Milicic  C/PF    7-0    275  18  10
##                                           PreNBA
## 3             Universo De'Longhi Treviso (Italy)
## 4                      Shanghai Dongfang (China)
## 5                        Glynn Academy (Georgia)
## 6        St. Vincent St. Mary High School (Ohio)
## 8  Southwest Atlanta Christian Academy (Georgia)
## 41                   KK Vrsac Swisslion (Serbia)

To figure out whether or not they came from High School or Overseas, I will match what’s in the brackets to a list of American states. I have checked and ensured there were no players who came from playing in Georgia (country not state). I create a new column in the IorH df that gives a TRue or False of whether or not it is a US state. I then add this column to the playerinfo df and give an “O” for False, an “H” for True and a “C” for NA.

states <- read.csv("states.csv", header = T)
head(states)
##        State Abbreviation
## 1    Alabama           AL
## 2     Alaska           AK
## 3    Arizona           AZ
## 4   Arkansas           AR
## 5 California           CA
## 6   Colorado           CO
states <- subset(states, select = State)

IorH$a <- str_extract(IorH$PreNBA, "\\(.*\\)")
IorH$a <- gsub("\\(", "", IorH$a)
IorH$a <- gsub("\\)", "", IorH$a)
IorH$b <- IorH$a %in% states$State
IorH <- subset(IorH, select = c(Player, Pk, b))

playerinfo <- merge(playerinfo, IorH, by = c("Player", "Pk"), all.x = T)


UHI <- function (x) {
      ifelse(is.na(x) == T, "C",
             ifelse(x == "TRUE", "H", "O"))
}
playerinfo$b <- UHI(playerinfo$b)
names(playerinfo) <- c("Player", "Pk", "Draft.Year", "POS", "Height", "Weight", "Age", "YOS", "PreNBA", "Source")


head(playerinfo)
##         Player Pk Draft.Year POS Height Weight Age YOS       PreNBA Source
## 1   A.C. Green 23       1985  PF    6-9    220  21  16 Oregon State      C
## 2 Aaron Brooks 26       2007  PG    6-0    161  22   8       Oregon      C
## 3  Aaron McKie 17       1994   G    6-5    209  21  14       Temple      C
## 4    Acie Earl 19       1993  PF   6-10    240  23   4         Iowa      C
## 5     Acie Law 11       2007  SG    6-3    202  22   4    Texas A&M      C
## 6   Adam Keefe 10       1992   F    6-9    230  22   9     Stanford      C

Now to assign one position for each of the players. I want to break it down into the classic 5 positions used. I will use the playerinfo df but for any players whose position isn’t one of the classic 5, I will assign the position of their rookie year (from the allplayers df).

allplayers$Age <- as.numeric(levels(allplayers$Age))[allplayers$Age]
POS_rookie <- allplayers %>% 
    group_by(Player) %>% 
    slice(which.min(Age))

POS_check <- merge(playerinfo, POS_rookie, all.x = T, by.x = "Player", by.y = "Player")
POS_check$Pos <- as.character(POS_check$Pos)
POS_check$Pos[is.na(POS_check$Pos) ==T] <- POS_check$POS
head(POS_check[complete.cases(POS_check) == F,], n=2)
##            Player Pk Draft.Year POS Height Weight Age.x YOS     PreNBA
## 108 Brian Jackson 26       1981  PF    6-8    200    22   0 Utah State
## 126 Calvin Duncan 30       1985   G    6-4    195    24   1        VCU
##     Source   Rk Pos Age.y   Tm    G   MP  PER  TS% 3PAr  FTr ORB% DRB%
## 108      C <NA>  PF    NA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 126      C <NA>  PG    NA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
##     TRB% AST% STL% BLK% TOV% USG%       OWS  DWS   WS WS/48   .1 OBPM DBPM
## 108 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
## 126 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
##      BPM VORP
## 108 <NA> <NA>
## 126 <NA> <NA>
POS_check$POS[POS_check$POS == "C"] <- 5
POS_check$POS[POS_check$POS == "PF"] <- 4
POS_check$POS[POS_check$POS == "SF"] <- 3
POS_check$POS[POS_check$POS == "SG"] <- 2
POS_check$POS[POS_check$POS == "PG"] <- 1


POS_check$Pos[POS_check$Pos == "C"] <- 5
POS_check$Pos[POS_check$Pos == "PF"] <- 4
POS_check$Pos[POS_check$Pos == "SF"] <- 3
POS_check$Pos[POS_check$Pos == "SG"] <- 2
POS_check$Pos[POS_check$Pos == "PG"] <- 1

head(POS_check, n=2)
##         Player Pk Draft.Year POS Height Weight Age.x YOS       PreNBA
## 1   A.C. Green 23       1985   4    6-9    220    21  16 Oregon State
## 2 Aaron Brooks 26       2007   1    6-0    161    22   8       Oregon
##   Source Rk Pos Age.y  Tm  G   MP  PER  TS% 3PAr  FTr ORB% DRB% TRB% AST%
## 1      C 98   4    22 LAL 82 1542 11.8 .564 .015 .430 12.4 15.5 14.0  4.2
## 2      C 60   1    23 HOU 51  608 13.1 .535 .484 .218  2.5  7.9  5.2 23.5
##   STL% BLK% TOV% USG%   OWS DWS  WS WS/48  .1 OBPM DBPM  BPM VORP
## 1  1.5  1.7 17.7 14.7   1.4 2.0 3.3  .103     -0.5  0.8  0.3  0.9
## 2  1.1  0.6 15.1 21.8   0.6 0.8 1.4  .112      0.0 -1.8 -1.8  0.0
POS_Func <- function (x) {
  ifelse(POS_check$POS != 1|2|3|4|5, POS_check$Pos, POS_check$POS)
}
POS_check$POS <- POS_Func(POS_check$POS)

POS_check <- POS_check[c(1,4)]

POS_check$POS[POS_check$POS == 5] <- "C"
POS_check$POS[POS_check$POS == 4] <- "PF"
POS_check$POS[POS_check$POS == 3] <- "SF"
POS_check$POS[POS_check$POS == 2] <- "SG"
POS_check$POS[POS_check$POS == 1] <-"PG"

playerinfo <- merge(playerinfo, POS_check, by.x = "Player", by.y = "Player") 
playerinfo <- playerinfo[,c(1,2,3,11,5,6,7,8,9,10)]
names(playerinfo) <- c("Player", "Pk", "Draft.Year", "POS", "Height", "Weight", "Age", "YOS", "PreNBA", "Source")
head(playerinfo)
##         Player Pk Draft.Year POS Height Weight Age YOS       PreNBA Source
## 1   A.C. Green 23       1985  PF    6-9    220  21  16 Oregon State      C
## 2 Aaron Brooks 26       2007  PG    6-0    161  22   8       Oregon      C
## 3  Aaron McKie 17       1994  SG    6-5    209  21  14       Temple      C
## 4    Acie Earl 19       1993   C   6-10    240  23   4         Iowa      C
## 5     Acie Law 11       2007  PG    6-3    202  22   4    Texas A&M      C
## 6   Adam Keefe 10       1992  PF    6-9    230  22   9     Stanford      C

Step 3: Aggregating Stats

I can now begin the process of aggregating season stats to each player to find their total career stats. I begin by creating a new df careerstats and converting names and classes of the variables:

careerstats <- subset(allplayers, select = c(Player, Age, Tm, G, MP, PER, WS, `WS/48`, BPM, VORP))
names(careerstats) <- c("Player", "Age", "Team", "Games", "Minutes", "PER", "WS", "WS.48", "BPM", "VORP")
lapply(careerstats, class)
## $Player
## [1] "character"
## 
## $Age
## [1] "numeric"
## 
## $Team
## [1] "character"
## 
## $Games
## [1] "factor"
## 
## $Minutes
## [1] "factor"
## 
## $PER
## [1] "factor"
## 
## $WS
## [1] "factor"
## 
## $WS.48
## [1] "factor"
## 
## $BPM
## [1] "factor"
## 
## $VORP
## [1] "factor"
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
careerstats$Games <- as.numeric.factor(careerstats$Games)
careerstats$WS <- as.numeric.factor(careerstats$WS)
careerstats$WS.48 <- as.numeric.factor(careerstats$WS.48)
careerstats$PER <- as.numeric.factor(careerstats$PER)
careerstats$BPM <- as.numeric.factor(careerstats$BPM)
careerstats$VORP <- as.numeric.factor(careerstats$VORP)
careerstats$Minutes <- as.numeric.factor(careerstats$Minutes)
head(careerstats)
##                Player Age Team Games Minutes  PER   WS  WS.48  BPM VORP
## 1 Kareem Abdul-Jabbar  33  LAL    80    2976 25.5 14.3  0.230  5.3  5.4
## 3       Tom Abernethy  26  GSW    10      39  3.2  0.0 -0.004 -6.2  0.0
## 4       Tom Abernethy  26  IND    29     259  8.7  0.4  0.072 -2.8 -0.1
## 5         Alvan Adams  26  PHO    75    2054 20.3  7.7  0.180  5.3  3.8
## 6      Darrell Allums  22  DAL    22     276  5.3 -0.3 -0.054 -5.6 -0.2
## 7      Tiny Archibald  32  BOS    80    2820 14.3  6.9  0.118 -1.3  0.5

Check for any NA’s. We run into one of the most challenging decisions of the project. After this check, I realize there are NA’s value for players who did not play a minute in the NBA. For some values like Seasons or Win Shares, I could easily assign a 0. The problem was that for a stat like PER and VORP, it is relevant to the other players in the league - and these both represent an impact you had(whether negative or positive). If a player never plays a game, is that as bad as playing the worst game possible? Or should they find the sweet spot where they had no impact on the game at all (this is completely further down on a different dataframe):

careerstats[complete.cases(careerstats) == F,]
##               Player Age Team Games Minutes PER WS WS.48  BPM VORP
## 12153     Josh Davis  25  HOU     1       0  NA  0    NA -6.3    0
## 12484    Alex Scales  27  SAS     1       0  NA  0    NA -4.4    0
## 13485 Stephane Lasme  25  GSW     1       0  NA  0    NA -6.4    0
## 14491  JamesOn Curry  24  LAC     1       0  NA  0    NA -6.9    0
## 16452   Damion James  25  BRK     2       0  NA  0    NA -5.2    0
careerstats$WS[is.na(careerstats$WS) == T] <- 0
careerstats$WS.48[is.na(careerstats$WS.48) == T] <- 0
careerstats$PER[careerstats$Minutes == 0] <- mean(careerstats$PER[careerstats$Games == 1 & careerstats$Minutes < 5], na.rm = T)

Merge records for players traded midseason. If a there is a record with the same Player and Age, I know this is an instance where a player was traded. So I add up (WS, Seasons, Minutes, Games) and average (BPM, VORP, PER, WS/48) for every instance of a unique player and age combination.:

sumstats <- aggregate(careerstats[c(4,5,7)], list(careerstats$Player, careerstats$Age), sum)
meanstats <- aggregate(careerstats[c(6, 8, 9, 10)], list(careerstats$Player, careerstats$Age), mean) 
newcareerstats <- merge(sumstats,meanstats)
names(newcareerstats) <- c("Player", "Age", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")

Calculate the number of seasons played by each player:

careerseasonsplayed <- count(newcareerstats, c('Player'))
names(careerseasonsplayed) <-  c('Player', 'Seasons')
head(careerseasonsplayed)
##          Player Seasons
## 1    A.C. Green      16
## 2 A.J. Bramlett       1
## 3  A.J. English       2
## 4   A.J. Guyton       3
## 5  A.J. Hammons       1
## 6    A.J. Price       6

Aggregating each player’s stats into one record (their total career stats):

final_careerstats <- merge(aggregate(newcareerstats[c(3,4,5)], list(newcareerstats$Player), sum),
                           aggregate(newcareerstats[c(6, 7, 8, 9)], list(newcareerstats$Player), mean))
names(final_careerstats) <- c("Player", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")
final_careerstats <- merge(careerseasonsplayed, final_careerstats)
head(final_careerstats)
##          Player Seasons Games Minutes   WS       PER       WS.48
## 1    A.C. Green      16  1278   36552 99.4 14.075000  0.12618750
## 2 A.J. Bramlett       1     8      61 -0.2 -0.400000 -0.12900000
## 3  A.J. English       2   151    3108  1.1 11.550000  0.01650000
## 4   A.J. Guyton       3    80    1246  0.3  4.366667 -0.14133333
## 5  A.J. Hammons       1    22     163  0.0  8.400000 -0.00100000
## 6    A.J. Price       6   261    3929  4.7 11.005556  0.03288889
##          BPM        VORP
## 1   0.093750  1.38437500
## 2 -16.800000 -0.20000000
## 3  -5.100000 -1.20000000
## 4 -11.000000 -0.33333333
## 5  -5.600000 -0.10000000
## 6  -3.522222 -0.05555556

Now I append the career stats to the drafted players from 1980-2012:

final_careerstats <- merge(Alldrafted, final_careerstats, by.x = "Player", by.y = "Player", all.x = T)
winstatscareer <- subset(final_careerstats, select = -c(4,5,6,7))
head(winstatscareer[complete.cases(winstatscareer) == F, ], n=3)
##            Player Pk  Tm Seasons Games Minutes WS.y PER WS.48.y BPM VORP
## 108 Brian Jackson 26 POR      NA    NA      NA   NA  NA      NA  NA   NA
## 126 Calvin Duncan 30 CHI      NA    NA      NA   NA  NA      NA  NA   NA
## 233   Deon Thomas 28 DAL      NA    NA      NA   NA  NA      NA  NA   NA
names(winstatscareer) <- c("Player", "Pk", "Tm", "Seasons", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")

head(winstatscareer, n=3)
##         Player Pk  Tm Seasons Games Minutes   WS      PER      WS.48
## 1   A.C. Green 23 LAL      16  1278   36552 99.4 14.07500 0.12618750
## 2 Aaron Brooks 26 HOU       9   613   13250 19.1 12.52222 0.05938889
## 3  Aaron McKie 17 POR      13   793   19156 40.0 11.75769 0.09123077
##          BPM      VORP
## 1  0.0937500 1.3843750
## 2 -2.3888889 0.1222222
## 3 -0.1384615 0.9346154

I need to account for any players with NA values. This was a bit difficult because for stats like PER, BPM and VORP, you cannot just assign players who have never played an NBA minute 0. Instead I decided to average out these stats for players who have only played one game and assign those players with no games the average.

winstatscareer$Seasons[is.na(winstatscareer$Seasons) == T] <- 0
winstatscareer$Games[is.na(winstatscareer$Games) == T] <- 0
winstatscareer$Minutes[is.na(winstatscareer$Minutes) == T] <- 0
winstatscareer$WS[is.na(winstatscareer$WS) == T] <- 0
winstatscareer$PER[is.na(winstatscareer$PER) == T] <- mean(winstatscareer$PER[careerstats$Games ==1], na.rm = T) 
winstatscareer$BPM[is.na(winstatscareer$BPM) == T] <- mean(winstatscareer$BPM[careerstats$Games ==1], na.rm = T) 
winstatscareer$WS.48[is.na(winstatscareer$WS.48) == T] <- 0
winstatscareer$VORP[is.na(winstatscareer$VORP) == T] <- 0

I begin the process of comparing the zscores of each player’s stats. I will find the average and standard deviation for each draft position and then find each player’s zscore for each recorded stat.

avestatscareer <- aggregate(winstatscareer[,4:11], list(winstatscareer$Pk), mean)
avestatscareersd <- aggregate(winstatscareer[,4:11], list(winstatscareer$Pk), sd)
names(avestatscareer) <- c("Pk", "Avg.Seasons", "Avg.G", "Avg.Minutes", "Avg.WS", "Avg.PER", "Avg.WS.48", "Avg.BPM", "Avg.VORP")
names(avestatscareersd) <- c("Pk", "sd.Seasons", "sd.G", "sd.Minutes", "sd.WS", "sd.PER", "sd.WS.48", "sd.BPM", "sd.VORP")

total_career_stats <- merge(winstatscareer, avestatscareer)
total_career_stats <- merge(total_career_stats, avestatscareersd)
head(total_career_stats, n=3)
##   Pk        Player  Tm Seasons Games Minutes    WS      PER      WS.48
## 1  1    Tim Duncan SAS      19  1392   47368 206.3 23.93158 0.20610526
## 2  1     Joe Smith GSW      16  1030   27022  60.1 14.01250 0.09565625
## 3  1 Dwight Howard ORL      13   954   33291 121.7 21.52308 0.17284615
##         BPM     VORP Avg.Seasons    Avg.G Avg.Minutes   Avg.WS  Avg.PER
## 1  5.278947 4.705263    11.87879 750.8788    24802.36 75.66061 18.53653
## 2 -1.950000 0.400000    11.87879 750.8788    24802.36 75.66061 18.53653
## 3  2.415385 3.000000    11.87879 750.8788    24802.36 75.66061 18.53653
##   Avg.WS.48  Avg.BPM Avg.VORP sd.Seasons     sd.G sd.Minutes    sd.WS
## 1 0.1270197 1.447196 2.299122   4.128843 300.0325   11197.47 56.64677
## 2 0.1270197 1.447196 2.299122   4.128843 300.0325   11197.47 56.64677
## 3 0.1270197 1.447196 2.299122   4.128843 300.0325   11197.47 56.64677
##     sd.PER  sd.WS.48   sd.BPM  sd.VORP
## 1 4.617105 0.0581278 2.902805 1.840581
## 2 4.617105 0.0581278 2.902805 1.840581
## 3 4.617105 0.0581278 2.902805 1.840581
total_career_stats <- mutate(total_career_stats, Z.Season = (Seasons - Avg.Seasons) / sd.Seasons)
total_career_stats <- mutate(total_career_stats, Z.Games = (Games - Avg.G) / sd.G)
total_career_stats <- mutate(total_career_stats, Z.Minutes = (Minutes - Avg.Minutes) / sd.Minutes)
total_career_stats <- mutate(total_career_stats, Z.WS = (WS - Avg.WS) / sd.WS)
total_career_stats <- mutate(total_career_stats, Z.WS.48 = (WS.48 - Avg.WS.48) / sd.WS.48)
total_career_stats <- mutate(total_career_stats, Z.PER = (PER - Avg.PER) / sd.PER)
total_career_stats <- mutate(total_career_stats, Z.BPM = (BPM - Avg.BPM) / sd.BPM)
total_career_stats <- mutate(total_career_stats, Z.VORP = (VORP - Avg.VORP) / sd.VORP)

total_career_stats <- total_career_stats[,-c(12:27)]
head(total_career_stats, n=3)
##   Pk        Player  Tm Seasons Games Minutes    WS      PER      WS.48
## 1  1    Tim Duncan SAS      19  1392   47368 206.3 23.93158 0.20610526
## 2  1     Joe Smith GSW      16  1030   27022  60.1 14.01250 0.09565625
## 3  1 Dwight Howard ORL      13   954   33291 121.7 21.52308 0.17284615
##         BPM     VORP  Z.Season   Z.Games Z.Minutes       Z.WS    Z.WS.48
## 1  5.278947 4.705263 1.7247474 2.1368394 2.0152443  2.3062109  1.3605464
## 2 -1.950000 0.400000 0.9981517 0.9303033 0.1982266 -0.2746954 -0.5395600
## 3  2.415385 3.000000 0.2715560 0.6769974 0.7580852  0.8127453  0.7883743
##        Z.PER      Z.BPM     Z.VORP
## 1  1.1684918  1.3200169  1.3072732
## 2 -0.9798410 -1.1703151 -1.0318057
## 3  0.6468442  0.3335355  0.3807921

Adding the rest of the information for each player from the playerinfo df:

tc_stat <- merge(total_career_stats, playerinfo, by.x = "Player", by.y = "Player", all.x = T) 
tc_stat[complete.cases(tc_stat) == F,]
##  [1] Player     Pk.x       Tm         Seasons    Games      Minutes   
##  [7] WS         PER        WS.48      BPM        VORP       Z.Season  
## [13] Z.Games    Z.Minutes  Z.WS       Z.WS.48    Z.PER      Z.BPM     
## [19] Z.VORP     Pk.y       Draft.Year POS        Height     Weight    
## [25] Age        YOS        PreNBA     Source    
## <0 rows> (or 0-length row.names)
tc_stat <- tc_stat[,c(1,2,22,3,21,23,24,25,27,28,4,5,6,7,9,8,10,11,12,13,14,15,16,17,18,19)]
colnames(tc_stat)[2] <- "Pk"

And finally computing the combined zscores of the PER, VORP and Win Share stats:

tc_stat <- mutate(tc_stat, ZCUM = (Z.WS + Z.PER + Z.VORP)/3)
head(tc_stat, n=3)
##         Player Pk POS  Tm Draft.Year Height Weight Age       PreNBA Source
## 1   A.C. Green 23  PF LAL       1985    6-9    220  21 Oregon State      C
## 2 Aaron Brooks 26  PG HOU       2007    6-0    161  22       Oregon      C
## 3  Aaron McKie 17  SG POR       1994    6-5    209  21       Temple      C
##   Seasons Games Minutes   WS      WS.48      PER        BPM      VORP
## 1      16  1278   36552 99.4 0.12618750 14.07500  0.0937500 1.3843750
## 2       9   613   13250 19.1 0.05938889 12.52222 -2.3888889 0.1222222
## 3      13   793   19156 40.0 0.09123077 11.75769 -0.1384615 0.9346154
##    Z.Season   Z.Games Z.Minutes      Z.WS   Z.WS.48      Z.PER     Z.BPM
## 1 2.0329221 2.6335190 3.0282707 3.9402446 1.1141943 0.74676450 1.0768054
## 2 0.7552549 0.9557381 0.7323317 0.1734760 0.1561328 0.24401134 0.3350919
## 3 1.3674761 1.2115115 0.9632230 0.9280125 0.7559631 0.05872295 0.9346929
##       Z.VORP       ZCUM
## 1  2.1924935 2.29316756
## 2 -0.1884345 0.07635097
## 3  0.7427288 0.57648808

Before I begin comparing, I need to weigh thhe scores based on the performance of the draft classes - bumping up players from poor classes and down players from great classes. First I need to compare the draft classes. I will compute the average ZCUM scores for each draft class and then find each class’ zscore:

game_test <- subset(tc_stat, select = c(1,2,5,27))

ClassMean <- aggregate(game_test$ZCUM, list(game_test$Draft.Year), mean)
names(ClassMean) <- c("Draft.Year", "Mean.Class")
ClassSD <- aggregate(game_test$ZCUM, list(game_test$Draft.Year), sd)
names(ClassSD) <- c("Draft.Year", "SD.Class")

dclass <- merge(ClassMean, ClassSD)
dclass <- mutate(dclass, AVG = mean(dclass$Mean.Class))
dclass <- mutate(dclass, SD = mean(dclass$SD.Class))
dclass <- mutate(dclass, ZScore = (dclass$Mean.Class - dclass$AVG) / dclass$SD)
head(dclass)
##   Draft.Year  Mean.Class  SD.Class           AVG        SD      ZScore
## 1       1980 -0.23500672 0.6758249 -3.837419e-18 0.8646639 -0.27178968
## 2       1981  0.15293530 0.9492109 -3.837419e-18 0.8646639  0.17687254
## 3       1982 -0.09570833 0.7081607 -3.837419e-18 0.8646639 -0.11068848
## 4       1983 -0.01803811 0.8286382 -3.837419e-18 0.8646639 -0.02086141
## 5       1984  0.10263854 1.2817262 -3.837419e-18 0.8646639  0.11870340
## 6       1985  0.09501193 1.1637063 -3.837419e-18 0.8646639  0.10988308
game_test <- merge(game_test, dclass, all.x = T)
game_test <- game_test[,-c(5:8)]

I create a grading system based on the zscores of the draft classes where better performing classes receive a higher grade. I then calculate a new zscore that will add 1/32 of the standard deviation of the distribution of class averages for each grade jump (lower grades receiving the highest bump in score):

as.numeric(mean(game_test$ZCUM))
## [1] -2.188528e-18
sd(game_test$ZCUM)/32
## [1] 0.02763328
game_test$Grade <- cut(game_test$ZScore, breaks = c(-Inf, -.3, -.2, -.1, 0, .1, .2, .3, +Inf), right = T, include.lowest = T, labels = c("H", "G", "F", "E", "D", "C", "B", "A"))

game_test <- mutate(game_test, 
                    New.ZCUM = ifelse(game_test$Grade == "A", game_test$ZCUM -0.09651498, 
                                ifelse(game_test$Grade == "B", game_test$ZCUM - 0.06893927,
                                  ifelse(game_test$Grade == "C", game_test$ZCUM - 0.04136356,
                                    ifelse(game_test$Grade == "D", game_test$ZCUM - 0.01378785, 
                                     ifelse(game_test$Grade == "E", game_test$ZCUM + 0.01378785,
                                       ifelse(game_test$Grade == "F", game_test$ZCUM + 0.04136356,
                                        ifelse(game_test$Grade == "G", game_test$ZCUM + 0.06893927,
                                          ifelse(game_test$Grade == "H", game_test$ZCUM + 0.09651498,
                                                 0)))))))))

head(game_test)
##   Draft.Year          Player Pk        ZCUM     ZScore Grade   New.ZCUM
## 1       1980     Jeff Ruland 25  1.56558689 -0.2717897     G  1.6345262
## 2       1980 Hawkeye Whitney 16 -0.32611397 -0.2717897     G -0.2571747
## 3       1980     Monti Davis 21  0.04723096 -0.2717897     G  0.1161702
## 4       1980    Wes Matthews 14 -0.23262998 -0.2717897     G -0.1636907
## 5       1980       Louis Orr 29  0.60506290 -0.2717897     G  0.6740022
## 6       1980      John Duren 19 -1.35892299 -0.2717897     G -1.2899837
game_test <- game_test[,c(1,2,7)]
tc_stat <- merge(tc_stat, game_test)

Step 4: Visualizing the Data

With my scores finalized, I can now begin exploring and plotting the data. My first two plots I want to create deal with comparing team totals. I create a new dataframe teamresults that finds the averages combined ZCUM for each of the teams. I also add which conference the team is a part of:

teamresults <- aggregate(tc_stat[,c(19:26,28)], list(tc_stat$Tm), mean)
names(teamresults) <- c("Tm", "Season", "Games", "Minutes", "WS", "WS.48", "PER", "BPM", "VORP", "ZCUM")
conferences <- c("E", "E", "E", "E", "E", "E", "W", "W", "E", "W", "W", "E", "W", "W", "W", "E", "E", "W", "W", "E", "W", "E", "E", "W", "W", "W", "W", "E", "W", "E")
teamresults$CONF <- conferences
teamresults$CONF[teamresults$CONF == "E"] <- "East"
teamresults$CONF[teamresults$CONF == "W"] <- "West"

And now to create the visualizations to help us compare teams drafting performances.

Our first bar chart will break it down by team and show an overall comparison of zscores for Seasons, WS, PER, BPM and VORP.

tbyt <- plot_ly(teamresults, x = teamresults$Tm, y = teamresults$Season, type = 'bar', name = 'Seasons') %>%
  add_trace(y = ~teamresults$WS, name = 'Win Shares') %>%
  add_trace(y = ~teamresults$PER, name = 'PER') %>%
  add_trace(y = ~teamresults$BPM, name = 'BPM') %>%
  add_trace(y = ~teamresults$VORP, name = 'VORP') %>%
  layout(title = "Team by Team Drafting Performance",
         xaxis = list(title = "Team", tickangle = -45), 
         yaxis = list(title = "Z Score"), 
         barmode = 'group',
         bargap = 0.35,
         autosize = F, width = 1400, height = 900)

The second bar chart will combine the WS, PER and WS stats to create my cumulative stat I will use to compare on a player by player basis. I also seperate the Eastern and Western conference teams in this chart.

zcum.order <- order(teamresults$ZCUM)
ordered.teamresults <- teamresults[zcum.order,]

Combined.ZCUM <- plot_ly(ordered.teamresults, x = ordered.teamresults$Tm, y = ordered.teamresults$ZCUM, 
        type = 'bar', color = ordered.teamresults$CONF, colors = "Set1") %>%
  layout(title = "Combined WS-PER-VORP ZScores by Team", 
         yaxis = list(title = "Combined ZScore"),
         xaxis = list(title = "Team",
                      type = "category",
                      categoryorder = "array",
                      categoryarray = sort(ordered.teamresults$ZCUM),
                      tickangle = -45,
                      ticks = "inside",
                      ticklen = 7),
         autosize = F, width = 1400, height = 900,
         showlegend = T)

Now I will create a chart for each team that breaks down their average Z statistic for every player drafted. First i create a new dataframe for every team, sort them by draft year and then create a list with the names of each of these dataframes. I run these dataframe created a for loop that will run through the list of each team and create the chart.

teamnames <- list(teamresults$Tm)
teamnames <- unlist(teamnames)

###Create the df for each team
for (l in 1:30) {
  teamname <- teamnames[l]
  assign(teamname, subset(tc_stat, tc_stat$Tm == teamnames[l]))
}

###Create the list of dataframes
team.list <- list()
for (h in 1:length(teamnames))
{
  team.list[[h]] <- get(teamnames[h])
}

###Run the list through the sorting function
Z.SORT <- function(c) {
  sort <- order(c$Draft.Year)
  c <- c[sort,]
}

team.list <- lapply(team.list, Z.SORT)


###Run the list of teams through the for loop to create a plot for each team.
teamplot <- for (q in 1:30) {
  teamdf <- team.list[q]
  teamdf <- as.data.frame.list(teamdf)
  q.plot <- plot_ly(
    teamdf,
    x = ~teamdf$Player,
    y = ~teamdf$ZCUM,
    color = teamdf$Pk,
    type = "bar",
    text = paste(teamdf$Draft.Year, "</br> Pk", teamdf$Pk),
    textposition = "outside",
    textfont = list(size = 25)) %>%
    colorbar(title = "Draft Position", len = 2.5) %>%
    layout(title = paste(teamdf$Tm, "Z Score Breakdown"), 
           margin = list(b = 130),
           yaxis = list(title = "Averaged PER-WS/48-VORP ZScore"),
           xaxis = list(title = "Draft Pick",
                        type = "category",
                        categoryorder = "array",
                        categoryarray = teamdf$Draft.Year,
                        tickangle = -45,
                        ticks = "inside",
                        ticklen = 7),
           autosize = F, width = 1400, height = 900,
           showlegend = TRUE, barmode = 'relative')
}