How well does your NBA team draft?
Comparing team’s draft success over the years is a tricky task. While you could easily compile a list of all drafted players by team and match, you’d be missing one very important aspect - where in the draft was the player picked? Drafting a player like Kawhi Leonard at 15 is a much more succesful draft pick than Lebron James at 1. The chance of selecting a Hall of Fame player is much greater the higher your draft selection is.
In this project, I have measured the success of a draft pick by comparing their career success to all others drafted at that draft position. I will be using a combination of the player’s PER, VORP and WS (all evenly weighted) and then utilzing the zscore of this combined stat as a measurement of success. I am only looking at first round picks (picks 1:30 in years where there weren’t 30 teams) between the years of 1980-2012.
I have also weighted their zscore based on the draft class. Selecting a player in a draft full of stars is not as difficult as selecting a player in a draft with only a select amount of stars. Draft classes with a higher combined score have been penalized while the scores of players in classes who have a lower average are bumped up.
The goal of this project is to see how teams compare in their draft results. While there are many factors that go into the success of a player’s career, I felt the combined PER, VORP and WS gives a decent representation of what a team would be looking for when selecting a player. Note that these stats are for the player’s whole career and therefore do not represent what the player actually contributed for their drafted team.
Let’s get started:
First i need to load the packages I will be using:
Here I define a function created to convert previous team names to their current names when cleaning the data (ex. SEA -> OKC):
teamsub <- function(pattern, replacement, x, ...) {
if (length(pattern)!=length(replacement)) {
stop("pattern and replacement do not have the same length.")
}
result <- x
for (i in 1:length(pattern)) {
result <- gsub(pattern[i], replacement[i], result, ...)
}
result
}
The next function I define is to convert factor variables to numeric:
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
Step 1: Setting up our dataframes
I initially downloaded datasets from basketballreference.com of each draft class and their relevant stats. Here I load these csv files in my project and combine them into one main dataframe. I also clean up the df by:
Removing some columns;
Giving 0 to any NA’s in the Seasons, Games, Win Shares or WS/48 stats;
Converting the older team names to the current team names:
OGnames <- sprintf("%sclass.csv", 1980:2012)
for (i in OGnames){
name <- gsub("class.csv", "", i)
name <- paste("class", name, sep = "")
assign(name, read.csv(i, header = T))
}
dfnames <- sprintf("class%s", 1980:2012)
Alldrafted <- rbind.fill(list(class1980, class1981, class1982, class1983, class1984, class1985, class1986, class1987, class1988, class1989, class1990, class1991, class1992, class1993, class1994, class1995, class1996, class1997, class1998, class1999, class2000, class2001, class2002, class2003, class2004, class2005, class2006, class2007, class2008, class2009, class2010, class2011, class2012))
Alldrafted$Player <- gsub("\\\\.*", "", Alldrafted$Player)
Alldrafted <- subset(Alldrafted, select = -c(Rk))
Alldrafted$Tm <- as.character(Alldrafted$Tm)
Alldrafted$Yrs[is.na(Alldrafted$Yrs)] <- 0
Alldrafted$G[is.na(Alldrafted$G)] <- 0
Alldrafted$WS[is.na(Alldrafted$WS)] <- 0
Alldrafted$WS.48[is.na(Alldrafted$WS.48)] <- 0
Alldrafted$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), Alldrafted$Tm)
head(Alldrafted)
## Pk Tm Player Yrs G WS WS.48
## 1 1 GSW Joe Barry Carroll 10 705 35.6 0.075
## 2 2 UTA Darrell Griffith 10 765 22.0 0.049
## 3 3 BOS Kevin McHale 13 971 113.0 0.180
## 4 4 CHI Kelvin Ransey 6 474 17.9 0.074
## 5 5 DEN James Ray 3 103 -0.3 -0.017
## 6 6 BRK Mike O'Koren 8 407 14.6 0.080
I also utilized data from Real GM that had more information on every first round draft selection which included their draft year, height, weight, Pre NBA Source. I add the data and clean up some of its columns by:
Converting the height variable to inches
Converting the team names to the proper acronyms:
playerinfo <- read.csv("draftedplayerinfo.csv", header = T, fill = T, stringsAsFactors=FALSE, fileEncoding="latin1")
playerinfo$Height <- gsub("11-May", "5-11", playerinfo$Height)
playerinfo$Height <- gsub("11-Jun", "6-11", playerinfo$Height)
playerinfo$Height <- gsub("10-Jun", "6-10", playerinfo$Height)
playerinfo$Height <- gsub("09-Jun", "6-9", playerinfo$Height)
playerinfo$Height <- gsub("08-Jun", "6-8", playerinfo$Height)
playerinfo$Height <- gsub("07-Jun", "6-7", playerinfo$Height)
playerinfo$Height <- gsub("06-Jun", "6-6", playerinfo$Height)
playerinfo$Height <- gsub("05-Jun", "6-5", playerinfo$Height)
playerinfo$Height <- gsub("04-Jun", "6-4", playerinfo$Height)
playerinfo$Height <- gsub("03-Jun", "6-3", playerinfo$Height)
playerinfo$Height <- gsub("02-Jun", "6-2", playerinfo$Height)
playerinfo$Height <- gsub("01-Jun", "6-1", playerinfo$Height)
playerinfo$Height <- gsub("10-May", "5-10", playerinfo$Height)
playerinfo$Height <- gsub("09-May", "5-9", playerinfo$Height)
playerinfo$Height <- gsub("08-May", "5-8", playerinfo$Height)
playerinfo$Height <- gsub("07-May", "5-7", playerinfo$Height)
playerinfo$Height <- gsub("06-May", "5-6", playerinfo$Height)
playerinfo$Height <- gsub("05-May", "5-5", playerinfo$Height)
playerinfo$Height <- gsub("03-May", "5-3", playerinfo$Height)
playerinfo$Height <- gsub("01-Jul", "7-1", playerinfo$Height)
playerinfo$Height <- gsub("02-Jul", "7-2", playerinfo$Height)
playerinfo$Height <- gsub("03-Jul", "7-3", playerinfo$Height)
playerinfo$Height <- gsub("04-Jul", "7-4", playerinfo$Height)
playerinfo$Height <- gsub("05-Jul", "7-5", playerinfo$Height)
playerinfo$Height <- gsub("06-Jul", "7-6", playerinfo$Height)
playerinfo$Height <- gsub("07-Jul", "7-7", playerinfo$Height)
playerinfo$Team <- teamsub(c("Portland Trail Blazers", "Milwaukee Bucks", "Toronto Raptors", "Houston Rockets", "Washington Wizards", "Cleveland Cavaliers", "Golden State Warriors" , "Orlando Magic", "New Jersey Nets", "Chicago Bulls", "Los Angeles Clippers", "Philadelphia Sixers", "San Antonio Spurs", "New York Knicks", "Sacramento Kings", "Charlotte Hornets (1988)", "Los Angeles Lakers", "Dallas Mavericks", "New Orleans Pelicans", "Minnesota Timberwolves", "Seattle SuperSonics", "Atlanta Hawks", "Vancouver Grizzlies", "Detroit Pistons", "Charlotte Hornets", "Miami Heat", "Indiana Pacers", "Phoenix Suns", "San Diego Clippers", "Boston Celtics", "Memphis Grizzlies", "Utah Jazz", "Denver Nuggets", "Oklahoma City Thunder", "Kansas City Kings", "Brooklyn Nets" ), c("POR", "MIL", "TOR", "HOU", "WAS", "CLE", "GSW", "ORL", "BRK", "CHI", "LAC", "PHI", "SAS", "NYK", "SAC", "NOP", "LAL", "DAL", "NOP", "MIN", "OKC", "ATL", "MEM", "DET", "CHO", "MIA", "IND", "PHO", "LAC", "BOS", "MEM", "UTA", "DEN", "OKC", "SAC", "BRK"), playerinfo$Team)
playerinfo$Draft.Trades <- teamsub(c("UTH", "SAN", "PHL", "GOS", "SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("UTA", "SAS", "PHI", "GSW", "LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), playerinfo$Draft.Trades)
head(Alldrafted)
## Pk Tm Player Yrs G WS WS.48
## 1 1 GSW Joe Barry Carroll 10 705 35.6 0.075
## 2 2 UTA Darrell Griffith 10 765 22.0 0.049
## 3 3 BOS Kevin McHale 13 971 113.0 0.180
## 4 4 CHI Kelvin Ransey 6 474 17.9 0.074
## 5 5 DEN James Ray 3 103 -0.3 -0.017
## 6 6 BRK Mike O'Koren 8 407 14.6 0.080
More cleaning of the playerinfo df:
Taking away the draft picks post 2012
Checking to see if the names from the Real Gm data match the names from the Basketball Reference data (ex. Penny Hardaway vs. Afernee Hardaway)
For some reason, Tim Duncan’s stats were not included so I add them in :
playerinfo <- subset(playerinfo, playerinfo$Year != 2013)
playerinfo <- subset(playerinfo, playerinfo$Year != 2014)
playerinfo <- subset(playerinfo, playerinfo$Year != 2015)
playerinfo <- subset(playerinfo, playerinfo$Year != 2016)
namecheck <- merge(Alldrafted, playerinfo, all.x = T, by = "Player")
investigate <- namecheck[complete.cases(namecheck) == F,]
playerinfo$Player[playerinfo$Player == "Alek Radojevic"] <- "Aleksandar Radojevic"
playerinfo$Player[playerinfo$Player == "Penny Hardaway"] <- "Anfernee Hardaway"
playerinfo$Player[playerinfo$Player == "Charles Daniel Smith"] <- "Charles Smith"
playerinfo$Player[playerinfo$Player == "Dontae Jones"] <- "Dontae' Jones"
playerinfo$Player[playerinfo$Player == "Efthimios Rentzias"] <- "Efthimi Rentzias"
playerinfo$Player[playerinfo$Player == "Frankie Williams"] <- "Frank Williams"
playerinfo$Player[playerinfo$Player == "Freddie Banks"] <- "Fred Banks"
playerinfo$Player[playerinfo$Player == "George Muresan"] <- "Gheorghe Muresan"
playerinfo$Player[playerinfo$Player == "Cadillac Anderson"] <- "Greg Anderson"
playerinfo$Player[playerinfo$Player == "J.R. Rider"] <- "Isaiah Rider"
playerinfo$Player[playerinfo$Player == "Jose Rafael Ortiz-Rijos"] <- "Jose Ortiz"
playerinfo$Player[playerinfo$Player == "Marcus D. Williams"] <- "Marcus Williams"
playerinfo$Player[playerinfo$Player == "Mo Taylor"] <- "Maurice Taylor"
playerinfo$Player[playerinfo$Player == "Mel Turpin"] <- "Melvin Turpin"
playerinfo$Player[playerinfo$Player == "Babby Araujo"] <- "Rafael Araujo"
playerinfo$Player[playerinfo$Player == "Saer Sene"] <- "Mouhamed Sene"
playerinfo$Player[playerinfo$Player == "Maybyner Nene"] <- "Nene Hilario"
playerinfo$Player[playerinfo$Player == "Roy Rogers, Jr."] <- "Roy Rogers"
playerinfo$Player[playerinfo$Player == "Jianlian Yi"] <- "Yi Jianlian"
timmy <- c("1997", "1", "1", "Tim Duncan", "SAS", "", "C", "83", "250", "21", "19", "Wake Forest")
playerinfo <-rbind(playerinfo, timmy)
head(playerinfo)
## Year Rnd Pick Player Team Draft.Trades Pos Height Weight Age
## 1 2007 1 1 Greg Oden POR C 7-0 273 19
## 2 2005 1 1 Andrew Bogut MIL C 7-0 260 20
## 3 2006 1 1 Andrea Bargnani TOR PF/C 7-0 245 20
## 4 2002 1 1 Yao Ming HOU C 7-6 310 21
## 5 2001 1 1 Kwame Brown WAS C 6-11 290 19
## 6 2003 1 1 LeBron James CLE SF/PF 6-8 250 18
## YOS Pre.Draft.Team
## 1 6 Ohio State
## 2 11 Utah
## 3 10 Universo De'Longhi Treviso (Italy)
## 4 9 Shanghai Dongfang (China)
## 5 12 Glynn Academy (Georgia)
## 6 13 St. Vincent St. Mary High School (Ohio)
The final set of data I use is scraped from the Basketball Reference website. This data holds the season by season records (1980-2016) of every player and their stats. I will be using this to find the career stats of players. Once loaded, I will be cleaning the data by: - Converting team names
***I’m using this data instead of set with career totals/averages because I will be looking at data for players on their drafted team in a future project.
URLs <- sprintf("http://www.basketball-reference.com/leagues/NBA_%s_advanced.html", 1981:2017)
tabla <- list()
for(v in seq_along(URLs))
{
total <- readHTMLTable(URLs[v])
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
tabla[[v]] <- as.data.frame(total[[which.max(n.rows)]])
}
allplayers <- do.call(rbind.data.frame, tabla)
allplayers$Player <- gsub("\\*", "", allplayers$Player)
allplayers$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), allplayers$Tm)
allplayers <- subset(allplayers, Tm != "Tm")
allplayers <- subset(allplayers, Tm != "TOT")
head(allplayers, n=3)
## Rk Player Pos Age Tm G MP PER TS% 3PAr FTr ORB%
## 1 1 Kareem Abdul-Jabbar C 33 LAL 80 2976 25.5 .616 .001 .379 7.6
## 3 2 Tom Abernethy SF 26 GSW 10 39 3.2 .463 .000 1.000 2.8
## 4 2 Tom Abernethy SF 26 IND 29 259 8.7 .458 .018 .339 7.8
## DRB% TRB% AST% STL% BLK% TOV% USG% OWS DWS WS WS/48 .1 OBPM DBPM
## 1 21.5 15.0 13.6 0.9 4.0 12.8 26.3 9.6 4.6 14.3 .230 3.9 1.4
## 3 20.2 11.4 2.9 1.2 0.0 31.6 6.4 0.0 0.0 0.0 -0.004 -6.0 -0.2
## 4 9.1 8.4 8.8 1.1 0.7 8.5 10.9 0.2 0.2 0.4 .072 -2.0 -0.8
## BPM VORP
## 1 5.3 5.4
## 3 -6.2 0.0
## 4 -2.8 -0.1
Step 2: Cleaning the dataframes.
With all three datasets in place, I continue cleaning the data. First off, I want to make sure I uniquely define any players with the same names (ex. Tim Hardaway and Tim Hardaway JR.).
I’ll first find duplicates in the Alldrafted and playerinfo df’s:
unique_test <- data.frame(table(Alldrafted$Player))
unique_test[unique_test$Freq > 1,]
## Var1 Freq
## 140 Charles Smith 2
Alldrafted$Player[Alldrafted$Player == "Charles Smith" & Alldrafted$Pk == 26] <- "Charles Cornelius Smith"
unique_test2 <- data.frame(table(playerinfo$Player))
unique_test2[unique_test2$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)
playerinfo$Player[playerinfo$Player == "Charles Daniel Smith" & playerinfo$Pick == 3] <- "Charles Smith"
Now for the allplayers dataframe. This one is a bit more difficult as there are records for each season played by a player, so there will for sure be duplicates of player names. I will take two approaches to combat this.
The allplayers dataset does not include the season played but does include the age of the player. When I compile the career stats, I want to make sure I am not adding the stats of two seperate players into one. Here I aggregate by player and age and sum games played stats. If there are players at an age with more than 84 games played, I investigate to see if there are duplicate names:
c1 <- allplayers
c1$G <- as.numeric(levels(c1$G))[c1$G]
c2 <- aggregate(c1$G, list(c1$Player, c1$Age), sum)
c2 <- subset(c2, x>84)
names(c2) <- c("Player", "Age", "Games")
c2test <- merge(c2, Alldrafted, all.x = T)
head(c2test)
## Player Age Games Pk Tm Yrs G WS WS.48
## 1 Charles Jones 23 107 NA <NA> NA NA NA NA
## 2 Charles Jones 24 99 NA <NA> NA NA NA NA
## 3 Charles Smith 26 109 3 PHI 9 564 40.9 0.12
## 4 Charles Smith 22 94 3 PHI 9 564 40.9 0.12
## 5 Charles Smith 30 95 3 PHI 9 564 40.9 0.12
## 6 Charles Smith 23 99 3 PHI 9 564 40.9 0.12
Now with the list of players with more than 84 games played in a season, I attempt to remove any players from allplayers who aren’t in the targeted draft group (Round 1, 1980-2012) or change the names of any duplicated players.
allplayers <- allplayers[!(allplayers$Player == "Charles Smith" & allplayers$Tm == "BOS"),]
allplayers <- allplayers[!(allplayers$Player == "Charles Smith" & allplayers$Tm == "MIN"),]
allplayers$Player[allplayers$Player == "Charles Smith" & allplayers$Pos == "SG"] <- "Charles Cornelius Smith"
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "UTA"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "WAS"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "PHO"),]
allplayers <- allplayers[!(allplayers$Player == "Dee Brown" & allplayers$Pos == "PG" & allplayers$Tm == "TOT"),]
allplayers <- allplayers[!(allplayers$Player == "Gerald Henderson" & allplayers$Pos == "PG"),]
allplayers <- allplayers[!(allplayers$Player == "Glen Rice" & allplayers$Pos == "SG"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Tm == "IND"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Pos == "SG"),]
allplayers <- allplayers[!(allplayers$Player == "Glenn Robinson" & allplayers$Tm == "PHI" & allplayers$Age == 21),]
allplayers <- allplayers[!(allplayers$Player == "Larry Drew" & allplayers$Tm == "PHI"),]
allplayers <- allplayers[!(allplayers$Player == "Marcus Williams" & allplayers$Tm == "LAC"),]
allplayers <- allplayers[!(allplayers$Player == "Marcus Williams" & allplayers$Tm == "SAS"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "SAC"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "MEM"),]
allplayers <- allplayers[!(allplayers$Player == "Michael Smith" & allplayers$Tm == "WAS"),]
allplayers <- allplayers[!(allplayers$Player == "Patrick Ewing" & allplayers$Tm == "NOP"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "GSW"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "NOP"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "CHO"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "OKC"),]
allplayers <- allplayers[!(allplayers$Player == "Reggie Williams" & allplayers$Tm == "SAS" & allplayers$G ==20),]
allplayers$Player[allplayers$Player == "Tim Hardaway" & allplayers$Tm == "NYK"] <- "Tim Hardaway Jr"
allplayers$Player[allplayers$Player == "Tim Hardaway" & allplayers$Tm == "ATL"] <- "Tim Hardaway Jr"
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "HOU"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "SAS"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "PHI"),]
allplayers <- allplayers[!(allplayers$Player == "Mike Dunleavy" & allplayers$Tm == "MIL" & allplayers$Pos == "PG"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "ATL"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "CLE"),]
allplayers <- allplayers[!(allplayers$Player == "Eddie Johnson" & allplayers$Tm == "SEA" & allplayers$G == 24),]
Second Attempt at finding duplicate names. In the allplayers dataset, if a player was traded midseason, they would have three records for that season: one for the stats on their initial team, one for the stats on their second team. Knowing this, I find rows that have the same player and age. I then match these players against players who have been traded midseason. If they were traded midseason, there is the possibility that there were two players with the same name who played at the same age.
### I have saved the allplayers file to a csv and have loaded the original data
c3 <- read.csv("allplayers.csv")
c3$Player <- gsub("\\*", "", c3$Player)
c3$Tm <- teamsub(c("SDC", "VAN", "NJN", "CHH", "NOH", "NOK", "KCK", "WSB", "SEA", "CHA", "PHX"), c("LAC", "MEM", "BRK", "NOP", "NOP", "NOP", "SAC", "WAS", "OKC", "CHO", "PHO"), c3$Tm)
c3$Player <- as.character(c3$Player)
c3$Age <- as.numeric.factor(c3$Age)
c3 <- c3[c3$Tm != "Tm",]
TradedPlayers <- subset(c3, Tm == "TOT")
TradedPlayers <- TradedPlayers[c(3,5,6)]
c3dupl <- c3[c(3,5,6)]
c3dupl <- c3dupl[c3dupl$Tm != "TOT",]
c3dupl$Freq <- 1
c3dupl <- aggregate(c3dupl$Freq, by = list(c3dupl$Player, c3dupl$Age), sum)
c3dupl <- subset(c3dupl, x >1)
c3check <- merge(c3dupl, TradedPlayers, by.x = c("Group.1", "Group.2"), by.y = c("Player", "Age"), all.x = T)
head(c3check[complete.cases(c3check) == F,])
## Group.1 Group.2 x Tm
## 241 Charles Jones 23 2 <NA>
## 242 Charles Jones 24 2 <NA>
## 243 Charles Jones 26 2 <NA>
## 246 Charles Smith 23 3 <NA>
## 247 Charles Smith 26 2 <NA>
## 248 Charles Smith 27 2 <NA>
After further investigation, I have ensured that the allplayers dataframe has unique player names for the targeted group.
We now have data frame with every draft selection, their draft team and their pick position. Some players were however traded on draft day or the subsequent days after to another team. To account for this, I will utilize some of the information provided in the playerinfo df on draft day trades. For players traded, I will change their Team variable in the Alldrafted df to the corresponding team they’ve been traded to:
draft_day_trades <- subset(playerinfo, select = c("Player", "Pick", "Draft.Trades", "Year"))
draft_day_trades$Draft.Trades[draft_day_trades$Draft.Trades == ""] <- NA
draft_day_trades <- draft_day_trades[complete.cases(draft_day_trades),]
draft_day_trades$Team <- substr(draft_day_trades$Draft.Trades, 8, 10)
draft_day_trades <- subset(draft_day_trades, select = -Draft.Trades)
names(draft_day_trades) <- c("Player", "Pk", "Year", "Team")
Alldrafted <- merge(Alldrafted, draft_day_trades, by = c("Player", "Pk"), all.x = T)
Alldrafted$Tm <- as.character(Alldrafted$Tm)
Alldrafted$Tm[is.na(Alldrafted$Team) == F] <- Alldrafted$Team[is.na(Alldrafted$Team) == F]
Alldrafted <- subset(Alldrafted, select = -c(8, 9))
I just want to check once more to see if there are any players that may have the wrong draft team. I check by seeing which players have no season stats with their drafted team and then investigate further to see if they were traded immediately after the draft.
drafttradetest <- merge(Alldrafted, allplayers, all.x = T, by.x = c("Player", "Tm"), by.y = c("Player", "Tm"))
drafttradetest2 <- drafttradetest[complete.cases(drafttradetest) == F,]
head(drafttradetest2, n=2)
## Player Tm Pk Yrs G.x WS.x WS.48 Rk Pos Age G.y MP PER
## 223 Ansu Sesay DAL 30 4 127 1.4 0.051 <NA> <NA> <NA> <NA> <NA> <NA>
## 229 Anthony Avent DEN 15 6 352 0.3 0.002 <NA> <NA> <NA> <NA> <NA> <NA>
## TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% OWS DWS
## 223 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 229 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## WS.y WS/48 .1 OBPM DBPM BPM VORP
## 223 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 229 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Alldrafted$Tm[Alldrafted$Player == "Brendan Haywood"] <-"WAS"
Alldrafted$Tm[Alldrafted$Player == "Byron Houston" ] <- "GSW"
Alldrafted$Tm[Alldrafted$Player == "Craig Brackins" ] <- "PHI"
Alldrafted$Tm[Alldrafted$Player == "Darrell Arthur" ] <- "MEM"
Alldrafted$Tm[Alldrafted$Player == "Anthony Avent"] <- "MIL"
Alldrafted$Tm[Alldrafted$Player == "Efthimi Rentzias"] <- "DEN"
Alldrafted$Tm[Alldrafted$Player == "Fred Roberts" ] <- "MIL"
Alldrafted$Tm[Alldrafted$Player == "Jamaal Tinsley" ] <- "IND"
Alldrafted$Tm[Alldrafted$Player == "Larry Krystkowiak" ] <- "SAS"
Alldrafted$Tm[Alldrafted$Player == "Leandro Barbosa" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Kenny Battle" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Mirsad Turkcan" ] <- "HOU"
Alldrafted$Tm[Alldrafted$Player == "Nikola Mirotic" ] <- "CHI"
Alldrafted$Tm[Alldrafted$Player == "Norris Cole" ] <- "MIA"
Alldrafted$Tm[Alldrafted$Player == "Pat Garrity" ] <- "PHO"
Alldrafted$Tm[Alldrafted$Player == "Randy Foye" ] <- "MIN"
Alldrafted$Tm[Alldrafted$Player == "Tracy Murray" ] <- "POR"
Next I want to prepare the playerinfo dataframe to be merged with our Alldrafted df. I am going to remove some of the unnecessary columns first.
playerinfo <- subset(playerinfo, select = -c(Rnd, Team, Draft.Trades))
names(playerinfo) <- c("Draft Year", "Pk", "Player", "POS", "Height", "Weight", "Age", "YOS", "PreNBA")
I’m adding a new column that details whether or not the draftee came from College (c), Overseas (o) or High school (h). Players from High School or Overseas have the region of their school or team in brackets (ex. Glynn Academy (Georgia)). I start by finding only the players from Overseas or High School as they have a the region of their former team in brackets. I use regex to extract only players with brackets in their PreNba variable:
IorH <- playerinfo[which(grepl("\\(", playerinfo$PreNBA) == T),]
There are however some draftees who have brackets in their PreNBA column:
playerinfo$PreNBA[playerinfo$Player == "Alfredrick Hughes"] <- "Loyola"
playerinfo$PreNBA[playerinfo$Player == "John Salmons"] <- "University of Miami"
playerinfo$PreNBA[playerinfo$Player == "Ron Harper"] <- "Miami University"
playerinfo$PreNBA[playerinfo$Player == "Tim James"] <- "University of Miami"
playerinfo$PreNBA[playerinfo$Player == "Wally Szczerbiak"] <- "Miami University"
Let’s try again:
IorH <- playerinfo[which(grepl("\\(", playerinfo$PreNBA) == T),]
head(IorH)
## Draft Year Pk Player POS Height Weight Age YOS
## 3 2006 1 Andrea Bargnani PF/C 7-0 245 20 10
## 4 2002 1 Yao Ming C 7-6 310 21 9
## 5 2001 1 Kwame Brown C 6-11 290 19 12
## 6 2003 1 LeBron James SF/PF 6-8 250 18 13
## 8 2004 1 Dwight Howard C/PF 6-11 265 18 12
## 41 2003 2 Darko Milicic C/PF 7-0 275 18 10
## PreNBA
## 3 Universo De'Longhi Treviso (Italy)
## 4 Shanghai Dongfang (China)
## 5 Glynn Academy (Georgia)
## 6 St. Vincent St. Mary High School (Ohio)
## 8 Southwest Atlanta Christian Academy (Georgia)
## 41 KK Vrsac Swisslion (Serbia)
To figure out whether or not they came from High School or Overseas, I will match what’s in the brackets to a list of American states. I have checked and ensured there were no players who came from playing in Georgia (country not state). I create a new column in the IorH df that gives a TRue or False of whether or not it is a US state. I then add this column to the playerinfo df and give an “O” for False, an “H” for True and a “C” for NA.
states <- read.csv("states.csv", header = T)
head(states)
## State Abbreviation
## 1 Alabama AL
## 2 Alaska AK
## 3 Arizona AZ
## 4 Arkansas AR
## 5 California CA
## 6 Colorado CO
states <- subset(states, select = State)
IorH$a <- str_extract(IorH$PreNBA, "\\(.*\\)")
IorH$a <- gsub("\\(", "", IorH$a)
IorH$a <- gsub("\\)", "", IorH$a)
IorH$b <- IorH$a %in% states$State
IorH <- subset(IorH, select = c(Player, Pk, b))
playerinfo <- merge(playerinfo, IorH, by = c("Player", "Pk"), all.x = T)
UHI <- function (x) {
ifelse(is.na(x) == T, "C",
ifelse(x == "TRUE", "H", "O"))
}
playerinfo$b <- UHI(playerinfo$b)
names(playerinfo) <- c("Player", "Pk", "Draft.Year", "POS", "Height", "Weight", "Age", "YOS", "PreNBA", "Source")
head(playerinfo)
## Player Pk Draft.Year POS Height Weight Age YOS PreNBA Source
## 1 A.C. Green 23 1985 PF 6-9 220 21 16 Oregon State C
## 2 Aaron Brooks 26 2007 PG 6-0 161 22 8 Oregon C
## 3 Aaron McKie 17 1994 G 6-5 209 21 14 Temple C
## 4 Acie Earl 19 1993 PF 6-10 240 23 4 Iowa C
## 5 Acie Law 11 2007 SG 6-3 202 22 4 Texas A&M C
## 6 Adam Keefe 10 1992 F 6-9 230 22 9 Stanford C
Now to assign one position for each of the players. I want to break it down into the classic 5 positions used. I will use the playerinfo df but for any players whose position isn’t one of the classic 5, I will assign the position of their rookie year (from the allplayers df).
allplayers$Age <- as.numeric(levels(allplayers$Age))[allplayers$Age]
POS_rookie <- allplayers %>%
group_by(Player) %>%
slice(which.min(Age))
POS_check <- merge(playerinfo, POS_rookie, all.x = T, by.x = "Player", by.y = "Player")
POS_check$Pos <- as.character(POS_check$Pos)
POS_check$Pos[is.na(POS_check$Pos) ==T] <- POS_check$POS
head(POS_check[complete.cases(POS_check) == F,], n=2)
## Player Pk Draft.Year POS Height Weight Age.x YOS PreNBA
## 108 Brian Jackson 26 1981 PF 6-8 200 22 0 Utah State
## 126 Calvin Duncan 30 1985 G 6-4 195 24 1 VCU
## Source Rk Pos Age.y Tm G MP PER TS% 3PAr FTr ORB% DRB%
## 108 C <NA> PF NA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 126 C <NA> PG NA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## TRB% AST% STL% BLK% TOV% USG% OWS DWS WS WS/48 .1 OBPM DBPM
## 108 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 126 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## BPM VORP
## 108 <NA> <NA>
## 126 <NA> <NA>
POS_check$POS[POS_check$POS == "C"] <- 5
POS_check$POS[POS_check$POS == "PF"] <- 4
POS_check$POS[POS_check$POS == "SF"] <- 3
POS_check$POS[POS_check$POS == "SG"] <- 2
POS_check$POS[POS_check$POS == "PG"] <- 1
POS_check$Pos[POS_check$Pos == "C"] <- 5
POS_check$Pos[POS_check$Pos == "PF"] <- 4
POS_check$Pos[POS_check$Pos == "SF"] <- 3
POS_check$Pos[POS_check$Pos == "SG"] <- 2
POS_check$Pos[POS_check$Pos == "PG"] <- 1
head(POS_check, n=2)
## Player Pk Draft.Year POS Height Weight Age.x YOS PreNBA
## 1 A.C. Green 23 1985 4 6-9 220 21 16 Oregon State
## 2 Aaron Brooks 26 2007 1 6-0 161 22 8 Oregon
## Source Rk Pos Age.y Tm G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST%
## 1 C 98 4 22 LAL 82 1542 11.8 .564 .015 .430 12.4 15.5 14.0 4.2
## 2 C 60 1 23 HOU 51 608 13.1 .535 .484 .218 2.5 7.9 5.2 23.5
## STL% BLK% TOV% USG% OWS DWS WS WS/48 .1 OBPM DBPM BPM VORP
## 1 1.5 1.7 17.7 14.7 1.4 2.0 3.3 .103 -0.5 0.8 0.3 0.9
## 2 1.1 0.6 15.1 21.8 0.6 0.8 1.4 .112 0.0 -1.8 -1.8 0.0
POS_Func <- function (x) {
ifelse(POS_check$POS != 1|2|3|4|5, POS_check$Pos, POS_check$POS)
}
POS_check$POS <- POS_Func(POS_check$POS)
POS_check <- POS_check[c(1,4)]
POS_check$POS[POS_check$POS == 5] <- "C"
POS_check$POS[POS_check$POS == 4] <- "PF"
POS_check$POS[POS_check$POS == 3] <- "SF"
POS_check$POS[POS_check$POS == 2] <- "SG"
POS_check$POS[POS_check$POS == 1] <-"PG"
playerinfo <- merge(playerinfo, POS_check, by.x = "Player", by.y = "Player")
playerinfo <- playerinfo[,c(1,2,3,11,5,6,7,8,9,10)]
names(playerinfo) <- c("Player", "Pk", "Draft.Year", "POS", "Height", "Weight", "Age", "YOS", "PreNBA", "Source")
head(playerinfo)
## Player Pk Draft.Year POS Height Weight Age YOS PreNBA Source
## 1 A.C. Green 23 1985 PF 6-9 220 21 16 Oregon State C
## 2 Aaron Brooks 26 2007 PG 6-0 161 22 8 Oregon C
## 3 Aaron McKie 17 1994 SG 6-5 209 21 14 Temple C
## 4 Acie Earl 19 1993 C 6-10 240 23 4 Iowa C
## 5 Acie Law 11 2007 PG 6-3 202 22 4 Texas A&M C
## 6 Adam Keefe 10 1992 PF 6-9 230 22 9 Stanford C
Step 3: Aggregating Stats
I can now begin the process of aggregating season stats to each player to find their total career stats. I begin by creating a new df careerstats and converting names and classes of the variables:
careerstats <- subset(allplayers, select = c(Player, Age, Tm, G, MP, PER, WS, `WS/48`, BPM, VORP))
names(careerstats) <- c("Player", "Age", "Team", "Games", "Minutes", "PER", "WS", "WS.48", "BPM", "VORP")
lapply(careerstats, class)
## $Player
## [1] "character"
##
## $Age
## [1] "numeric"
##
## $Team
## [1] "character"
##
## $Games
## [1] "factor"
##
## $Minutes
## [1] "factor"
##
## $PER
## [1] "factor"
##
## $WS
## [1] "factor"
##
## $WS.48
## [1] "factor"
##
## $BPM
## [1] "factor"
##
## $VORP
## [1] "factor"
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
careerstats$Games <- as.numeric.factor(careerstats$Games)
careerstats$WS <- as.numeric.factor(careerstats$WS)
careerstats$WS.48 <- as.numeric.factor(careerstats$WS.48)
careerstats$PER <- as.numeric.factor(careerstats$PER)
careerstats$BPM <- as.numeric.factor(careerstats$BPM)
careerstats$VORP <- as.numeric.factor(careerstats$VORP)
careerstats$Minutes <- as.numeric.factor(careerstats$Minutes)
head(careerstats)
## Player Age Team Games Minutes PER WS WS.48 BPM VORP
## 1 Kareem Abdul-Jabbar 33 LAL 80 2976 25.5 14.3 0.230 5.3 5.4
## 3 Tom Abernethy 26 GSW 10 39 3.2 0.0 -0.004 -6.2 0.0
## 4 Tom Abernethy 26 IND 29 259 8.7 0.4 0.072 -2.8 -0.1
## 5 Alvan Adams 26 PHO 75 2054 20.3 7.7 0.180 5.3 3.8
## 6 Darrell Allums 22 DAL 22 276 5.3 -0.3 -0.054 -5.6 -0.2
## 7 Tiny Archibald 32 BOS 80 2820 14.3 6.9 0.118 -1.3 0.5
Check for any NA’s. We run into one of the most challenging decisions of the project. After this check, I realize there are NA’s value for players who did not play a minute in the NBA. For some values like Seasons or Win Shares, I could easily assign a 0. The problem was that for a stat like PER and VORP, it is relevant to the other players in the league - and these both represent an impact you had(whether negative or positive). If a player never plays a game, is that as bad as playing the worst game possible? Or should they find the sweet spot where they had no impact on the game at all (this is completely further down on a different dataframe):
careerstats[complete.cases(careerstats) == F,]
## Player Age Team Games Minutes PER WS WS.48 BPM VORP
## 12153 Josh Davis 25 HOU 1 0 NA 0 NA -6.3 0
## 12484 Alex Scales 27 SAS 1 0 NA 0 NA -4.4 0
## 13485 Stephane Lasme 25 GSW 1 0 NA 0 NA -6.4 0
## 14491 JamesOn Curry 24 LAC 1 0 NA 0 NA -6.9 0
## 16452 Damion James 25 BRK 2 0 NA 0 NA -5.2 0
careerstats$WS[is.na(careerstats$WS) == T] <- 0
careerstats$WS.48[is.na(careerstats$WS.48) == T] <- 0
careerstats$PER[careerstats$Minutes == 0] <- mean(careerstats$PER[careerstats$Games == 1 & careerstats$Minutes < 5], na.rm = T)
Merge records for players traded midseason. If a there is a record with the same Player and Age, I know this is an instance where a player was traded. So I add up (WS, Seasons, Minutes, Games) and average (BPM, VORP, PER, WS/48) for every instance of a unique player and age combination.:
sumstats <- aggregate(careerstats[c(4,5,7)], list(careerstats$Player, careerstats$Age), sum)
meanstats <- aggregate(careerstats[c(6, 8, 9, 10)], list(careerstats$Player, careerstats$Age), mean)
newcareerstats <- merge(sumstats,meanstats)
names(newcareerstats) <- c("Player", "Age", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")
Calculate the number of seasons played by each player:
careerseasonsplayed <- count(newcareerstats, c('Player'))
names(careerseasonsplayed) <- c('Player', 'Seasons')
head(careerseasonsplayed)
## Player Seasons
## 1 A.C. Green 16
## 2 A.J. Bramlett 1
## 3 A.J. English 2
## 4 A.J. Guyton 3
## 5 A.J. Hammons 1
## 6 A.J. Price 6
Aggregating each player’s stats into one record (their total career stats):
final_careerstats <- merge(aggregate(newcareerstats[c(3,4,5)], list(newcareerstats$Player), sum),
aggregate(newcareerstats[c(6, 7, 8, 9)], list(newcareerstats$Player), mean))
names(final_careerstats) <- c("Player", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")
final_careerstats <- merge(careerseasonsplayed, final_careerstats)
head(final_careerstats)
## Player Seasons Games Minutes WS PER WS.48
## 1 A.C. Green 16 1278 36552 99.4 14.075000 0.12618750
## 2 A.J. Bramlett 1 8 61 -0.2 -0.400000 -0.12900000
## 3 A.J. English 2 151 3108 1.1 11.550000 0.01650000
## 4 A.J. Guyton 3 80 1246 0.3 4.366667 -0.14133333
## 5 A.J. Hammons 1 22 163 0.0 8.400000 -0.00100000
## 6 A.J. Price 6 261 3929 4.7 11.005556 0.03288889
## BPM VORP
## 1 0.093750 1.38437500
## 2 -16.800000 -0.20000000
## 3 -5.100000 -1.20000000
## 4 -11.000000 -0.33333333
## 5 -5.600000 -0.10000000
## 6 -3.522222 -0.05555556
Now I append the career stats to the drafted players from 1980-2012:
final_careerstats <- merge(Alldrafted, final_careerstats, by.x = "Player", by.y = "Player", all.x = T)
winstatscareer <- subset(final_careerstats, select = -c(4,5,6,7))
head(winstatscareer[complete.cases(winstatscareer) == F, ], n=3)
## Player Pk Tm Seasons Games Minutes WS.y PER WS.48.y BPM VORP
## 108 Brian Jackson 26 POR NA NA NA NA NA NA NA NA
## 126 Calvin Duncan 30 CHI NA NA NA NA NA NA NA NA
## 233 Deon Thomas 28 DAL NA NA NA NA NA NA NA NA
names(winstatscareer) <- c("Player", "Pk", "Tm", "Seasons", "Games", "Minutes", "WS", "PER", "WS.48", "BPM", "VORP")
head(winstatscareer, n=3)
## Player Pk Tm Seasons Games Minutes WS PER WS.48
## 1 A.C. Green 23 LAL 16 1278 36552 99.4 14.07500 0.12618750
## 2 Aaron Brooks 26 HOU 9 613 13250 19.1 12.52222 0.05938889
## 3 Aaron McKie 17 POR 13 793 19156 40.0 11.75769 0.09123077
## BPM VORP
## 1 0.0937500 1.3843750
## 2 -2.3888889 0.1222222
## 3 -0.1384615 0.9346154
I need to account for any players with NA values. This was a bit difficult because for stats like PER, BPM and VORP, you cannot just assign players who have never played an NBA minute 0. Instead I decided to average out these stats for players who have only played one game and assign those players with no games the average.
winstatscareer$Seasons[is.na(winstatscareer$Seasons) == T] <- 0
winstatscareer$Games[is.na(winstatscareer$Games) == T] <- 0
winstatscareer$Minutes[is.na(winstatscareer$Minutes) == T] <- 0
winstatscareer$WS[is.na(winstatscareer$WS) == T] <- 0
winstatscareer$PER[is.na(winstatscareer$PER) == T] <- mean(winstatscareer$PER[careerstats$Games ==1], na.rm = T)
winstatscareer$BPM[is.na(winstatscareer$BPM) == T] <- mean(winstatscareer$BPM[careerstats$Games ==1], na.rm = T)
winstatscareer$WS.48[is.na(winstatscareer$WS.48) == T] <- 0
winstatscareer$VORP[is.na(winstatscareer$VORP) == T] <- 0
I begin the process of comparing the zscores of each player’s stats. I will find the average and standard deviation for each draft position and then find each player’s zscore for each recorded stat.
avestatscareer <- aggregate(winstatscareer[,4:11], list(winstatscareer$Pk), mean)
avestatscareersd <- aggregate(winstatscareer[,4:11], list(winstatscareer$Pk), sd)
names(avestatscareer) <- c("Pk", "Avg.Seasons", "Avg.G", "Avg.Minutes", "Avg.WS", "Avg.PER", "Avg.WS.48", "Avg.BPM", "Avg.VORP")
names(avestatscareersd) <- c("Pk", "sd.Seasons", "sd.G", "sd.Minutes", "sd.WS", "sd.PER", "sd.WS.48", "sd.BPM", "sd.VORP")
total_career_stats <- merge(winstatscareer, avestatscareer)
total_career_stats <- merge(total_career_stats, avestatscareersd)
head(total_career_stats, n=3)
## Pk Player Tm Seasons Games Minutes WS PER WS.48
## 1 1 Tim Duncan SAS 19 1392 47368 206.3 23.93158 0.20610526
## 2 1 Joe Smith GSW 16 1030 27022 60.1 14.01250 0.09565625
## 3 1 Dwight Howard ORL 13 954 33291 121.7 21.52308 0.17284615
## BPM VORP Avg.Seasons Avg.G Avg.Minutes Avg.WS Avg.PER
## 1 5.278947 4.705263 11.87879 750.8788 24802.36 75.66061 18.53653
## 2 -1.950000 0.400000 11.87879 750.8788 24802.36 75.66061 18.53653
## 3 2.415385 3.000000 11.87879 750.8788 24802.36 75.66061 18.53653
## Avg.WS.48 Avg.BPM Avg.VORP sd.Seasons sd.G sd.Minutes sd.WS
## 1 0.1270197 1.447196 2.299122 4.128843 300.0325 11197.47 56.64677
## 2 0.1270197 1.447196 2.299122 4.128843 300.0325 11197.47 56.64677
## 3 0.1270197 1.447196 2.299122 4.128843 300.0325 11197.47 56.64677
## sd.PER sd.WS.48 sd.BPM sd.VORP
## 1 4.617105 0.0581278 2.902805 1.840581
## 2 4.617105 0.0581278 2.902805 1.840581
## 3 4.617105 0.0581278 2.902805 1.840581
total_career_stats <- mutate(total_career_stats, Z.Season = (Seasons - Avg.Seasons) / sd.Seasons)
total_career_stats <- mutate(total_career_stats, Z.Games = (Games - Avg.G) / sd.G)
total_career_stats <- mutate(total_career_stats, Z.Minutes = (Minutes - Avg.Minutes) / sd.Minutes)
total_career_stats <- mutate(total_career_stats, Z.WS = (WS - Avg.WS) / sd.WS)
total_career_stats <- mutate(total_career_stats, Z.WS.48 = (WS.48 - Avg.WS.48) / sd.WS.48)
total_career_stats <- mutate(total_career_stats, Z.PER = (PER - Avg.PER) / sd.PER)
total_career_stats <- mutate(total_career_stats, Z.BPM = (BPM - Avg.BPM) / sd.BPM)
total_career_stats <- mutate(total_career_stats, Z.VORP = (VORP - Avg.VORP) / sd.VORP)
total_career_stats <- total_career_stats[,-c(12:27)]
head(total_career_stats, n=3)
## Pk Player Tm Seasons Games Minutes WS PER WS.48
## 1 1 Tim Duncan SAS 19 1392 47368 206.3 23.93158 0.20610526
## 2 1 Joe Smith GSW 16 1030 27022 60.1 14.01250 0.09565625
## 3 1 Dwight Howard ORL 13 954 33291 121.7 21.52308 0.17284615
## BPM VORP Z.Season Z.Games Z.Minutes Z.WS Z.WS.48
## 1 5.278947 4.705263 1.7247474 2.1368394 2.0152443 2.3062109 1.3605464
## 2 -1.950000 0.400000 0.9981517 0.9303033 0.1982266 -0.2746954 -0.5395600
## 3 2.415385 3.000000 0.2715560 0.6769974 0.7580852 0.8127453 0.7883743
## Z.PER Z.BPM Z.VORP
## 1 1.1684918 1.3200169 1.3072732
## 2 -0.9798410 -1.1703151 -1.0318057
## 3 0.6468442 0.3335355 0.3807921
Adding the rest of the information for each player from the playerinfo df:
tc_stat <- merge(total_career_stats, playerinfo, by.x = "Player", by.y = "Player", all.x = T)
tc_stat[complete.cases(tc_stat) == F,]
## [1] Player Pk.x Tm Seasons Games Minutes
## [7] WS PER WS.48 BPM VORP Z.Season
## [13] Z.Games Z.Minutes Z.WS Z.WS.48 Z.PER Z.BPM
## [19] Z.VORP Pk.y Draft.Year POS Height Weight
## [25] Age YOS PreNBA Source
## <0 rows> (or 0-length row.names)
tc_stat <- tc_stat[,c(1,2,22,3,21,23,24,25,27,28,4,5,6,7,9,8,10,11,12,13,14,15,16,17,18,19)]
colnames(tc_stat)[2] <- "Pk"
And finally computing the combined zscores of the PER, VORP and Win Share stats:
tc_stat <- mutate(tc_stat, ZCUM = (Z.WS + Z.PER + Z.VORP)/3)
head(tc_stat, n=3)
## Player Pk POS Tm Draft.Year Height Weight Age PreNBA Source
## 1 A.C. Green 23 PF LAL 1985 6-9 220 21 Oregon State C
## 2 Aaron Brooks 26 PG HOU 2007 6-0 161 22 Oregon C
## 3 Aaron McKie 17 SG POR 1994 6-5 209 21 Temple C
## Seasons Games Minutes WS WS.48 PER BPM VORP
## 1 16 1278 36552 99.4 0.12618750 14.07500 0.0937500 1.3843750
## 2 9 613 13250 19.1 0.05938889 12.52222 -2.3888889 0.1222222
## 3 13 793 19156 40.0 0.09123077 11.75769 -0.1384615 0.9346154
## Z.Season Z.Games Z.Minutes Z.WS Z.WS.48 Z.PER Z.BPM
## 1 2.0329221 2.6335190 3.0282707 3.9402446 1.1141943 0.74676450 1.0768054
## 2 0.7552549 0.9557381 0.7323317 0.1734760 0.1561328 0.24401134 0.3350919
## 3 1.3674761 1.2115115 0.9632230 0.9280125 0.7559631 0.05872295 0.9346929
## Z.VORP ZCUM
## 1 2.1924935 2.29316756
## 2 -0.1884345 0.07635097
## 3 0.7427288 0.57648808
Before I begin comparing, I need to weigh thhe scores based on the performance of the draft classes - bumping up players from poor classes and down players from great classes. First I need to compare the draft classes. I will compute the average ZCUM scores for each draft class and then find each class’ zscore:
game_test <- subset(tc_stat, select = c(1,2,5,27))
ClassMean <- aggregate(game_test$ZCUM, list(game_test$Draft.Year), mean)
names(ClassMean) <- c("Draft.Year", "Mean.Class")
ClassSD <- aggregate(game_test$ZCUM, list(game_test$Draft.Year), sd)
names(ClassSD) <- c("Draft.Year", "SD.Class")
dclass <- merge(ClassMean, ClassSD)
dclass <- mutate(dclass, AVG = mean(dclass$Mean.Class))
dclass <- mutate(dclass, SD = mean(dclass$SD.Class))
dclass <- mutate(dclass, ZScore = (dclass$Mean.Class - dclass$AVG) / dclass$SD)
head(dclass)
## Draft.Year Mean.Class SD.Class AVG SD ZScore
## 1 1980 -0.23500672 0.6758249 -3.837419e-18 0.8646639 -0.27178968
## 2 1981 0.15293530 0.9492109 -3.837419e-18 0.8646639 0.17687254
## 3 1982 -0.09570833 0.7081607 -3.837419e-18 0.8646639 -0.11068848
## 4 1983 -0.01803811 0.8286382 -3.837419e-18 0.8646639 -0.02086141
## 5 1984 0.10263854 1.2817262 -3.837419e-18 0.8646639 0.11870340
## 6 1985 0.09501193 1.1637063 -3.837419e-18 0.8646639 0.10988308
game_test <- merge(game_test, dclass, all.x = T)
game_test <- game_test[,-c(5:8)]
I create a grading system based on the zscores of the draft classes where better performing classes receive a higher grade. I then calculate a new zscore that will add 1/32 of the standard deviation of the distribution of class averages for each grade jump (lower grades receiving the highest bump in score):
as.numeric(mean(game_test$ZCUM))
## [1] -2.188528e-18
sd(game_test$ZCUM)/32
## [1] 0.02763328
game_test$Grade <- cut(game_test$ZScore, breaks = c(-Inf, -.3, -.2, -.1, 0, .1, .2, .3, +Inf), right = T, include.lowest = T, labels = c("H", "G", "F", "E", "D", "C", "B", "A"))
game_test <- mutate(game_test,
New.ZCUM = ifelse(game_test$Grade == "A", game_test$ZCUM -0.09651498,
ifelse(game_test$Grade == "B", game_test$ZCUM - 0.06893927,
ifelse(game_test$Grade == "C", game_test$ZCUM - 0.04136356,
ifelse(game_test$Grade == "D", game_test$ZCUM - 0.01378785,
ifelse(game_test$Grade == "E", game_test$ZCUM + 0.01378785,
ifelse(game_test$Grade == "F", game_test$ZCUM + 0.04136356,
ifelse(game_test$Grade == "G", game_test$ZCUM + 0.06893927,
ifelse(game_test$Grade == "H", game_test$ZCUM + 0.09651498,
0)))))))))
head(game_test)
## Draft.Year Player Pk ZCUM ZScore Grade New.ZCUM
## 1 1980 Jeff Ruland 25 1.56558689 -0.2717897 G 1.6345262
## 2 1980 Hawkeye Whitney 16 -0.32611397 -0.2717897 G -0.2571747
## 3 1980 Monti Davis 21 0.04723096 -0.2717897 G 0.1161702
## 4 1980 Wes Matthews 14 -0.23262998 -0.2717897 G -0.1636907
## 5 1980 Louis Orr 29 0.60506290 -0.2717897 G 0.6740022
## 6 1980 John Duren 19 -1.35892299 -0.2717897 G -1.2899837
game_test <- game_test[,c(1,2,7)]
tc_stat <- merge(tc_stat, game_test)
Step 4: Visualizing the Data
With my scores finalized, I can now begin exploring and plotting the data. My first two plots I want to create deal with comparing team totals. I create a new dataframe teamresults that finds the averages combined ZCUM for each of the teams. I also add which conference the team is a part of:
teamresults <- aggregate(tc_stat[,c(19:26,28)], list(tc_stat$Tm), mean)
names(teamresults) <- c("Tm", "Season", "Games", "Minutes", "WS", "WS.48", "PER", "BPM", "VORP", "ZCUM")
conferences <- c("E", "E", "E", "E", "E", "E", "W", "W", "E", "W", "W", "E", "W", "W", "W", "E", "E", "W", "W", "E", "W", "E", "E", "W", "W", "W", "W", "E", "W", "E")
teamresults$CONF <- conferences
teamresults$CONF[teamresults$CONF == "E"] <- "East"
teamresults$CONF[teamresults$CONF == "W"] <- "West"
And now to create the visualizations to help us compare teams drafting performances.
Our first bar chart will break it down by team and show an overall comparison of zscores for Seasons, WS, PER, BPM and VORP.
tbyt <- plot_ly(teamresults, x = teamresults$Tm, y = teamresults$Season, type = 'bar', name = 'Seasons') %>%
add_trace(y = ~teamresults$WS, name = 'Win Shares') %>%
add_trace(y = ~teamresults$PER, name = 'PER') %>%
add_trace(y = ~teamresults$BPM, name = 'BPM') %>%
add_trace(y = ~teamresults$VORP, name = 'VORP') %>%
layout(title = "Team by Team Drafting Performance",
xaxis = list(title = "Team", tickangle = -45),
yaxis = list(title = "Z Score"),
barmode = 'group',
bargap = 0.35,
autosize = F, width = 1400, height = 900)
The second bar chart will combine the WS, PER and WS stats to create my cumulative stat I will use to compare on a player by player basis. I also seperate the Eastern and Western conference teams in this chart.
zcum.order <- order(teamresults$ZCUM)
ordered.teamresults <- teamresults[zcum.order,]
Combined.ZCUM <- plot_ly(ordered.teamresults, x = ordered.teamresults$Tm, y = ordered.teamresults$ZCUM,
type = 'bar', color = ordered.teamresults$CONF, colors = "Set1") %>%
layout(title = "Combined WS-PER-VORP ZScores by Team",
yaxis = list(title = "Combined ZScore"),
xaxis = list(title = "Team",
type = "category",
categoryorder = "array",
categoryarray = sort(ordered.teamresults$ZCUM),
tickangle = -45,
ticks = "inside",
ticklen = 7),
autosize = F, width = 1400, height = 900,
showlegend = T)
Now I will create a chart for each team that breaks down their average Z statistic for every player drafted. First i create a new dataframe for every team, sort them by draft year and then create a list with the names of each of these dataframes. I run these dataframe created a for loop that will run through the list of each team and create the chart.
teamnames <- list(teamresults$Tm)
teamnames <- unlist(teamnames)
###Create the df for each team
for (l in 1:30) {
teamname <- teamnames[l]
assign(teamname, subset(tc_stat, tc_stat$Tm == teamnames[l]))
}
###Create the list of dataframes
team.list <- list()
for (h in 1:length(teamnames))
{
team.list[[h]] <- get(teamnames[h])
}
###Run the list through the sorting function
Z.SORT <- function(c) {
sort <- order(c$Draft.Year)
c <- c[sort,]
}
team.list <- lapply(team.list, Z.SORT)
###Run the list of teams through the for loop to create a plot for each team.
teamplot <- for (q in 1:30) {
teamdf <- team.list[q]
teamdf <- as.data.frame.list(teamdf)
q.plot <- plot_ly(
teamdf,
x = ~teamdf$Player,
y = ~teamdf$ZCUM,
color = teamdf$Pk,
type = "bar",
text = paste(teamdf$Draft.Year, "</br> Pk", teamdf$Pk),
textposition = "outside",
textfont = list(size = 25)) %>%
colorbar(title = "Draft Position", len = 2.5) %>%
layout(title = paste(teamdf$Tm, "Z Score Breakdown"),
margin = list(b = 130),
yaxis = list(title = "Averaged PER-WS/48-VORP ZScore"),
xaxis = list(title = "Draft Pick",
type = "category",
categoryorder = "array",
categoryarray = teamdf$Draft.Year,
tickangle = -45,
ticks = "inside",
ticklen = 7),
autosize = F, width = 1400, height = 900,
showlegend = TRUE, barmode = 'relative')
}