Data Selection

The dataset I chose was from Kaggle (https://www.kaggle.com/mamadoudiallo/nba-players-stats-19802017). It is a list of player statistics from the NBA from 1980 to 2017.It depicts the stats of players of specific years on the teams they played. I chose this dataset because its very robust as it has a lot of data from a wide range of years which allows for a lot to explore and it is of a league I don’t watch that much of. I wanted to use looking at the data as an opportunity to learn a little bit more about the league.

What Each Column Means in the Dataset

Year
Player
Pos: Player Position
Age
Tm: Team
G: Games
MP: Minutes played per game
PER: Player Efficiency Rating
TS%: True shooting percentage
OWS: Offensive Win Shares
DWS: Defensive Win Shares
WS: Win shares
WS/48: Win shares per 48 minutes
BPM: Box Plus/Minus
VORP: Value over replacement player
FG: Field goals
FGA: Field goal attempts
FG%: Field goal percentage
3P: 3 point field goals per game
3PA: 3 point field goal attempts
3P%: 3 point field goal percentage
2P: 2 point field goals
2PA: 2 point field goals attempts
2P%: 2 point field goal percentage
eFG%: Effective field goal percentage
FT: Free throws
FTA: Free throw attempts
FT%: Percent field goals
ORB: Offensive Rebounds per game
DRB: Defensive Rebounds per game
TRB: Total Rebounds per game
AST: Assists per game
STL: Steals per game
BLK: Blocks per game
TOV: Turnovers per game
PF: Personal Fouls
PTS: Points

Background: NBA

Formed in 1946, the National Basketball Association (NBA) is the youngest of American sport associations but has the highest salaries. The league is composed of 30 teams, 29 from the United States and 1 from Canada, and they’re split into 2 conferences. Each team has 82 regular season games then the best teams of each conference go a 16 playoff series knockout round where the winners of each conference then play each other in the championship. Over the years, the NBA has been able to expand their marketing to all over the world and because of that they have been able to add players from all over the world (Dewey, 2020).

Load the Data

Once the dataset was selected, I then put it in the folder that would be set as my working directory. I then converted it to a dataframe and looked at what I had to work with. The majority of the dataset is numeric with a couple of columns containing characters.

setwd("~/Desktop/MC Data Science /DATA 110 /DataSets ")
library(tidyverse)
nba <- read_csv("player_df.csv")
# head of data 
head(nba)
## # A tibble: 6 x 38
##      X1  Year Player Pos     Age Tm        G    MP   PER `TS%`   OWS   DWS    WS
##   <dbl> <dbl> <chr>  <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   689  1980 Karee… C        32 LAL      82  3143  25.3 0.639   9.5   5.3  14.8
## 2   690  1980 Tom A… PF       25 GSW      67  1222  11   0.511   1.2   0.8   2  
## 3   691  1980 Alvan… C        25 PHO      75  2168  19.2 0.571   3.1   3.9   7  
## 4   692  1980 Tiny … PG       31 BOS      80  2864  15.3 0.574   5.9   2.9   8.9
## 5   694  1980 Gus B… SG       28 WSB      20   180   9.3 0.467   0     0.2   0.2
## 6   696  1980 Greg … SF       25 WSB      82  2438  18.1 0.532   4.1   2.8   6.9
## # … with 25 more variables: `WS/48` <dbl>, BPM <dbl>, VORP <dbl>, FG <dbl>,
## #   FGA <dbl>, `FG%` <dbl>, `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, `2P` <dbl>,
## #   `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
## #   ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
## #   TOV <dbl>, PF <dbl>, PTS <dbl>
# tail of data 
tail(nba)
## # A tibble: 6 x 38
##      X1  Year Player Pos     Age Tm        G    MP   PER `TS%`   OWS   DWS    WS
##   <dbl> <dbl> <chr>  <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19609  2017 Nick … SG       31 LAL      60  1556  14.1 0.588   3     0.2   3.2
## 2 19610  2017 Thadd… PF       28 IND      74  2237  14.9 0.562   1.9   2.7   4.6
## 3 19611  2017 Cody … PF       24 CHO      62  1725  16.7 0.604   3.4   2.2   5.6
## 4 19612  2017 Tyler… C        27 BOS      51   525  13   0.508   0.5   0.6   1  
## 5 19614  2017 Paul … SF       22 CHI      44   843   6.9 0.503  -0.3   0.8   0.5
## 6 19615  2017 Ivica… C        19 LAL      38   609  17   0.547   0.6   0.5   1.1
## # … with 25 more variables: `WS/48` <dbl>, BPM <dbl>, VORP <dbl>, FG <dbl>,
## #   FGA <dbl>, `FG%` <dbl>, `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, `2P` <dbl>,
## #   `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
## #   ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
## #   TOV <dbl>, PF <dbl>, PTS <dbl>
# data structure 
str(nba)
## tibble [15,107 × 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ X1    : num [1:15107] 689 690 691 692 694 696 697 699 703 705 ...
##  $ Year  : num [1:15107] 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Player: chr [1:15107] "Kareem Abdul-Jabbar*" "Tom Abernethy" "Alvan Adams" "Tiny Archibald*" ...
##  $ Pos   : chr [1:15107] "C" "PF" "C" "PG" ...
##  $ Age   : num [1:15107] 32 25 25 31 28 25 28 35 23 25 ...
##  $ Tm    : chr [1:15107] "LAL" "GSW" "PHO" "BOS" ...
##  $ G     : num [1:15107] 82 67 75 80 20 82 77 72 16 73 ...
##  $ MP    : num [1:15107] 3143 1222 2168 2864 180 ...
##  $ PER   : num [1:15107] 25.3 11 19.2 15.3 9.3 18.1 13.7 14.8 24.1 13.1 ...
##  $ TS%   : num [1:15107] 0.639 0.511 0.571 0.574 0.467 0.532 0.533 0.517 0.552 0.513 ...
##  $ OWS   : num [1:15107] 9.5 1.2 3.1 5.9 0 4.1 2.1 2.2 0.7 0.4 ...
##  $ DWS   : num [1:15107] 5.3 0.8 3.9 2.9 0.2 2.8 1.9 1.2 0.3 2.7 ...
##  $ WS    : num [1:15107] 14.8 2 7 8.9 0.2 6.9 3.9 3.4 0.9 3.2 ...
##  $ WS/48 : num [1:15107] 0.227 0.08 0.155 0.148 0.043 0.136 0.081 0.09 0.188 0.08 ...
##  $ BPM   : num [1:15107] 6.7 -1.6 4.4 0 -2.4 2.5 0.3 0.6 3.3 0.5 ...
##  $ VORP  : num [1:15107] 6.8 0.1 3.5 1.5 0 2.7 1.4 1.2 0.3 1.2 ...
##  $ FG    : num [1:15107] 835 153 465 383 16 545 384 325 72 299 ...
##  $ FGA   : num [1:15107] 1383 318 875 794 35 ...
##  $ FG%   : num [1:15107] 0.604 0.481 0.531 0.482 0.457 0.495 0.505 0.422 0.493 0.484 ...
##  $ 3P    : num [1:15107] 0 0 0 4 1 16 1 73 8 1 ...
##  $ 3PA   : num [1:15107] 1 1 2 18 1 47 3 221 19 5 ...
##  $ 3P%   : num [1:15107] 0 0 0 0.222 1 0.34 0.333 0.33 0.421 0.2 ...
##  $ 2P    : num [1:15107] 835 153 465 379 15 529 383 252 64 298 ...
##  $ 2PA   : num [1:15107] 1382 317 873 776 34 ...
##  $ 2P%   : num [1:15107] 0.604 0.483 0.533 0.488 0.441 0.502 0.506 0.458 0.504 0.486 ...
##  $ eFG%  : num [1:15107] 0.604 0.481 0.531 0.485 0.471 0.502 0.506 0.469 0.521 0.485 ...
##  $ FT    : num [1:15107] 364 56 188 361 5 171 139 143 28 99 ...
##  $ FTA   : num [1:15107] 476 82 236 435 13 227 209 153 39 141 ...
##  $ FT%   : num [1:15107] 0.765 0.683 0.797 0.83 0.385 0.753 0.665 0.935 0.718 0.702 ...
##  $ ORB   : num [1:15107] 190 62 158 59 6 240 192 53 13 126 ...
##  $ DRB   : num [1:15107] 696 129 451 138 22 398 264 183 16 327 ...
##  $ TRB   : num [1:15107] 886 191 609 197 28 638 456 236 29 453 ...
##  $ AST   : num [1:15107] 371 87 322 671 26 159 279 268 31 178 ...
##  $ STL   : num [1:15107] 81 35 108 106 7 90 85 80 14 73 ...
##  $ BLK   : num [1:15107] 280 12 55 10 4 36 49 28 2 92 ...
##  $ TOV   : num [1:15107] 297 39 218 242 11 133 189 152 20 157 ...
##  $ PF    : num [1:15107] 216 118 237 218 18 197 268 182 26 246 ...
##  $ PTS   : num [1:15107] 2034 362 1118 1131 38 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   Year = col_double(),
##   ..   Player = col_character(),
##   ..   Pos = col_character(),
##   ..   Age = col_double(),
##   ..   Tm = col_character(),
##   ..   G = col_double(),
##   ..   MP = col_double(),
##   ..   PER = col_double(),
##   ..   `TS%` = col_double(),
##   ..   OWS = col_double(),
##   ..   DWS = col_double(),
##   ..   WS = col_double(),
##   ..   `WS/48` = col_double(),
##   ..   BPM = col_double(),
##   ..   VORP = col_double(),
##   ..   FG = col_double(),
##   ..   FGA = col_double(),
##   ..   `FG%` = col_double(),
##   ..   `3P` = col_double(),
##   ..   `3PA` = col_double(),
##   ..   `3P%` = col_double(),
##   ..   `2P` = col_double(),
##   ..   `2PA` = col_double(),
##   ..   `2P%` = col_double(),
##   ..   `eFG%` = col_double(),
##   ..   FT = col_double(),
##   ..   FTA = col_double(),
##   ..   `FT%` = col_double(),
##   ..   ORB = col_double(),
##   ..   DRB = col_double(),
##   ..   TRB = col_double(),
##   ..   AST = col_double(),
##   ..   STL = col_double(),
##   ..   BLK = col_double(),
##   ..   TOV = col_double(),
##   ..   PF = col_double(),
##   ..   PTS = col_double()
##   .. )

Cleaning

There was one unnamed column in the dataset that I did not know what it was for or how to use it for analysis so it was removed. I also noticed that some of the players names contained "*" next to them so I also had that removed to make the names column a lot more uniform. I then installed the janitor package, useful in cleaning data, to change the names of the columns so that they would be usable as the current state of a lot of them had symbols that would make it difficult for R to call them. The package changed all of them to lowercase and replaced any numbers and symbols to leave me with my cleaned dataframe, nba_clean.

#install.packages("janitor")
library(janitor)
#Remove X1 column and make all columns lower case and remove %
nba_clean <- subset( nba, select = -X1) %>% clean_names()
#Remove extra character from player names 
nba_clean$player <- str_remove_all(nba_clean$player, "[*]")
#Sorting the data
nba_clean<- nba_clean %>% group_by(pos) %>% arrange(desc(pts))
#Look at new dataset 
head(nba_clean)
## # A tibble: 6 x 37
## # Groups:   pos [2]
##    year player pos     age tm        g    mp   per ts_percent   ows   dws    ws
##   <dbl> <chr>  <chr> <dbl> <chr> <dbl> <dbl> <dbl>      <dbl> <dbl> <dbl> <dbl>
## 1  1987 Micha… SG       23 CHI      82  3281  29.8      0.562  11.9   5    16.9
## 2  1988 Micha… SG       24 CHI      82  3311  31.7      0.603  15.2   6.1  21.2
## 3  2006 Kobe … SG       27 LAL      80  3277  28        0.559  11.6   3.7  15.3
## 4  1990 Micha… SG       26 CHI      82  3197  31.2      0.606  14.7   4.3  19  
## 5  1989 Micha… SG       25 CHI      81  3255  31.1      0.614  14.6   5.2  19.8
## 6  2014 Kevin… SF       25 OKC      81  3122  29.8      0.635  14.8   4.4  19.2
## # … with 25 more variables: ws_48 <dbl>, bpm <dbl>, vorp <dbl>, fg <dbl>,
## #   fga <dbl>, fg_percent <dbl>, x3p <dbl>, x3pa <dbl>, x3p_percent <dbl>,
## #   x2p <dbl>, x2pa <dbl>, x2p_percent <dbl>, e_fg_percent <dbl>, ft <dbl>,
## #   fta <dbl>, ft_percent <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>,
## #   stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>, pts <dbl>

Exploring Data

Some questions I wanted some answers to were: -Based on the players in the dataset, which positions had the most points scored? -Which teams have the best three point averages? -Which years have the most points scored?

I used graphs to see what the results would be.

Positions That Score the Most Points

Each team has 5 players on the court. There is a point guard (PG), shooting guard (SG), small forward (SF), power forward (PF), and center. I wanted to see which positions from this dataset had generated the most overall points. Dataset was changed to show the total points by player positions. From doing that, I noticed that this dataset had some position hybrids where some players had multiple roles on the court. For the sake of the graph I removed those to only show the traditional positions in the bar graph.

library(ggplot2)
library(hrbrthemes)
library(plotly)
# Set up data to show points by position 
nba_p <- nba_clean %>% group_by(pos) %>% summarise(pts= sum(pts)) %>% arrange(desc(pts))
nba_p #contains hybrid positions condense results to show five main positions 
## # A tibble: 16 x 2
##    pos       pts
##    <chr>   <dbl>
##  1 SG    2003515
##  2 SF    1904686
##  3 PG    1776981
##  4 PF    1678907
##  5 C     1144053
##  6 SF-SG   15126
##  7 PG-SG   12960
##  8 SG-SF   10463
##  9 SG-PG   10269
## 10 SF-PF    9088
## 11 PF-SF    6958
## 12 C-PF     5911
## 13 PF-C     4668
## 14 SG-PF    3456
## 15 PG-SF    1022
## 16 C-SF      178
nba_p1 <-head(nba_p,5)
nba_p1
## # A tibble: 5 x 2
##   pos       pts
##   <chr>   <dbl>
## 1 SG    2003515
## 2 SF    1904686
## 3 PG    1776981
## 4 PF    1678907
## 5 C     1144053
#Create Graph
nba_v1 <- nba_p1 %>% ggplot(aes(x=pos, y= pts)) +
  geom_bar(stat="identity", width = 0.2,fill="#f68060") + ggtitle("Total Points Scored by Each Position") + xlab("Player Position") + ylab("Points Scored") 
nba_v1 + theme_gray() + theme(plot.title = element_text(hjust = 0.5, face = "bold.italic"), axis.title.x = element_text(size = 8, face = "italic"), axis.title.y = element_text(size=8, face = "italic"), axis.text.x = element_text(color = "blue", face = "bold"))

Based on the graph, shooting guards have generated the most points from this dataset. This makes sense to me as shooting is in the name of this position. Centers generate the least amount of points. But I am missing a greater overall picture with some of the players not being considered since they were in the dataset as they had multiple roles.

Teams With the Best Three Point Averages

The highest field goal attempt is three points. I wanted to see which teams have been able to have players that help with their overall three point average. I had the data summarized to show the three point averages of each time. With the first dataframe, I looked at the results of the new dataset it shows the team with the most points was “TOT”. I created another dataframe to look at some of the players in TOT. In looking up the statistics of the first player in the group, Carmelo Anthony, he played on 2 teams in 2011. He had statistics for the Denver and New York (Basketball reference). So man conclusion is that TOT is for players that accumulated stats in one year from more than one team. To make sure that the graph would represent single teams, updated the dataframe to remove the first row. I then had the dataset show the top ten teams for the scatterplot.

#Set up data to the teams and their cumulative three point averages 
nba_b3 <- nba_clean %>% group_by(tm) %>% summarise(x3pa= sum(x3pa)) %>% arrange(desc(x3pa))
head(nba_b3)
## # A tibble: 6 x 2
##   tm     x3pa
##   <chr> <dbl>
## 1 TOT   99794
## 2 HOU   48866
## 3 GSW   43023
## 4 DAL   42685
## 5 PHO   41725
## 6 NYK   41496
#Figure out which team is TOT  
nba_tot <- nba_clean %>% select(year,player,tm) %>%filter(tm == "TOT")
head(nba_tot)
## # A tibble: 6 x 4
## # Groups:   pos [4]
##   pos    year player            tm   
##   <chr> <dbl> <chr>             <chr>
## 1 SF     2011 Carmelo Anthony   TOT  
## 2 C      2017 DeMarcus Cousins  TOT  
## 3 SF     1994 Dominique Wilkins TOT  
## 4 SF-SG  2005 Vince Carter      TOT  
## 5 SG     1983 World B.          TOT  
## 6 SF     1982 Mike Mitchell     TOT
#Remove TOT 
nba_b3 <- nba_b3[-1,]
nba_b3
## # A tibble: 40 x 2
##    tm     x3pa
##    <chr> <dbl>
##  1 HOU   48866
##  2 GSW   43023
##  3 DAL   42685
##  4 PHO   41725
##  5 NYK   41496
##  6 LAL   41295
##  7 BOS   40125
##  8 POR   39542
##  9 IND   39352
## 10 DEN   38875
## # … with 30 more rows
# Minimize teams selected for graph 
nba_b3 <- head(nba_b3,10)
nba_b3
## # A tibble: 10 x 2
##    tm     x3pa
##    <chr> <dbl>
##  1 HOU   48866
##  2 GSW   43023
##  3 DAL   42685
##  4 PHO   41725
##  5 NYK   41496
##  6 LAL   41295
##  7 BOS   40125
##  8 POR   39542
##  9 IND   39352
## 10 DEN   38875
#Create graph 
nba_v2 <- nba_b3 %>% ggplot(aes(x=tm, y=x3pa)) + 
  geom_point(shape=21, color="black", fill="#E6F130", size=6) +
  theme_dark() + ggtitle("Teams With Best Overall Three Point Average") + xlab("Teams") + ylab("Three Point Average") 
nba_v2 + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.title = element_text(color = "blue", size = 8, face = "italic"))

Top three teams that have had best accumulated three point averages were the Houston Rockets (HOU), Golden State Warriors (GSW), and the Dallas Mavericks (DAL). Ultimately, players who had played on multiple teams in a year had the best three point averages. Just like the previous graph, not a full picture of this answer since I took out a big chunk of the data.

Points Scored Over The Years

One last thing I wanted to look at was just the overall scope of points scored for every year of the dataset. To look at that I made a dataframe that showed the total points for every year. I then used that dataframe to create a line graph. I used plotly so that I would be able o see the specific numbers for each year to accompany the visual.

#Set up data to list the years and total points 
nba_y <- nba_clean %>% group_by(year) %>% summarise(pts = sum(pts)) %>% arrange(desc(year))
nba_y
## # A tibble: 38 x 2
##     year    pts
##    <dbl>  <dbl>
##  1  2017 274407
##  2  2016 258432
##  3  2015 261249
##  4  2014 256170
##  5  2013 242268
##  6  2012 186201
##  7  2011 263879
##  8  2010 257619
##  9  2009 257911
## 10  2008 252005
## # … with 28 more rows
#Create Graph 
nba_v3 <- nba_y %>% ggplot(aes(x=year, y=pts)) + 
  geom_line(color="dark green") +
  geom_point(shape=21, color="black", fill="#f17113", size=2) +
  theme_minimal() + ggtitle("Points Scored Over the Years") + xlab("Year") + ylab("Total Points") 
#Add Plotly  
nba_v4 <- ggplotly(nba_v3)
nba_v4

Something that is very obvious in this graph is that there are 2 years where there was a significant drop in overall points, 1999 and 2012. These coincide with the 2 seasons where there was an NBA lockout. In the late 90s, player salaries were starting were getting increasingly outrageous. One example of this is Kevin Garnett’s deal in 1997 was $126 million over six seasons when he was only 18 and was drafter right out of high school. Deals like this were causing a ripple effect in the league to the point where they were going to claim losses prior to the start of the next season. In trying to get it under control, league owners and the players association try to come to an agreement of how salaries should be handled in the future. The talks weren’t resolved in a timely fashion so there was a lockout. The first 32 games of the 82-game season were canceled (Ringold, 2000). Another breakdown in talks between the league,owners and players in 2011 which caused another lockout. When the season resumed, only 66 games were scheduled to be played (Lopa,2019). These significant delays in seasons can really be shown in this graph. Less games played, and even most likely less accuracy in shooting from players means less points scored overall.

Statistical Analysis

The columns from the dataset that drew my attention were win shares and PER. Win shares of a player is the estimate of how much they contributed to a team win while the player efficiency rating (PER) is a player’s positive impact through a per minute rating (Basketball Reference). With that in mind, I wanted to see what the correlation would be between these two columns. I created a data frame that focused on a portion the data then arranged it so it would show the players with the highest win shares at the top of the dataframe. When I first attempted to graph the data it looked very cluttered so I created another data frame that would only show a specific year. I chose 1998 as it was the middle of the range of years of the dataset. I then added a regression line and calculated the correlation.

# ws vs per 
nba_s <- nba_clean %>% group_by(age) %>% select(year,player,age,ws,per) %>% arrange(desc(ws))
head(nba_s)
## # A tibble: 6 x 5
## # Groups:   age [5]
##    year player           age    ws   per
##   <dbl> <chr>          <dbl> <dbl> <dbl>
## 1  1988 Michael Jordan    24  21.2  31.7
## 2  1996 Michael Jordan    32  20.4  29.4
## 3  1991 Michael Jordan    27  20.3  31.6
## 4  2009 LeBron James      24  20.3  31.7
## 5  1994 David Robinson    28  20    30.7
## 6  1989 Michael Jordan    25  19.8  31.1
# Condense to focus on a specific year 
nba_s1 <- nba_s %>% filter(year == "1998")
head(nba_s1)
## # A tibble: 6 x 5
## # Groups:   age [4]
##    year player           age    ws   per
##   <dbl> <chr>          <dbl> <dbl> <dbl>
## 1  1998 Karl Malone       34  16.4  27.9
## 2  1998 Michael Jordan    34  15.8  25.2
## 3  1998 David Robinson    32  13.8  27.8
## 4  1998 Tim Duncan        21  12.8  22.6
## 5  1998 Gary Payton       29  12.5  21.6
## 6  1998 Reggie Miller     32  12    19.8
# Create graph 
nba_v5 <- nba_s1 %>% ggplot(aes(x=per, y= ws )) + 
  geom_point() + xlim(0,41) + ylim(0,25) + 
  ggtitle("Win Shares to PER in 1998") + 
  xlab("Player Efficiency Rating (PER)") +
  ylab("Win Shares")
nba_v5

# Show correlation 
nba_v6 <- nba_v5 + geom_smooth(method=lm , color="yellow", fill="#69b3a2", se=TRUE) + theme_grey() + theme(plot.title = element_text(hjust = 0.5, face = "bold.italic", size = 16), axis.title = element_text(face = "italic", size = 8))
nba_v6

# Correlation values 
cor(nba_s$per, nba_s$ws) 
## [1] 0.7367184
cor(nba_s1$per, nba_s1$ws)
## [1] 0.706774

In theory, the higher a player’s efficiency rating, the higher their win share. This is present in the upward trend of the regression line. This is also represented in values in the correlation values of the data. When looking at all the years represented or just the year selected to plot, the correlation values are relatively strong.

Final Visual

For my final visual, I wanted to show which player positions have the best win shares. I created a dataframe that had the five main player positions then had them grouped by year. To minimize the scope, I had the top ten win shares from each year. I used that dataframe to create a highchart area graph.

Player Positions

PG- Point Guard SG- Dhooting Guard SF- Small Forward PF- Power Forward C- Center

library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(RColorBrewer)
#Create dataframe 
nba_f <- nba_clean %>% select(ws,year,pos,player) %>% group_by(year) %>%  filter(pos == "PG" | pos == "SG" | pos == "SF" | pos == "PF"| pos == "C") %>% slice_max(order_by = year, n= 10)

head(nba_f)
## # A tibble: 6 x 4
## # Groups:   year [1]
##      ws  year pos   player             
##   <dbl> <dbl> <chr> <chr>              
## 1  10.6  1980 SG    George Gervin      
## 2  11.9  1980 C     Moses Malone       
## 3  12.5  1980 SF    Julius Erving      
## 4   8    1980 SG    World B.           
## 5  14.8  1980 C     Kareem Abdul-Jabbar
## 6  11.5  1980 C     Dan Issel
#grpah 
cute <- brewer.pal(5, "Set3")
nba_v7 <- highchart() %>% hc_add_series(data = nba_f, 
                                        type = "area", hcaes(x=year, 
                                                             y= ws,
                                                             group= pos)) %>% 
  hc_colors(cute) %>% 
    hc_plotOptions(series = list(stacking = "normal",
                               marker = list(enabled = FALSE,
                               states = list(hover = list(enabled = FALSE))),
                               lineWidth = 0.5,
                               lineColor = "black")) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="Win Shares")) %>%
  hc_legend(title= list(text= "Player Position"),align = "right", verticalAlign = "center",
            layout = "vertical") %>%
  hc_tooltip(enabled = FALSE) %>%
  hc_title(
    text = "Impact of Player Position on Wins",
    margin = 20,
    align = "center",
    style = list(color = "#2B1A84", useHTML = TRUE)
    )

nba_v7

You can see the NBA lockdowns did have an impact on the win shares with the two significant dips in the data. From there you can tell that centers (C) don’t have much of an impact on win shares compared to the other positions. Power Forwards (PF) seem to have the most impact on win shares.

More to Explore

This was a pretty robust dataset that while working with it, I think anything I would find in exploration was always missing a small portion of data that would’ve made some answers a lot more definitive. For example, a lot of the graphs I was filtering out portions of data as they had double meanings. If I had more time, more manipulation would have been needed to separate out the data with double meanings to more broader groupings to get a better scope of the data. Another question that would have been fun to look at would be seeing if I could use this dataset to compare stats of some of the best players in the league against each other in their primes.

Works Cited

Carmelo Anthony Basketball Reference. https://www.basketball-reference.com/players/a/anthoca01.html

Dewey, J., PhD. (2020). National Basketball Association (NBA). Salem Press Encyclopedia.

Glossary. Basetball Reference. https://www.basketball-reference.com/about/glossary.html#:~:text=TS%25%20%2D%20True%20Shooting%20Percentage%3B,is%20FGA%20%2B%200.44%20*%20FTA.

Lopa, M. C. C. (2019). There Are No Winners in Lockouts: Understanding the 2011 NBA Lockout Through the Lens of Philippine Laws and Jurisprudence on Collective Bargaining. Ateneo Law Journal, 63(3), 807–831.

Ringold, D. (2000). Full Court Pressure: A Look at the 1998-1999 National Basketball Association Lockout. Texas Review of Entertainment & Sports Law, 1(1), 101.