Week 3 Data Dive

Loading Data

nba <- nba %>%
  distinct(Year, Player, Tm, .keep_all = T)

Group by 1

For this group by I created a categorical variable based on the continuous variable 3PA. The reasoning for the cutoffs is based on the idea of attempts/game.

#nba$`3P%`[nba$`3PA` == 0] <- 0
nba$`3PVolume` <- cut(nba$`3PA`, breaks = c(0,39,99,229,399, max(nba$`3PA`)),labels = c('very low', 'low', 'medium', 'high', 'very high'), include.lowest = T)

nba <- nba %>%
  relocate(`3PVolume`, .after = `3P%`)

Find probabilities of each category

row_cnt <- length(nba$`3PVolume`)
nba %>%
  group_by(`3PVolume`) %>%
  summarise(volume_cnt = n(),
            prob = volume_cnt/sum(row_cnt))

## # A tibble: 5 × 3
##   `3PVolume` volume_cnt   prob
##   <fct>           <int>  <dbl>
## 1 very low        13220 0.615 
## 2 low              2846 0.132 
## 3 medium           3169 0.147 
## 4 high             1820 0.0846
## 5 very high         456 0.0212

Map Prob Values back onto original data frame

nba <- nba %>%
  mutate(`3PVprobability` = case_when(`3PVolume` == 'very low' ~0.02119846,`3PVolume` == 'low' ~ 0.08460788, `3PVolume` == 'medium' ~ 0.14731998,`3PVolume` == 'high' ~ 0.13230440,`3PVolume` == 'very high' ~ 0.61456929)) %>%
  relocate(`3PVprobability`, .after = `3PVolume`)

I think it would be really interesting to look at how the conditional probabilities by position have changed over the years. For example, is the probability of very high 3PVolume given the player is a center higher in 2022 than it was in 1980. My guess would be yes based on the increased shooting over the years. You could used a hypothesis test to see if these proportions are significantly different.

Group by 2

nba %>%
  filter(MP > 240) %>%
  summarise(`meanUSG%` = mean(`USG%`),
            `sdUSG%` = sd(`USG%`))

## # A tibble: 1 × 2
##   `meanUSG%` `sdUSG%`
##        <dbl>    <dbl>
## 1       18.9     4.84

Create new data that only includes players with more than 240 minutes in a season. This is equivalent to 5 full games played. The reason for this is because the USG% can be inaccurate is a player has only played a small amount of minutes. In general, the more minutes played = the more realiable the value is.

usage <- nba %>%
  filter(MP > 240)
usage$`USGcategory` <- cut(usage$`USG%`,
                         breaks = c(0,4.40,9.25,14.09,23.78,28.63,33.47,max(usage$`USG%`)), include.lowest = T, labels = c('way below average', 'below average', 'slightly below average', 'average', 'slightly above average', 'above average', 'way above average'))

usage <- usage %>%
  relocate(`USGcategory`, .after = `USG%`)

Now find the probability distribution for the categorical variable we just created

total_rows <- length(usage$USGcategory)
usage %>%
  group_by(USGcategory) %>%
  summarise(n = n(),
            prob = n/total_rows)

## # A tibble: 7 × 3
##   USGcategory                n     prob
##   <fct>                  <int>    <dbl>
## 1 way below average          4 0.000239
## 2 below average            223 0.0133  
## 3 slightly below average  2346 0.140   
## 4 average                11512 0.688   
## 5 slightly above average  2097 0.125   
## 6 above average            472 0.0282  
## 7 way above average         71 0.00425

Looking at the probabilities it is not surprising to see that it looks normally distributed as I based the breaks of the calculated mean and standard deviation for the USG% column. These probabilities also give us a good insight into what outliers we have in the data set and spark more questions about the make up of the teams the highly used players are on and make us question what the 4 players with extremely low usage were doing out there on the court.

Group by 3

nba %>%
  summarise(no_college = sum(is.na(Colleges)),
            yes_college = sum(!is.na(Colleges)),
            prob_college = yes_college/n())

## # A tibble: 1 × 3
##   no_college yes_college prob_college
##        <int>       <int>        <dbl>
## 1       2005       19506        0.907

nba %>%
  group_by(Pos) %>%
  summarise(no_college = sum(is.na(Colleges)),
            yes_college = sum(!is.na(Colleges)),
            pos_count = n(),
            pos_prob_college = yes_college/pos_count)

## # A tibble: 5 × 5
##   Pos   no_college yes_college pos_count pos_prob_college
##   <chr>      <int>       <int>     <int>            <dbl>
## 1 C            738        3532      4270            0.827
## 2 PF           399        4145      4544            0.912
## 3 PG           222        4004      4226            0.947
## 4 SF           377        3700      4077            0.908
## 5 SG           269        4125      4394            0.939

Add these probabilities to data set

nba %>%
  mutate(CollegeProb = ifelse(!is.na(Colleges), 0.9067919, 0.0932081 )) %>%
  mutate(Pos_CollegeProb = case_when(Pos == "C" & !is.na(Colleges) ~ 0.8271663,
                                     Pos == "PF" & !is.na(Colleges) ~ 0.9121919,
                                     Pos == "SF" & !is.na(Colleges) ~ 0.9075300,
                                     Pos == "SG" & !is.na(Colleges) ~ 0.9387802,
                                     Pos == "PG" & !is.na(Colleges) ~ 0.9474681,
                                     Pos == "C" & is.na(Colleges) ~ 1-0.8271663,
                                     Pos == "PF" & is.na(Colleges) ~ 1-0.9121919,
                                     Pos == "SF" & is.na(Colleges) ~ 1-0.9075300,
                                     Pos == "SG" & is.na(Colleges) ~ 1-0.9387802,
                                     Pos == "PG" & is.na(Colleges) ~ 1-0.9474681,))

## # A tibble: 21,511 × 58
##     Year Player      Pos     Age Tm        G    GS    MP    FG   FGA `FG%`  `3P`
##    <dbl> <chr>       <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1980 Kareem Abd… C        32 LAL      82    NA  3143   835  1383 0.604     0
##  2  1981 Kareem Abd… C        33 LAL      80    NA  2976   836  1457 0.574     0
##  3  1982 Kareem Abd… C        34 LAL      76    76  2677   753  1301 0.579     0
##  4  1983 Kareem Abd… C        35 LAL      79    79  2554   722  1228 0.588     0
##  5  1984 Kareem Abd… C        36 LAL      80    80  2622   716  1238 0.578     0
##  6  1985 Kareem Abd… C        37 LAL      79    79  2630   723  1207 0.599     0
##  7  1986 Kareem Abd… C        38 LAL      79    79  2629   755  1338 0.564     0
##  8  1987 Kareem Abd… C        39 LAL      78    78  2441   560   993 0.564     1
##  9  1988 Kareem Abd… C        40 LAL      80    80  2308   480   903 0.532     0
## 10  1989 Kareem Abd… C        41 LAL      74    74  1695   313   659 0.475     0
## # ℹ 21,501 more rows
## # ℹ 46 more variables: `3PA` <dbl>, `3P%` <dbl>, `3PVolume` <fct>,
## #   `3PVprobability` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>,
## #   FT <dbl>, FTA <dbl>, `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>,
## #   AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, PER <dbl>,
## #   `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, `ORB%` <dbl>, `DRB%` <dbl>,
## #   `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, …

I would would curious to look into why the difference in proportion of centers that went to college and point guards is so different. My initial idea is that it takes more time for PGs to develop and college is a great place to do that, whereas if you are a big dominant post man there is no need to “waste” time in college if you are already “NBA ready”. Also to look at these probabilities on a year to year scale would be interesting as the rules regarding college have changed over time and have forced players to go the college route, but now there are some more options such as going to the NBL in Australia or Overtime Elite.

Last Part

Pick 2-3 categorical variables for which you know all possible combinations.

USGcategory and 3PVolume, which were both created using the cut function above. Have to use usage data set, also created above, because the USGcategory variable only makes sense for certain values of MP, which the usage data set is already filtered on. There are 7 categories for USGcategory and 5 for 3PVolume so there would be 35 possible combinations of these variables.

usage %>%
  group_by(USGcategory, `3PVolume`) %>%
  count()

## # A tibble: 30 × 3
## # Groups:   USGcategory, 3PVolume [30]
##    USGcategory            `3PVolume`     n
##    <fct>                  <fct>      <int>
##  1 way below average      very low       4
##  2 below average          very low     199
##  3 below average          low           17
##  4 below average          medium         6
##  5 below average          high           1
##  6 slightly below average very low    1576
##  7 slightly below average low          358
##  8 slightly below average medium       314
##  9 slightly below average high          94
## 10 slightly below average very high      4
## # ℹ 20 more rows

Of the 35 possible pairs there are 5 that never appear in the data, which are (way below average, low), (way below average, medium), (way below average, high), (way below average,very high ), and (below average, high). These all make sense. If someone has a way below average usage rate is is unlikely that they even had the opportunity to shoot even a “low” amount of threes let alone more. Similarly, if you have a below average usage rate it is possible to have up to high 3PVolume (players are primarily shooting 3 pointers when they touch the ball), however if you got the chance to shoot a very high amount of 3s then you are being used more than below average amount. The most common is average usage and very low 3PVolume. This makes sense, because there is about 68% of the data in the average usage column and then 61% in the very low 3PVolume category. This is due to the fact that the three pointer was not very popular for big men to shoot until very recently.

usage %>%
  group_by(USGcategory, `3PVolume`) %>%
  count() %>%
  ggplot() +
  geom_tile(aes(x = USGcategory, y = `3PVolume`, fill = n)) +
  scale_fill_gradient2(high = "red", mid = "white", low = "blue", midpoint = 2900, guide = "colorbar", aesthetics = "fill") + 
  theme(axis.text.x = element_text(angle = 11))

Out of curiosity I wanted to see how the above changed when grouped by position. Here you can see a lot of the bright red is due to the C and PF positions, as expected. As you can see, as you go from C -> PF -> SF-> SG -> PG the graph distribution begins t shirt to the top right corner.

usage %>%
  group_by(USGcategory, `3PVolume`, Pos) %>%
  count() %>%
  ggplot() +
  geom_tile(aes(x = `3PVolume`, y = USGcategory, fill = n)) +
  scale_fill_gradient2(high = "red", mid = "white", low = "blue", midpoint = 900, guide = "colorbar", aesthetics = "fill") + 
  theme(axis.text.x = element_text(angle = 25)) +
  facet_wrap(~Pos)

Still curious about how different these graphs, especially the C and PF ones, would look if you looked across decades.