nba <- nba %>%
distinct(Year, Player, Tm, .keep_all = T)
For this group by I created a categorical variable based on the continuous variable 3PA. The reasoning for the cutoffs is based on the idea of attempts/game.
#nba$`3P%`[nba$`3PA` == 0] <- 0
nba$`3PVolume` <- cut(nba$`3PA`, breaks = c(0,39,99,229,399, max(nba$`3PA`)),labels = c('very low', 'low', 'medium', 'high', 'very high'), include.lowest = T)
nba <- nba %>%
relocate(`3PVolume`, .after = `3P%`)
Find probabilities of each category
row_cnt <- length(nba$`3PVolume`)
nba %>%
group_by(`3PVolume`) %>%
summarise(volume_cnt = n(),
prob = volume_cnt/sum(row_cnt))
## # A tibble: 5 × 3
## `3PVolume` volume_cnt prob
## <fct> <int> <dbl>
## 1 very low 13220 0.615
## 2 low 2846 0.132
## 3 medium 3169 0.147
## 4 high 1820 0.0846
## 5 very high 456 0.0212
Map Prob Values back onto original data frame
nba <- nba %>%
mutate(`3PVprobability` = case_when(`3PVolume` == 'very low' ~0.02119846,`3PVolume` == 'low' ~ 0.08460788, `3PVolume` == 'medium' ~ 0.14731998,`3PVolume` == 'high' ~ 0.13230440,`3PVolume` == 'very high' ~ 0.61456929)) %>%
relocate(`3PVprobability`, .after = `3PVolume`)
I think it would be really interesting to look at how the conditional probabilities by position have changed over the years. For example, is the probability of very high 3PVolume given the player is a center higher in 2022 than it was in 1980. My guess would be yes based on the increased shooting over the years. You could used a hypothesis test to see if these proportions are significantly different.
nba %>%
filter(MP > 240) %>%
summarise(`meanUSG%` = mean(`USG%`),
`sdUSG%` = sd(`USG%`))
## # A tibble: 1 × 2
## `meanUSG%` `sdUSG%`
## <dbl> <dbl>
## 1 18.9 4.84
Create new data that only includes players with more than 240 minutes in a season. This is equivalent to 5 full games played. The reason for this is because the USG% can be inaccurate is a player has only played a small amount of minutes. In general, the more minutes played = the more realiable the value is.
usage <- nba %>%
filter(MP > 240)
usage$`USGcategory` <- cut(usage$`USG%`,
breaks = c(0,4.40,9.25,14.09,23.78,28.63,33.47,max(usage$`USG%`)), include.lowest = T, labels = c('way below average', 'below average', 'slightly below average', 'average', 'slightly above average', 'above average', 'way above average'))
usage <- usage %>%
relocate(`USGcategory`, .after = `USG%`)
Now find the probability distribution for the categorical variable we just created
total_rows <- length(usage$USGcategory)
usage %>%
group_by(USGcategory) %>%
summarise(n = n(),
prob = n/total_rows)
## # A tibble: 7 × 3
## USGcategory n prob
## <fct> <int> <dbl>
## 1 way below average 4 0.000239
## 2 below average 223 0.0133
## 3 slightly below average 2346 0.140
## 4 average 11512 0.688
## 5 slightly above average 2097 0.125
## 6 above average 472 0.0282
## 7 way above average 71 0.00425
Looking at the probabilities it is not surprising to see that it looks normally distributed as I based the breaks of the calculated mean and standard deviation for the USG% column. These probabilities also give us a good insight into what outliers we have in the data set and spark more questions about the make up of the teams the highly used players are on and make us question what the 4 players with extremely low usage were doing out there on the court.
nba %>%
summarise(no_college = sum(is.na(Colleges)),
yes_college = sum(!is.na(Colleges)),
prob_college = yes_college/n())
## # A tibble: 1 × 3
## no_college yes_college prob_college
## <int> <int> <dbl>
## 1 2005 19506 0.907
nba %>%
group_by(Pos) %>%
summarise(no_college = sum(is.na(Colleges)),
yes_college = sum(!is.na(Colleges)),
pos_count = n(),
pos_prob_college = yes_college/pos_count)
## # A tibble: 5 × 5
## Pos no_college yes_college pos_count pos_prob_college
## <chr> <int> <int> <int> <dbl>
## 1 C 738 3532 4270 0.827
## 2 PF 399 4145 4544 0.912
## 3 PG 222 4004 4226 0.947
## 4 SF 377 3700 4077 0.908
## 5 SG 269 4125 4394 0.939
Add these probabilities to data set
nba %>%
mutate(CollegeProb = ifelse(!is.na(Colleges), 0.9067919, 0.0932081 )) %>%
mutate(Pos_CollegeProb = case_when(Pos == "C" & !is.na(Colleges) ~ 0.8271663,
Pos == "PF" & !is.na(Colleges) ~ 0.9121919,
Pos == "SF" & !is.na(Colleges) ~ 0.9075300,
Pos == "SG" & !is.na(Colleges) ~ 0.9387802,
Pos == "PG" & !is.na(Colleges) ~ 0.9474681,
Pos == "C" & is.na(Colleges) ~ 1-0.8271663,
Pos == "PF" & is.na(Colleges) ~ 1-0.9121919,
Pos == "SF" & is.na(Colleges) ~ 1-0.9075300,
Pos == "SG" & is.na(Colleges) ~ 1-0.9387802,
Pos == "PG" & is.na(Colleges) ~ 1-0.9474681,))
## # A tibble: 21,511 × 58
## Year Player Pos Age Tm G GS MP FG FGA `FG%` `3P`
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1980 Kareem Abd… C 32 LAL 82 NA 3143 835 1383 0.604 0
## 2 1981 Kareem Abd… C 33 LAL 80 NA 2976 836 1457 0.574 0
## 3 1982 Kareem Abd… C 34 LAL 76 76 2677 753 1301 0.579 0
## 4 1983 Kareem Abd… C 35 LAL 79 79 2554 722 1228 0.588 0
## 5 1984 Kareem Abd… C 36 LAL 80 80 2622 716 1238 0.578 0
## 6 1985 Kareem Abd… C 37 LAL 79 79 2630 723 1207 0.599 0
## 7 1986 Kareem Abd… C 38 LAL 79 79 2629 755 1338 0.564 0
## 8 1987 Kareem Abd… C 39 LAL 78 78 2441 560 993 0.564 1
## 9 1988 Kareem Abd… C 40 LAL 80 80 2308 480 903 0.532 0
## 10 1989 Kareem Abd… C 41 LAL 74 74 1695 313 659 0.475 0
## # ℹ 21,501 more rows
## # ℹ 46 more variables: `3PA` <dbl>, `3P%` <dbl>, `3PVolume` <fct>,
## # `3PVprobability` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>,
## # FT <dbl>, FTA <dbl>, `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>,
## # AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, PER <dbl>,
## # `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, `ORB%` <dbl>, `DRB%` <dbl>,
## # `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, …
I would would curious to look into why the difference in proportion of centers that went to college and point guards is so different. My initial idea is that it takes more time for PGs to develop and college is a great place to do that, whereas if you are a big dominant post man there is no need to “waste” time in college if you are already “NBA ready”. Also to look at these probabilities on a year to year scale would be interesting as the rules regarding college have changed over time and have forced players to go the college route, but now there are some more options such as going to the NBL in Australia or Overtime Elite.
USGcategory and 3PVolume, which were both created using the cut function above. Have to use usage data set, also created above, because the USGcategory variable only makes sense for certain values of MP, which the usage data set is already filtered on. There are 7 categories for USGcategory and 5 for 3PVolume so there would be 35 possible combinations of these variables.
usage %>%
group_by(USGcategory, `3PVolume`) %>%
count()
## # A tibble: 30 × 3
## # Groups: USGcategory, 3PVolume [30]
## USGcategory `3PVolume` n
## <fct> <fct> <int>
## 1 way below average very low 4
## 2 below average very low 199
## 3 below average low 17
## 4 below average medium 6
## 5 below average high 1
## 6 slightly below average very low 1576
## 7 slightly below average low 358
## 8 slightly below average medium 314
## 9 slightly below average high 94
## 10 slightly below average very high 4
## # ℹ 20 more rows
Of the 35 possible pairs there are 5 that never appear in the data, which are (way below average, low), (way below average, medium), (way below average, high), (way below average,very high ), and (below average, high). These all make sense. If someone has a way below average usage rate is is unlikely that they even had the opportunity to shoot even a “low” amount of threes let alone more. Similarly, if you have a below average usage rate it is possible to have up to high 3PVolume (players are primarily shooting 3 pointers when they touch the ball), however if you got the chance to shoot a very high amount of 3s then you are being used more than below average amount. The most common is average usage and very low 3PVolume. This makes sense, because there is about 68% of the data in the average usage column and then 61% in the very low 3PVolume category. This is due to the fact that the three pointer was not very popular for big men to shoot until very recently.
usage %>%
group_by(USGcategory, `3PVolume`) %>%
count() %>%
ggplot() +
geom_tile(aes(x = USGcategory, y = `3PVolume`, fill = n)) +
scale_fill_gradient2(high = "red", mid = "white", low = "blue", midpoint = 2900, guide = "colorbar", aesthetics = "fill") +
theme(axis.text.x = element_text(angle = 11))
Out of curiosity I wanted to see how the above changed when grouped by position. Here you can see a lot of the bright red is due to the C and PF positions, as expected. As you can see, as you go from C -> PF -> SF-> SG -> PG the graph distribution begins t shirt to the top right corner.
usage %>%
group_by(USGcategory, `3PVolume`, Pos) %>%
count() %>%
ggplot() +
geom_tile(aes(x = `3PVolume`, y = USGcategory, fill = n)) +
scale_fill_gradient2(high = "red", mid = "white", low = "blue", midpoint = 900, guide = "colorbar", aesthetics = "fill") +
theme(axis.text.x = element_text(angle = 25)) +
facet_wrap(~Pos)
Still curious about how different these graphs, especially the C and PF ones, would look if you looked across decades.