Loading Data
A numeric summary of data for at least 10 columns of data
3P%
First filter to only get stats for players that have attempted at least 20 3 pointers in a season and where Tm does not equal “TOT”. “TOT” represents a players cumulative stats if they played for multiple teams in a year. ie. if included their stats will get counted twice
nba %>%
filter(`3PA`>=25, Tm != 'TOT') %>%
summarise(max = max(`3P%`, na.rm = T),
'75%' = quantile(`3P%`, probs = c(0.75)),
med = median(`3P%`, na.rm = T),
mean = mean(`3P%`, na.rm = T),
'25%' = quantile(`3P%`, probs = c(0.25)),
min = min(`3P%`, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.594 0.382 0.346 0.337 0.3 0.036
2P%
nba %>%
filter(`2PA`>=25, Tm != 'TOT') %>%
summarise(max = max(`2P%`, na.rm = T),
'75%' = quantile(`2P%`, probs = c(0.75)),
med = median(`2P%`, na.rm = T),
mean = mean(`2P%`, na.rm = T),
'25%' = quantile(`2P%`, probs = c(0.25)),
min = min(`2P%`, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.897 0.512 0.476 0.475 0.439 0.163
This column is interesting when coupled with the 3P% column because it gives a little insight into why data driven teams are looking to shoot more 3s. Based on this data the average 2 point shot results in less than 1 point (0.950768) whereas the average 3 point shot results in more than 1 point (1.011334)
Free Throw Percentage (FT%) Summary
nba %>%
filter(`FTA`>=25, Tm != 'TOT') %>%
summarise(max = max(`FT%`, na.rm = T),
'75%' = quantile(`FT%`, probs = c(0.75)),
med = median(`FT%`, na.rm = T),
mean = mean(`FT%`, na.rm = T),
'25%' = quantile(`FT%`, probs = c(0.25)),
min = min(`FT%`, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.815 0.758 0.741 0.686 0.16
College Summary
1st separate this column so you can find individual column colleges if a player has been to more than 1. First output shows the 25 colleges who have produced the most NBA players. The second output shows the number of different colleges that have produced NBA players.
# Find 25 colleges with most NBA Players
nba %>%
separate_wider_delim(Colleges,',' ,names = c('college1', 'college2', 'college3','college4'), too_few = "align_start") %>%
group_by(college1) %>%
count(Player) %>%
summarise(total = sum(n_distinct(Player))) %>%
top_n(25) %>%
arrange(desc(total))
## Selecting by total
## # A tibble: 26 × 2
## college1 total
## <chr> <int>
## 1 <NA> 287
## 2 Kentucky 85
## 3 UCLA 74
## 4 UNC 71
## 5 Duke 69
## 6 Kansas 56
## 7 Arizona 51
## 8 Indiana 42
## 9 Louisville 42
## 10 Michigan 42
## # ℹ 16 more rows
nba %>%
separate_wider_delim(Colleges,',' ,names = c('college1', 'college2', 'college3','college4'), too_few = "align_start") %>%
group_by(college1) %>%
summarise(n_school = n_distinct(college1)) %>%
summarise(college_count = sum(n_school))
## # A tibble: 1 × 1
## college_count
## <int>
## 1 416
Minutes Played (MP)
nba %>%
filter(Tm != 'TOT') %>%
summarise(max = max(MP, na.rm = T),
'75%' = quantile(MP, probs = c(0.75)),
med = median(MP, na.rm = T),
mean = mean(MP, na.rm = T),
'25%' = quantile(MP, probs = c(0.25)),
min = min(MP, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3533 1851 954 1135. 286 0
Position (Pos)
nba %>%
group_by(Player, Pos) %>%
count(Pos) %>%
group_by(Pos) %>%
summarise(pos_count = n_distinct(Player)) %>%
arrange(desc(pos_count))
## # A tibble: 5 × 2
## Pos pos_count
## <chr> <int>
## 1 SG 973
## 2 PF 920
## 3 SF 893
## 4 C 748
## 5 PG 704
Age
nba %>%
filter(Tm != 'TOT') %>%
summarise(max = max(Age, na.rm = T),
'75%' = quantile(Age, probs = c(0.75)),
med = median(Age, na.rm = T),
mean = mean(Age, na.rm = T),
'25%' = quantile(Age, probs = c(0.25)),
min = min(Age, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44 29 26 26.7 24 18
Height
nba %>%
mutate(`height(in)` = Feet*12 + Inches) %>%
filter(Tm != 'TOT') %>%
summarise(max = max(`height(in)`, na.rm = T),
'75%' = quantile(`height(in)`, probs = c(0.75)),
med = median(`height(in)`, na.rm = T),
mean = mean(`height(in)`, na.rm = T),
'25%' = quantile(`height(in)`, probs = c(0.25)),
min = min(`height(in)`, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 91 81 79 78.8 76 63
TOV
Summary stats for turnovers. Mean turnovers/season by position
#
nba %>%
filter(Tm != 'TOT') %>%
summarise(max = max(TOV, na.rm = T),
'75%' = quantile(TOV, probs = c(0.75)),
med = median(TOV, na.rm = T),
mean = mean(TOV, na.rm = T),
'25%' = quantile(TOV, probs = c(0.25)),
min = min(TOV, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 464 107 51 70.2 16 0
nba %>%
summarise(pos_mean = mean(TOV, na.rm = T),.by = Pos) %>%
arrange(desc(pos_mean))
## # A tibble: 5 × 2
## Pos pos_mean
## <chr> <dbl>
## 1 PG 82.6
## 2 SG 68.4
## 3 SF 68.0
## 4 PF 64.0
## 5 C 63.6
Weight (Wt)
summary stats for weight as well as mean and standard deviation by position
nba %>%
filter(Tm != 'TOT') %>%
summarise(max = max(Wt, na.rm = T),
'75%' = quantile(Wt, probs = c(0.75)),
med = median(Wt, na.rm = T),
mean = mean(Wt, na.rm = T),
'25%' = quantile(Wt, probs = c(0.25)),
min = min(Wt, na.rm = T))
## # A tibble: 1 × 6
## max `75%` med mean `25%` min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 360 235 215 216. 195 133
nba %>%
summarise(pos_mean = mean(Wt, na.rm = T), pos_sd = sd(Wt, na.rm = T),.by = Pos) %>%
arrange(desc(pos_mean))
## # A tibble: 5 × 3
## Pos pos_mean pos_sd
## <chr> <dbl> <dbl>
## 1 C 246. 20.5
## 2 PF 231. 16.1
## 3 SF 215. 14.5
## 4 SG 200. 14.5
## 5 PG 184. 14.3
A set of at least 5 novel questions to investigate informed by the following:
Questions:
Use of aggregation functions (other than the ones used from the first bullet, above)
A visual summary of at least 5 columns of your data
Age vs Position throughout the years
nba %>%
filter(Tm != 'TOT', Year %in% c(1980, 1985, 1990,1995,2000,2005,2010,2015,2020)) %>%
ggplot(mapping = aes(x = Pos,y = Age)) +
geom_boxplot(aes(color = Pos)) +
facet_wrap(~Year)
Height by Position
Distribution of heights by position. Would be curious to see how this has changed over the years. For example have point guards gotten taller over the years and have power forwards and centers gotten shorter? Also interesting to note that within each position heights look normally distributed.
nba %>%
mutate(`Height(in)` = (Feet*12 + Inches)/12) %>%
ggplot()+
geom_bar(mapping = aes(`Height(in)`)) +
geom_vline(xintercept = c(6,6.5,7), alpha = 0.3)+
facet_wrap(~Pos)
3 point attempts over the years
As the game has evolved the 3 point shot is being taken more and more. This graph show that not only has the shot been taken more league wide, but it also has been taken more by the “big men”, power forwards and centers.
nba %>%
filter(Tm != 'TOT', Year %in% c(1980,2000,2020)) %>%
ggplot() +
geom_histogram(aes(`3PA`),binwidth = 25, position = "dodge") +
facet_wrap(~Year)
nba %>%
filter(Tm != 'TOT', Pos %in% c('C','PF'), Year %in% c(1980,2000,2020)) %>%
ggplot() +
geom_histogram(aes(`3PA`),binwidth = 25, position = "dodge") +
facet_wrap(~Year)
OWS/DWS to GS is there a correlation between these
These graphs show the relationship between offensive/defensive win shares and games started. Negative values for both OWS and DWS are more heavily distributed towards 0 GS and higher values are more often seen towards 82.
nba %>%
ggplot(mapping = aes(x = GS, y = OWS)) +
geom_point(aes(color = OWS>0),alpha = 0.1) +
labs(y = "Games Started",
x = "Offensive Win Shares",
title = "Games Started by Offensive Win Shares ")
## Warning: Removed 707 rows containing missing values (`geom_point()`).
nba %>%
ggplot(mapping = aes(x = GS, y = DWS)) +
geom_point(aes(color = DWS>0),alpha = 0.1) +
labs(y = "Games Started",
x = "Defensive Win Shares",
title = "Games Started by Defensive Win Shares ")
## Warning: Removed 707 rows containing missing values (`geom_point()`).
Assist/ TO Ratio by Position
Graph shows assists compared to turnovers with colors showing each position. The line represent linear regression for each position. The slope of these lines for each group would be the average AST/TOV ratio for the position. ie for every 1 TOV how many more assists would you expect
nba %>%
ggplot(mapping = aes(x= TOV,y = AST)) +
geom_point(mapping = aes(color = Pos), alpha = 0.1)+
geom_smooth(aes(color = Pos),formula = y~x,se = F)
## `geom_smooth()` using method = 'gam'