Week 2 Data Dive

Kael Ecord

Loading Data

A numeric summary of data for at least 10 columns of data

For categorical columns, this should include unique values and counts
For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)
These summaries can be combined

3P%

First filter to only get stats for players that have attempted at least 20 3 pointers in a season and where Tm does not equal “TOT”. “TOT” represents a players cumulative stats if they played for multiple teams in a year. ie. if included their stats will get counted twice

nba %>%
  filter(`3PA`>=25, Tm != 'TOT') %>%
  summarise(max = max(`3P%`, na.rm = T),
            '75%' = quantile(`3P%`, probs = c(0.75)),
            med = median(`3P%`, na.rm = T),
            mean = mean(`3P%`, na.rm = T),
            '25%' = quantile(`3P%`, probs = c(0.25)),
            min = min(`3P%`, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.594 0.382 0.346 0.337   0.3 0.036

2P%

nba %>%
  filter(`2PA`>=25, Tm != 'TOT') %>%
  summarise(max = max(`2P%`, na.rm = T),
            '75%' = quantile(`2P%`, probs = c(0.75)),
            med = median(`2P%`, na.rm = T),
            mean = mean(`2P%`, na.rm = T),
            '25%' = quantile(`2P%`, probs = c(0.25)),
            min = min(`2P%`, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.897 0.512 0.476 0.475 0.439 0.163

This column is interesting when coupled with the 3P% column because it gives a little insight into why data driven teams are looking to shoot more 3s. Based on this data the average 2 point shot results in less than 1 point (0.950768) whereas the average 3 point shot results in more than 1 point (1.011334)

Free Throw Percentage (FT%) Summary

nba %>%
  filter(`FTA`>=25, Tm != 'TOT') %>%
  summarise(max = max(`FT%`, na.rm = T),
            '75%' = quantile(`FT%`, probs = c(0.75)),
            med = median(`FT%`, na.rm = T),
            mean = mean(`FT%`, na.rm = T),
            '25%' = quantile(`FT%`, probs = c(0.25)),
            min = min(`FT%`, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1 0.815 0.758 0.741 0.686  0.16

College Summary

1st separate this column so you can find individual column colleges if a player has been to more than 1. First output shows the 25 colleges who have produced the most NBA players. The second output shows the number of different colleges that have produced NBA players.

# Find 25 colleges with most NBA Players
nba %>% 
  separate_wider_delim(Colleges,',' ,names = c('college1', 'college2', 'college3','college4'), too_few = "align_start") %>%
  group_by(college1) %>%
  count(Player) %>%
  summarise(total = sum(n_distinct(Player))) %>%
  top_n(25) %>%
  arrange(desc(total))

## Selecting by total

## # A tibble: 26 × 2
##    college1   total
##    <chr>      <int>
##  1 <NA>         287
##  2 Kentucky      85
##  3 UCLA          74
##  4 UNC           71
##  5 Duke          69
##  6 Kansas        56
##  7 Arizona       51
##  8 Indiana       42
##  9 Louisville    42
## 10 Michigan      42
## # ℹ 16 more rows

nba %>% 
  separate_wider_delim(Colleges,',' ,names = c('college1', 'college2', 'college3','college4'), too_few = "align_start") %>%
  group_by(college1) %>%
  summarise(n_school = n_distinct(college1)) %>%
  summarise(college_count = sum(n_school))

## # A tibble: 1 × 1
##   college_count
##           <int>
## 1           416

Minutes Played (MP)

nba %>%
  filter(Tm != 'TOT') %>%
  summarise(max = max(MP, na.rm = T),
            '75%' = quantile(MP, probs = c(0.75)),
            med = median(MP, na.rm = T),
            mean = mean(MP, na.rm = T),
            '25%' = quantile(MP, probs = c(0.25)),
            min = min(MP, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  3533  1851   954 1135.   286     0

Position (Pos)

nba %>%
  group_by(Player, Pos) %>%
  count(Pos) %>%
  group_by(Pos) %>%
  summarise(pos_count = n_distinct(Player)) %>%
  arrange(desc(pos_count))

## # A tibble: 5 × 2
##   Pos   pos_count
##   <chr>     <int>
## 1 SG          973
## 2 PF          920
## 3 SF          893
## 4 C           748
## 5 PG          704

Age

nba %>%
  filter(Tm != 'TOT') %>%
  summarise(max = max(Age, na.rm = T),
            '75%' = quantile(Age, probs = c(0.75)),
            med = median(Age, na.rm = T),
            mean = mean(Age, na.rm = T),
            '25%' = quantile(Age, probs = c(0.25)),
            min = min(Age, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    44    29    26  26.7    24    18

Height

nba %>%
  mutate(`height(in)` = Feet*12 + Inches) %>%
  filter(Tm != 'TOT') %>%
  summarise(max = max(`height(in)`, na.rm = T),
            '75%' = quantile(`height(in)`, probs = c(0.75)),
            med = median(`height(in)`, na.rm = T),
            mean = mean(`height(in)`, na.rm = T),
            '25%' = quantile(`height(in)`, probs = c(0.25)),
            min = min(`height(in)`, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    91    81    79  78.8    76    63

TOV

Summary stats for turnovers. Mean turnovers/season by position

# 
nba %>%
  filter(Tm != 'TOT') %>%
  summarise(max = max(TOV, na.rm = T),
            '75%' = quantile(TOV, probs = c(0.75)),
            med = median(TOV, na.rm = T),
            mean = mean(TOV, na.rm = T),
            '25%' = quantile(TOV, probs = c(0.25)),
            min = min(TOV, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   464   107    51  70.2    16     0

nba %>%
  summarise(pos_mean = mean(TOV, na.rm = T),.by = Pos) %>%
  arrange(desc(pos_mean))

## # A tibble: 5 × 2
##   Pos   pos_mean
##   <chr>    <dbl>
## 1 PG        82.6
## 2 SG        68.4
## 3 SF        68.0
## 4 PF        64.0
## 5 C         63.6

Weight (Wt)

summary stats for weight as well as mean and standard deviation by position

nba %>%
  filter(Tm != 'TOT') %>%
  summarise(max = max(Wt, na.rm = T),
            '75%' = quantile(Wt, probs = c(0.75)),
            med = median(Wt, na.rm = T),
            mean = mean(Wt, na.rm = T),
            '25%' = quantile(Wt, probs = c(0.25)),
            min = min(Wt, na.rm = T))

## # A tibble: 1 × 6
##     max `75%`   med  mean `25%`   min
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   360   235   215  216.   195   133

nba %>%
 summarise(pos_mean = mean(Wt, na.rm = T), pos_sd = sd(Wt, na.rm = T),.by = Pos) %>%
    arrange(desc(pos_mean))

## # A tibble: 5 × 3
##   Pos   pos_mean pos_sd
##   <chr>    <dbl>  <dbl>
## 1 C         246.   20.5
## 2 PF        231.   16.1
## 3 SF        215.   14.5
## 4 SG        200.   14.5
## 5 PG        184.   14.3

A set of at least 5 novel questions to investigate informed by the following:

column summaries (i.e., the above bullet)
data documentation
your project’s goals/purpose

Questions:

What is the average free throw percentage (FT%) by height and is there a correlation between FT% and position?
How has the distribution of 3 point attempts (3PAs) by each position changed throughout the years?
Do minutes per game (MPG) tend to increase or decrease as a player progresses throughout their career?
Which college has produced the most NBA players?
What are the trends in height by position since 1980?

Use of aggregation functions (other than the ones used from the first bullet, above)

I.e., use these explore something interesting about your data

A visual summary of at least 5 columns of your data

This should include distributions at least
In addition, you should consider trends, correlations, and interactions between variables
Use different channels (e.g., color) to show how categorical variables interact with continuous variables

Age vs Position throughout the years

nba %>%
  filter(Tm != 'TOT', Year %in% c(1980, 1985, 1990,1995,2000,2005,2010,2015,2020)) %>%
  ggplot(mapping = aes(x = Pos,y = Age)) +
    geom_boxplot(aes(color = Pos)) +
    facet_wrap(~Year)

Height by Position

Distribution of heights by position. Would be curious to see how this has changed over the years. For example have point guards gotten taller over the years and have power forwards and centers gotten shorter? Also interesting to note that within each position heights look normally distributed.

nba %>%
  mutate(`Height(in)` = (Feet*12 + Inches)/12) %>%
  ggplot()+
  geom_bar(mapping = aes(`Height(in)`)) +
  geom_vline(xintercept = c(6,6.5,7), alpha = 0.3)+
  facet_wrap(~Pos)

3 point attempts over the years

As the game has evolved the 3 point shot is being taken more and more. This graph show that not only has the shot been taken more league wide, but it also has been taken more by the “big men”, power forwards and centers.

nba %>%
  filter(Tm != 'TOT', Year %in% c(1980,2000,2020)) %>%
  ggplot() +
  geom_histogram(aes(`3PA`),binwidth = 25, position = "dodge") +
  facet_wrap(~Year)

nba %>%
  filter(Tm != 'TOT', Pos %in% c('C','PF'), Year %in% c(1980,2000,2020)) %>%
  ggplot() +
  geom_histogram(aes(`3PA`),binwidth = 25, position = "dodge") +
  facet_wrap(~Year)

OWS/DWS to GS is there a correlation between these

These graphs show the relationship between offensive/defensive win shares and games started. Negative values for both OWS and DWS are more heavily distributed towards 0 GS and higher values are more often seen towards 82.

nba %>%
  ggplot(mapping = aes(x = GS, y = OWS)) +
  geom_point(aes(color = OWS>0),alpha = 0.1) +
  labs(y = "Games Started",
       x = "Offensive Win Shares",
       title = "Games Started by Offensive Win Shares ")

## Warning: Removed 707 rows containing missing values (`geom_point()`).

nba %>%
  ggplot(mapping = aes(x = GS, y = DWS)) +
  geom_point(aes(color = DWS>0),alpha = 0.1) +
  labs(y = "Games Started",
       x = "Defensive Win Shares",
       title = "Games Started by Defensive Win Shares ")

## Warning: Removed 707 rows containing missing values (`geom_point()`).

Assist/ TO Ratio by Position

Graph shows assists compared to turnovers with colors showing each position. The line represent linear regression for each position. The slope of these lines for each group would be the average AST/TOV ratio for the position. ie for every 1 TOV how many more assists would you expect

nba %>%
  ggplot(mapping = aes(x= TOV,y = AST)) +
  geom_point(mapping = aes(color = Pos), alpha = 0.1)+
  geom_smooth(aes(color = Pos),formula = y~x,se = F)

## `geom_smooth()` using method = 'gam'