Project 2 - NBA Team Statistics

Author

Kristoff Oliphant

Introduction

My goal is to transform a wide format dataset that contains NBA team statistics with multiple headers and condensing them to a tidy format that allows for my analysis. The raw data from one of my classmates in Discussion 5A Untidy Data includes statistics from multiple seasons with categories such as Shooting, Advanced, Per Game. These performances are stored as separate columns rather than as values within a single variable column.

Planned Workflow

My workflow will aim to be reproducible and I’ll be using tidyverse. I plan to load the raw nba wide from a csv file into R with the intention of tidying and transforming the data utilzing dplyr for removing redundant columns and using rename to organize columns into a consistent format. I also want to make sure that the statistical metrics are converted from strings to numerical values to conduct mathematical operations. I want to follow what my classmate said and utilize this dataset to view the trends in team performance over time like how three-point shooting has evolved NBA offenses.

Anticipated Challenges

One of the challenges I can face is re-structuring the data in a way that’s readable and able to be worked with. The raw csv contains categories such as ‘Per Game’, ‘Shooting’, and ‘Advanced’ that I will have to make their statistics in these categories can match. There will also be some categories that have empty values due to certain statistics not being counted or missing.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

nba_raw <- read_csv("nba_raw_wide.csv", skip = 1)

Rows: 80 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Season, Lg, Ht
dbl (30): Rk, Age, Wt, G, MP, FG, FGA, 3P, 3PA, FT, FTA, ORB, DRB, TRB, AST,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nba_raw)

# A tibble: 6 × 33
     Rk Season Lg      Age Ht       Wt     G    MP    FG   FGA  `3P` `3PA`    FT
  <dbl> <chr>  <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 2025-… NBA    26.2 6-7     216   910  241.  41.8  89.2  13.3  37    18.4
2     2 2024-… NBA    26.3 6-7     216  1230  241.  41.7  89.2  13.5  37.6  16.9
3     3 2023-… NBA    26.4 6-6     217  1230  241.  42.2  88.9  12.8  35.1  17  
4     4 2022-… NBA    26.1 6-6     217  1230  242.  42    88.3  12.3  34.2  18.4
5     5 2021-… NBA    26.1 6-6     216  1230  241.  40.6  88.1  12.4  35.2  16.9
6     6 2020-… NBA    26.1 6-6     218  1080  241.  41.2  88.4  12.7  34.6  17  
# ℹ 20 more variables: FTA <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>,
#   STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, `FG%` <dbl>,
#   `3P%` <dbl>, `FT%` <dbl>, Pace <dbl>, `eFG%` <dbl>, `TOV%` <dbl>,
#   `ORB%` <dbl>, `FT/FGA` <dbl>, ORtg <dbl>, `TS%` <dbl>

nba_tidy <- nba_raw %>%
  mutate(`3P` = replace_na(`3P`, 0)) %>%
  mutate(`2P` = FG - `3P`) %>%
  mutate(year = as.numeric(str_extract(Season, "\\d{4}"))) %>%
  select(year, `2P`, `3P`) %>%
  pivot_longer(
    cols = c(`2P`, `3P`),
    names_to = "Shot_Type",
    values_to = "Made_Per_Game"
  )

ggplot(nba_tidy, aes(x = year, y = Made_Per_Game, color =Shot_Type)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Revolution of the NBA Shot",
    subtitle = "Evolution of 2-Pointers vs. 3-Pointers (1946 - Present)",
    x = "Year",
    y = "Average Shots Made Per Game",
    color = "Shot Type"
  ) +
  theme_minimal()

Conclusion

Based on the graph, we can see that during the beginning years of the NBA, there was no 3-Point being recorded until 1979-1980 season. Even during it’s beginning of recording, we didn’t see much 3-pointers being taken per game but rather, the 2 point had a surge between 1960-1980. It wasn’t until the 2000s that we saw the 3-point shot begin to rise while the 2-point shot gradually fell in this time period. We see that NBA teams and game plans began to move further away from the 2-point and the 3-point gained prominence during the 2010s, and reaching an all time high in the 2020s.