library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Dataset

nba <- read.csv("nba.csv")

Playoffs (Categorical)

nba |>
  count(Playoffs)
##   Playoffs    n
## 1    false 1655
## 2     true   48

The dataset contains a lot more regular season games than playoff games, which is expected given the structure of an NBA season. This is important because playoff games often differ in rotations and defensive pressure due to higher stakes. Any analysis that is comparing playoff and regular season performances should account for this imbalance. Future models will most likely either analyze playoff games separately or use some sort of weighted or normalized comparison system.

Points (Numeric)

summary(nba$PTS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   19.00   24.00   26.06   32.00   81.00
quantile(nba$PTS, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
##  19  24  32

The range for points scored in this dataset is very wide with some extreme highs including Kobe Bryant’s historic 81-point game. This usually results in a right-skewed distribution, where elite scoring performances are rare but highly influential. This shows that unexpected performances in the NBA do not always mean high scoring outputs because certain players can affect the game in a variety of ways and even superstar players can have an off night where none of their shots are falling. These extremes should be treated as meaningful outliers rather than just noise but thresholds could be set when looking at other factors that impact these games.

Exploratory Questions 1. Do players tend to perform better or worse in playoff games compared to regular season games? 2. What is the relationship between scoring and a player’s overall impact on the game? 3. Are there any trends or patterns in these performances across seasons?

Aggregation (Question 1)

nba |>
  group_by(Playoffs) |>
  summarize(
    avg_pts = mean(PTS, na.rm = TRUE),
    avg_gmsc = mean(GmSc, na.rm = TRUE),
    games = n()
  )
## # A tibble: 2 × 4
##   Playoffs avg_pts avg_gmsc games
##   <chr>      <dbl>    <dbl> <int>
## 1 false       25.9     25.0  1655
## 2 true        32.9     30.3    48

Playoff games have both higher average points and overall game scores, suggesting that most players seem to increase their level of play in the postseason compared to the regular season. Another reason could be the fact that superstar players usually get more minutes on the court in the playoffs and might be taking more shots as well. However, evaluating players purely on points may undervalue other playoff contributors that are not necessarily high scorers but can affect the game in other ways. Later analysis could emphasize a variety of performance metrics when assessing these standout performances.

Visualization (Points vs. Game Score)

nba |>
  ggplot(aes(x = PTS, y = GmSc, color = Playoffs)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Total Points vs Game Score",
    x = "Points",
    y = "Game Score",
    color = "Playoffs"
  )

There is obviously a positive relationship between points scored and game score, but there is still a decent amount of variation at similar point totals. While the highest point totals occurred in the regular season, many of the playoff performances have higher game scores at the same number of points scored. Many players seem to find additional ways to contribute in the playoffs. In the future, I would like to figure out which statistics other than scoring boost a player’s overall game impact the most.

Visualization (Point Averages Over Time)

nba |>
  group_by(Year) |>
  summarize(avg_pts = mean(PTS, na.rm = TRUE)) |>
  ggplot(aes(x = Year, y = avg_pts)) +
  geom_line() +
  labs(
    title = "Average PPG Over Time",
    x = "Year",
    y = "Average Points"
  )

While there is a lot of fluctuation across years since these are all unexpected performances, there is still a noticable trend of points per game increasing over the years due to a faster pace of play and more three-point shooting across the entire NBA. Thus, a player’s scoring output should be interpreted within its historical context. A 30-point game in earlier seasons may hold more weight than it does in the modern era.