library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
NBA Dataset
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Numerical Summaries
How many different types of games are there, and what are they?
unique(NBA_Data$season_type)
## [1] "Playoffs" "All-Star" "All Star" "Regular Season"
## [5] "Pre Season"
summary(NBA_Data$season_type)
## Length Class Mode
## 46977 character character
For additional analysis, any all star and pre-season games will be
filtered out since they do not feature “legitimate” competition amongst
everyday starters on teams.
NBA_Data <- NBA_Data |>
filter(season_type != 'All-Star',
season_type != 'All Star',
season_type != 'Pre Season')
That eliminated roughly 1600 records from the original data set, but
there are still about 45,400 records remaining.
Average Home Scoring Margin
NBA_Data <- NBA_Data|>
mutate(home_score_margin = pts_home - pts_away)
summary(NBA_Data$home_score_margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -58.000 -6.000 4.000 3.367 12.000 73.000
Home teams are outscoring away teams by an average of 3.4 points per
game. While I can’t prove home court advantage, that certainly raises
interesting questions about the impact.
Lingering Questions
1. What impact does home court advantage have on NBA games?
2. How has the introduction of the three-point line changed the game
of basketball?
3. Has there been an increase in transition points or up-tempo game
play in the last 5-10 years?
Home Court Evaluation
How strong is the advantage for teams winning at home? Is there
a sizable difference?
NBA_Data |>
select(wl_home) |>
table()
## wl_home
## L W
## 17927 27471
The home team is winning almost 61% of their home games.
NBA_Data |>
filter(season_type == "Playoffs") |>
select(wl_home) |>
table()
## wl_home
## L W
## 997 1737
That number increases to almost 64% in the playoffs.
Is there a correlation between attendance and winning?
NBA_Data2 <- NBA_Data |>
filter(!is.na(attendance))
NBA_Data2 <- NBA_Data2 |>
mutate(win_binary = ifelse(wl_home == "W", 1, 0))
cor_val <- cor(NBA_Data2$attendance, NBA_Data2$win_binary, use = "complete.obs")
cor_val
## [1] 0.02911234
ggplot(NBA_Data2, aes(x = wl_home, y = attendance, fill = wl_home)) +
geom_boxplot(alpha = 0.6) +
geom_jitter(width = 0.15, alpha = 0.4) +
labs(
title = "Attendance vs Home Team Result",
subtitle = paste("Point-biserial correlation:", round(cor_val, 3)),
x = "Home Team Result (W/L)",
y = "Attendance"
) +
theme_minimal()

Not definitively according to my correlation graph.
Changes Due to the Three Point Line
I think the league has taken more threes as the years have gone
on, but let’s see.
NBA_Data |>
mutate(total_threes = fg3m_home + fg3m_away) |>
group_by(season) |>
summarise(total_threes = sum(total_threes)) |>
ggplot(aes(x = as.numeric(substr(season, 1, 4)), y = total_threes)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_point(color = "steelblue") +
labs(
title = "Total Threes by Season",
x = "Season",
y = "Total Threes"
) +
theme_minimal()

This graph verifies my hypothesis but also brings to my attention an
important piece of information I forgot about - the lockout of 2012
which cut down the number of games significantly.
Transition Points in the Last 5 Seasons
NBA_Data |>
filter(
!is.na(pts_fb_home),
!is.na(pts_fb_away),
as.numeric(substr(season, 1, 4)) >= 2017
) |>
mutate(total_transition = pts_fb_home + pts_fb_away) |>
group_by(season) |>
summarise(total_transition = sum(total_transition)) |>
ggplot(aes(x = as.numeric(substr(season, 1, 4)), y = total_transition)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_point(color = "steelblue") +
labs(
title = "Total Fastbreak Points by Season",
x = "Season",
y = "Total Transition Points"
) +
theme_minimal()

The dip in fast-break/transition points coinciding with the global
pandemic is extremely fascinating. I wonder if the lack of games, or the
NBA’s bubble format had an impact on the speed of the games, where it
did not seem to impact the aforementioned three point volume.
I want to explore the dip more in further analyses to see what other
statistical measures or game states could have been effected in this
seasonal range.