library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NBA Dataset

Loading in the data:
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Filtering out unnecessary games:
NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')

Creating New Variables

new_cols <- NBA_Data |>
  mutate(fg_pct_home = (fgm_home / fga_home))

Home FG% Distribution

summary(new_cols$fg_pct_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2338  0.4267  0.4658  0.4667  0.5062  0.6966

Creating a New Column based off FG%

new_cols <- new_cols |>
  mutate(shooting_level = case_when(
    fg_pct_home > 0.55 ~ 'elite',
    fg_pct_home >= 0.5 & fg_pct_home <= 0.55 ~ 'very good',
    fg_pct_home >= 0.45 & fg_pct_home < 0.5 ~ 'good',
    fg_pct_home >= 0.42 & fg_pct_home < 0.45 ~ 'average',
    fg_pct_home >= 0.38 & fg_pct_home < 0.42 ~ 'below average',
    fg_pct_home < 0.38 ~ 'bad',
  ))

Shooting Levels Indicating Shooting Success (in terms of FG%)

new_cols <- new_cols |>
  mutate(
    shooting_level = factor(
      shooting_level,
      levels = c(
        "bad",
        "below average",
        "average",
        "good",
        "very good",
        "elite"
      ),
      ordered = TRUE
    )
  )

Home FG% vs. Home Points

new_cols |>
  ggplot(aes(x = fg_pct_home, y = pts_home, color = shooting_level)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_brewer(palette = "Blues") +
  labs(
    title = "Home FG% vs Home Points (Colored by Shooting Level)",
    x = "Home Field Goal Percentage",
    y = "Home Points",
    color = "Shooting Level"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

A better field goal percentage inherently means the team is shooting more efficiently - they have made most of their shots. However, this graph displays whether or not this also leads to an increase in points scored as well. Unsurprisingly, more points are scored when teams shoot more efficiently from the field. However, is there a strong correlation as well?

Correlation

cor_pair1 <- cor(new_cols$fg_pct_home, new_cols$pts_home, use = "complete.obs")
cor_pair1
## [1] 0.6786808

The correlation of nearly 0.68 indicates a strong positive correlation. This is not a surprise considering both conventional wisdom (the better you shoot the more points you should score), but also the graph above.

Confidence Interval

pts <- new_cols$pts_home
n1 <- length(pts)
mean1 <- mean(pts, na.rm = TRUE)
sd1 <- sd(pts, na.rm = TRUE)
se1 <- sd1 / sqrt(n1)
tcrit1 <- qt(0.975, df = n1 - 1)

lower1 <- mean1 - tcrit1 * se1
upper1 <- mean1 + tcrit1 * se1
c(lower1, upper1)
## [1] 103.8530 104.1036

We are 95% confident that the true average points per home team in a game is between 103.8530 and 104.1036. This number is representative of the mean scoring output for all games in the data set, in other words it is the average points per home team in a game in the three point era. However, I would be willing to guess that if we restricted that number to more recent years (e.g. after 2000), that number would inflate. But in the scope of this analysis, that number does make sense since basketball in the 1980s and 1990s was a slower paced game and hitting 100 points was a rarer occurrence.

Home FG% vs. Field Goals Made

new_cols |>
  ggplot(aes(x = fg_pct_home, y = fgm_home, color = shooting_level)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_brewer(palette = "Blues") +
  labs(
    title = "Home FG% vs Field Goals Made (Colored by Shooting Level)",
    x = "Home Field Goal Percentage",
    y = "Field Goals Made",
    color = "Shooting Level"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

A better field goal percentage inherently means the team is shooting more efficiently - they have made most of their shots. However, this graph displays whether or not this also leads to an increase in total shots made. There is an argument to be made that if you are being more efficient shooting the ball, you may not need to take as many shots. However, rather unsurprisingly, when teams shoot more efficiently from the field, they tend to keep shooting according to the graph. While the visual evidence is rather compelling, we have to ask how strong the correlation is?

Correlation

cor_pair2 <- cor(new_cols$fg_pct_home, new_cols$fgm_home, use = "complete.obs")
cor_pair2
## [1] 0.7448839

The correlation of nearly 0.75 indicates a very strong positive correlation. This is not a surprise considering both conventional wisdom (the better you shoot the more tempted the team is to keep shooting), but also the graph above.

Confidence Interval

fgm <- new_cols$fgm_home
n2 <- length(fgm)
mean2 <- mean(fgm, na.rm = TRUE)
sd2 <- sd(fgm, na.rm = TRUE)
se2 <- sd2 / sqrt(n2)
tcrit2 <- qt(0.975, df = n2 - 1)

lower2 <- mean2 - tcrit2 * se2
upper2 <- mean2 + tcrit2 * se2
c(lower2, upper2)
## [1] 38.96476 39.07304

We are 95% confident that the true average field goals made per home team in a game is between 38.96476 and 39.07304. This number is representative of the mean shooting output for all games in the data set, in other words it is the average field goals made per home team in a game in the three point era. However, I would be willing to guess that if we restricted that number to more recent years (e.g. after 2000), that number would inflate. But in the scope of this analysis, that number does make sense since basketball in the 1980s and 1990s was a slower paced game and offense was more limited. In the modern game, defense is not as strong and these numbers would likely grow.