library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NBA_Data <- NBA_Data |>
filter(season_type != 'All-Star',
season_type != 'All Star',
season_type != 'Pre Season')
new_cols <- NBA_Data |>
mutate(fg_pct_home = (fgm_home / fga_home))
summary(new_cols$fg_pct_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2338 0.4267 0.4658 0.4667 0.5062 0.6966
new_cols <- new_cols |>
mutate(shooting_level = case_when(
fg_pct_home > 0.55 ~ 'elite',
fg_pct_home >= 0.5 & fg_pct_home <= 0.55 ~ 'very good',
fg_pct_home >= 0.45 & fg_pct_home < 0.5 ~ 'good',
fg_pct_home >= 0.42 & fg_pct_home < 0.45 ~ 'average',
fg_pct_home >= 0.38 & fg_pct_home < 0.42 ~ 'below average',
fg_pct_home < 0.38 ~ 'bad',
))
new_cols <- new_cols |>
mutate(
shooting_level = factor(
shooting_level,
levels = c(
"bad",
"below average",
"average",
"good",
"very good",
"elite"
),
ordered = TRUE
)
)
new_cols |>
ggplot(aes(x = fg_pct_home, y = pts_home, color = shooting_level)) +
geom_point(alpha = 0.7, size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_brewer(palette = "Blues") +
labs(
title = "Home FG% vs Home Points (Colored by Shooting Level)",
x = "Home Field Goal Percentage",
y = "Home Points",
color = "Shooting Level"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
A better field goal percentage inherently means the team is shooting more efficiently - they have made most of their shots. However, this graph displays whether or not this also leads to an increase in points scored as well. Unsurprisingly, more points are scored when teams shoot more efficiently from the field. However, is there a strong correlation as well?
cor_pair1 <- cor(new_cols$fg_pct_home, new_cols$pts_home, use = "complete.obs")
cor_pair1
## [1] 0.6786808
The correlation of nearly 0.68 indicates a strong positive correlation. This is not a surprise considering both conventional wisdom (the better you shoot the more points you should score), but also the graph above.
pts <- new_cols$pts_home
n1 <- length(pts)
mean1 <- mean(pts, na.rm = TRUE)
sd1 <- sd(pts, na.rm = TRUE)
se1 <- sd1 / sqrt(n1)
tcrit1 <- qt(0.975, df = n1 - 1)
lower1 <- mean1 - tcrit1 * se1
upper1 <- mean1 + tcrit1 * se1
c(lower1, upper1)
## [1] 103.8530 104.1036
We are 95% confident that the true average points per home team in a game is between 103.8530 and 104.1036. This number is representative of the mean scoring output for all games in the data set, in other words it is the average points per home team in a game in the three point era. However, I would be willing to guess that if we restricted that number to more recent years (e.g. after 2000), that number would inflate. But in the scope of this analysis, that number does make sense since basketball in the 1980s and 1990s was a slower paced game and hitting 100 points was a rarer occurrence.
new_cols |>
ggplot(aes(x = fg_pct_home, y = fgm_home, color = shooting_level)) +
geom_point(alpha = 0.7, size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_brewer(palette = "Blues") +
labs(
title = "Home FG% vs Field Goals Made (Colored by Shooting Level)",
x = "Home Field Goal Percentage",
y = "Field Goals Made",
color = "Shooting Level"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
A better field goal percentage inherently means the team is shooting more efficiently - they have made most of their shots. However, this graph displays whether or not this also leads to an increase in total shots made. There is an argument to be made that if you are being more efficient shooting the ball, you may not need to take as many shots. However, rather unsurprisingly, when teams shoot more efficiently from the field, they tend to keep shooting according to the graph. While the visual evidence is rather compelling, we have to ask how strong the correlation is?
cor_pair2 <- cor(new_cols$fg_pct_home, new_cols$fgm_home, use = "complete.obs")
cor_pair2
## [1] 0.7448839
The correlation of nearly 0.75 indicates a very strong positive correlation. This is not a surprise considering both conventional wisdom (the better you shoot the more tempted the team is to keep shooting), but also the graph above.
fgm <- new_cols$fgm_home
n2 <- length(fgm)
mean2 <- mean(fgm, na.rm = TRUE)
sd2 <- sd(fgm, na.rm = TRUE)
se2 <- sd2 / sqrt(n2)
tcrit2 <- qt(0.975, df = n2 - 1)
lower2 <- mean2 - tcrit2 * se2
upper2 <- mean2 + tcrit2 * se2
c(lower2, upper2)
## [1] 38.96476 39.07304
We are 95% confident that the true average field goals made per home team in a game is between 38.96476 and 39.07304. This number is representative of the mean shooting output for all games in the data set, in other words it is the average field goals made per home team in a game in the three point era. However, I would be willing to guess that if we restricted that number to more recent years (e.g. after 2000), that number would inflate. But in the scope of this analysis, that number does make sense since basketball in the 1980s and 1990s was a slower paced game and offense was more limited. In the modern game, defense is not as strong and these numbers would likely grow.