Final_Project

library(httr)
Warning: package 'httr' was built under R version 4.5.2
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
Warning: package 'readr' was built under R version 4.5.2
library(skimr)
library(ggplot2)
library(tidyverse)
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.5     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(magrittr)
Warning: package 'magrittr' was built under R version 4.5.2

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
library(lubridate)
library(stringr)
library(knitr)

#| echo: false

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

harden_games <- read_csv(
  "https://myxavier-my.sharepoint.com/:x:/g/personal/tullise_xavier_edu/IQBfZsTxRpALRqb-wDviwwTvAb-tZAJ6GV6oet0ngdPv7oM?download=1")
Rows: 1205 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Team, Opp
dbl  (8): Season_End_Year, PTS, TRB, AST, TOV, FG_percent, ThreeP_percent, H...
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
harden_playoff_games <- read_csv(
  "https://myxavier-my.sharepoint.com/:x:/g/personal/tullise_xavier_edu/IQAO3SlbFTrHRYmGTTi8W7RaAdCJDC0ys-Me_N61a_N1riY?download=1")
Rows: 199 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Team, Opp
dbl  (7): PTS, TRB, AST, TOV, FG_percent, ThreeP_percent, Harden_Score
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Introduction
# For this project, I wanted to figure out what James Harden's best statistical season has
# been so far in his NBA career and compare his playoff stats to his regular seasons stats. 
# Instead of just looking at points, rebounds, and assists, I created a new variable called a 
# Harden score which adds up those three categories and then subtracts turnovers. This 
# allows me to see which season was truly his best. I wanted to do this because I am a fan 
# of the Cleveland Cavaliers, and Harden is currently playing for them. (and causing them 
# to lose playoff games). I am interested in seeing if he is actually worse in the playoffs,
# or if I am just scapegoating him due to his reputation of being a bad playoff player.

# This means my research question is: What is James Harden's best statistical season based 
# on his Harden score and how does it compare to his playoff performances?

# This data comes from Harden's basketball reference game log page and his playoff game log
# page. This data includes his points, assists, rebounds, turnovers, team, opposing team, 
# field goal percentage, three point percentage, year of the season, and the date each game
# was played.

# here are links to my dataset:
# regular season: https://www.basketball-reference.com/players/h/hardeja01.html 
# playoffs: https://www.basketball-reference.com/players/h/hardeja01/gamelog-playoffs/ 

You can add options to executable code like this

# Data Dictionary:
# PTS - Points
# TRB - Rebounds
# AST - Assists
# TOV - Turnovers
# FG_percent - Field Goal Percentage
# ThreeP_percent - Three Point Percentage
# Season_End_Year - Year the NBA Season Ended
# Date - Date of the Game
# Team - Harden's Team
# Opp - Opposing Team
# Harden Score - PTS + AST + TRB - TOV
# Data Cleaning: This takes the game log and changes it to show average stats for each 
# season
harden_seasons <- 
  harden_games %>%
  filter(str_detect(Date, "[0-9]{4}-[0-9]{2}-[0-9]{2}")) %>%
  group_by(Season_End_Year) %>%
  summarise(
    Games = n(),
    Avg_PTS = mean(PTS, na.rm = TRUE),
    Avg_TRB = mean(TRB, na.rm = TRUE),
    Avg_AST = mean(AST, na.rm = TRUE),
    Avg_TOV = mean(TOV, na.rm = TRUE),
    Avg_FG_percent = mean(FG_percent, na.rm = TRUE),
    Avg_ThreeP_percent = mean(ThreeP_percent, na.rm = TRUE),
    Avg_Harden_Score = mean(Harden_Score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Harden_Score))

harden_seasons
# A tibble: 15 × 9
   Season_End_Year Games Avg_PTS Avg_TRB Avg_AST Avg_TOV Avg_FG_percent
             <dbl> <int>   <dbl>   <dbl>   <dbl>   <dbl>          <dbl>
 1            2019    82   36.1     6.64    7.51    4.96          0.447
 2            2020    72   34.3     6.56    7.53    4.53          0.444
 3            2017    82   29.1     8.14   11.2     5.73          0.443
 4            2018    82   30.4     5.40    8.75    4.38          0.448
 5            2021    68   24.6     7.91   10.8     4.02          0.460
 6            2016    82   29.0     6.11    7.46    4.56          0.436
 7            2015    82   27.4     5.67    6.98    3.96          0.439
 8            2022    83   22.0     7.69   10.3     4.37          0.412
 9            2023    82   21.0     6.10   10.7     3.36          0.444
10            2013    82   25.9     4.86    5.83    3.78          0.442
11            2014    82   25.4     4.71    6.11    3.63          0.445
12            2024    81   16.6     5.12    8.53    2.57          0.420
13            2012    66   16.8     4.06    3.69    2.21          0.485
14            2011    82   12.2     3.11    2.15    1.29          0.431
15            2010    82    9.91    3.21    1.80    1.39          0.407
# ℹ 2 more variables: Avg_ThreeP_percent <dbl>, Avg_Harden_Score <dbl>
harden_playoff_seasons <- 
  harden_playoff_games %>%
  filter(
    str_detect(Date, "[0-9]{4}-[0-9]{2}-[0-9]{2}"),
    !is.na(PTS),
    !is.na(TRB),
    !is.na(AST),
    !is.na(TOV)
  ) %>%
  mutate(
    Season_End_Year = year(as.Date(Date))
  ) %>%
  group_by(Season_End_Year) %>%
  summarise(
    Games = n(),
    Avg_PTS = mean(PTS, na.rm = TRUE),
    Avg_TRB = mean(TRB, na.rm = TRUE),
    Avg_AST = mean(AST, na.rm = TRUE),
    Avg_TOV = mean(TOV, na.rm = TRUE),
    Avg_FG_percent = mean(FG_percent, na.rm = TRUE),
    Avg_ThreeP_percent = mean(ThreeP_percent, na.rm = TRUE),
    Avg_Harden_Score = mean(Harden_Score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Harden_Score))

harden_playoff_seasons
# A tibble: 17 × 9
   Season_End_Year Games Avg_PTS Avg_TRB Avg_AST Avg_TOV Avg_FG_percent
             <dbl> <int>   <dbl>   <dbl>   <dbl>   <dbl>          <dbl>
 1            2019    11   31.6     6.82    6.64    4.64          0.417
 2            2020    12   29.6     5.58    7.67    3.75          0.467
 3            2017    11   28.5     5.45    8.45    5.36          0.397
 4            2018    17   28.6     5.24    6.82    3.82          0.404
 5            2015    17   27.2     5.71    7.53    4.53          0.423
 6            2016     5   26.6     5.2     7.6     5.2           0.398
 7            2014     6   26.8     4.67    5.83    3.5           0.389
 8            2013     6   26.3     6.67    4.5     4.5           0.396
 9            2021     9   20.2     6.33    8.56    2.89          0.463
10            2023    11   20.3     6.18    8.27    3.18          0.365
11            2024     6   21.2     4.5     8       2.33          0.437
12            2025     7   18.7     5.43    9.14    3             0.411
13            2022    12   18.6     5.67    8.58    4.17          0.407
14            2026     9   19.6     5.67    5.89    5.22          0.418
15            2012    20   16.3     5.1     3.4     2.1           0.424
16            2011    17   13       5.35    3.59    1.65          0.462
17            2010     6    7.67    2.5     1.83    0.5           0.301
# ℹ 2 more variables: Avg_ThreeP_percent <dbl>, Avg_Harden_Score <dbl>
# These are some summary statistics for James Harden's averages by season across his regular
# season and playoff games.
# Here are Harden's top 5 regular seasons by Harden score:
harden_seasons %>%
  slice_max(Avg_Harden_Score, n = 5) %>%
  kable()
Season_End_Year Games Avg_PTS Avg_TRB Avg_AST Avg_TOV Avg_FG_percent Avg_ThreeP_percent Avg_Harden_Score
2019 82 36.12821 6.641026 7.512821 4.961538 0.4471667 0.3727821 45.32051
2020 72 34.33824 6.558823 7.529412 4.529412 0.4442059 0.3558235 43.89706
2017 82 29.08642 8.135803 11.197531 5.728395 0.4432716 0.3390988 42.69136
2018 82 30.43056 5.402778 8.750000 4.375000 0.4476389 0.3653750 40.20833
2021 68 24.61364 7.909091 10.795454 4.022727 0.4599773 0.3383409 39.29545
# Visualization 1: Average Harden score across all of his regular seasons:
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_Harden_Score)) +
  geom_col() +
  labs(
    title = "James Harden's Average Harden Score by Regular Season",
    x = "Season Ending Year",
    y = "Average Harden Score"
  )

# Harden's harden score peaked around 2017 - 2020, with 3 out of these 4 seasons having a 
# score greater than 40. 
# Visualization 2 - This shows Harden's average points per game by season of his career
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_PTS)) +
  geom_line() +
  geom_point() +
  labs(
    title = "James Harden's Average Points by Season",
    x = "Season Ending Year",
    y = "Average Points"
  )

# Harden's points per game were at their highest in 2019 and 2020, meaning that he had to 
# have averaged more assists and rebounds than usual in 2017 to have it be one of his 
# higher harden scores in his career.
# Visualization 3 - This shows Harden's average assists per game by season of his career
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_AST)) +
  geom_line() +
  geom_point() +
  labs(
    title = "James Harden's Average Assists by Season",
    x = "Season Ending Year",
    y = "Average Assists"
  )

# Harden's assists peaked in 2017, which is why his harden score for that year is so 
# high despite averaging fewer points.
# Visualization 4 - This shows Harden's assist to turnover ratio for each regular season.
assist_turnover_summary <- 
  harden_seasons %>%
  
  select(
    Season_End_Year,
    Avg_AST,
    Avg_TOV
  ) %>%
  
  pivot_longer(
    cols = c(Avg_AST, Avg_TOV),
    names_to = "Stat",
    values_to = "Average"
  )

ggplot(assist_turnover_summary,
       aes(
         x = factor(Season_End_Year),
         y = Average,
         fill = Stat
       )) +
  
  geom_col(position = "dodge") +
  
  labs(
    title = "James Harden Average Assists vs. Turnovers by Season",
    x = "Season Ending Year",
    y = "Average Per Game",
    fill = "Statistic"
  ) +
  
  scale_fill_discrete(
    labels = c("Assists", "Turnovers")
  ) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Harden's turnover numbers are correlated to his assist numbers, which makes sense. His best passing season is also his 
# worst turnover season. However, it appears that as he is aging, he is turning the ball over less.
# Visualization 5 - This shows the relationship between total points scored and total Harden Score for each season.
ggplot(harden_games, aes(x = PTS, y = Harden_Score)) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Relationship Between Points and Harden Score",
    x = "Points",
    y = "Harden Score"
  )
Warning: Removed 118 rows containing missing values or values outside the scale range
(`geom_point()`).

# This visualization is helpful in determining which seasons that Harden may have averaged more turnovers than usual
# or less assists/rebounds than usual.
# Combining regular season and playoff stats for further comparison.

regular_clean <- 
  harden_games %>%
  filter(!is.na(PTS), !is.na(TRB), !is.na(AST), !is.na(TOV)) %>%
  mutate(Game_Type = "Regular Season")

playoff_clean <- 
  harden_playoff_games %>%
  filter(!is.na(PTS), !is.na(TRB), !is.na(AST), !is.na(TOV)) %>%
  mutate(Game_Type = "Playoffs")

combined_games <- 
  bind_rows(regular_clean, playoff_clean)
# Visualization 5 - Average Harden Score by season (Playoffs vs regular season)
regular_summary <- 
  harden_seasons %>%
  mutate(Game_Type = "Regular Season") %>%
  select(
    Season_End_Year,
    Game_Type,
    Avg_Harden_Score
  )

playoff_summary <- 
  harden_playoff_seasons %>%
  mutate(Game_Type = "Playoffs") %>%
  select(
    Season_End_Year,
    Game_Type,
    Avg_Harden_Score
  )

combined_season_summary <- 
  bind_rows(regular_summary, playoff_summary)

ggplot(combined_season_summary,
       aes(
         x = interaction(Season_End_Year, Game_Type),
         y = Avg_Harden_Score,
         fill = Game_Type
       )) +
  
  geom_col() +
  
  labs(
    title = "James Harden Average Harden Score by Season and Game Type",
    x = "Season and Game Type",
    y = "Average Harden Score",
    fill = "Game Type"
  ) +
  
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1)
  )

# This shows that Harden statistically plays worse in the playoffs than the regular season
# Conclusion: By looking at these visualizations, it can be concluded that Harden's 
# best statistical season was in 2019. This makes sense, as this was also the year he 
# averaged the most points in his career. His worst seasons were early in his career in 
# 2010 and 2011. This also makes sense because these were the seasons where he was 
# establishing himself in the NBA, so he wasn't playing as much. When it comes to the 
# playoffs, Harden's stats do in fact drop in every post season compared to it's 
# corresponding regular season. So all in all, Harden does statistically play worse in the 
# playoffs than he does in the regular season.

The echo: false option disables the printing of code (only output is displayed).