James Harden Game Log Analysis

library(readr)

Warning: package 'readr' was built under R version 4.5.2

library(ggplot2)
library(tidyverse)

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.5     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

harden_games <- read_csv(
  "https://myxavier-my.sharepoint.com/:x:/g/personal/tullise_xavier_edu/IQBfZsTxRpALRqb-wDviwwTvAb-tZAJ6GV6oet0ngdPv7oM?download=1"
)

Rows: 1205 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Team, Opp
dbl  (8): Season_End_Year, PTS, TRB, AST, TOV, FG_percent, ThreeP_percent, H...
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can add options to executable code like this

# For this assignment, I wanted to figure out what James Harden's best statistical season has
# been so far in his NBA career. Instead of just looking at points, rebounds, and assists, 
# I created a new variable called a Harden score which adds up those three categories and 
# then subtracts turnovers. This allows me to see which season was truly his best. I wanted 
# to do this because I am a fan of the Cleveland Cavaliers, and Harden is currently playing 
# for them. (and causing them to lose playoff games) :( 

# This means my research question is: What is James Harden's best statistical season based 
# on his Harden score?

# This data comes from Harden's basketball reference game log page.

# Data wrangling

harden_seasons <- 
  harden_games %>%
  
  filter(
    str_detect(Date, "[0-9]{4}-[0-9]{2}-[0-9]{2}")
  ) %>%
  
# this filter is here to eliminate the rows that show his season totals from the per game 
# averages
  
  group_by(Season_End_Year) %>%
  
  summarise(
    Games = n(),
    
    Avg_PTS = mean(PTS, na.rm = TRUE),
    Avg_TRB = mean(TRB, na.rm = TRUE),
    Avg_AST = mean(AST, na.rm = TRUE),
    Avg_TOV = mean(TOV, na.rm = TRUE),
    
    Avg_FG_percent = mean(FG_percent, na.rm = TRUE),
    Avg_ThreeP_percent = mean(ThreeP_percent, na.rm = TRUE),
    
    Avg_Harden_Score = mean(Harden_Score, na.rm = TRUE)
  ) %>%
  
  arrange(desc(Avg_Harden_Score))

# This will arrange the data by season in descending order based on the average harden 
# score from each season

# visualization 1 - this shows Hardens average harden score per game by season of his career
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_Harden_Score)) +
  geom_col() +
  labs(
    title = "James Harden's Average Statistical Score by Season",
    x = "Season Ending Year",
    y = "Average Harden Score"
  )

# Harden's harden score peaked around 2017 - 2020, with 3 out of these 4 seasons having a 
# score greater than 40.

# visualization 2 - This shows Harden's average points per game by season of his career
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_PTS)) +
  geom_line() +
  geom_point() +
  labs(
    title = "James Harden's Average Points by Season",
    x = "Season Ending Year",
    y = "Average Points"
  )

# Harden's points per game were at their highest in 2019 and 2020, meaning that he had to 
# have averaged more assists and rebounds than usual in 2017 to have it be one of his 
# higher harden scores in his career.

# visualization 3 - This shows Harden's average assists per game by season of his career
ggplot(harden_seasons, aes(x = Season_End_Year, y = Avg_AST)) +
  geom_line() +
  geom_point() +
  labs(
    title = "James Harden's Average Assists by Season",
    x = "Season Ending Year",
    y = "Average Assists"
  )

# Harden's assists peaked in 2017, which is why his harden score for that year is so 
# high despite averaging fewer points.

# visualization 4 - assists vs turnovers
ggplot(harden_games, aes(x = TOV, y = AST)) +
  geom_point(alpha = 0.4) +
  labs(
    title = "James Harden's Assists Compared to Turnovers",
    x = "Turnovers",
    y = "Assists"
  )

Warning: Removed 118 rows containing missing values or values outside the scale range
(`geom_point()`).

# This graph shows Harden's total assists vs turnovers for each season.

# visualization 5
ggplot(harden_games, aes(x = factor(Season_End_Year), y = Harden_Score)) +
  geom_boxplot() +
  labs(
    title = "Distribution of James Harden's Game Scores by Season",
    x = "Season Ending Year",
    y = "Harden Score"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Warning: Removed 118 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Conclusion: By looking at these visualizations, it can be concluded that Harden's 
# best statistical season was in 2019. This makes sense, as this was also the year he 
# averaged the most points in his career. His worst seasons were early in his career in 
# 2010 and 2011. This also makes sense because these were the seasons where he was 
# establishing himself in the NBA, so he wasn't playing as much.

The echo: false option disables the printing of code (only output is displayed).