Introduction

For this assignment, I analyzed an NBA free throw data set spanning 2006 to 2015. I explored various statistics such as overall free throw percentages, year to year trends, and effects free throws have on team success. From my research, I was able to uncover some intriguing insights along the way. By creating visualizations like line graphs, scatter plots, and many others, I was able to identify key shifts in performance and understand how the dynamics of free throw success evolved over the decade. These observations not only shed light on player efficiency, but also hint at broader game strategies that continue to shape the sport.

Reworking the dataset and creating new variables for descriptive stats

setwd("/Users/jonahgoodman/Desktop/IS460")

file.path(getwd(), "Data/free_throws.csv")
## [1] "/Users/jonahgoodman/Desktop/IS460/Data/free_throws.csv"
suppressMessages(library(tidyr))
suppressMessages(library(ggplot2))
suppressMessages(library(data.table))
suppressMessages(library(dplyr))
suppressMessages(library(lubridate))
suppressMessages(library(httr))
suppressMessages(library(DescTools))
suppressMessages(library(plotly))

file1 <- "Data/free_throws.csv"

df <- fread(file1)
ColsToDrop <- c("play")
df <- as.data.frame(df)
df <- df[, !names(df) %in% ColsToDrop]
colnames(df)
##  [1] "end_result" "game"       "game_id"    "period"     "player"    
##  [6] "playoffs"   "score"      "season"     "shot_made"  "time"
df_free_throws_attempted <- df %>%
  group_by( playoffs, game_id, season) %>%
  summarise(free_throw_attempts_per_game= n(), .groups = "drop") 
  
df_free_throws_made <- df %>%
  filter(shot_made == 1) %>%
  group_by(playoffs, game_id, season) %>%
  summarise(free_throws_made_per_game = n(), .groups = "drop")

df_free_throw <- df_free_throws_attempted %>% 
  select(playoffs, game_id, season, free_throw_attempts_per_game) %>%
  left_join(df_free_throws_made %>% select(game_id, season, 
            free_throws_made_per_game), by  = c("game_id", "season"))%>%
  data.frame()  

df_free_throw <- df_free_throw %>%
  mutate(free_throw_percentage = ifelse(free_throw_attempts_per_game > 0,
                                        (free_throws_made_per_game / 
                                        free_throw_attempts_per_game)*100, NA)) %>%
  data.frame()


df_seasonal_avg <- df_free_throw %>%
  group_by(season, playoffs) %>%
  summarise(avg_attempts = mean(free_throw_attempts_per_game, na.rm = TRUE),
            avg_made = mean(free_throws_made_per_game, na.rm = TRUE),
            avg_percentage = mean(free_throw_percentage, na.rm = TRUE))

player_summary <- df %>%
  group_by(player) %>%
  summarise(
    attempts = n(),
    made = sum(shot_made),
    percentage = made / attempts
  ) %>%
  filter(attempts > 200) %>%  # Filter for players with a minimum number of attempts
  arrange(desc(percentage))

summary(df_free_throw)
##    playoffs            game_id             season         
##  Length:12874       Min.   :261031013   Length:12874      
##  Class :character   1st Qu.:290119029   Class :character  
##  Mode  :character   Median :310410028   Mode  :character  
##                     Mean   :336085907                     
##                     3rd Qu.:400489596                     
##                     Max.   :400878160                     
##  free_throw_attempts_per_game free_throws_made_per_game free_throw_percentage
##  Min.   :  9.00               Min.   : 6.00             Min.   : 37.21       
##  1st Qu.: 40.00               1st Qu.:30.00             1st Qu.: 71.15       
##  Median : 47.00               Median :36.00             Median : 75.93       
##  Mean   : 48.01               Mean   :36.33             Mean   : 75.68       
##  3rd Qu.: 56.00               3rd Qu.:42.00             3rd Qu.: 80.56       
##  Max.   :113.00               Max.   :92.00             Max.   :100.00

Visualization 1:

#1
# Change the season to factor so seasons are ordered correctly
df_seasonal_avg$season <- factor(df_seasonal_avg$season, levels = unique(df_seasonal_avg$season))

ggplot(df_seasonal_avg, aes(x = season)) +
  geom_line(aes(y = avg_attempts, group = 1, color = "Attempts"), size = 1) +
  geom_line(aes(y = avg_made, group = 1, color = "Made"), size = 1) +
  geom_line(aes(y = avg_percentage * max(df_seasonal_avg$avg_attempts) / 100, 
                group = 1, color = "Percentage (scaled)"), size = 1) +
  scale_y_continuous(sec.axis = sec_axis(~ . * 100 / max(df_seasonal_avg$avg_attempts), 
                                         name = "Free Throw Percentage")) +
  labs(title = "Free Throw Stats Over Seasons (2006-2016)",
       x = "Season",
       y = "Average Free Throws") +
  scale_color_manual(values = c("Attempts" = "blue", "Made" = "green", "Percentage (scaled)" = "red")) +
  theme_minimal()

#### Explanation 1:

For this first graph, I wanted to create a standard line graph that would examine free throw attempts, made free throws, and free throw percentage. My thought process was that I wanted to visualize if there was a change in free throw attempts, and how that has affected made free throws as well the percentage. What I found with the graph was that over the years, free throw attempts have had a definite decrease from 2006 to 2015. However, you can see that the free throw percentage has stayed very static, even increased to a degree over the 10 years. This shows that the number of free throws does not have an effect on the percentage a team or player will shoot over a season. With this in mind, teams should try to drive to the basket and draw fouls more than they currently do. As long as the free throw percentage does not change, teams should increase their shots in the paints, leading to free throws which are the most effective way of scoring over any other shot.

Visualization 2:

#2
ggplot(df_free_throw, aes(x = free_throw_attempts_per_game, y = free_throw_percentage)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Trend line
  theme_minimal() +
  labs(
    title = "Free Throw Percentage vs. Free Throw Attempts Per Game",
    x = "Free Throw Attempts Per Game",
    y = "Free Throw Percentage")

#### Explanation 2:

Each dot on this scatter plot represents a game and its number of free throws and free throw percentage. I chose to include this graph because I feel it gives a great perspective on what I was discussing with the previous graph. The regression line is horizontal, showing that even as the number of free throw attempts increases, it has no effect on the free throw percentage. This further visualizes the idea that taking more free throws would not harm your teams free throw percentage, and it would lead to more points scored at an efficient rate.

Visualization 3:

#3
# Change season to factor so the season are ordered correctly
df_seasonal_avg$season <- factor(df_seasonal_avg$season, levels = unique(df_seasonal_avg$season))

# Reshape data for plotting 
df_long <- df_seasonal_avg %>%
  select(season, playoffs, avg_attempts, avg_made, avg_percentage) %>%
  pivot_longer(cols = c(avg_attempts, avg_made, avg_percentage), 
               names_to = "metric", 
               values_to = "value")

ggplot(df_long, aes(x = season, y = value, color = interaction(playoffs, metric), group = interaction(playoffs, metric))) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme_minimal() +
  scale_y_continuous(
    name = "Average Free Throws (Attempts & Made)",
    sec.axis = sec_axis(~ ., name = "Free Throw Percentage (%)")  # Secondary axis for percentage
  ) +
  labs(
    title = "Free Throw Attempts, Made, & Percentage: Regular vs. Playoffs",
    x = "Season"
  ) +
  scale_color_manual(values = c(
    "regular.avg_attempts" = "blue",
    "playoffs.avg_attempts" = "red",
    "regular.avg_made" = "green",
    "playoffs.avg_made" = "orange",
    "regular.avg_percentage" = "purple",
    "playoffs.avg_percentage" = "brown"
  )) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(color = "Legend")

Explanation 3:

For this graph, I wanted to dive into the playoffs and how the stats change from regular to post season. What I found with the difference between free throw percentage, attempt, and makes was very insightful. For the playoffs, makes and attempts are both up in comparison to the regular season. This we saw in the graphs above. However, the percentage when comparing playoff and regular season percentage is nearly identical. This shows that players do not tighten up or excel in the playoffs when it comes to free throws. Overall, they perform the same in the regular season as they do in thus playoffs. With this in mind, it is even more important for teams to take more free throws in the playoffs than in the regular season. The points per game in the playoffs by teams always decreases in the playoffs. Teams are tightening up their defenses, making it harder to score. However, the free throw line is the one place where players can’t defend other players, making it the one place where percentage in the playoffs and regular season is the same. So, teams should take more free throws in the playoffs, since points come at an even higher commodity than compared to the regular season.

Visualization 4a:

#4a
ggplot(player_summary, aes(x = attempts, y = percentage)) +
  geom_point(aes(color = percentage), size = 3) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +  # Trend line
  labs(title = "Free Throw Performers: Percentage vs Attempts",
       x = "Attempts", y = "Free Throw Percentage") +
  scale_color_gradient(low = "red", high = "darkgreen")

#### Visualization 4b:

#4b
p <- ggplot(player_summary, aes(x = attempts, y = percentage, text = player)) +
  geom_point(aes(color = percentage), size = 1) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +  # Trend line
  labs(title = "Free Throw Performers: Percentage vs Attempts",
       x = "Attempts", y = "Free Throw Percentage") +
  scale_color_gradient(low = "red", high = "darkgreen")

ggplotly(p, tooltip = "text")

Explanation 4:

I wanted to look deeper into individual player stats. I created a new data frame called player_summary, and created these scatter plots of players and their free throw percentages over there attempts(There are two plots because I wanted to show the regression line but I could not do that while also being able to hover over a data point and have the players name displayed. I created two graphs so you could see both the players names and the regression line clearly.) I wanted to identify players who should be more or less aggressive at getting to the free throw line. The regression line is the threshold to help identify these players. So, players who are significantly below the line, should increase their efforts to avoid the free throw line, while players who are significantly above the line, should try to draw more fouls. An example is Ben Gordon, 86% from the free throw line and has under 2000 attempts. He is significantly above the average, meaning he should prioritize getting to the line more often, since his numbers indicate he is more effective than a majority of the players in the league. On the other side, Josh Smith shoots 62% from the line (13% below the average), and he has over 3500 attempts. This is not an effective shot for Josh, meaning he should try to alter his game so that he takes less free throws and prioritizes shots where he likely won’t be fouled.

Visualization 5:

#5
df$score_diff <- abs(as.numeric(sub(" -.*", "", df$end_result)) - as.numeric(sub(".*- ", "", df$end_result)))
close_games <- df[df$score_diff <= 5, ]

ggplot(close_games, aes(x = shot_made, fill = playoffs)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = mean(close_games$shot_made), color = "red", linetype = "dashed") +
  labs(title = "Free Throw Performance in Close Games",
       subtitle = "Games Decided by 5 Points or Less",
       x = "Free Throw Percentage",
       y = "Density",
       fill = "Game Type") +
  theme_minimal()

Explanation 5:

Lastly, I wanted to try and learn more about free throw variance in close games, especially the difference between regular season and playoffs. This density plot illustrates free throw performance in close games (decided by 5 points or less), comparing the regular season (blue) and playoffs (red). A key observation is the bi modal distribution, where players tend to either make all their free throws (100%) or miss all of them (0%), with fewer performances around the 50% mark. This suggests that free throw shooting in high-pressure moments is often either elite or unreliable, rather than averaging out. The mean free throw percentage (red dashed line) is around 75%, which is identical to free throws throughout the entire game. So even in high pressure situations, the average is the same as any other time in the game. However, this may be because certain players elevate their free throw percentage in crunch time, while others crumble under the pressure and lower theirs, which evens out the difference. The playoffs distribution is more spread out, indicating that there is more variance in free throw performance come playoff time. Each of the curves at the start and end of the graph have long tails come the playoffs. In fact, the tails of the playoff graphs are nearly twice as long as the tails in the regular season. This means more players are making 1 free throw and missing 1 free throw, than simply making or missing both. This analysis highlights the importance of mental toughness in late-game free throws, as well as the need for teams to strategically choose reliable shooters in close contests.

Conclusion

Working with a free throw data set gave me valuable practice with creating and manipulating data frames. By structuring the data effectively, I was able to filter and analyze specific game situations, such as free throw performance in close games, individual player performance, playoff vs regular season performance, etc. Through visualizations, patterns emerged that may not have been immediately obvious in the raw data, such as the tendency for players to either excel or struggle under pressure, or how free throw attempts have an effect or lack there of on free throw percentage. Experimenting with different graph types, from density plots to line plots, helped highlight key insights, like the impact of playoff intensity on shooting consistency. This process reinforced the importance of data-driven decision-making in sports analytics, and deepened my understanding of how to translate statistical data into meaningful interpretations. Moving forward, refining these techniques and expanding upon them could further my ability to uncover trends and make insightful observations.