Assignment 2

Introduction/ Description

To help me determine who is the “best” kicker, the first thing I did was download all relevant data relating to kickers. I merged this data into one data frame and removed the exess data frames. From this I filtered it to only include “Feild Goals” and “Extra Points”. I did this to narrow down the results as those are a) the most impact full plays, and most directly measurable to examine kicker without outside factors. In order to only examine experienced kickers to avoid low data skewing results, I also filtered for only kickers who had been in multiple seasons. Additionally, I noticed that the “kick length” for Extra Points was NA, so I replaced it with 33 since all extra points take place on the 15 yard line (15 + 18 yards being the average for extra points length).

This gave me complete and filtered data only referring to the field goals and extra points I was interested in. Using this data, I was able to take quartiles for Good and Bad results. I did not want kickers scores to be unfairly skewed by very high probability kicks (such as from >10 yard lines with 99.9% success rates) or from very unlikely kicks (such as from 60 + yards that had very low probabilities) as this would penalize those who took risks. I then filtered the longest 10% of Fails and shortest 10% of successes out to balance this out and give data I felt was more representative of kickers true skills. Leaving in sucsesses that occurred outside of the 90% quartile and failures inside of the 10% quartile, properly marked down/up kickers who went against the averages. Summing this new filtered data by player, gave me their total attempted, and total completed yards which I then averaged into their adjusted accuracy (adj_accuracy).

I then analyzed it through graphs to see the comparisons across all 3 seasons, as well as the best 10 across each individual season. The Results I found show the most yard-accurate kickers in the NFL across both parameters. This is a important metric to evaluate as you want reliability in your kickers. A unreliable kicker is more of a burden then a asset as often teams depend on kickers to be accurate a high percentage of time. The results I provided show which kickers are most reliable when it comes to overall yards after factoring out less relevant data. Due to varying play time/opportunities/attempt lengths, I believe finding the accuracy of yards was the best way to take away the variability of individual kicks.

Data Visualization

First, I downloaded all relivant libraries and data sources, and created my initial data frame

#Set working Directory/Load libraries
setwd("C:/Users/bradk/OneDrive/Documents/IS470")

suppressMessages(library(data.table))
suppressMessages(library(lubridate))
suppressMessages(library(dplyr))
suppressMessages(library(httr))
suppressMessages(library(ggplot2))
suppressMessages(library(DescTools))
suppressMessages(library(scales))
suppressMessages(library(tidytext))
suppressMessages(library(RColorBrewer))
suppressMessages(library(kableExtra))

#Reading in Data
plays = fread("Data/NFLBDB2022/NFL2022/plays.csv")
players = fread("Data/NFLBDB2022/NFL2022/players.csv")
games = fread("Data/NFLBDB2022/NFL2022/games.csv")
pff = fread("Data/NFLBDB2022/NFL2022/pffScoutingData.csv")

#Merging Data and removing exess dataframes
merged1 = left_join(plays, pff, by = c("gameId", "playId"))
merged2 = left_join(merged1, games, by = c("gameId"))
df = left_join(merged2, players, by = c("kickerId" = "nflId"))

rm(merged1)
rm(merged2)
rm(plays)
rm(players)
rm(games)
rm(pff)

I initially filtered my data and producted quartiles for each kickers attempts

relivant = df %>%
  select(specialTeamsResult, specialTeamsPlayType, yardlineNumber, kickLength, displayName, possessionTeam, kickerId, season) %>%
  filter((specialTeamsResult %in% c("Kick Attempt Good", "Kick Attempt No Good"))) %>%
  group_by(displayName, possessionTeam) %>%
  filter(n_distinct(season) > 1) %>%
  mutate(seasons = n_distinct(season)) %>%
  mutate(kickLength = ifelse(is.na(kickLength) & specialTeamsPlayType == "Extra Point", 33, kickLength)) %>%
  data.frame()
  
Quartiles = relivant %>%
  group_by(kickerId, displayName) %>%
  summarise(
    Good_Q_10 = quantile(kickLength[specialTeamsResult == "Kick Attempt Good"], 0.1),
    Bad_Q_90 = quantile(kickLength[specialTeamsResult == "Kick Attempt No Good"], 0.9)) %>%
  data.frame()
  
  knitr::kable(head(Quartiles, 8), caption = 'Kicker Range Quartiles') %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Kicker Range Quartiles
kickerId	displayName	Good_Q_10	Bad_Q_90
21213	Adam Vinatieri	26.4	55.0
27091	Matt Bryant	33.0	53.0
30403	Robbie Gould	28.4	55.0
30932	Stephen Gostkowski	28.0	52.0
31446	Matt Prater	27.0	55.0
32371	Nick Folk	25.0	49.5
32386	Mason Crosby	33.0	51.7
33469	Stephen Hauschka	31.0	54.0

Using those calculated quartiles, I then removed those values from the relevant data to remove data outside of those ranges. This was done to avoid penalizing kickers for missing kicks with objectively low probabilities, as well as not inflate their statistics with kicks well within their abilities as it does not portray skill accurately.

Using this newly filtered data, I then created a new statistic to track the percentage of total yards attempted and made.

filtered = relivant %>%
  left_join(Quartiles, by = c("kickerId", "displayName")) %>%
  filter(!(specialTeamsResult == "Kick Attempt Good" & kickLength <= Good_Q_10)) %>%
  filter(!(specialTeamsResult == "Kick Attempt  No Good" & kickLength >= Bad_Q_90)) %>%
  group_by(kickerId, displayName) %>%
  summarise(
    total_kickLength = sum(kickLength),
    total_kickLength_good = sum(kickLength[specialTeamsResult == "Kick Attempt Good"]),
    adj_accuracy = total_kickLength_good / total_kickLength,
  ) %>%
  arrange(-adj_accuracy) %>%
  data.frame()

knitr::kable(head(filtered, 8), caption = 'Adjusted Kicker Accuracy Percentages') %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Adjusted Kicker Accuracy Percentages
kickerId	displayName	total_kickLength	total_kickLength_good	adj_accuracy
39470	Justin Tucker	7360	6999	0.9509511
43689	Wil Lutz	7420	6987	0.9416442
43068	Josh Lambo	3503	3264	0.9317728
45046	Harrison Butker	7586	6900	0.9095703
32371	Nick Folk	2664	2420	0.9084084
41953	Chris Boswell	5313	4784	0.9004329
45576	Younghoe Koo	3311	2970	0.8970100
46298	Jason Sanders	5827	5196	0.8917110

The resulting variable adj_accuracy shows the percent of yards the kicker attempted and made.

Visualization 1: A graph that shows the adjusted percentage of yards made by quarterbacks

ggplot(filtered, aes(x = reorder(displayName, -adj_accuracy), y = adj_accuracy, fill = adj_accuracy)) + geom_bar(stat = "identity") + labs(title = "Adjusted Accuracy by Player", x = "Player", y = "Adjusted Accuracy") + geom_text(aes( label = label_percent(accuracy = 1L) (adj_accuracy)), vjust = -0.5, angle = 33) + scale_fill_continuous(breaks = seq(.6, 1, 0.1), limits = c(0.6, 1), labels = paste0(100*seq(.6, 1, 0.1), "%"), low = "blue", high = "orange") + theme_minimal() + theme(plot.title = element_text(hjust = 0.50, face="bold", size=18), axis.text.x=element_text(size=12, angle = 45, hjust = 1), axis.text.y=element_text(size=15), axis.title=element_text(size=18,face="bold"))

Then, I broke it down and arranged it by each season for a new ranking:

filtered2 <- relivant %>% left_join(Quartiles, by = c("kickerId", "displayName")) %>% filter(!(specialTeamsResult == "Kick Attempt Good" & kickLength <= Good_Q_10)) %>% filter(!(specialTeamsResult == "Kick Attempt No Good" & kickLength >= Bad_Q_90)) %>% group_by(kickerId, displayName, season) %>% summarise( total_kickLength = sum(kickLength), total_kickLength_good = sum(kickLength[specialTeamsResult == "Kick Attempt Good"]), adj_accuracy = total_kickLength_good / total_kickLength, .groups = 'drop' ) %>% group_by(season) %>% arrange(desc(adj_accuracy)) %>% slice_head(n = 10) %>% ungroup() %>% data.frame() knitr::kable(head(filtered2, 12), caption = 'top 10 Kicker Adjusted Accuracy by season') %>% kable_styling(bootstrap_options = c("striped", "hover"))

top 10 Kicker Adjusted Accuracy by season

kickerId displayName season total_kickLength total_kickLength_good adj_accuracy

43689 Wil Lutz 2018 2517 2484 0.9868892

39470 Justin Tucker 2018 2177 2144 0.9848415

46342 Michael Badgley 2018 1366 1333 0.9758419

43068 Josh Lambo 2018 1364 1322 0.9692082

46236 Daniel Carlson 2018 1145 1100 0.9606987

43937 Aldrick Rosas 2018 1984 1899 0.9571573

31446 Matt Prater 2018 1897 1805 0.9515024

45046 Harrison Butker 2018 2680 2538 0.9470149

34707 Ryan Succop 2018 1635 1541 0.9425076

38691 Randy Bullock 2018 1956 1838 0.9396728

41953 Chris Boswell 2019 1539 1495 0.9714100

39470 Justin Tucker 2019 2672 2563 0.9592066

I then applied it into a horizontally stacked trellis chart for season by season comparison.

top 10 Kicker Adjusted Accuracy by season
kickerId	displayName	season	total_kickLength	total_kickLength_good	adj_accuracy
43689	Wil Lutz	2018	2517	2484	0.9868892
39470	Justin Tucker	2018	2177	2144	0.9848415
46342	Michael Badgley	2018	1366	1333	0.9758419
43068	Josh Lambo	2018	1364	1322	0.9692082
46236	Daniel Carlson	2018	1145	1100	0.9606987
43937	Aldrick Rosas	2018	1984	1899	0.9571573
31446	Matt Prater	2018	1897	1805	0.9515024
45046	Harrison Butker	2018	2680	2538	0.9470149
34707	Ryan Succop	2018	1635	1541	0.9425076
38691	Randy Bullock	2018	1956	1838	0.9396728
41953	Chris Boswell	2019	1539	1495	0.9714100
39470	Justin Tucker	2019	2672	2563	0.9592066

Visualization 1: A graph that shows the top 10 kickers ranked by Adj_accuracy by season

ggplot(data = filtered2, aes(x = reorder_within(displayName, -adj_accuracy, season), y = adj_accuracy, fill = adj_accuracy)) + geom_bar(stat = "identity", width = 0.7) + labs(x = "Players", y = "Adjusted Accuracy", title = "Top 10 Players Adjusted Accuracy by Season", fill = "Adj. Acc.") + geom_text(aes(label = scales::percent(adj_accuracy, accuracy = 0.1)), hjust = 0.5, size = 4, angle =10) + scale_fill_continuous(breaks = seq(0.9, 1, 0.02), limits = c(0.9, 1), labels = paste0(100 * seq(0.9, 1, 0.02), "%"), low = "red", high = "forestgreen") + scale_x_reordered() + facet_wrap(~season, scales = "free", ncol = 1) + theme_minimal(base_size = 15) + theme(plot.title = element_text(hjust = 0.50, face = "bold", size = 18), axis.text.x = element_text(size = 10, angle = 15, vjust = 1.25, hjust = 0.5), axis.text.y = element_text(size = 15), axis.title = element_text(size = 18, face = "bold"), strip.text = element_text(size = 18))

Conclusion

In conclusion, I filtered my data twice and created a new variable to find the percentage of yards attempted that the kicker was able to make. I believe this will help rank them as how successful they are at gaining yards, while also helping to balance out extremes on both sides.

Assignment 2

B. Kramer

10/14/2024

Introduction/ Description

Data Visualization

Visualization 1: A graph that shows the adjusted percentage of yards made by quarterbacks

Visualization 1: A graph that shows the top 10 kickers ranked by Adj_accuracy by season

Conclusion