To help me determine who is the “best” kicker, the first thing I did was download all relevant data relating to kickers. I merged this data into one data frame and removed the exess data frames. From this I filtered it to only include “Feild Goals” and “Extra Points”. I did this to narrow down the results as those are a) the most impact full plays, and most directly measurable to examine kicker without outside factors. In order to only examine experienced kickers to avoid low data skewing results, I also filtered for only kickers who had been in multiple seasons. Additionally, I noticed that the “kick length” for Extra Points was NA, so I replaced it with 33 since all extra points take place on the 15 yard line (15 + 18 yards being the average for extra points length).
This gave me complete and filtered data only referring to the field goals and extra points I was interested in. Using this data, I was able to take quartiles for Good and Bad results. I did not want kickers scores to be unfairly skewed by very high probability kicks (such as from >10 yard lines with 99.9% success rates) or from very unlikely kicks (such as from 60 + yards that had very low probabilities) as this would penalize those who took risks. I then filtered the longest 10% of Fails and shortest 10% of successes out to balance this out and give data I felt was more representative of kickers true skills. Leaving in sucsesses that occurred outside of the 90% quartile and failures inside of the 10% quartile, properly marked down/up kickers who went against the averages. Summing this new filtered data by player, gave me their total attempted, and total completed yards which I then averaged into their adjusted accuracy (adj_accuracy).
I then analyzed it through graphs to see the comparisons across all 3 seasons, as well as the best 10 across each individual season. The Results I found show the most yard-accurate kickers in the NFL across both parameters. This is a important metric to evaluate as you want reliability in your kickers. A unreliable kicker is more of a burden then a asset as often teams depend on kickers to be accurate a high percentage of time. The results I provided show which kickers are most reliable when it comes to overall yards after factoring out less relevant data. Due to varying play time/opportunities/attempt lengths, I believe finding the accuracy of yards was the best way to take away the variability of individual kicks.
First, I downloaded all relivant libraries and data sources, and created my initial data frame
#Set working Directory/Load libraries
setwd("C:/Users/bradk/OneDrive/Documents/IS470")
suppressMessages(library(data.table))
suppressMessages(library(lubridate))
suppressMessages(library(dplyr))
suppressMessages(library(httr))
suppressMessages(library(ggplot2))
suppressMessages(library(DescTools))
suppressMessages(library(scales))
suppressMessages(library(tidytext))
suppressMessages(library(RColorBrewer))
suppressMessages(library(kableExtra))
#Reading in Data
plays = fread("Data/NFLBDB2022/NFL2022/plays.csv")
players = fread("Data/NFLBDB2022/NFL2022/players.csv")
games = fread("Data/NFLBDB2022/NFL2022/games.csv")
pff = fread("Data/NFLBDB2022/NFL2022/pffScoutingData.csv")
#Merging Data and removing exess dataframes
merged1 = left_join(plays, pff, by = c("gameId", "playId"))
merged2 = left_join(merged1, games, by = c("gameId"))
df = left_join(merged2, players, by = c("kickerId" = "nflId"))
rm(merged1)
rm(merged2)
rm(plays)
rm(players)
rm(games)
rm(pff)
I initially filtered my data and producted quartiles for each kickers attempts
relivant = df %>%
select(specialTeamsResult, specialTeamsPlayType, yardlineNumber, kickLength, displayName, possessionTeam, kickerId, season) %>%
filter((specialTeamsResult %in% c("Kick Attempt Good", "Kick Attempt No Good"))) %>%
group_by(displayName, possessionTeam) %>%
filter(n_distinct(season) > 1) %>%
mutate(seasons = n_distinct(season)) %>%
mutate(kickLength = ifelse(is.na(kickLength) & specialTeamsPlayType == "Extra Point", 33, kickLength)) %>%
data.frame()
Quartiles = relivant %>%
group_by(kickerId, displayName) %>%
summarise(
Good_Q_10 = quantile(kickLength[specialTeamsResult == "Kick Attempt Good"], 0.1),
Bad_Q_90 = quantile(kickLength[specialTeamsResult == "Kick Attempt No Good"], 0.9)) %>%
data.frame()
knitr::kable(head(Quartiles, 8), caption = 'Kicker Range Quartiles') %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| kickerId | displayName | Good_Q_10 | Bad_Q_90 |
|---|---|---|---|
| 21213 | Adam Vinatieri | 26.4 | 55.0 |
| 27091 | Matt Bryant | 33.0 | 53.0 |
| 30403 | Robbie Gould | 28.4 | 55.0 |
| 30932 | Stephen Gostkowski | 28.0 | 52.0 |
| 31446 | Matt Prater | 27.0 | 55.0 |
| 32371 | Nick Folk | 25.0 | 49.5 |
| 32386 | Mason Crosby | 33.0 | 51.7 |
| 33469 | Stephen Hauschka | 31.0 | 54.0 |
Using those calculated quartiles, I then removed those values from the relevant data to remove data outside of those ranges. This was done to avoid penalizing kickers for missing kicks with objectively low probabilities, as well as not inflate their statistics with kicks well within their abilities as it does not portray skill accurately.
Using this newly filtered data, I then created a new statistic to track the percentage of total yards attempted and made.
filtered = relivant %>%
left_join(Quartiles, by = c("kickerId", "displayName")) %>%
filter(!(specialTeamsResult == "Kick Attempt Good" & kickLength <= Good_Q_10)) %>%
filter(!(specialTeamsResult == "Kick Attempt No Good" & kickLength >= Bad_Q_90)) %>%
group_by(kickerId, displayName) %>%
summarise(
total_kickLength = sum(kickLength),
total_kickLength_good = sum(kickLength[specialTeamsResult == "Kick Attempt Good"]),
adj_accuracy = total_kickLength_good / total_kickLength,
) %>%
arrange(-adj_accuracy) %>%
data.frame()
knitr::kable(head(filtered, 8), caption = 'Adjusted Kicker Accuracy Percentages') %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| kickerId | displayName | total_kickLength | total_kickLength_good | adj_accuracy |
|---|---|---|---|---|
| 39470 | Justin Tucker | 7360 | 6999 | 0.9509511 |
| 43689 | Wil Lutz | 7420 | 6987 | 0.9416442 |
| 43068 | Josh Lambo | 3503 | 3264 | 0.9317728 |
| 45046 | Harrison Butker | 7586 | 6900 | 0.9095703 |
| 32371 | Nick Folk | 2664 | 2420 | 0.9084084 |
| 41953 | Chris Boswell | 5313 | 4784 | 0.9004329 |
| 45576 | Younghoe Koo | 3311 | 2970 | 0.8970100 |
| 46298 | Jason Sanders | 5827 | 5196 | 0.8917110 |
The resulting variable adj_accuracy shows the percent of yards the kicker attempted and made.
ggplot(filtered, aes(x = reorder(displayName, -adj_accuracy), y = adj_accuracy, fill = adj_accuracy)) +
geom_bar(stat = "identity") +
labs(title = "Adjusted Accuracy by Player", x = "Player", y = "Adjusted Accuracy") +
geom_text(aes( label = label_percent(accuracy = 1L) (adj_accuracy)), vjust = -0.5, angle = 33) +
scale_fill_continuous(breaks = seq(.6, 1, 0.1),
limits = c(0.6, 1),
labels = paste0(100*seq(.6, 1, 0.1), "%"),
low = "blue",
high = "orange") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.50, face="bold", size=18),
axis.text.x=element_text(size=12, angle = 45, hjust = 1),
axis.text.y=element_text(size=15),
axis.title=element_text(size=18,face="bold"))
Then, I broke it down and arranged it by each season for a new ranking:
filtered2 <- relivant %>%
left_join(Quartiles, by = c("kickerId", "displayName")) %>%
filter(!(specialTeamsResult == "Kick Attempt Good" & kickLength <= Good_Q_10)) %>%
filter(!(specialTeamsResult == "Kick Attempt No Good" & kickLength >= Bad_Q_90)) %>%
group_by(kickerId, displayName, season) %>%
summarise(
total_kickLength = sum(kickLength),
total_kickLength_good = sum(kickLength[specialTeamsResult == "Kick Attempt Good"]),
adj_accuracy = total_kickLength_good / total_kickLength,
.groups = 'drop'
) %>%
group_by(season) %>%
arrange(desc(adj_accuracy)) %>%
slice_head(n = 10) %>%
ungroup() %>%
data.frame()
knitr::kable(head(filtered2, 12), caption = 'top 10 Kicker Adjusted Accuracy by season') %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| kickerId | displayName | season | total_kickLength | total_kickLength_good | adj_accuracy |
|---|---|---|---|---|---|
| 43689 | Wil Lutz | 2018 | 2517 | 2484 | 0.9868892 |
| 39470 | Justin Tucker | 2018 | 2177 | 2144 | 0.9848415 |
| 46342 | Michael Badgley | 2018 | 1366 | 1333 | 0.9758419 |
| 43068 | Josh Lambo | 2018 | 1364 | 1322 | 0.9692082 |
| 46236 | Daniel Carlson | 2018 | 1145 | 1100 | 0.9606987 |
| 43937 | Aldrick Rosas | 2018 | 1984 | 1899 | 0.9571573 |
| 31446 | Matt Prater | 2018 | 1897 | 1805 | 0.9515024 |
| 45046 | Harrison Butker | 2018 | 2680 | 2538 | 0.9470149 |
| 34707 | Ryan Succop | 2018 | 1635 | 1541 | 0.9425076 |
| 38691 | Randy Bullock | 2018 | 1956 | 1838 | 0.9396728 |
| 41953 | Chris Boswell | 2019 | 1539 | 1495 | 0.9714100 |
| 39470 | Justin Tucker | 2019 | 2672 | 2563 | 0.9592066 |
I then applied it into a horizontally stacked trellis chart for season by season comparison.
ggplot(data = filtered2, aes(x = reorder_within(displayName, -adj_accuracy, season), y = adj_accuracy, fill = adj_accuracy)) +
geom_bar(stat = "identity", width = 0.7) +
labs(x = "Players", y = "Adjusted Accuracy", title = "Top 10 Players Adjusted Accuracy by Season", fill = "Adj. Acc.") +
geom_text(aes(label = scales::percent(adj_accuracy, accuracy = 0.1)), hjust = 0.5, size = 4, angle =10) +
scale_fill_continuous(breaks = seq(0.9, 1, 0.02),
limits = c(0.9, 1),
labels = paste0(100 * seq(0.9, 1, 0.02), "%"),
low = "red",
high = "forestgreen") +
scale_x_reordered() +
facet_wrap(~season, scales = "free", ncol = 1) +
theme_minimal(base_size = 15) +
theme(plot.title = element_text(hjust = 0.50, face = "bold", size = 18),
axis.text.x = element_text(size = 10, angle = 15, vjust = 1.25, hjust = 0.5),
axis.text.y = element_text(size = 15),
axis.title = element_text(size = 18, face = "bold"),
strip.text = element_text(size = 18))
In conclusion, I filtered my data twice and created a new variable to find the percentage of yards attempted that the kicker was able to make. I believe this will help rank them as how successful they are at gaining yards, while also helping to balance out extremes on both sides.