Start by setting up the packages to manipulate data.
suppressPackageStartupMessages({
library(tidyverse)
library(rio)
source("aptheme.R") #Code that helps format graphs
})
Import data
data <- import("plays.csv")
There are two columns, targetX and targetY
that are not obvious. Intuitively it seems like coordinates but what
coordinates is vauge. The documentation clarifies that these refere to
the respective X and Y coordinates of the targeted receiver. This is the
simplest way to tell where the targeted reciever is, but without
documentation, I would have assumed this was where the ball was spotted
on the field. Without the clarification, this would have completely
changed where I believed the ball was on the field, by at least 10
yards.
Additionally there are two columns pff_runConceptPrimary
and pff_runConceptSecondary that are not clear even with
the documentation.
#Check how many values are in both
sum(unique(data$pff_runConceptPrimary) %in% unique(data$pff_runConceptSecondary))
## [1] 1
#Check the number of unique values in eaach
length(unique(data$pff_runConceptPrimary))
## [1] 13
length(unique(data$pff_runConceptSecondary))
## [1] 44
sum(!is.na(data$pff_runConceptPrimary) & !is.na(data$pff_runConceptSecondary))/sum(!is.na(data$pff_runConceptPrimary))
## [1] 0.3109911
This makes it look like the primary concept is mandatory for a run play, but there is only a secondary concept in some situations. So perhapse this is a case where there is an initial run play called, and then based on how the defense is playing the running back has the option to select a modification.
Further, there is a column called expectedPoints which
is described as the expected points on a given play. On the surface it
sounds pretty
run_data <- data %>%
select(pff_runConceptPrimary,
pff_runConceptSecondary) %>%
filter(!is.na(pff_runConceptPrimary)) %>%
mutate(run_combo = paste(pff_runConceptPrimary, pff_runConceptSecondary),
secondary_exits = !is.na(pff_runConceptSecondary))
ggplot(data = run_data, aes(x = reorder(pff_runConceptPrimary, (secondary_exits)), y = secondary_exits)) +
geom_col(fill = "#674875") +
theme_ap(family = "sans") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Primary Run Concept",
y = "Secondary Concept Exists",
title = "Primary Run Concept w/ \n Secondary Concept")
Further, there is a column called expectedPoints which
is described as the expected points on a given play. On the surface it
sounds pretty straight forward, but there are negative values. It
doesn’t seem possible for points to be taken off the board, so those
values don’t make much sense. My first thought was that this has
something to do with how close to the goal line a team is, so I plot
here the yard number and expected points
#Looking at the correlation between the absolute yard line and the expected points
ggplot(data = data, aes(x = absoluteYardlineNumber, y = expectedPoints, color = expectedPoints)) +
geom_point() +
scale_color_gradient2(midpoint=0, high = "#146994", mid = "white", low = "#C83728", space="Lab") +
theme_ap(family = "sans") +
geom_hline(yintercept = 0) +
labs(x = "Absolute Yard Line",
y = "Expected Points",
title = "Expected Points by Field Position")
This is even stranger. This looks like really strong correlations,
but both positive and negative dependent on a different variable.
There’s no binary variables in the dataset (other than
playAction which doesn’t make any sense in context), so I’m
not sure what was causing this. I was wondering if this variable was
negative if the opposing team was expected to score so I looked at
whether or not the pass was intercepted. I also looked at whether or not
the play occured on first down (when there could be a turnover)
data %>%
mutate(interception = passResult == "IN") %>%
group_by(interception) %>%
summarise(avg_points_expected = mean(expectedPoints),
n = n())
## # A tibble: 2 × 3
## interception avg_points_expected n
## <lgl> <dbl> <int>
## 1 FALSE 2.25 15931
## 2 TRUE 1.91 193
data %>%
mutate(fourth = down == 4) %>%
group_by(fourth) %>%
summarise(avg_points_expected = mean(expectedPoints),
n = n())
## # A tibble: 2 × 3
## fourth avg_points_expected n
## <lgl> <dbl> <int>
## 1 FALSE 2.26 15807
## 2 TRUE 1.63 317
Looking at the results here, I don’t see any obvious support for the
idea that these negative numbers occur when the opposing team is
expected to score. Between this and the graph, I think this variable is
mis-coded. Just looking at the graph, there appear to be similar numbers
of points positively correlated with field position as negatively
correlated. All the other variables are obviously associated with the
team on offense or defense, and I think the expectedPoints
variable sometimes refers to the offense and other times refers to the
defense. Unless I could get confirmation of what actually is happening
here, I would recommend not using this column since there’s no way of
telling exactly what is going on.
We start by looking at the possession team variable to check for implicit and explicit values.
data %>%
mutate(posessionTeamNA = is.na(possessionTeam)) %>%
group_by(possessionTeam) %>%
summarise(team_appearance = n(),
posessionTeamNA = sum(posessionTeamNA))
## # A tibble: 32 × 3
## possessionTeam team_appearance posessionTeamNA
## <chr> <int> <int>
## 1 ARI 569 0
## 2 ATL 511 0
## 3 BAL 517 0
## 4 BUF 469 0
## 5 CAR 458 0
## 6 CHI 510 0
## 7 CIN 551 0
## 8 CLE 512 0
## 9 DAL 453 0
## 10 DEN 490 0
## # ℹ 22 more rows
Here we can see there are 32 groups, which all match up to the 32 NFL
teams, so there are no missing groups. We also check for explicitly
missing values by using is.na(), which shows there are
none. Because this is an un-ordered categorical variable, there’s no
obvious implicitly missing values. Next we look at the
pff_manZone variable.
data %>%
mutate(coverageNA = is.na(pff_manZone)) %>%
group_by(pff_manZone) %>%
summarise(coverage = n(),
coverageNA = sum(coverageNA))
## # A tibble: 4 × 3
## pff_manZone coverage coverageNA
## <chr> <int> <int>
## 1 Man 4145 0
## 2 Other 818 0
## 3 Zone 10969 0
## 4 <NA> 192 192
Here we can see that there are 192 explicit missing values in the
coverage. There are no empty groups, since the only options for coverage
are Man and Zone. The weird thing here is the coverage listed as
Other. Coverage should be a variable, so we should probably
consider the plays with coverage marked as Other as a kind of missing
value as well.
###Continuous outliers Finally we check the yards gained variable for outliers.
ggplot(data = data, aes(x = yardsGained)) +
geom_boxplot() +
theme_ap(family = "sans")
ggplot(data = data, aes(x = yardsGained)) +
geom_histogram(binwidth = 5, fill = "black") +
coord_cartesian(xlim = c(-70, 100)) +
theme_ap(family = "sans")
percentiles <- c(quantile(data$yardsGained, 0.01),
quantile(data$yardsGained, 0.99))
ggplot(data = data) +
geom_jitter(mapping = aes(x = yardsGained, y = ""),
width = 0, height = 0.1) +
geom_vline(mapping = aes(xintercept = percentiles["1%"],
color = "1% percentile")) +
geom_vline(mapping = aes(xintercept = percentiles["99%"],
color = "99% percentile"))
This is a tricky variable to identify outliers. The vast majority of the plays gain between 0-10 yards, however there are plays that lose more than 50 yards and gain nearly 100 yards. Looking at the box and whiskers plot, there are plenty of values that fall outside of 1.5 times the inter quartile range, but it doesn’t make much sense to exclude all of them. Looking at the percentiles, there are also plenty of values that fall well beyond the 1st and 99th percentile. I would probably not call any of these values outliers, since there’s no obvious break in the data, but the best candidate for an outlier would be the row where yards gained is 98. There is an obvious break in the histogram as well as in the percentiles dot plot.