Data Dive Week 5 - Documentation

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  source("aptheme.R") #Code that helps format graphs
  })

Import data

data <- import("plays.csv")

Unclear Variables

There are two columns, targetX and targetY that are not obvious. Intuitively it seems like coordinates but what coordinates is vauge. The documentation clarifies that these refere to the respective X and Y coordinates of the targeted receiver. This is the simplest way to tell where the targeted reciever is, but without documentation, I would have assumed this was where the ball was spotted on the field. Without the clarification, this would have completely changed where I believed the ball was on the field, by at least 10 yards.

Additionally there are two columns pff_runConceptPrimary and pff_runConceptSecondary that are not clear even with the documentation.

#Check how many values are in both
sum(unique(data$pff_runConceptPrimary) %in% unique(data$pff_runConceptSecondary))
## [1] 1
#Check the number of unique values in eaach 
length(unique(data$pff_runConceptPrimary))
## [1] 13
length(unique(data$pff_runConceptSecondary))
## [1] 44
sum(!is.na(data$pff_runConceptPrimary) & !is.na(data$pff_runConceptSecondary))/sum(!is.na(data$pff_runConceptPrimary))
## [1] 0.3109911

This makes it look like the primary concept is mandatory for a run play, but there is only a secondary concept in some situations. So perhapse this is a case where there is an initial run play called, and then based on how the defense is playing the running back has the option to select a modification.

Further, there is a column called expectedPoints which is described as the expected points on a given play. On the surface it sounds pretty

run_data <- data %>%
  select(pff_runConceptPrimary, 
         pff_runConceptSecondary) %>%
  filter(!is.na(pff_runConceptPrimary)) %>%
  mutate(run_combo = paste(pff_runConceptPrimary, pff_runConceptSecondary), 
         secondary_exits = !is.na(pff_runConceptSecondary))
ggplot(data = run_data, aes(x = reorder(pff_runConceptPrimary, (secondary_exits)), y = secondary_exits)) + 
  geom_col(fill =  "#674875") + 
  theme_ap(family = "sans") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
   labs(x = "Primary Run Concept",
        y = "Secondary Concept Exists",
    title = "Primary Run Concept w/ \n Secondary Concept") 

Further, there is a column called expectedPoints which is described as the expected points on a given play. On the surface it sounds pretty straight forward, but there are negative values. It doesn’t seem possible for points to be taken off the board, so those values don’t make much sense. My first thought was that this has something to do with how close to the goal line a team is, so I plot here the yard number and expected points

#Looking at the correlation between the absolute yard line and the expected points 
ggplot(data = data, aes(x = absoluteYardlineNumber, y = expectedPoints, color = expectedPoints)) + 
  geom_point() + 
  scale_color_gradient2(midpoint=0, high = "#146994", mid = "white", low = "#C83728", space="Lab") + 
  theme_ap(family = "sans") + 
  geom_hline(yintercept = 0) + 
  labs(x = "Absolute Yard Line",
        y = "Expected Points",
    title = "Expected Points by Field Position") 

This is even stranger. This looks like really strong correlations, but both positive and negative dependent on a different variable. There’s no binary variables in the dataset (other than playAction which doesn’t make any sense in context), so I’m not sure what was causing this. I was wondering if this variable was negative if the opposing team was expected to score so I looked at whether or not the pass was intercepted. I also looked at whether or not the play occured on first down (when there could be a turnover)

data %>%
  mutate(interception = passResult == "IN") %>%
  group_by(interception) %>%
  summarise(avg_points_expected = mean(expectedPoints), 
            n = n())
## # A tibble: 2 × 3
##   interception avg_points_expected     n
##   <lgl>                      <dbl> <int>
## 1 FALSE                       2.25 15931
## 2 TRUE                        1.91   193
data %>%
  mutate(fourth = down == 4) %>%
  group_by(fourth) %>%
  summarise(avg_points_expected = mean(expectedPoints), 
            n = n())
## # A tibble: 2 × 3
##   fourth avg_points_expected     n
##   <lgl>                <dbl> <int>
## 1 FALSE                 2.26 15807
## 2 TRUE                  1.63   317

Looking at the results here, I don’t see any obvious support for the idea that these negative numbers occur when the opposing team is expected to score. Between this and the graph, I think this variable is mis-coded. Just looking at the graph, there appear to be similar numbers of points positively correlated with field position as negatively correlated. All the other variables are obviously associated with the team on offense or defense, and I think the expectedPoints variable sometimes refers to the offense and other times refers to the defense. Unless I could get confirmation of what actually is happening here, I would recommend not using this column since there’s no way of telling exactly what is going on.

Checking for Missing Values

We start by looking at the possession team variable to check for implicit and explicit values.

data %>%
  mutate(posessionTeamNA = is.na(possessionTeam)) %>%
  group_by(possessionTeam) %>%
  summarise(team_appearance = n(),
            posessionTeamNA = sum(posessionTeamNA))
## # A tibble: 32 × 3
##    possessionTeam team_appearance posessionTeamNA
##    <chr>                    <int>           <int>
##  1 ARI                        569               0
##  2 ATL                        511               0
##  3 BAL                        517               0
##  4 BUF                        469               0
##  5 CAR                        458               0
##  6 CHI                        510               0
##  7 CIN                        551               0
##  8 CLE                        512               0
##  9 DAL                        453               0
## 10 DEN                        490               0
## # ℹ 22 more rows

Here we can see there are 32 groups, which all match up to the 32 NFL teams, so there are no missing groups. We also check for explicitly missing values by using is.na(), which shows there are none. Because this is an un-ordered categorical variable, there’s no obvious implicitly missing values. Next we look at the pff_manZone variable.

data %>%
  mutate(coverageNA = is.na(pff_manZone)) %>%
  group_by(pff_manZone) %>%
  summarise(coverage = n(),
            coverageNA = sum(coverageNA))
## # A tibble: 4 × 3
##   pff_manZone coverage coverageNA
##   <chr>          <int>      <int>
## 1 Man             4145          0
## 2 Other            818          0
## 3 Zone           10969          0
## 4 <NA>             192        192

Here we can see that there are 192 explicit missing values in the coverage. There are no empty groups, since the only options for coverage are Man and Zone. The weird thing here is the coverage listed as Other. Coverage should be a variable, so we should probably consider the plays with coverage marked as Other as a kind of missing value as well.

###Continuous outliers Finally we check the yards gained variable for outliers.

ggplot(data = data, aes(x = yardsGained)) + 
 geom_boxplot() + 
  theme_ap(family = "sans")

ggplot(data = data, aes(x = yardsGained)) + 
 geom_histogram(binwidth = 5, fill =  "black") + 
  coord_cartesian(xlim = c(-70, 100)) + 
  theme_ap(family = "sans")

percentiles <- c(quantile(data$yardsGained, 0.01),
                 quantile(data$yardsGained, 0.99))

ggplot(data = data) + 
geom_jitter(mapping = aes(x = yardsGained, y = ""),
              width = 0, height = 0.1) +
  geom_vline(mapping = aes(xintercept = percentiles["1%"],
                           color = "1% percentile")) +
  geom_vline(mapping = aes(xintercept = percentiles["99%"],
                           color = "99% percentile")) 

This is a tricky variable to identify outliers. The vast majority of the plays gain between 0-10 yards, however there are plays that lose more than 50 yards and gain nearly 100 yards. Looking at the box and whiskers plot, there are plenty of values that fall outside of 1.5 times the inter quartile range, but it doesn’t make much sense to exclude all of them. Looking at the percentiles, there are also plenty of values that fall well beyond the 1st and 99th percentile. I would probably not call any of these values outliers, since there’s no obvious break in the data, but the best candidate for an outlier would be the row where yards gained is 98. There is an obvious break in the histogram as well as in the percentiles dot plot.