Data Dive Week 6 - Confidence

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  source("aptheme.R") #Code that helps format graphs
  })

Import data

games <- import("games.csv")
data <- import("plays.csv")

Yardage and Penatlies

We start out looking at the relationship between penatly yards and the net yardage gained. There’s common wisdom in footbal that pre-snap penatly yards are drive killers, so we test this idea with the below code.

data <- data %>%
  mutate(net_yards = yardsToGo - yardsGained) %>% 
  mutate(penaltyYards = ifelse(is.na(penaltyYards), 0, penaltyYards)) 

data_penalty <- data %>%
  filter(penaltyYards != 0)

ggplot(data = data_penalty, aes(x = penaltyYards, y = net_yards)) + 
  geom_point() + 
   geom_smooth(method = "lm") +
  theme_ap(family = "sans") + 
  labs(x = "Penalty Yards",
        y = "Net Yards Gained",
    title = "Net Yards Gained by Penatly Yards", 
    subtitle = paste("Correlation:", round(cor(data_penalty$net_yards, data_penalty$penaltyYards), digits = 2))) 
## `geom_smooth()` using formula = 'y ~ x'

Here we can see a pretty strong correlation between penatly yards, which we are treating as the explanatory variable, and the net yards gained, which is being treated as the response variable. There’s some pretty strong clustering around the 15, 10, and 5 yard penalties., which makes sense considering most of the pre-snap penalties are going to be for the same offense (offsides, encrochment and false start).There are a handful of data points in between these numbers, but I suspect these are likely due to penalties near the goal line which result in a “half the distance” penalty.

When we calculate pearsons’s correlation coefficent, we get a value of -0.68. This number makes sense looking at the graph, because we can see a strong negative slope in the dot plot (this is emphasized by adding the regression line). However,the clustering around specific yardage gives me pause to draw any specific conclusions about penalty yards and net yards gained. There are just as many dots above the regression line at each cluster as there are below it. We can use confidence intervals to dive in deeper.

Confidence Intervals

To start off, we can build a confidence interval for the whole of the net_yards variable, and then we can look at the individual penalty yardage situations. (Note: the above analysis was done only on plays that had a penalty, and the below confidence interval is being constructed for all plays. This is done since we are interested in determining if having specific yardage penalties is statistically different than any given play.)

no_fs_data <- data %>%
  filter(penaltyYards != 5)

yard_mean <- mean(no_fs_data$net_yards)
se <- sd(no_fs_data$net_yards)
upper_95 <- yard_mean + se * 1.96
lower_95 <- yard_mean - se * 1.96

ggplot(data = no_fs_data, aes(x = net_yards)) + 
   stat_function(fun = dnorm, args = list(mean = yard_mean, sd = se), color = "blue") + 
  geom_vline(xintercept = lower_95) + 
  annotate("text", x = lower_95, y = 0.03, label = round(lower_95, digits = 2), hjust = 1.5) + 
  geom_vline(xintercept = upper_95) + 
  annotate("text", x = lower_95, y = 0.03, label = round(upper_95, digits = 2), hjust = -4) + 

  theme_ap(family = "sans") + 
   labs(x = "Net Yards Gained",
        y = "",
    title = "Sampling Distribution of Net Yards") 

Looking at this plot, we can say with 95% confidence that the true mean of yards gained lies between -15.18 and 21.33. so a pretty wide confidence band. Now we want to look at this distribution in comparison with the false start penalties (a 5 yard infraction).

fs_data <- data %>%
  filter(penaltyYards == 5)

fs_yard_mean <- mean(fs_data$net_yards)
fs_se <- sd(fs_data$net_yards)
fs_upper_95 <- fs_yard_mean + fs_se * 1.96
fs_lower_95 <- fs_yard_mean - fs_se * 1.96

ggplot(data = fs_data, aes(x = net_yards)) + 
   stat_function(fun = dnorm, args = list(mean = fs_yard_mean, sd = se), color = "#C43D49") + 
  geom_vline(xintercept = fs_lower_95) + 
  annotate("text", x = fs_lower_95, y = 0.03, label = round(fs_lower_95, digits = 2), hjust = -1) + 
  geom_vline(xintercept = fs_upper_95) + 
  annotate("text", x = fs_lower_95, y = 0.03, label = round(fs_upper_95, digits = 2), hjust = -14) + 

  theme_ap(family = "sans") + 
   labs(x = "Net Yards Gained",
        y = "",
    title = "Sampling Distribution of Net Yards \n /w 15 Yard Penalty") 

Here we can see the distribution of the net yards gained when there is a false start penatly pre-snap. We can say with 95% confidence that the true mean is between -20 and 16.48. The next step is to plot the two distributions on top of each other.

ggplot(data = fs_data, aes(x = net_yards)) + 
  stat_function(fun = dnorm, args = list(mean = fs_yard_mean, sd = fs_se), color = "#C43D49") + 
  geom_vline(xintercept = fs_lower_95, color = "#C43D49") + 
  geom_vline(xintercept = fs_upper_95, color = "#C43D49") + 


  stat_function(fun = dnorm, args = list(mean = yard_mean, sd = se), color =  "#669900") + 
  geom_vline(xintercept = lower_95, color =  "#669900") + 
  geom_vline(xintercept = upper_95, color =  "#669900") + 
  theme_ap(family = "sans") + 
   labs(x = "Net Yards Gained",
        y = "",
    title = "Sampling Distribution of Net Yards") 

Here we can see the interval for plays without the false start penalty (in green) is overlapping substantially with the interval for plays with the penalty. The means are obviously different, but despite the conventional wisdom that false starts are drive killers, there’s no evidence here that there is any difference in the average net yard gained with or without the false start.

Lead and Win Probability

It’s not suprising that the likelihood of a team winning any given game will be correlated with the lead, but here we look to see if the quarter has any effect on that correlation. For example does the lead and win probability change together more tightly in the third quarter than the second.

qtr <- data %>% 
  mutate(home_score_delta = preSnapHomeScore - preSnapVisitorScore) %>%
  filter(quarter != 5)

qtr %>%
  group_by(quarter) %>%
  summarise(cor(home_score_delta, preSnapHomeTeamWinProbability))
## # A tibble: 4 Ă— 2
##   quarter `cor(home_score_delta, preSnapHomeTeamWinProbability)`
##     <int>                                                  <dbl>
## 1       1                                                  0.564
## 2       2                                                  0.870
## 3       3                                                  0.903
## 4       4                                                  0.888
ggplot(data = qtr, aes(x = home_score_delta, y = preSnapHomeTeamWinProbability)) + 
  geom_point(aes(color = quarter)) + 
  facet_wrap(~ quarter) + 
  # geom_smooth(method = "lm") +
  theme_ap(family = "sans") + 
  labs(x = "Lead",
        y = "Win Probability",
    title = "Lead and Win Probability by Quarter")

Here we can see that in the first quarter, the win probability is loosely correlated with the lead, but in subsequent quarters it becomes much tighter. This intuitively makes sense, since as there is less time for the opposing team to catch up, but the interesting thing is that the correlation between lead and win probability doesn’t change a whole lot from the second quarter to the third quarter. This suggests that the first quarter is a little fuzy, but the win percentage as a funciton of lead pretty much locks in further into the game.