In this lab, we’ll explore NFL play-by-play looking at penalties, the choice to run or pass, and the success of offenses, defenses and passers.
Please answer the numbered questions on a separate sheet.

Loading the Data and good old dplyr

n <- read.csv('/home/rstudioshared/shared_files/data/NFLPlaybyPlay2015.csv')
library(dplyr)

It’s always a good idea to start by simply looking at the data. Here are three ways:

head(n)
summary(n)
View(n)

Penalties

In Whistle swallowing we read that, due to omission bias, we may see fewer penalties late in close games. Let’s explore this data to see if it supports that claim.

First, here are the total number of plays by quarter:

n %>% group_by(qtr) %>% summarize(length(qtr))

Q1: Which quarters see more plays? Why?

Now, let’s add a new column to our data to note whether a penalty took place. We’ll use the PenaltyType column. Notice that when no penalty took place, PenaltyType takes the value NA for not applicable. Put another way, a penalty took place whenever Penalty is not NA. We can write this in code as follows (note that “!” is “not” and that we are writing over our original data frame, n):

n <- n %>% mutate(Penalty = !is.na(PenaltyType))

Now, we can get the number of penalties by quarter:

n %>%  
  group_by(qtr) %>% 
  summarize(num = length(qtr), penalties = sum(Penalty)) %>% 
  arrange(qtr)

We might be most interested in what proportion of the plays are penalties:

n %>%  
  group_by(qtr) %>% 
  summarize(num = length(qtr), penalty_rate = mean(Penalty)) %>% 
  arrange(qtr)

Q2: Are there fewer penalties in the 4th quarter?

It might be instructive to look at what types of penalties take place in greater numbers in each quarter. Let’s create a data frame with those numbers and then take a look at it.

penalties_by_quarter_type <- n %>% filter(Penalty==TRUE) %>% 
  group_by(PenaltyType, qtr) %>% 
  summarize(num = length(qtr)) %>% 
  arrange(PenaltyType, qtr)

View(penalties_by_quarter_type)

Q3: What quarters have the most… …defensive pass interference calls? …neutral zone infractions? …offensive holdings? …false starts?

Now, let’s try to take the game situation, namely how close the game is, into account. The column AbsScoreDiff has the absolute value of the difference in team scores. Let’s break this into 7 point chunks using the cut function.

n <- n %>% mutate(diff_group =cut(AbsScoreDiff, seq(0, 42, 7), include.lowest=TRUE))

Now, we can find the penalty rate split by how close the game is:

n %>% filter(!is.na(diff_group)) %>%
  group_by(diff_group) %>% 
  summarize(num_plays = length(Penalty), percent_penalty = mean(Penalty)) %>% 
  arrange(diff_group)

Q4: Are fewer penalties called in closer games or in games where one team is well ahead?

Let’s split the data by both quarter and closeness. This will be a larger table so we can once again assign this to a data frame and then view it.

penalties_by_quarter_score <- n %>% filter(!is.na(diff_group)) %>%
  group_by(qtr, diff_group) %>% 
  summarize(num_plays = length(Penalty), percent_penalty = mean(Penalty)) %>% 
  arrange(qtr, diff_group)

View(penalties_by_quarter_score)

Q5: Does our data support the claim in Whistle Swallowing about omission bias? Are there additional calculations we should make?

We might also be interested in whether penalties are handed out to the offensive or defensive team. Let’s split up penalties based on whether the penalized team was the defensive team and find the total number of penalty yards:

n %>% filter(!is.na(PenalizedTeam)) %>% 
  group_by(DefensiveTeam==PenalizedTeam) %>% summarize(sum(Penalty.Yards))

We could also look at which side of the ball gets penalized more on rushing attempts and on passing attempts. Notice that when looking at passing attempts we include sacks.

n %>% filter(PlayType=="Pass" | PlayType=="Sack") %>% 
  group_by(DefensiveTeam==PenalizedTeam) %>% summarize(sum(Penalty.Yards))

n %>% filter(PlayType=="Run") %>% 
  group_by(RushAttempt, DefensiveTeam==PenalizedTeam) %>% summarize(sum(Penalty.Yards))

To Run or to Pass

Are runs and passes equally effective? Should teams pass more? Perhaps we can use this data set to find out.

Let’s look at the average number of yards gained on run plays versus on pass plays. We’ll include sacks as pass plays although unfortunately this will have the effect of including a designed quarterback run that’s caught behind the line of scrimmage as a sack. Take a long look at the following code until you understand how it works:

n %>% filter(PlayType %in% c("Run","Pass", "Sack")) %>% group_by(PlayType=="Run") %>% 
  summarize(num_plays=length(Fumble), yds_per_play=mean(Yards.Gained))

Q6: How does the mean number of yards per pass compare to the mean number of yards per run?

Perhaps mean yards is the wrong metric. Let’s look at the median yards and the percentage of plays with positive yardage:

n %>% filter(PlayType %in% c("Run","Pass", "Sack")) %>% group_by(PlayType=="Run") %>% 
  summarize(num_plays=length(Fumble), median_yds_per_play=median(Yards.Gained), positive_rate = mean(Yards.Gained>0))

Maybe these metrics don’t tell us what we want either. Sometimes analysts categorize plays as simply successful or unsuccessful. On first down, a play is considered a success if it gains 45 percent of needed yards; on second down, a play needs to gain 60 percent of needed yards; on third or fourth down, only gaining a new first down is considered success.

Let’s code each play based on whether it was successful based on this definition. Recall that “|” stands for “or” and “&” stands for “and” and that the ifelse function includes a value to assign if the statement is true and a second value to assign if the statement is false. Our code, assigns the value 1 if the play is successful and 0 if it is not:

n <- n %>% mutate(successful_play = ifelse(
  (down == 1 & Yards.Gained >= .45*ydstogo) | (down == 2 & Yards.Gained >= .6*ydstogo) | (down >= 3 & Yards.Gained >= ydstogo),
  1,0))

Now, we can look at the success rate of run and pass plays:

n %>% filter(PlayType %in% c("Run","Pass", "Sack") & !is.na(successful_play)) %>% group_by(PlayType=="Run") %>% 
  summarize(num_plays=length(Fumble), successful_play_rate=mean(successful_play))

We could dig a little deeper and look at the success rate by run and pass plays split by down:

n %>% filter(PlayType %in% c("Run","Pass", "Sack") & !is.na(successful_play)) %>% group_by(down, PlayType=="Run") %>% 
  summarize(num_plays=length(Fumble), successful_play_rate=mean(successful_play)) %>%
  arrange(down)

Q7: Are run plays or pass plays more likely to be successful on first down? One third down?

Perhaps, there’s a simply explanation for this. Let’s look at the number of yards to go (for a first down) in each of those categories:

n %>% filter(PlayType %in% c("Run","Pass", "Sack") & !is.na(successful_play)) %>% group_by(down, PlayType=="Run") %>% 
  summarize(num_plays=length(Fumble), successful_play_rate=mean(successful_play), mean_to_go=mean(ydstogo)) %>%
  arrange(down)

Q8: How does the mean number of yards to go help explain the success rates?

Best Offensive and Defensive Units and Best Passers

First let’s look at the offensive with the best success rates on running plays:

n %>% filter(PlayType=="Run" & !is.na(successful_play)) %>% 
  group_by(posteam) %>% 
  summarize(num_plays=length(Fumble), successful_play_rate=mean(successful_play)) %>%
  arrange(desc(successful_play_rate))

And the defensive with the best success rates against the run:

n %>% filter(PlayType=="Run" & !is.na(successful_play)) %>% 
  group_by(DefensiveTeam) %>% 
  summarize(num_plays=length(Fumble), successful_play_rate=mean(successful_play)) %>%
  arrange(successful_play_rate)

Q9: Find the offense that has the most success when passing and the defense that has the most success against the pass.

Finally, let’s put this all together to look at the passing game for every team:

passing_stats <- n %>% filter(PlayType %in% c("Pass", "Sack") & !is.na(successful_play)) %>%
  group_by(posteam) %>%
  summarize(num_plays=length(Fumble), mean(InterceptionThrown), mean(Yards.Gained), median(Yards.Gained),
            mean(successful_play), mean(Sack), mean(Fumble), mean(Touchdown)) 

View(passing_stats)

Q10: Which team do you think had the best passing game in 2015? Why?