## NFL Play Analysis in R

I got started using R to analyze baseball statistics. It is good to see American football can benefit from statistical analysis, as well.

The data set for this exercise can be found here.

If you click on “Sample Data (Week 1 - 8 2015), choose PLAY.csv, which is renamed nfl_plays.csv in the code below. You can see from the dim() command this is a large file. It classifies all plays in all NFL games in the first half of the 2015 season.

And here is the code!

#### Open File and Examine Data

Once you have downloaded the file and located its path you open the file and examine the data. It’s generally good to have an idea of the data’s dimensions, structure, and headings before getting started.

nfldata <- read.csv('C:\\Users\\Steven\\Desktop\\Data Files Misc\\nfl_plays.csv', header = TRUE)
dim(nfldata)
## [1] 22347    30
str(nfldata)
## 'data.frame':    22347 obs. of  30 variables:
##  $gid : int 3990 3990 3990 3990 3990 3990 3990 3990 3990 3990 ... ##$ pid : int  652367 652368 652369 652370 652371 652372 652373 652374 652375 652376 ...
##  $off : Factor w/ 32 levels "ARI","ATL","BAL",..: 25 25 25 25 25 25 25 25 25 25 ... ##$ def : Factor w/ 32 levels "ARI","ATL","BAL",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $type: Factor w/ 8 levels "CONV","FGXP",..: 3 8 6 8 6 8 6 4 8 6 ... ##$ dseq: int  0 1 2 3 4 5 6 6 7 8 ...
##  $len : int 6 35 17 38 44 37 45 27 25 44 ... ##$ qtr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $min : int 15 15 14 14 13 12 12 11 10 10 ... ##$ sec : int  0 0 21 4 26 42 5 20 53 28 ...
##  $ptso: int 0 0 0 0 0 0 0 0 0 0 ... ##$ ptsd: int  0 0 0 0 0 0 0 0 0 0 ...
##  $timo: int 3 3 3 3 3 3 3 3 3 3 ... ##$ timd: int  3 3 3 3 3 3 3 3 3 3 ...
##  $dwn : int 0 1 1 2 1 1 1 2 2 3 ... ##$ ytg : int  0 10 10 1 10 10 10 18 28 22 ...
##  $yfog: int 0 20 38 47 51 65 76 68 58 64 ... ##$ zone: int  0 1 2 3 3 4 4 4 3 4 ...
##  $fd : int 0 1 0 1 1 1 0 0 0 0 ... ##$ sg  : int  0 0 0 0 0 1 0 0 0 1 ...
##  $nh : int 0 0 0 0 0 0 0 0 0 0 ... ##$ pts : int  0 0 0 0 0 0 0 0 0 0 ...
##  $tck : int 0 1 1 1 1 1 0 0 1 1 ... ##$ sk  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $pen : int 0 0 0 0 0 0 0 1 0 0 ... ##$ ints: int  0 0 0 0 0 0 0 0 0 0 ...
##  $fum : int 0 0 0 0 0 0 0 0 0 0 ... ##$ saf : int  0 0 0 0 0 0 0 0 0 0 ...
##  $blk : int 0 0 0 0 0 0 0 0 0 0 ... ##$ olid: int  0 1854 1854 1854 1854 1854 1854 1854 1854 1854 ...
head(nfldata)
##    gid    pid off def type dseq len qtr min sec ptso ptsd timo timd dwn
## 1 3990 652367 PIT  NE KOFF    0   6   1  15   0    0    0    3    3   0
## 2 3990 652368 PIT  NE RUSH    1  35   1  15   0    0    0    3    3   1
## 3 3990 652369 PIT  NE PASS    2  17   1  14  21    0    0    3    3   1
## 4 3990 652370 PIT  NE RUSH    3  38   1  14   4    0    0    3    3   2
## 5 3990 652371 PIT  NE PASS    4  44   1  13  26    0    0    3    3   1
## 6 3990 652372 PIT  NE RUSH    5  37   1  12  42    0    0    3    3   1
##   ytg yfog zone fd sg nh pts tck sk pen ints fum saf blk olid
## 1   0    0    0  0  0  0   0   0  0   0    0   0   0   0    0
## 2  10   20    1  1  0  0   0   1  0   0    0   0   0   0 1854
## 3  10   38    2  0  0  0   0   1  0   0    0   0   0   0 1854
## 4   1   47    3  1  0  0   0   1  0   0    0   0   0   0 1854
## 5  10   51    3  1  0  0   0   1  0   0    0   0   0   0 1854
## 6  10   65    4  1  1  0   0   1  0   0    0   0   0   0 1854

We are interested in learning more about passing and rushing plays. We want to see which plays teams call in certain situations. In this case, when and where do they tend to call a pass play? When and where do they tend to rely on rushing?

#### Use a Pivot Table to Analyze Pass and Rushing Plays by Field Position and Down

A pivot table is a good way to organize this information. First, we will create a basic pivot table. We can see from its structure it is a large table. So let’s just check the column headers to start.

library(plyr)
playcomp <- ddply(nfldata, .(yfog, type, dwn), .fun = summarize, playsran = length(pid))
playcomp <- subset(playcomp, yfog != 0 & type == c('PASS', 'RUSH'))
str(playcomp)
## 'data.frame':    356 obs. of  4 variables:
##  $yfog : int 1 1 1 2 2 2 3 3 3 4 ... ##$ type    : Factor w/ 8 levels "CONV","FGXP",..: 6 6 8 6 6 8 6 6 8 6 ...
##  $dwn : int 1 3 3 1 3 2 1 3 2 1 ... ##$ playsran: int  5 3 3 8 1 3 6 3 6 6 ...
head(playcomp)
##    yfog type dwn playsran
## 7     1 PASS   1        5
## 9     1 PASS   3        3
## 12    1 RUSH   3        3
## 15    2 PASS   1        8
## 17    2 PASS   3        1
## 20    2 RUSH   2        3

We can pretty the table up by using the knitr package and the kable command. We’ll stick with the column headers again. If you like the results and want the entire table, just use kable(playcomp).

library(knitr)
kable(head(playcomp))
yfog type dwn playsran
7 1 PASS 1 5
9 1 PASS 3 3
12 1 RUSH 3 3
15 2 PASS 1 8
17 2 PASS 3 1
20 2 RUSH 2 3

#### To Visualize Data

Tables are good, but graphics are better. Here, the four plots show type of play (pass or rush) by field position. The down is represented by the row across the top of each chart.

library(ggplot2)
plot1 <- qplot(data = playcomp, x = yfog, xlab = 'Yards to Goal', y = playsran,
ylim = c(0, 100), ylab = 'No. of Plays',
color = as.factor(type))
plot1 + facet_wrap(~ dwn) + scale_color_discrete(name = 'Play Type', labels = c('Pass', 'Rush'))