I got started using R to analyze baseball statistics. It is good to see American football can benefit from statistical analysis, as well.
The data set for this exercise can be found here.
If you click on “Sample Data (Week 1 - 8 2015), choose PLAY.csv, which is renamed nfl_plays.csv in the code below. You can see from the dim() command this is a large file. It classifies all plays in all NFL games in the first half of the 2015 season.
Once you have downloaded the file and located its path you open the file and examine the data. It’s generally good to have an idea of the data’s dimensions, structure, and headings before getting started.
nfldata <- read.csv('C:\\Users\\Steven\\Desktop\\Data Files Misc\\nfl_plays.csv', header = TRUE)
dim(nfldata)
## [1] 22347 30
str(nfldata)
## 'data.frame': 22347 obs. of 30 variables:
## $ gid : int 3990 3990 3990 3990 3990 3990 3990 3990 3990 3990 ...
## $ pid : int 652367 652368 652369 652370 652371 652372 652373 652374 652375 652376 ...
## $ off : Factor w/ 32 levels "ARI","ATL","BAL",..: 25 25 25 25 25 25 25 25 25 25 ...
## $ def : Factor w/ 32 levels "ARI","ATL","BAL",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ type: Factor w/ 8 levels "CONV","FGXP",..: 3 8 6 8 6 8 6 4 8 6 ...
## $ dseq: int 0 1 2 3 4 5 6 6 7 8 ...
## $ len : int 6 35 17 38 44 37 45 27 25 44 ...
## $ qtr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ min : int 15 15 14 14 13 12 12 11 10 10 ...
## $ sec : int 0 0 21 4 26 42 5 20 53 28 ...
## $ ptso: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ptsd: int 0 0 0 0 0 0 0 0 0 0 ...
## $ timo: int 3 3 3 3 3 3 3 3 3 3 ...
## $ timd: int 3 3 3 3 3 3 3 3 3 3 ...
## $ dwn : int 0 1 1 2 1 1 1 2 2 3 ...
## $ ytg : int 0 10 10 1 10 10 10 18 28 22 ...
## $ yfog: int 0 20 38 47 51 65 76 68 58 64 ...
## $ zone: int 0 1 2 3 3 4 4 4 3 4 ...
## $ fd : int 0 1 0 1 1 1 0 0 0 0 ...
## $ sg : int 0 0 0 0 0 1 0 0 0 1 ...
## $ nh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pts : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tck : int 0 1 1 1 1 1 0 0 1 1 ...
## $ sk : int 0 0 0 0 0 0 1 0 0 0 ...
## $ pen : int 0 0 0 0 0 0 0 1 0 0 ...
## $ ints: int 0 0 0 0 0 0 0 0 0 0 ...
## $ fum : int 0 0 0 0 0 0 0 0 0 0 ...
## $ saf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ blk : int 0 0 0 0 0 0 0 0 0 0 ...
## $ olid: int 0 1854 1854 1854 1854 1854 1854 1854 1854 1854 ...
head(nfldata)
## gid pid off def type dseq len qtr min sec ptso ptsd timo timd dwn
## 1 3990 652367 PIT NE KOFF 0 6 1 15 0 0 0 3 3 0
## 2 3990 652368 PIT NE RUSH 1 35 1 15 0 0 0 3 3 1
## 3 3990 652369 PIT NE PASS 2 17 1 14 21 0 0 3 3 1
## 4 3990 652370 PIT NE RUSH 3 38 1 14 4 0 0 3 3 2
## 5 3990 652371 PIT NE PASS 4 44 1 13 26 0 0 3 3 1
## 6 3990 652372 PIT NE RUSH 5 37 1 12 42 0 0 3 3 1
## ytg yfog zone fd sg nh pts tck sk pen ints fum saf blk olid
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 10 20 1 1 0 0 0 1 0 0 0 0 0 0 1854
## 3 10 38 2 0 0 0 0 1 0 0 0 0 0 0 1854
## 4 1 47 3 1 0 0 0 1 0 0 0 0 0 0 1854
## 5 10 51 3 1 0 0 0 1 0 0 0 0 0 0 1854
## 6 10 65 4 1 1 0 0 1 0 0 0 0 0 0 1854
We are interested in learning more about passing and rushing plays. We want to see which plays teams call in certain situations. In this case, when and where do they tend to call a pass play? When and where do they tend to rely on rushing?
A pivot table is a good way to organize this information. First, we will create a basic pivot table. We can see from its structure it is a large table. So let’s just check the column headers to start.
library(plyr)
playcomp <- ddply(nfldata, .(yfog, type, dwn), .fun = summarize, playsran = length(pid))
playcomp <- subset(playcomp, yfog != 0 & type == c('PASS', 'RUSH'))
str(playcomp)
## 'data.frame': 356 obs. of 4 variables:
## $ yfog : int 1 1 1 2 2 2 3 3 3 4 ...
## $ type : Factor w/ 8 levels "CONV","FGXP",..: 6 6 8 6 6 8 6 6 8 6 ...
## $ dwn : int 1 3 3 1 3 2 1 3 2 1 ...
## $ playsran: int 5 3 3 8 1 3 6 3 6 6 ...
head(playcomp)
## yfog type dwn playsran
## 7 1 PASS 1 5
## 9 1 PASS 3 3
## 12 1 RUSH 3 3
## 15 2 PASS 1 8
## 17 2 PASS 3 1
## 20 2 RUSH 2 3
We can pretty the table up by using the knitr package and the kable command. We’ll stick with the column headers again. If you like the results and want the entire table, just use kable(playcomp).
library(knitr)
kable(head(playcomp))
| yfog | type | dwn | playsran | |
|---|---|---|---|---|
| 7 | 1 | PASS | 1 | 5 |
| 9 | 1 | PASS | 3 | 3 |
| 12 | 1 | RUSH | 3 | 3 |
| 15 | 2 | PASS | 1 | 8 |
| 17 | 2 | PASS | 3 | 1 |
| 20 | 2 | RUSH | 2 | 3 |
Tables are good, but graphics are better. Here, the four plots show type of play (pass or rush) by field position. The down is represented by the row across the top of each chart.
library(ggplot2)
plot1 <- qplot(data = playcomp, x = yfog, xlab = 'Yards to Goal', y = playsran,
ylim = c(0, 100), ylab = 'No. of Plays',
color = as.factor(type))
plot1 + facet_wrap(~ dwn) + scale_color_discrete(name = 'Play Type', labels = c('Pass', 'Rush'))