One of the better Super Bowl stories published recently was A Tale of Two Cities by Bill Barnwell on January 27th, 2014. The article uses standard scores to analyze how the Denver offense and Seattle defense match up against some of the great teams going back to 1940. In his words, after describing Denver's record 606 points this year:
The problem with those raw numbers is that we’re playing in an era of particularly high scoring. NFL teams averaged 23.4 points per game this season, more than any other year in league history. And when you look at the list of teams with the most points in a single season, four of the top six seasons belong to teams from 2011 (the Packers and Saints), 2012 (Patriots), or 2013. Those were great offenses, but the best way to judge an offense is by what it did in relation to its peers.
Within the story he presents two tables showing the top 10 teams in Points Scored and Points Allowed from 1940 - 2013 based on standard score and adding an adjustment for era based on 2013, the highest scoring season in NFL history. I thought it would be a fun exercise to recreate the tables using publicly accessible data and make the whole process reproducible.
For historical data I used the files accessible from this site. To begin:
nfllines.In the code below we'll grab the 2013 file separately (it's not in the .zip file). Also note this analysis only looks back to 1978, and the article looks at data back to 1940. The reason for this is that I've used these files before and they're in any easily accessible format for analysis. Also 19 of the 20 teams in the two tables (the one exception: the 1970 Minnesota Vikings in Points Allowed) were teams from 1978-forward, so close enough for this exercise. Let's get into it:
## import files make sure you've saved the 1978 - 2012 files in folder called
## nfllines under your current working directory
# grab the 2013 file, write to nfllines folder to simplify 'for loop'
df <- read.csv("http://www.repole.com/sun4cast/stats/nfl2013lines.csv", stringsAsFactors = FALSE)
write.table(df, file = paste0(getwd(), "/nfllines/nfl2013lines.csv"), row.names = FALSE,
quote = FALSE, sep = ",")
# create vector of csv file names and years for each file
scores_list <- grep(".csv", list.files((paste0(getwd(), "/nfllines"))), value = TRUE)
# load dplyr package for manipulations within the for loop
library(dplyr)
# loop through each file name and read csv file
i <- 1
for (i in i:length(scores_list)) {
df <- read.csv(paste0(getwd(), "/nfllines/", scores_list[i]), stringsAsFactors = FALSE)
# each file has one row for each game, split into home/visitor data frames
# then combine into one for easier manipulation
df1 <- select(df, Visitor, Visitor.Score, Home.Score)
df2 <- select(df, Home.Team, Home.Score, Visitor.Score)
# change column names for rbind
names(df1) <- c("Team", "PF", "PA")
names(df2) <- c("Team", "PF", "PA")
# combine data frames
df <- rbind_list(df1, df2)
# count number of games for season, generally 16 games but 1982 strike
# shortened season and older 14 game schedules (pre-1978) need to be
# accounted for
num_games <- df %.% group_by(Team) %.% tally()
# use game count for first team since all should be equal
num_games <- num_games[1, 2]
# create total season points data frame for offense and defense
df <- group_by(df, Team)
off <- summarise(df, Score = sum(PF))
def <- summarise(df, Score = sum(PA))
# get standard deviations for points for and points allowed, then mean of
# points by team (mean will be same for points scored and allowed, standard
# deviations will not)
pts_names <- list(off, def)
stdev <- lapply(pts_names, function(x) sd(x$Score))
avg <- mean(off$Score)
# calculate StdScores
off$StdScore <- (off$Score - avg)/stdev[[1]]
def$StdScore <- (def$Score - avg)/stdev[[2]]
# add year from file name
off$Year <- as.numeric(substr(scores_list[i], 4, 7))
def$Year <- as.numeric(substr(scores_list[i], 4, 7))
# add PPG
off$PPG <- round(off$Score/num_games, 1)
def$PPG <- round(def$Score/num_games, 1)
# create or append to 1978 - 2013 data frames
if (i == 1) {
pts_for <- off
pts_allow <- def
} else {
pts_for <- rbind_list(pts_for, off)
pts_allow <- rbind_list(pts_allow, def)
}
}
At this point the two data frames (pts_for and pts_allow) have most of what we need:
summary(pts_for)
## Team Score StdScore Year
## Length:1073 Min. :113 Min. :-2.482 Min. :1978
## Class :character 1st Qu.:281 1st Qu.:-0.712 1st Qu.:1987
## Mode :character Median :325 Median :-0.055 Median :1997
## Mean :328 Mean : 0.000 Mean :1996
## 3rd Qu.:375 3rd Qu.: 0.625 3rd Qu.:2005
## Max. :606 Max. : 3.320 Max. :2013
## PPG
## Min. : 8.8
## 1st Qu.:17.9
## Median :20.7
## Mean :20.9
## 3rd Qu.:23.7
## Max. :37.9
summary(pts_allow)
## Team Score StdScore Year
## Length:1073 Min. :128 Min. :-2.704 Min. :1978
## Class :character 1st Qu.:287 1st Qu.:-0.653 1st Qu.:1987
## Mode :character Median :327 Median :-0.080 Median :1997
## Mean :328 Mean : 0.000 Mean :1996
## 3rd Qu.:370 3rd Qu.: 0.655 3rd Qu.:2005
## Max. :533 Max. : 3.196 Max. :2013
## PPG
## Min. :10.3
## 1st Qu.:18.4
## Median :20.7
## Mean :20.9
## 3rd Qu.:23.5
## Max. :33.3
The last piece is adding points per game and the adjustments for era:
# clean up workspace a little
rm(df, df1, df2, i, off, def, avg, stdev, pts_names)
## add 2013 adjustments recreate 2013 standard deviations
off2013 <- subset(pts_for, pts_for$Year == 2013)
def2013 <- subset(pts_allow, pts_allow$Year == 2013)
# repeat certain steps for mean and standard deviations for just 2013
pts_names <- list(off2013, def2013)
stdev2013 <- lapply(pts_names, function(x) sd(x$Score))
avg2013 <- mean(off2013$Score)
# calculate adjusted fields
pts_for$Score_Adj <- round(pts_for$StdScore * stdev2013[[1]] + avg2013)
pts_for$PPG_Adj <- round(pts_for$Score_Adj/16, 1)
pts_allow$Score_Adj <- round(pts_allow$StdScore * stdev2013[[2]] + avg2013)
pts_allow$PPG_Adj <- round(pts_allow$Score_Adj/16, 1)
# clean up everything but two final data frames
rm(def2013, off2013, stdev2013, avg2013, pts_names, scores_list)
# re-order data frames from best-to-worst bases on StdScore
pts_for <- pts_for[order(-pts_for$StdScore), ]
pts_allow <- pts_allow[order(pts_allow$StdScore), ]
And the final table views:
# show final tables in similar format to story
pts_for[1:10, c("Year", "Team", "Score", "PPG", "StdScore", "Score_Adj", "PPG_Adj")]
## Year Team Score PPG StdScore Score_Adj PPG_Adj
## 1046 2013 Denver Broncos 606 37.9 3.320 606 37.9
## 678 2001 St Louis Rams 503 31.4 3.095 590 36.9
## 453 1994 San Francisco 49ers 505 31.6 3.074 589 36.8
## 871 2007 New England Patriots 589 36.8 3.027 586 36.6
## 597 1999 St Louis Rams 526 32.9 2.858 574 35.9
## 436 1993 San Francisco 49ers 473 29.6 2.807 570 35.6
## 544 1997 Denver Broncos 472 29.5 2.697 563 35.2
## 144 1983 Washington Redskins 541 33.8 2.637 558 34.9
## 575 1998 Minnesota Vikings 556 34.8 2.611 557 34.8
## 1035 2012 New England Patriots 557 34.8 2.598 556 34.8
pts_allow[1:10, c("Year", "Team", "Score", "PPG", "StdScore", "Score_Adj", "PPG_Adj")]
## Year Team Score PPG StdScore Score_Adj PPG_Adj
## 699 2002 Tampa Bay Buccaneers 196 12.2 -2.704 198 12.4
## 845 2006 Baltimore Ravens 201 12.6 -2.704 198 12.4
## 496 1995 Kansas City Chiefs 241 15.1 -2.577 206 12.9
## 474 1994 Cleveland Browns 204 12.8 -2.346 221 13.8
## 198 1985 Chicago Bears 198 12.4 -2.293 225 14.1
## 225 1986 Chicago Bears 187 11.7 -2.285 225 14.1
## 657 2000 Baltimore Ravens 165 10.3 -2.217 230 14.4
## 293 1988 Chicago Bears 215 13.4 -2.201 231 14.4
## 1065 2013 Seattle Seahawks 231 14.4 -2.199 231 14.4
## 811 2005 Chicago Bears 202 12.6 -2.161 233 14.6
One thing that jumps out is the Points Allowed table, in the article it has 2006 Baltimore at the top while this analysis has 2002 Tampa Bay. In both sets the two teams are extremely close. In this view their standard scores look exactly the same, but without the rounded view I can see where Tampa Bay's score is -2.704378 and 2006 Baltimore is -2.703853. The article has Tampa Bay at -2.70 and Baltimore at -2.71. My guess is some sort of rounding error or a score in one of our data sources is off ever so slightly, and if I had to guess my source would be the culprit. Regardless, my intention isn't to double check work, rather show a reproducible example of some analysis from a great writer who I've been reading since his Football Outsiders days.
Lastly here's a plot of each team's points per game average (each point on the plot is a team's PPG for that year) and line showing the league average points per game (red line).
# create means and standard deviations for graphs
pts_for <- group_by(pts_for, Year)
pf_mean_sd <- summarise(pts_for, PPG = mean(PPG))
pf_stdev <- summarise(pts_for, stdev = sd(PPG))
pf_mean_sd$stdev <- pf_stdev$stdev
library(ggplot2)
p <- ggplot(pts_for, aes(x = factor(Year), y = PPG)) + geom_point(alpha = 7/10) +
theme(axis.text.x = element_text(angle = 90), axis.title.x = element_blank()) +
geom_line(data = pf_mean_sd, aes(x = factor(Year), y = PPG, group = 1),
size = 1, colour = "#F8766D") + ggtitle("NFL Points Per Game for each Team, 1978 - 2013")
p