Reproducible Standard Scores from Grantland.com using R Markdown

One of the better Super Bowl stories published recently was A Tale of Two Cities by Bill Barnwell on January 27th, 2014. The article uses standard scores to analyze how the Denver offense and Seattle defense match up against some of the great teams going back to 1940. In his words, after describing Denver's record 606 points this year:

The problem with those raw numbers is that we’re playing in an era of particularly high scoring. NFL teams averaged 23.4 points per game this season, more than any other year in league history. And when you look at the list of teams with the most points in a single season, four of the top six seasons belong to teams from 2011 (the Packers and Saints), 2012 (Patriots), or 2013. Those were great offenses, but the best way to judge an offense is by what it did in relation to its peers.

Within the story he presents two tables showing the top 10 teams in Points Scored and Points Allowed from 1940 - 2013 based on standard score and adding an adjustment for era based on 2013, the highest scoring season in NFL history. I thought it would be a fun exercise to recreate the tables using publicly accessible data and make the whole process reproducible.

For historical data I used the files accessible from this site. To begin:

In the code below we'll grab the 2013 file separately (it's not in the .zip file). Also note this analysis only looks back to 1978, and the article looks at data back to 1940. The reason for this is that I've used these files before and they're in any easily accessible format for analysis. Also 19 of the 20 teams in the two tables (the one exception: the 1970 Minnesota Vikings in Points Allowed) were teams from 1978-forward, so close enough for this exercise. Let's get into it:

## import files make sure you've saved the 1978 - 2012 files in folder called
## nfllines under your current working directory

# grab the 2013 file, write to nfllines folder to simplify 'for loop'
df <- read.csv("http://www.repole.com/sun4cast/stats/nfl2013lines.csv", stringsAsFactors = FALSE)
write.table(df, file = paste0(getwd(), "/nfllines/nfl2013lines.csv"), row.names = FALSE, 
    quote = FALSE, sep = ",")

# create vector of csv file names and years for each file
scores_list <- grep(".csv", list.files((paste0(getwd(), "/nfllines"))), value = TRUE)

# load dplyr package for manipulations within the for loop
library(dplyr)

# loop through each file name and read csv file
i <- 1
for (i in i:length(scores_list)) {
    df <- read.csv(paste0(getwd(), "/nfllines/", scores_list[i]), stringsAsFactors = FALSE)

    # each file has one row for each game, split into home/visitor data frames
    # then combine into one for easier manipulation
    df1 <- select(df, Visitor, Visitor.Score, Home.Score)
    df2 <- select(df, Home.Team, Home.Score, Visitor.Score)

    # change column names for rbind
    names(df1) <- c("Team", "PF", "PA")
    names(df2) <- c("Team", "PF", "PA")

    # combine data frames
    df <- rbind_list(df1, df2)

    # count number of games for season, generally 16 games but 1982 strike
    # shortened season and older 14 game schedules (pre-1978) need to be
    # accounted for
    num_games <- df %.% group_by(Team) %.% tally()

    # use game count for first team since all should be equal
    num_games <- num_games[1, 2]

    # create total season points data frame for offense and defense
    df <- group_by(df, Team)
    off <- summarise(df, Score = sum(PF))
    def <- summarise(df, Score = sum(PA))

    # get standard deviations for points for and points allowed, then mean of
    # points by team (mean will be same for points scored and allowed, standard
    # deviations will not)
    pts_names <- list(off, def)
    stdev <- lapply(pts_names, function(x) sd(x$Score))
    avg <- mean(off$Score)

    # calculate StdScores
    off$StdScore <- (off$Score - avg)/stdev[[1]]
    def$StdScore <- (def$Score - avg)/stdev[[2]]

    # add year from file name
    off$Year <- as.numeric(substr(scores_list[i], 4, 7))
    def$Year <- as.numeric(substr(scores_list[i], 4, 7))

    # add PPG
    off$PPG <- round(off$Score/num_games, 1)
    def$PPG <- round(def$Score/num_games, 1)

    # create or append to 1978 - 2013 data frames
    if (i == 1) {
        pts_for <- off
        pts_allow <- def
    } else {
        pts_for <- rbind_list(pts_for, off)
        pts_allow <- rbind_list(pts_allow, def)
    }
}

At this point the two data frames (pts_for and pts_allow) have most of what we need:

summary(pts_for)
##      Team               Score        StdScore           Year     
##  Length:1073        Min.   :113   Min.   :-2.482   Min.   :1978  
##  Class :character   1st Qu.:281   1st Qu.:-0.712   1st Qu.:1987  
##  Mode  :character   Median :325   Median :-0.055   Median :1997  
##                     Mean   :328   Mean   : 0.000   Mean   :1996  
##                     3rd Qu.:375   3rd Qu.: 0.625   3rd Qu.:2005  
##                     Max.   :606   Max.   : 3.320   Max.   :2013  
##       PPG      
##  Min.   : 8.8  
##  1st Qu.:17.9  
##  Median :20.7  
##  Mean   :20.9  
##  3rd Qu.:23.7  
##  Max.   :37.9
summary(pts_allow)
##      Team               Score        StdScore           Year     
##  Length:1073        Min.   :128   Min.   :-2.704   Min.   :1978  
##  Class :character   1st Qu.:287   1st Qu.:-0.653   1st Qu.:1987  
##  Mode  :character   Median :327   Median :-0.080   Median :1997  
##                     Mean   :328   Mean   : 0.000   Mean   :1996  
##                     3rd Qu.:370   3rd Qu.: 0.655   3rd Qu.:2005  
##                     Max.   :533   Max.   : 3.196   Max.   :2013  
##       PPG      
##  Min.   :10.3  
##  1st Qu.:18.4  
##  Median :20.7  
##  Mean   :20.9  
##  3rd Qu.:23.5  
##  Max.   :33.3

The last piece is adding points per game and the adjustments for era:

# clean up workspace a little
rm(df, df1, df2, i, off, def, avg, stdev, pts_names)

## add 2013 adjustments recreate 2013 standard deviations
off2013 <- subset(pts_for, pts_for$Year == 2013)
def2013 <- subset(pts_allow, pts_allow$Year == 2013)

# repeat certain steps for mean and standard deviations for just 2013
pts_names <- list(off2013, def2013)
stdev2013 <- lapply(pts_names, function(x) sd(x$Score))
avg2013 <- mean(off2013$Score)

# calculate adjusted fields
pts_for$Score_Adj <- round(pts_for$StdScore * stdev2013[[1]] + avg2013)
pts_for$PPG_Adj <- round(pts_for$Score_Adj/16, 1)

pts_allow$Score_Adj <- round(pts_allow$StdScore * stdev2013[[2]] + avg2013)
pts_allow$PPG_Adj <- round(pts_allow$Score_Adj/16, 1)

# clean up everything but two final data frames
rm(def2013, off2013, stdev2013, avg2013, pts_names, scores_list)

# re-order data frames from best-to-worst bases on StdScore
pts_for <- pts_for[order(-pts_for$StdScore), ]
pts_allow <- pts_allow[order(pts_allow$StdScore), ]

And the final table views:

# show final tables in similar format to story
pts_for[1:10, c("Year", "Team", "Score", "PPG", "StdScore", "Score_Adj", "PPG_Adj")]
##      Year                 Team Score  PPG StdScore Score_Adj PPG_Adj
## 1046 2013       Denver Broncos   606 37.9    3.320       606    37.9
## 678  2001        St Louis Rams   503 31.4    3.095       590    36.9
## 453  1994  San Francisco 49ers   505 31.6    3.074       589    36.8
## 871  2007 New England Patriots   589 36.8    3.027       586    36.6
## 597  1999        St Louis Rams   526 32.9    2.858       574    35.9
## 436  1993  San Francisco 49ers   473 29.6    2.807       570    35.6
## 544  1997       Denver Broncos   472 29.5    2.697       563    35.2
## 144  1983  Washington Redskins   541 33.8    2.637       558    34.9
## 575  1998    Minnesota Vikings   556 34.8    2.611       557    34.8
## 1035 2012 New England Patriots   557 34.8    2.598       556    34.8
pts_allow[1:10, c("Year", "Team", "Score", "PPG", "StdScore", "Score_Adj", "PPG_Adj")]
##      Year                 Team Score  PPG StdScore Score_Adj PPG_Adj
## 699  2002 Tampa Bay Buccaneers   196 12.2   -2.704       198    12.4
## 845  2006     Baltimore Ravens   201 12.6   -2.704       198    12.4
## 496  1995   Kansas City Chiefs   241 15.1   -2.577       206    12.9
## 474  1994     Cleveland Browns   204 12.8   -2.346       221    13.8
## 198  1985        Chicago Bears   198 12.4   -2.293       225    14.1
## 225  1986        Chicago Bears   187 11.7   -2.285       225    14.1
## 657  2000     Baltimore Ravens   165 10.3   -2.217       230    14.4
## 293  1988        Chicago Bears   215 13.4   -2.201       231    14.4
## 1065 2013     Seattle Seahawks   231 14.4   -2.199       231    14.4
## 811  2005        Chicago Bears   202 12.6   -2.161       233    14.6

One thing that jumps out is the Points Allowed table, in the article it has 2006 Baltimore at the top while this analysis has 2002 Tampa Bay. In both sets the two teams are extremely close. In this view their standard scores look exactly the same, but without the rounded view I can see where Tampa Bay's score is -2.704378 and 2006 Baltimore is -2.703853. The article has Tampa Bay at -2.70 and Baltimore at -2.71. My guess is some sort of rounding error or a score in one of our data sources is off ever so slightly, and if I had to guess my source would be the culprit. Regardless, my intention isn't to double check work, rather show a reproducible example of some analysis from a great writer who I've been reading since his Football Outsiders days.

Lastly here's a plot of each team's points per game average (each point on the plot is a team's PPG for that year) and line showing the league average points per game (red line).

# create means and standard deviations for graphs
pts_for <- group_by(pts_for, Year)
pf_mean_sd <- summarise(pts_for, PPG = mean(PPG))
pf_stdev <- summarise(pts_for, stdev = sd(PPG))
pf_mean_sd$stdev <- pf_stdev$stdev

library(ggplot2)
p <- ggplot(pts_for, aes(x = factor(Year), y = PPG)) + geom_point(alpha = 7/10) + 
    theme(axis.text.x = element_text(angle = 90), axis.title.x = element_blank()) + 
    geom_line(data = pf_mean_sd, aes(x = factor(Year), y = PPG, group = 1), 
        size = 1, colour = "#F8766D") + ggtitle("NFL Points Per Game for each Team, 1978 - 2013")
p

plot of chunk unnamed-chunk-5