By Srihari Mohan
With the NFL Playoffs underway, the frenzy to predict favorites to advance to the Super Bowl has reached its highest point of the season.
Sabermetrics was developed in 1980 by then MLB analyst Bill James to generate objective insight about baseball through the use of statistical analyses. James' works, such as The Bill James Handbook, has created great interest in the use of statistics in sports analysis. Although his work was focused on baseball, his methods have spanned to other sports, particularly the NFL, where statisticians like Nate Silver have developed a statistical system called ELO ratings to predict outcomes of entire seasons based off team statistics.
While Vegas odds makers and ESPN analysts have their own opinions on who will advance through the playoffs, and with NFL viewership at an all-time high - CBS News reports a cable record 111.5 million tuned in to watch the Seahawks defeat the Broncos 43-8 in Super Bowl 48 - I wanted to come up with a better way to quantitatively assess which teams are the most talented offensive and defensive units in football and who, at least on paper, has the best chance to advance through the playoffs and win the Super Bowl.
I'm encouraged that a statistical analysis will yield greater insights into how each team should perform in the playoffs. In fact, ESPN and FiveThirtyEight data analyst Nate Silver even claims that “one blind spot of the analytics movement has been its underappreciation of postseason play” (Silver).
Therefore, rather than conduct a statistical test, I felt that my research project would best serve as a data exploration. Specifically, I wanted to come up with a way to assess which positions on offense and defense contribute most to that unit's overall success (i.e. does a running back play a more important role in an offensive's production than the receiving corps). Ultimately, my analysis involved weighting these results with the average player ratings of each playoff team's starters (from the Madden NFL 25 dataset) to generate scores that could be added together and thus provide a better picture of the overall talent of the offenses and defenses in this year's playoffs.
Although this exploration isn't a traditional statistical test, I still can make the prediction that there will be a strong positive correlation between the strength of a team's quarterback and that offense's overall success (measured by an offense's yards per game from NFL.com). This is because the quarterback gets the ball on every play, so it makes sense to assume that teams with their best player having the ball most often would have higher stats than those where their best players didn't get the ball as often. At least on defense, I predict that linebackers will contribute most to the team's defensive success because they play both the run and the pass (are more active in a game) than the defensive line which only rushes the passer and plays the run or the secondary which only plays the pass.
I got the dataset (Excel spreadsheet with multiple tabs) from this link of Player ratings from the Madden video-game: http://gomadden.operationsports.com/content/madden/madden-25-player-ratings-complete/
The Roster tab of the spreadsheet is where the data is. I saved this worksheet as a CSV file under the name Madden2014.csv.
I had no missing values in the raw data.
Since my analysis only accounts for the starters on each playoff team, I needed to subset the data using a vector filled with Excel row IDs corresponding to which players would be starters. Similarly, I filtered out the data frame such that the only teams left were the 2015 playoff teams (using a concatenated vector).
I also needed to create two additional vectors and combine them into one new data frame (using data.frame()) that holds the team statistics for yards gained per game (offense) and yards given up per game (defense) using the team statistics found on NFL.com. I then created another data frame that grouped corresponding positions together (i.e. free safety and strong safety both counted as just a safety).
I then used a custom WinShares algorithm to weight the impact of each position group's average ratings on the offensive and defensive units' production respectively. The WinShares algorithm multiplies each position group's average rating by a weighted key to generate a net WinShare per team on offense and on defense (the net WinShares approximates the highest ranking offensive and defensive units while accounting for the relative influence of each position group on its unit's overall production).
Using ggplot() and facet_grid(), I then plotted my results where the first two graphs were faceted scatterplots of each team's average position group rating on overall unit production. A linear regression line was fit to each faceted scatterplot using method=“lm”. The last two graphs were bar plots of the offense and defense's WinShares for each team.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.2
library(stringr)
library(data.table)
## Warning: package 'data.table' was built under R version 3.1.1
setwd("c:/")
#Raw data
setInternet2(use = TRUE)
rawdata <-read.csv("Madden2014.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE,
comment.char = "", stringsAsFactors = default.stringsAsFactors())
#To update the starters from excel row id
startersubset<-c(37,41,42,45,46,48,49,52,53,57,63,64,65,69,73,77,83,85,88,89,96,101,102,108,187,191,
194,195,199,202,206,211,214,216,218,219,225,229,232,234,240,247,248,250,393,396,399,
400,403,407,410,415,417,419,422,425,430,434,438,441,443,447,451,453,456,457,592,594,
595,601,605,615,620,622,624,629,634,639,643,646,651,653,654,658,659,661,664,669,670,
673,674,678,680,685,687,691,693,696,698,700,703,710,711,713,715,720,728,730,734,1229,
1235,1237,1238,1239,1241,1246,1252,1258,1264,1265,1269,1272,1274,1277,1279,1283,1284,
1297,1298,1301,1304,1306,1312,1315,1318,1323,1324,1326,1330,1331,1334,1336,1341,1348,
1349,1354,1358,1361,1362,1368,1370,1373,1375,1381,1382,1385,1391,1396,1398,1399,1401,
1404,1409,1412,1414,1418,1420,1426,1428,1429,1433,1437,1439,1440,1443,1445,1448,1449,
1453,1457,1463,1465,1468,1474,1476,1478,1479,1481,1484,1490,1493,1494,1498,1636,1638,
1640,1641,1646,1650,1655,1658,1661,1663,1666,1672,1677,1678,1682,1685,1688,1689,1693,
1698,1700,1703,1854,1859,1861,1868,1873,1877,1881,1884,1886,1890,1891,1896,1897,1899,
1900,1905,1908,1913,1923,1924,1926,1929,1930,1931,1940,1944,1949,1951,1954,1956,1958,
1965,1970,1974,1977,1979,1980,1984,1990,1997)
madden<-rawdata[row.names(rawdata)[startersubset],]
#Filter playoff teams
playoffteams<- c("Seahawks","Cardinals","Panthers","Cowboys","Packers","Lions","Patriots",
"Bengals","Ravens","Steelers","Broncos","Colts")
madden<-madden[madden$Team %in% playoffteams,]
#Load NFL Team defense and offense stats
offensestatsteam<-c("Saints","Steelers","Colts","Broncos","Eagles","Packers","Cowboys","Falcons","Seahawks",
"Giants","Patriots","Ravens","Redskins","Dolphins","Bengals","Panthers","Texans","Chargers",
"Lions","49ers","Bears","Jets","Browns","Cardinals","Chiefs","Bills","Vikings","Rams","Titans",
"Buccaneers","Jaguars","Raiders")
offenseydsg<-c(411.4,411.1,406.6,402.9,396.8,386.1,383.6,378.2,375.8,367.2,365.5,364.9,358.6,350.1,
348,346.7,344.6,341.6,340.8,327.4,327.1,326.6,324.6,319.8,318.8,318.5,315.5,314.7,
303.7,292,289.6,282.2)
offensestats<-data.frame(Team=offensestatsteam, Yds.G=offenseydsg)
defensestatsteam<-c("Seahawks","Lions","Broncos","Bills","49ers","Jets","Chiefs","Ravens","Chargers",
"Panthers","Colts","Dolphins","Patriots","Vikings","Packers","Texans","Rams","Steelers",
"Cowboys","Redskins","Raiders","Bengals","Browns","Cardinals","Buccaneers","Jaguars",
"Titans","Eagles","Giants","Bears","Saints","Falcons")
defenseydsg<-c(267.1,300.9,305.2,312.2,321.4,327.2,330.5,336.9,338.3,339.8,342.7,343.4,344.1,344.7,
346.4,348.2,351.6,353.4,355.1,357,357.6,359.3,366.1,368.2,368.9,370.8,373,375.6,375.8,
377.1,384,398.2)
defensestats<-data.frame(Team=defensestatsteam,Yds.G=defenseydsg)
#NFL Team Offense and Defense Yds per game Statistics
combinedstats<-defensestats[,c("Team","Yds.G")]
names(combinedstats)[2]="DefenseYds.G"
combinedstats$OffenseYds.G<-offensestats[match(combinedstats$Team,offensestats$Team),c("Yds.G")]
combinedstats<-combinedstats[combinedstats$Team %in% playoffteams,]
#create a new data frame to group the positions
position=c("C","CB","DT","FS","HB","LE","LG","LOLB","LT","MLB","QB","RE","RG","ROLB","RT","SS","TE","WR")
positionGroup=c("OL","CB","DL","S","RB","DL","OL","LB","OL","LB","QB","DL","OL","LB","OL","S","WR","WR")
Unit=c("Offense","Defense","Defense","Defense","Offense","Defense","Offense","Defense","Offense","Defense",
"Offense","Defense","Offense","Defense","Offense","Defense","Offense","Offense")
positionGroupStats=data.frame(position=position, positionGroup=positionGroup,Unit=Unit)
#Defense and Offense Position Grouping
madden$positionGroup<-positionGroupStats[match(madden$Position,positionGroupStats$position),2]
madden$Unit<-positionGroupStats[match(madden$Position,positionGroupStats$position),3]
#Compute mean - by Team,positionGroup
AvgStatsByPosition<-aggregate(madden$Overall,list(team=madden$Team,
unit=madden$Unit,positionGroup=madden$positionGroup),FUN="mean")
names(AvgStatsByPosition)[4] <- "OverallMean"
names(AvgStatsByPosition)[1] <- "Team"
AvgStatsByPosition$OverallMean <-format(round(AvgStatsByPosition$OverallMean, 1), nsmall = 1)
#YPG - update the data frame from nflstats
AvgStatsByPosition$TeamOffenseYPG<-combinedstats[match(AvgStatsByPosition$Team,combinedstats$Team),3]
AvgStatsByPosition$TeamDefenseYPG<-combinedstats[match(AvgStatsByPosition$Team,combinedstats$Team),2]
defensesubset <- AvgStatsByPosition[which(AvgStatsByPosition[2]=="Defense"),]
offensesubset <- AvgStatsByPosition[which(AvgStatsByPosition[2]=="Offense"),]
#WinShares algorithm
#Offense
QBdata<-offensesubset[offensesubset$positionGroup=="QB",c("Team","OverallMean")]
names(QBdata)[2]="QB"
QBdata$QB<-as.numeric(QBdata$QB)
WRdata<-offensesubset[offensesubset$positionGroup=="WR",c("Team","OverallMean")]
names(WRdata)[2]="WR"
WRdata$WR<-as.numeric(WRdata$WR)
RBdata<-offensesubset[offensesubset$positionGroup=="RB",c("Team","OverallMean")]
names(RBdata)[2]="RB"
RBdata$RB<-as.numeric(RBdata$RB)
OLdata<-offensesubset[offensesubset$positionGroup=="OL",c("Team","OverallMean")]
names(OLdata)[2]="OL"
OLdata$OL<-as.numeric(OLdata$OL)
offenseshares<-data.frame(QBdata,WRdata,RBdata,OLdata)
offenseshares$WinSharesOff <- (offenseshares$QB * 10) + (offenseshares$WR * 8)+
(offenseshares$OL * 6)+ (offenseshares$RB *5)
#Defense
CBdata<-defensesubset[defensesubset$positionGroup=="CB",c("Team","OverallMean")]
names(CBdata)[2]="CB"
CBdata$CB<-as.numeric(CBdata$CB)
DLdata<-defensesubset[defensesubset$positionGroup=="DL",c("Team","OverallMean")]
names(DLdata)[2]="DL"
DLdata$DL<-as.numeric(DLdata$DL)
LBdata<-defensesubset[defensesubset$positionGroup=="LB",c("Team","OverallMean")]
names(LBdata)[2]="LB"
LBdata$LB<-as.numeric(LBdata$LB)
Sdata<-defensesubset[defensesubset$positionGroup=="S",c("Team","OverallMean")]
names(Sdata)[2]="S"
Sdata$S<-as.numeric(Sdata$S)
defenseshares<-data.frame(CBdata,DLdata,LBdata,Sdata)
defenseshares$WinSharesOff <- (defenseshares$DL * 6) + (defenseshares$CB * 5)+ (defenseshares$S * 8)+ (defenseshares$LB *10)
#create graphics plots
firstPart1 <- ggplot(defensesubset,aes(as.numeric(OverallMean),TeamDefenseYPG))+geom_point(,na.rm=TRUE)+
geom_point(aes(color = Team),size=4)+ geom_smooth(method = "lm",fill=NA)
firstPart2 <- ggplot(offensesubset,aes(as.numeric(OverallMean),TeamOffenseYPG))+geom_point(,na.rm=TRUE)+
geom_point(aes(color = Team),size=4)+geom_smooth(method = "lm",fill=NA)
finalPart <- theme(axis.text.x=element_text(colour="slateblue4",size=12,face="bold"))+
theme(axis.text.y=element_text(colour="slateblue4",size=12,face="bold"))+
theme(axis.title.x=element_text(colour="slateblue4",size=16,face="bold"))+
theme(axis.title.y=element_text(colour="slateblue4",size=16,face="bold"))+
theme(plot.title=element_text(colour="slateblue4", face="bold", size=20))+
theme(axis.text.x = element_text(angle=90,vjust=0.5, hjust=1,face="bold"))+
theme(axis.ticks = element_line(colour = "slateblue4"))+
theme(strip.text = element_text(size=12,face="bold"))+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(strip.text = element_text(size=12,face="bold"))
p1<- firstPart1 + facet_grid(positionGroup ~ .)+geom_jitter(na.rm=TRUE)+
xlab("Average Scores")+ylab("Yards Given Up per Game")+
ggtitle("Defense")+theme_bw()+xlim(0, 100)+finalPart
p2<- firstPart2 + facet_grid(positionGroup ~ .)+geom_jitter(na.rm=TRUE)+
xlab("Average Scores")+ylab("Yards Gained per Game")+
ggtitle("Offense" )+theme_bw()+xlim(0, 100)+finalPart
p3<-ggplot(offenseshares, aes(x = Team, y = WinSharesOff)) +
geom_bar(stat = "identity", fill="darkblue",width=0.3)+ xlab("Teams")+ylab("WinShares")+
geom_text(color="black",fontface="bold",aes(label=round(WinSharesOff,2)), vjust = -0.5)
p3<-p3+ggtitle("Offense")+theme_bw()+finalPart
p4<-ggplot(defenseshares, aes(x = Team, y = WinSharesOff)) +
geom_bar(stat = "identity", fill="darkgreen",width=0.3)+ xlab("Teams")+ylab("WinShares")+
geom_text(color="black",fontface="bold",aes(label=round(WinSharesOff,2)), vjust = -0.5)
p4<-p4+ggtitle("Defense")+theme_bw()+finalPart
grid.arrange(p1,p2,p4,p3, nrow=2,ncol=2, main = textGrob(paste("Playoff Teams Winshare ")
, vjust = 1,
gp = gpar(fontface = "bold", cex = 1.5,col="slateblue4")))
The results point to the trends that on offense, the strength of the quarterback most strongly correlates with that team's offensive production (as evidenced by the most obvious linear pattern). Likewise, the linebackers on defense contributed most to the team's overall defensive production because the linebackers' average ratings had the strongest correlation with “yards per game given up” when compared to any other defensive position. Thus, my two original predictions were correct.
At the same time, it is interesting to see how on offense, running backs had the least obvious linear relationship with offensive production. In fact, most of the highest powered offenses in this dataset had running backs with an average rating below 80. Interestingly, the Seahawks have the highest rated back among the playoff teams (Marshawn Lynch, 96) yet have an average overall offense when compared to the other 11 teams. Similarly, I was surprised how the strength of a team's cornerbacks contributed least to the team's overall defense. Considering that the league is a passing league, where strong quarterbacks impact offensive production most, you would assume that the defensive players who guard against the pass would play a critical role in that team's defensive stats; however, the cornerbacks average ratings had the least linear relationship with “yards given up” among the defensive positions analyzed.
Overall on offense, the order of position importance on offensive production is as follows: quarterback, wide receiver, offensive line, running back. On defense, this order is linebacker, safety, defensive line, and cornerback.
The overall Winshares Scores on offense and defense were calculated by multiplying each team's average starter rating (by position) by a key and then adding all these results together. The WinShares plots suggest that the Seahawks defense was strongest overall while the Patriots offense was strongest overall. The Seahawks defense had the highest WinShares of any unit (on either offense or defense), which suggests that the Seattle defense is the most dominant individual unit among the playoff teams. The highest combined WinShares (when the offense and defense scores are added together) in the NFC is Seattle at 4935.2. From the AFC, the Ravens surprisingly have the highest combined WinShares at 4909.6. This is surprising considering that the Ravens were the last Wild Card team to get into the playoffs (the lowest seed in the AFC), but my results point to them upsetting the #1 seeded Patriots in the AFC, while NFC favorite and #1 seeded Seattle advances to the Super Bowl.
My analysis points to a Super Bowl 49 contest between the Seattle Seahawks and the Baltimore Ravens, where Seattle is just good enough to beat Baltimore and capture their second consecutive Super Bowl title.
There were several problems that I solved in this research project. In order to ensure that my method was reproducible, I needed to find a way to filter the raw dataset to just include the starters without altering the original dataset (or creating a new spreadsheet) myself. So, I decided to create a numeric vector that held all of the numeric Excel row IDs for the starters (using row.names() and this concatenated vector). I needed to only include the starters in my analysis because if I included backups, all of my results would be skewed. For example, the Denver Broncos quarterback Peyton Manning is arguably the best in the NFL with a rating of 97. Since he plays every game, the Broncos never took the time to invest in a quality backup. If I only included averages of every quarterback on each playoff team's rosters, then the average quarterback score for the Broncos would be in the 70s (because of a fantastic starter and an awful backup); however, since Manning played every game, the Broncos quarterback score should be a 97 instead. Without creating this concatenated vector, my research project would have failed to produce any meaningful insights, so I'm proud that I was able to solve this problem.
Something that didn't go as well as I wanted to in this research project concerned the accuracy of the raw dataset itself. For the most part, the raw data was very accurate, but since it takes the rosters from last year, players with a breakout season this year (like Dallas Cowboys running back and leading NFL rusher Demarco Murray) had far lower ratings in this research project than what they should be. Certainly, breakout players like Murray could have changed the outcome of my results, so I wish I could have found a better way to factor in the extent to which certain players “broke out” during the season.
I'd like to use the WinShares scores generated from my analysis to predict the entire NFL playoffs and see how accurate my statistical model really was. In the Wild Card Round, the WinShares Model correctly predicted 3 of the 4 first round playoff games: Panthers over Cardinals, Colts over Bengals, as well as the upset pick Ravens over Steelers. However, my model predicted that the Lions would beat the Cowboys on the road, which didn't happen. I'd like to track how accurately the WinShares Model predicts the rest of the playoff Schedule. As of now, it predicts in the Divisional Round that the Seahawks will beat the Panthers, the Packers will beat the Cowboys, the Ravens will upset #1 seed New England, and the Broncos will beat the Colts.
Since the first half of the research project concerned finding out how each position (on the offensive and defensive units respectively) affected that unit's overall production, future research in this topic could include using WinShares to predict which positions NFL teams should draft at in the 2015 NFL draft.