Exploring NFL Data with GGPLOT2

Our last homework assignment focused on using lattice to make informative graphs for the GapMinder Data. For this week, we will explore GGPLOT2 using a data set of our choosing. Because I am football nut, I decided that I could make my homework fun by exploring NFL data from 2010-2012. Nathan Brixius has generously posted the raw data here.

Jenny Bryan has graciously helped organize and consolidate the data. The R code and new data files can be found here.

My focus for this assignment is to gain proficiency using GGPLOT2. Therefore, most of the graphs displayed may seem obvious and uninteresting but I hope to have more time to explore the data in future assignments. I limited my analysis to the WR data for simplicity.

Before, getting started I noticed that for some of the data files there was white space in front the player or team name. Therefore, I had to do some additional preprocessing steps before beginning my analysis. I used the trim function from

library(ggplot2)
library(plyr)
library(xtable)
wrDat <- read.csv("data/WR.csv")
wrDat <- trim(wrDat)
str(wrDat)
## 'data.frame':    614 obs. of  25 variables:
##  $ Name    : Factor w/ 297 levels "A.J. Green","A.J. Jenkins",..: 30 256 248 115 226 8 102 191 44 267 ...
##  $ Team    : Factor w/ 32 levels "ARI","ATL","BAL",..: 10 2 14 12 25 13 16 1 11 32 ...
##  $ G       : int  16 16 16 16 16 13 16 16 16 16 ...
##  $ Rec     : int  77 115 111 76 60 86 72 90 77 93 ...
##  $ Targets : int  153 179 176 124 99 138 133 173 137 145 ...
##  $ Rec.Yds : int  1448 1389 1355 1265 1257 1216 1162 1137 1120 1115 ...
##  $ Rec.YG  : num  90.5 86.8 84.7 79.1 78.6 93.5 72.6 71.1 70 69.7 ...
##  $ Rec.Avg : num  18.8 12.1 12.2 16.6 21 14.1 16.1 12.6 14.5 12 ...
##  $ Lng     : int  71 46 50 86 56 60 75 41 87 56 ...
##  $ YAC     : num  2.9 3.3 3.9 5.7 6.3 4 5.6 2.5 4.6 5.4 ...
##  $ Rec.1stD: int  72 73 72 52 48 59 55 59 57 61 ...
##  $ Rec.TD  : int  11 10 6 12 10 8 15 6 12 6 ...
##  $ KR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.Yds  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.Avg  : num  NA NA 0 NA NA 0 NA 0 0 NA ...
##  $ KR.Long : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.TD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.Yds  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.Avg  : num  NA NA 0 NA NA 0 NA 0 0 NA ...
##  $ PR.Long : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.TD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fum     : int  0 1 1 2 1 1 1 0 1 3 ...
##  $ FumL    : int  0 1 1 0 0 0 0 0 0 2 ...
##  $ year    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
printTable(head(wrDat))
Name Team G Rec Targets Rec.Yds Rec.YG Rec.Avg Lng YAC Rec.1stD Rec.TD KR KR.Yds KR.Avg KR.Long KR.TD PR PR.Yds PR.Avg PR.Long PR.TD Fum FumL year
Brandon Lloyd DEN 16 77 153 1448 90.50 18.80 71 2.90 72 11 0 0 0 0 0 0 0 0 0 0 2010
Roddy White ATL 16 115 179 1389 86.80 12.10 46 3.30 73 10 0 0 0 0 0 0 0 0 1 1 2010
Reggie Wayne IND 16 111 176 1355 84.70 12.20 50 3.90 72 6 0 0 0.00 0 0 0 0 0.00 0 0 1 1 2010
Greg Jennings GNB 16 76 124 1265 79.10 16.60 86 5.70 52 12 0 0 0 0 0 0 0 0 2 0 2010
Mike Wallace PIT 16 60 99 1257 78.60 21.00 56 6.30 48 10 0 0 0 0 0 0 0 0 1 0 2010
Andre Johnson HOU 13 86 138 1216 93.50 14.10 60 4.00 59 8 0 0 0.00 0 0 0 0 0.00 0 0 1 0 2010

Targets vs Receiving Yards

The first basic question I will examine is how do the number of targets affect total receiving yards for a wide receiver?

Note: There is no 2011 data for the number of targets for each WR

wrDat$year <- as.factor(wrDat$year)
ggplot(subset(wrDat, year != 2011), aes(x = Targets, y = Rec.Yds, color = year)) + 
    geom_point(shape = 1) + geom_smooth(method = lm, se = F)

plot of chunk unnamed-chunk-4

This is somewhat interesting although not unexpected. It appears that there is a linear relationship between targets and receiving yards. This is extremely useful information for fantasy football. If you want to draft or start a reliable wide receiver, look for the one who receives the most targets on their team.

MostTgts <- function(x) {
    wr <- which.max(x$Targets)
    best <- as.character(x$Name[wr])
    names(best) <- "MostTgts"
    return(best)
}

MostTargets <- ddply(subset(wrDat, year == 2012), ~Team, MostTgts)
printTable(MostTargets)
Team MostTgts
ARI Larry Fitzgerald
ATL Roddy White
BAL Anquan Boldin
BUF Stevie Johnson
CAR Steve Smith
CHI Brandon Marshall
CIN A.J. Green
CLE Josh Gordon
DAL Dez Bryant
DEN Demaryius Thomas
DET Calvin Johnson
GNB Randall Cobb
HOU Andre Johnson
IND Reggie Wayne
JAC Justin Blackmon
KAN Dwayne Bowe
MIA Brian Hartline
MIN Percy Harvin
NOR Marques Colston
NWE Wes Welker
NYG Victor Cruz
NYJ Jeremy Kerley
OAK Denarius Moore
PHI Jeremy Maclin
PIT Mike Wallace
SDG Malcom Floyd
SEA Sidney Rice
SFO Michael Crabtree
STL Danny Amendola
TAM Vincent Jackson
TEN Kendall Wright
WAS Joshua Morgan

While some of the names on this list are obvious (Calvin Johnson & AJ Green), if I hadlooked at this prior to this year's draf,t I might have had an opportunity to make some late round value picks (Justin Blackmon, Brian Hartline, Kendall Wright & Jeremy Kerley).

Are teams that spread the ball around more successful?

Aaron Rodgers, Drew Brees and Peyton Manning are notorious for spreading the ball around. One week, Jordy Nelson could be Rodger's favorite target and the next it's all about Randall Cobb. Is there anything interesting about the distribution of receiving touchdowns on an NFL Team?

ggplot(wrDat, aes(x = Team, y = Rec.TD, fill = Team)) + geom_boxplot(alpha = 0.4) + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-6

I like this graph because you can learn a lot from it. You can see what teams score a lot of passing touch and also the distribution of passing touchdowns among receivers on that team. You can see the obvious affect that Rodgers and Brees have on their receivers but you can also see a possible reason why Miami, Kansas City and Cleveland have been struggling over the past year.

Maybe these teams actual throw quite a bit but prefer to run ball once they get into the end zone. It might be better to look at the same plot in terms of receiving yardage but this time use a violin plot.

ggplot(wrDat, aes(x = Team, y = Rec.Yds, fill = Team)) + geom_violin(alpha = 0.4) + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-7

Hmm..St Louis is interesting. Perhaps that led to their decision to draft Tavon Austin in the first round?

TotalTD <- ddply(wrDat, ~Team, summarize, Rec_TDs = sum(Rec.TD))

ggplot(TotalTD, aes(x = Team, y = Rec_TDs, fill = Team)) + geom_bar(stat = "identity") + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-8


TotalYD <- ddply(wrDat, ~Team, summarize, Rec_YDs = sum(Rec.Yds))
ggplot(TotalYD, aes(x = Team, y = Rec_YDs, fill = Team)) + geom_bar(stat = "identity") + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-8

Note: I plan on putting these graphs in descending order later on

Lastly, let's quickly examine the relationship between year and number of receiving first downs with a stripplot.

ggplot(wrDat, aes(x = year, y = Rec.1stD, color = Team)) + geom_jitter()

plot of chunk unnamed-chunk-9

This one is still a work in progress. I would like to overlay the mean receiving 1st Downs by year and then compare it against 1st Downs by running backs.

So far, we have simply been analyzing teams based on their receiving stats. However, this is bias because tight ends are having an increased effect on the receiving game. We are simply not accounting for the impact of players like Rob Gronkowski, Jimmy Graham and Tony Gonzalez. For teams like, New England it drastically affects how the two above bar graphs look. This is definitely something to come back to at a later time in addition to analyzing the impact of the other key positions.