Exploring NFL Data with GGPLOT2

Our last homework assignment focused on using lattice to make informative graphs for the GapMinder Data. For this week, we will explore GGPLOT2 using a data set of our choosing. Because I am football nut, I decided that I could make my homework fun by exploring NFL data from 2010-2012. Nathan Brixius has generously posted the raw data here.

Jenny Bryan has graciously helped organize and consolidate the data. The R code and new data files can be found here.

My focus for this assignment is to gain proficiency using GGPLOT2. Therefore, most of the graphs displayed may seem obvious and uninteresting but I hope to have more time to explore the data in future assignments. I limited my analysis to the WR data for simplicity.

Before, getting started I noticed that for some of the data files there was white space in front the player or team name. Therefore, I had to do some additional preprocessing steps before beginning my analysis. I used the trim function from

library(ggplot2)
library(plyr)
library(xtable)
wrDat <- read.csv("data/WR.csv")
wrDat <- trim(wrDat)
str(wrDat)

## 'data.frame':    614 obs. of  25 variables:
##  $ Name    : Factor w/ 297 levels "A.J. Green","A.J. Jenkins",..: 30 256 248 115 226 8 102 191 44 267 ...
##  $ Team    : Factor w/ 32 levels "ARI","ATL","BAL",..: 10 2 14 12 25 13 16 1 11 32 ...
##  $ G       : int  16 16 16 16 16 13 16 16 16 16 ...
##  $ Rec     : int  77 115 111 76 60 86 72 90 77 93 ...
##  $ Targets : int  153 179 176 124 99 138 133 173 137 145 ...
##  $ Rec.Yds : int  1448 1389 1355 1265 1257 1216 1162 1137 1120 1115 ...
##  $ Rec.YG  : num  90.5 86.8 84.7 79.1 78.6 93.5 72.6 71.1 70 69.7 ...
##  $ Rec.Avg : num  18.8 12.1 12.2 16.6 21 14.1 16.1 12.6 14.5 12 ...
##  $ Lng     : int  71 46 50 86 56 60 75 41 87 56 ...
##  $ YAC     : num  2.9 3.3 3.9 5.7 6.3 4 5.6 2.5 4.6 5.4 ...
##  $ Rec.1stD: int  72 73 72 52 48 59 55 59 57 61 ...
##  $ Rec.TD  : int  11 10 6 12 10 8 15 6 12 6 ...
##  $ KR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.Yds  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.Avg  : num  NA NA 0 NA NA 0 NA 0 0 NA ...
##  $ KR.Long : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ KR.TD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.Yds  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.Avg  : num  NA NA 0 NA NA 0 NA 0 0 NA ...
##  $ PR.Long : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PR.TD   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fum     : int  0 1 1 2 1 1 1 0 1 3 ...
##  $ FumL    : int  0 1 1 0 0 0 0 0 0 2 ...
##  $ year    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...

printTable(head(wrDat))

Name	Team	G	Rec	Targets	Rec.Yds	Rec.YG	Rec.Avg	Lng	YAC	Rec.1stD	Rec.TD	KR.Avg	PR.Avg	Fum	FumL	year
Brandon Lloyd	DEN	16	77	153	1448	90.50	18.80	71	2.90	72	11			0	0	2010
Roddy White	ATL	16	115	179	1389	86.80	12.10	46	3.30	73	10			1	1	2010
Reggie Wayne	IND	16	111	176	1355	84.70	12.20	50	3.90	72	6	0.00	0.00	1	1	2010
Greg Jennings	GNB	16	76	124	1265	79.10	16.60	86	5.70	52	12			2	0	2010
Mike Wallace	PIT	16	60	99	1257	78.60	21.00	56	6.30	48	10			1	0	2010
Andre Johnson	HOU	13	86	138	1216	93.50	14.10	60	4.00	59	8	0.00	0.00	1	0	2010

Targets vs Receiving Yards

The first basic question I will examine is how do the number of targets affect total receiving yards for a wide receiver?

Note: There is no 2011 data for the number of targets for each WR

wrDat$year <- as.factor(wrDat$year)
ggplot(subset(wrDat, year != 2011), aes(x = Targets, y = Rec.Yds, color = year)) + 
    geom_point(shape = 1) + geom_smooth(method = lm, se = F)

plot of chunk unnamed-chunk-4

This is somewhat interesting although not unexpected. It appears that there is a linear relationship between targets and receiving yards. This is extremely useful information for fantasy football. If you want to draft or start a reliable wide receiver, look for the one who receives the most targets on their team.

MostTgts <- function(x) {
    wr <- which.max(x$Targets)
    best <- as.character(x$Name[wr])
    names(best) <- "MostTgts"
    return(best)
}

MostTargets <- ddply(subset(wrDat, year == 2012), ~Team, MostTgts)
printTable(MostTargets)

Team	MostTgts
ARI	Larry Fitzgerald
ATL	Roddy White
BAL	Anquan Boldin
BUF	Stevie Johnson
CAR	Steve Smith
CHI	Brandon Marshall
CIN	A.J. Green
CLE	Josh Gordon
DAL	Dez Bryant
DEN	Demaryius Thomas
DET	Calvin Johnson
GNB	Randall Cobb
HOU	Andre Johnson
IND	Reggie Wayne
JAC	Justin Blackmon
KAN	Dwayne Bowe
MIA	Brian Hartline
MIN	Percy Harvin
NOR	Marques Colston
NWE	Wes Welker
NYG	Victor Cruz
NYJ	Jeremy Kerley
OAK	Denarius Moore
PHI	Jeremy Maclin
PIT	Mike Wallace
SDG	Malcom Floyd
SEA	Sidney Rice
SFO	Michael Crabtree
STL	Danny Amendola
TAM	Vincent Jackson
TEN	Kendall Wright
WAS	Joshua Morgan

While some of the names on this list are obvious (Calvin Johnson & AJ Green), if I hadlooked at this prior to this year's draf,t I might have had an opportunity to make some late round value picks (Justin Blackmon, Brian Hartline, Kendall Wright & Jeremy Kerley).

Are teams that spread the ball around more successful?

Aaron Rodgers, Drew Brees and Peyton Manning are notorious for spreading the ball around. One week, Jordy Nelson could be Rodger's favorite target and the next it's all about Randall Cobb. Is there anything interesting about the distribution of receiving touchdowns on an NFL Team?

ggplot(wrDat, aes(x = Team, y = Rec.TD, fill = Team)) + geom_boxplot(alpha = 0.4) + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-6

I like this graph because you can learn a lot from it. You can see what teams score a lot of passing touch and also the distribution of passing touchdowns among receivers on that team. You can see the obvious affect that Rodgers and Brees have on their receivers but you can also see a possible reason why Miami, Kansas City and Cleveland have been struggling over the past year.

Maybe these teams actual throw quite a bit but prefer to run ball once they get into the end zone. It might be better to look at the same plot in terms of receiving yardage but this time use a violin plot.

ggplot(wrDat, aes(x = Team, y = Rec.Yds, fill = Team)) + geom_violin(alpha = 0.4) + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-7

Hmm..St Louis is interesting. Perhaps that led to their decision to draft Tavon Austin in the first round?

TotalTD <- ddply(wrDat, ~Team, summarize, Rec_TDs = sum(Rec.TD))

ggplot(TotalTD, aes(x = Team, y = Rec_TDs, fill = Team)) + geom_bar(stat = "identity") + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-8


TotalYD <- ddply(wrDat, ~Team, summarize, Rec_YDs = sum(Rec.Yds))
ggplot(TotalYD, aes(x = Team, y = Rec_YDs, fill = Team)) + geom_bar(stat = "identity") + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-8

Note: I plan on putting these graphs in descending order later on

Lastly, let's quickly examine the relationship between year and number of receiving first downs with a stripplot.

ggplot(wrDat, aes(x = year, y = Rec.1stD, color = Team)) + geom_jitter()

plot of chunk unnamed-chunk-9

This one is still a work in progress. I would like to overlay the mean receiving 1st Downs by year and then compare it against 1st Downs by running backs.

So far, we have simply been analyzing teams based on their receiving stats. However, this is bias because tight ends are having an increased effect on the receiving game. We are simply not accounting for the impact of players like Rob Gronkowski, Jimmy Graham and Tony Gonzalez. For teams like, New England it drastically affects how the two above bar graphs look. This is definitely something to come back to at a later time in addition to analyzing the impact of the other key positions.