Our last homework assignment focused on using lattice to make informative graphs for the GapMinder Data. For this week, we will explore GGPLOT2 using a data set of our choosing. Because I am football nut, I decided that I could make my homework fun by exploring NFL data from 2010-2012. Nathan Brixius has generously posted the raw data here.
Jenny Bryan has graciously helped organize and consolidate the data. The R code and new data files can be found here.
My focus for this assignment is to gain proficiency using GGPLOT2. Therefore, most of the graphs displayed may seem obvious and uninteresting but I hope to have more time to explore the data in future assignments. I limited my analysis to the WR data for simplicity.
Before, getting started I noticed that for some of the data files there was white space in front the player or team name. Therefore, I had to do some additional preprocessing steps before beginning my analysis. I used the trim function from
library(ggplot2)
library(plyr)
library(xtable)
wrDat <- read.csv("data/WR.csv")
wrDat <- trim(wrDat)
str(wrDat)
## 'data.frame': 614 obs. of 25 variables:
## $ Name : Factor w/ 297 levels "A.J. Green","A.J. Jenkins",..: 30 256 248 115 226 8 102 191 44 267 ...
## $ Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 10 2 14 12 25 13 16 1 11 32 ...
## $ G : int 16 16 16 16 16 13 16 16 16 16 ...
## $ Rec : int 77 115 111 76 60 86 72 90 77 93 ...
## $ Targets : int 153 179 176 124 99 138 133 173 137 145 ...
## $ Rec.Yds : int 1448 1389 1355 1265 1257 1216 1162 1137 1120 1115 ...
## $ Rec.YG : num 90.5 86.8 84.7 79.1 78.6 93.5 72.6 71.1 70 69.7 ...
## $ Rec.Avg : num 18.8 12.1 12.2 16.6 21 14.1 16.1 12.6 14.5 12 ...
## $ Lng : int 71 46 50 86 56 60 75 41 87 56 ...
## $ YAC : num 2.9 3.3 3.9 5.7 6.3 4 5.6 2.5 4.6 5.4 ...
## $ Rec.1stD: int 72 73 72 52 48 59 55 59 57 61 ...
## $ Rec.TD : int 11 10 6 12 10 8 15 6 12 6 ...
## $ KR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ KR.Yds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ KR.Avg : num NA NA 0 NA NA 0 NA 0 0 NA ...
## $ KR.Long : int 0 0 0 0 0 0 0 0 0 0 ...
## $ KR.TD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PR : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PR.Yds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PR.Avg : num NA NA 0 NA NA 0 NA 0 0 NA ...
## $ PR.Long : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PR.TD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Fum : int 0 1 1 2 1 1 1 0 1 3 ...
## $ FumL : int 0 1 1 0 0 0 0 0 0 2 ...
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
printTable(head(wrDat))
| Name | Team | G | Rec | Targets | Rec.Yds | Rec.YG | Rec.Avg | Lng | YAC | Rec.1stD | Rec.TD | KR | KR.Yds | KR.Avg | KR.Long | KR.TD | PR | PR.Yds | PR.Avg | PR.Long | PR.TD | Fum | FumL | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Brandon Lloyd | DEN | 16 | 77 | 153 | 1448 | 90.50 | 18.80 | 71 | 2.90 | 72 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2010 | ||
| Roddy White | ATL | 16 | 115 | 179 | 1389 | 86.80 | 12.10 | 46 | 3.30 | 73 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2010 | ||
| Reggie Wayne | IND | 16 | 111 | 176 | 1355 | 84.70 | 12.20 | 50 | 3.90 | 72 | 6 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 1 | 1 | 2010 |
| Greg Jennings | GNB | 16 | 76 | 124 | 1265 | 79.10 | 16.60 | 86 | 5.70 | 52 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2010 | ||
| Mike Wallace | PIT | 16 | 60 | 99 | 1257 | 78.60 | 21.00 | 56 | 6.30 | 48 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2010 | ||
| Andre Johnson | HOU | 13 | 86 | 138 | 1216 | 93.50 | 14.10 | 60 | 4.00 | 59 | 8 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 1 | 0 | 2010 |
The first basic question I will examine is how do the number of targets affect total receiving yards for a wide receiver?
Note: There is no 2011 data for the number of targets for each WR
wrDat$year <- as.factor(wrDat$year)
ggplot(subset(wrDat, year != 2011), aes(x = Targets, y = Rec.Yds, color = year)) +
geom_point(shape = 1) + geom_smooth(method = lm, se = F)
This is somewhat interesting although not unexpected. It appears that there is a linear relationship between targets and receiving yards. This is extremely useful information for fantasy football. If you want to draft or start a reliable wide receiver, look for the one who receives the most targets on their team.
MostTgts <- function(x) {
wr <- which.max(x$Targets)
best <- as.character(x$Name[wr])
names(best) <- "MostTgts"
return(best)
}
MostTargets <- ddply(subset(wrDat, year == 2012), ~Team, MostTgts)
printTable(MostTargets)
| Team | MostTgts |
|---|---|
| ARI | Larry Fitzgerald |
| ATL | Roddy White |
| BAL | Anquan Boldin |
| BUF | Stevie Johnson |
| CAR | Steve Smith |
| CHI | Brandon Marshall |
| CIN | A.J. Green |
| CLE | Josh Gordon |
| DAL | Dez Bryant |
| DEN | Demaryius Thomas |
| DET | Calvin Johnson |
| GNB | Randall Cobb |
| HOU | Andre Johnson |
| IND | Reggie Wayne |
| JAC | Justin Blackmon |
| KAN | Dwayne Bowe |
| MIA | Brian Hartline |
| MIN | Percy Harvin |
| NOR | Marques Colston |
| NWE | Wes Welker |
| NYG | Victor Cruz |
| NYJ | Jeremy Kerley |
| OAK | Denarius Moore |
| PHI | Jeremy Maclin |
| PIT | Mike Wallace |
| SDG | Malcom Floyd |
| SEA | Sidney Rice |
| SFO | Michael Crabtree |
| STL | Danny Amendola |
| TAM | Vincent Jackson |
| TEN | Kendall Wright |
| WAS | Joshua Morgan |
While some of the names on this list are obvious (Calvin Johnson & AJ Green), if I hadlooked at this prior to this year's draf,t I might have had an opportunity to make some late round value picks (Justin Blackmon, Brian Hartline, Kendall Wright & Jeremy Kerley).
Aaron Rodgers, Drew Brees and Peyton Manning are notorious for spreading the ball around. One week, Jordy Nelson could be Rodger's favorite target and the next it's all about Randall Cobb. Is there anything interesting about the distribution of receiving touchdowns on an NFL Team?
ggplot(wrDat, aes(x = Team, y = Rec.TD, fill = Team)) + geom_boxplot(alpha = 0.4) +
theme(legend.position = "none")
I like this graph because you can learn a lot from it. You can see what teams score a lot of passing touch and also the distribution of passing touchdowns among receivers on that team. You can see the obvious affect that Rodgers and Brees have on their receivers but you can also see a possible reason why Miami, Kansas City and Cleveland have been struggling over the past year.
Maybe these teams actual throw quite a bit but prefer to run ball once they get into the end zone. It might be better to look at the same plot in terms of receiving yardage but this time use a violin plot.
ggplot(wrDat, aes(x = Team, y = Rec.Yds, fill = Team)) + geom_violin(alpha = 0.4) +
theme(legend.position = "none")
Hmm..St Louis is interesting. Perhaps that led to their decision to draft Tavon Austin in the first round?
TotalTD <- ddply(wrDat, ~Team, summarize, Rec_TDs = sum(Rec.TD))
ggplot(TotalTD, aes(x = Team, y = Rec_TDs, fill = Team)) + geom_bar(stat = "identity") +
theme(legend.position = "none")
TotalYD <- ddply(wrDat, ~Team, summarize, Rec_YDs = sum(Rec.Yds))
ggplot(TotalYD, aes(x = Team, y = Rec_YDs, fill = Team)) + geom_bar(stat = "identity") +
theme(legend.position = "none")
Note: I plan on putting these graphs in descending order later on
Lastly, let's quickly examine the relationship between year and number of receiving first downs with a stripplot.
ggplot(wrDat, aes(x = year, y = Rec.1stD, color = Team)) + geom_jitter()
This one is still a work in progress. I would like to overlay the mean receiving 1st Downs by year and then compare it against 1st Downs by running backs.
So far, we have simply been analyzing teams based on their receiving stats. However, this is bias because tight ends are having an increased effect on the receiving game. We are simply not accounting for the impact of players like Rob Gronkowski, Jimmy Graham and Tony Gonzalez. For teams like, New England it drastically affects how the two above bar graphs look. This is definitely something to come back to at a later time in addition to analyzing the impact of the other key positions.