Sean Jewell

Visualize anything with ggplot2 – NFL data

In this homework assignment we will look into a new dataset, the NFL dataset prepared by Jenny and Leah. In particular we will look at a subset of the whole dataset, the quarterback data. I am trying to find some interesting patterns in the NFL data set relating to QBs and other statistics. Some things of interest may be how do sacks per game affect number of TDs, which types of teams have quarterbacks who get sacked often, or–additionally–when QBs are sacked how many yards do they on average lose. We will endeavor to answer at least some of these questions or related questions as we go along.

Looking at this data we see that it is organized as:

wdir <- getwd()
nDat <- read.csv(paste(wdir, "/NFL data/data/QB.csv", sep = ""))
str(nDat)

## 'data.frame':    227 obs. of  22 variables:
##  $ Name    : Factor w/ 180 levels " A.J. Feeley",..: 172 150 76 147 81 180 80 126 127 174 ...
##  $ Team    : Factor w/ 32 levels "ARI","ATL","BAL",..: 20 26 12 24 26 31 25 30 30 9 ...
##  $ G       : int  16 16 15 12 3 9 12 16 11 6 ...
##  $ QBRat   : num  111 102 101 100 100 ...
##  $ Comp    : int  324 357 312 233 1 93 240 291 14 148 ...
##  $ Att     : int  492 541 475 372 1 156 389 474 16 213 ...
##  $ Pct     : num  65.9 66 65.7 62.6 100 59.6 61.7 61.4 87.5 69.5 ...
##  $ Pass.Yds: int  3900 4710 3922 3018 8 1255 3200 3451 111 1605 ...
##  $ Pass.YG : num  243.8 294.4 261.5 251.5 2.7 ...
##  $ Yds.Att : num  7.9 8.7 8.3 8.1 8 8 8.2 7.3 6.9 7.5 ...
##  $ TD      : int  36 30 28 21 0 10 17 25 0 11 ...
##  $ Int     : int  4 13 11 6 0 3 5 6 0 7 ...
##  $ Rush    : int  31 29 64 100 6 25 34 68 4 6 ...
##  $ Rush.Yds: int  30 52 356 676 -5 125 176 364 39 38 ...
##  $ Rush.YG : num  1.9 3.3 23.7 56.3 -1.7 13.9 14.7 22.8 3.5 6.3 ...
##  $ Rush.Avg: num  1 1.8 5.6 6.8 -0.8 5 5.2 5.4 9.8 6.3 ...
##  $ Rush.TD : int  1 0 4 9 0 0 2 0 0 0 ...
##  $ Sack    : int  25 38 31 34 0 13 32 28 2 7 ...
##  $ Sack.Yds: int  175 227 193 210 0 80 220 195 8 41 ...
##  $ Fum     : int  3 7 4 11 0 6 7 7 0 0 ...
##  $ FumL    : int  1 4 1 3 0 4 3 3 0 0 ...
##  $ year    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...

a flat file with 227 records and 22 variables. On the variable side we have information such as player name, team, and then stats for a few years like passing yards, rushing yards, sacks, fumbles etc. Let's see how many years of data we have:

unique(nDat$year)

## [1] 2010 2011 2012

so we have 3 years of data across 32 teams. Since this is our first time looking at the data some preliminary graphical analysis will be informative. The idea is to generate a few figures that may show an interesting story, and then we can explore those story lines in more detail.

ggplot(nDat) + geom_density(aes(x = Sack))

plot of chunk unnamed-chunk-4

Ok, this is interesting–there seems to be some bimodality in the sack data. Do certain teams have more sacks than other teams, or is a player specific quality? Or as we will see later, is this just way too naive of a start. Do we need to make some normalizing assumptions for the data and then look at this more carefully?

ggplot(nDat, aes(x = Sack, color = Team)) + geom_density()

plot of chunk unnamed-chunk-5

This looks terrible, so we need to aggregate the data a bit for a cleaner visualization. (Or facet the the chart a bit). What we are really seeking–and it is amazing that it is not in the dataset–are the number of wins in the year so we can group teams into losing teams and winning teams. It is also becoming more apparent that so scaling needs to happen in order to make any real conclusions. For example, the number of games played is clearly going to affect every other statistic! Take a look at the following chart for an illustration:

ex <- melt(nDat, id.vars = c("Name", "Team", "G", "year"))
ggplot(data = ex, aes(x = G, y = value)) + geom_point() + facet_wrap(~variable, 
    scales = "free") + geom_smooth()

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method.

plot of chunk unnamed-chunk-6

There is clearly an increasing pattern for a number of these series, so for further analysis we will want to scale those metrics of interest by the number of games played. Let us revisit the figure we made earlier which was looking at the distribution of the sack data. Recall we saw that this looked bimodal, however, with our new knowledge perhaps the bimodality came from players who played a lot vs. those who never played? The answer should be clear from the below:

ggplot(nDat, aes(x = Sack/G)) + geom_density()

plot of chunk unnamed-chunk-7

This is still interesting since there are clearly a number of players with no sacks, and then we see that the majority have 2 sacks per game. Let's look at a stripplot type plot where we compare based on number of games.

ggplot(nDat, aes(x = Sack/G, y = Team)) + geom_point() + geom_jitter(position = position_jitter(width = 0.2))

plot of chunk unnamed-chunk-8

ggplot(nDat, aes(x = Sack/G)) + geom_density() + facet_wrap(~Team)

plot of chunk unnamed-chunk-8

Looks like the Jets and Bills have some pretty bad QBs based on this quick and dirty analysis. What is going on with IND?

ggplot(subset(nDat, Team == "IND"), aes(x = Name, y = Sack/G, color = Name, 
    size = G)) + geom_point()

plot of chunk unnamed-chunk-9

The Colt's basically have two outliers who are skewing the density plot. In particular, both Peyton and Luck have played in many games but while Peyton averages 1 sack/per game Luck averages over 2.5 sacks/per game.

The last analysis that we will look into involves the relationship between the so called QBRat and the number of passing yards per game. Part of this analysis is a simple attempt to see how the QBRat is designed and if passing yards per game is an important feature.

ggplot(nDat, aes(y = QBRat, x = Pass.Yds/G)) + geom_point() + geom_smooth()

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-10

After looking at this chart it is surprising to see that some QBs have very high ratings, but have no passing yards! How can this be? If we look team by team and size on the number of games played we see:

ggplot(nDat, aes(y = QBRat, x = Pass.Yds/G, size = G)) + geom_point() + geom_smooth()

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-11

This plot actually provides a lot of additional information and confidence in the loess fit–for those players who have played many games (and thus more data) are quite close to the smooth fit, whereas those with few games have larger errors. I'm still interested in those QBs who have played quite a few games, have no passed yards, but still have high ratings! What's happening?

ggplot(subset(nDat, Pass.Yds < 10), aes(x = QBRat)) + geom_histogram()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-12

htmlPrint(subset(nDat, Pass.Yds < 10 & QBRat > 0))

Name	Team	G	QBRat	Comp	Att	Pct	Pass.Yds	Pass.YG	Yds.Att	Rush	Rush.Yds	Rush.YG	Rush.Avg	Rush.TD	year
Billy Volek	SDG	3	100	1	1	100	8	3	8	6	-5	-2	-1	0	2010
Kellen Clemens	NYJ	1	56	1	2	50	6	6	3	2	9	9	4	1	2010
Tom Brandstater	STL	1	40	0	2	0	0	0	0	0	0	0		0	2011

We see that these players really did not play enough in order to have any “negative” statistics pulling down their QBRat. Future work on this dataset would include number of wins each team had, as well as investigating some joint relationship between all of the positions.

Note that we did not employ any of the lattice package, since instead I spent time playing around with this new dataset. This is as per Jenny's instructions

date()

## [1] "Sun Oct  6 14:26:50 2013"

sessionInfo()

## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xtable_1.7-1    reshape2_1.2.2  ggplot2_0.9.3.1 plyr_1.8       
## [5] knitr_1.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.1        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-26       
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] scales_0.2.3       stringr_0.6.2      tools_3.0.1