Sean Jewell
In this homework assignment we will look into a new dataset, the NFL dataset prepared by Jenny and Leah. In particular we will look at a subset of the whole dataset, the quarterback data. I am trying to find some interesting patterns in the NFL data set relating to QBs and other statistics. Some things of interest may be how do sacks per game affect number of TDs, which types of teams have quarterbacks who get sacked often, or–additionally–when QBs are sacked how many yards do they on average lose. We will endeavor to answer at least some of these questions or related questions as we go along.
Looking at this data we see that it is organized as:
wdir <- getwd()
nDat <- read.csv(paste(wdir, "/NFL data/data/QB.csv", sep = ""))
str(nDat)
## 'data.frame': 227 obs. of 22 variables:
## $ Name : Factor w/ 180 levels " A.J. Feeley",..: 172 150 76 147 81 180 80 126 127 174 ...
## $ Team : Factor w/ 32 levels "ARI","ATL","BAL",..: 20 26 12 24 26 31 25 30 30 9 ...
## $ G : int 16 16 15 12 3 9 12 16 11 6 ...
## $ QBRat : num 111 102 101 100 100 ...
## $ Comp : int 324 357 312 233 1 93 240 291 14 148 ...
## $ Att : int 492 541 475 372 1 156 389 474 16 213 ...
## $ Pct : num 65.9 66 65.7 62.6 100 59.6 61.7 61.4 87.5 69.5 ...
## $ Pass.Yds: int 3900 4710 3922 3018 8 1255 3200 3451 111 1605 ...
## $ Pass.YG : num 243.8 294.4 261.5 251.5 2.7 ...
## $ Yds.Att : num 7.9 8.7 8.3 8.1 8 8 8.2 7.3 6.9 7.5 ...
## $ TD : int 36 30 28 21 0 10 17 25 0 11 ...
## $ Int : int 4 13 11 6 0 3 5 6 0 7 ...
## $ Rush : int 31 29 64 100 6 25 34 68 4 6 ...
## $ Rush.Yds: int 30 52 356 676 -5 125 176 364 39 38 ...
## $ Rush.YG : num 1.9 3.3 23.7 56.3 -1.7 13.9 14.7 22.8 3.5 6.3 ...
## $ Rush.Avg: num 1 1.8 5.6 6.8 -0.8 5 5.2 5.4 9.8 6.3 ...
## $ Rush.TD : int 1 0 4 9 0 0 2 0 0 0 ...
## $ Sack : int 25 38 31 34 0 13 32 28 2 7 ...
## $ Sack.Yds: int 175 227 193 210 0 80 220 195 8 41 ...
## $ Fum : int 3 7 4 11 0 6 7 7 0 0 ...
## $ FumL : int 1 4 1 3 0 4 3 3 0 0 ...
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
a flat file with 227 records and 22 variables. On the variable side we have information such as player name, team, and then stats for a few years like passing yards, rushing yards, sacks, fumbles etc. Let's see how many years of data we have:
unique(nDat$year)
## [1] 2010 2011 2012
so we have 3 years of data across 32 teams. Since this is our first time looking at the data some preliminary graphical analysis will be informative. The idea is to generate a few figures that may show an interesting story, and then we can explore those story lines in more detail.
ggplot(nDat) + geom_density(aes(x = Sack))
Ok, this is interesting–there seems to be some bimodality in the sack data. Do certain teams have more sacks than other teams, or is a player specific quality? Or as we will see later, is this just way too naive of a start. Do we need to make some normalizing assumptions for the data and then look at this more carefully?
ggplot(nDat, aes(x = Sack, color = Team)) + geom_density()
This looks terrible, so we need to aggregate the data a bit for a cleaner visualization. (Or facet the the chart a bit). What we are really seeking–and it is amazing that it is not in the dataset–are the number of wins in the year so we can group teams into losing teams and winning teams. It is also becoming more apparent that so scaling needs to happen in order to make any real conclusions. For example, the number of games played is clearly going to affect every other statistic! Take a look at the following chart for an illustration:
ex <- melt(nDat, id.vars = c("Name", "Team", "G", "year"))
ggplot(data = ex, aes(x = G, y = value)) + geom_point() + facet_wrap(~variable,
scales = "free") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method. geom_smooth: method="auto" and size of
## largest group is <1000, so using loess. Use 'method = x' to change the
## smoothing method. geom_smooth: method="auto" and size of largest group is
## <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method. geom_smooth:
## method="auto" and size of largest group is <1000, so using loess. Use
## 'method = x' to change the smoothing method. geom_smooth: method="auto"
## and size of largest group is <1000, so using loess. Use 'method = x' to
## change the smoothing method.
There is clearly an increasing pattern for a number of these series, so for further analysis we will want to scale those metrics of interest by the number of games played. Let us revisit the figure we made earlier which was looking at the distribution of the sack data. Recall we saw that this looked bimodal, however, with our new knowledge perhaps the bimodality came from players who played a lot vs. those who never played? The answer should be clear from the below:
ggplot(nDat, aes(x = Sack/G)) + geom_density()
This is still interesting since there are clearly a number of players with no sacks, and then we see that the majority have 2 sacks per game. Let's look at a stripplot type plot where we compare based on number of games.
ggplot(nDat, aes(x = Sack/G, y = Team)) + geom_point() + geom_jitter(position = position_jitter(width = 0.2))
ggplot(nDat, aes(x = Sack/G)) + geom_density() + facet_wrap(~Team)
Looks like the Jets and Bills have some pretty bad QBs based on this quick and dirty analysis. What is going on with IND?
ggplot(subset(nDat, Team == "IND"), aes(x = Name, y = Sack/G, color = Name,
size = G)) + geom_point()
The Colt's basically have two outliers who are skewing the density plot. In particular, both Peyton and Luck have played in many games but while Peyton averages 1 sack/per game Luck averages over 2.5 sacks/per game.
The last analysis that we will look into involves the relationship between the so called QBRat and the number of passing yards per game. Part of this analysis is a simple attempt to see how the QBRat is designed and if passing yards per game is an important feature.
ggplot(nDat, aes(y = QBRat, x = Pass.Yds/G)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
After looking at this chart it is surprising to see that some QBs have very high ratings, but have no passing yards! How can this be? If we look team by team and size on the number of games played we see:
ggplot(nDat, aes(y = QBRat, x = Pass.Yds/G, size = G)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
This plot actually provides a lot of additional information and confidence in the loess fit–for those players who have played many games (and thus more data) are quite close to the smooth fit, whereas those with few games have larger errors. I'm still interested in those QBs who have played quite a few games, have no passed yards, but still have high ratings! What's happening?
ggplot(subset(nDat, Pass.Yds < 10), aes(x = QBRat)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
htmlPrint(subset(nDat, Pass.Yds < 10 & QBRat > 0))
| Name | Team | G | QBRat | Comp | Att | Pct | Pass.Yds | Pass.YG | Yds.Att | TD | Int | Rush | Rush.Yds | Rush.YG | Rush.Avg | Rush.TD | Sack | Sack.Yds | Fum | FumL | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Billy Volek | SDG | 3 | 100 | 1 | 1 | 100 | 8 | 3 | 8 | 0 | 0 | 6 | -5 | -2 | -1 | 0 | 0 | 0 | 0 | 0 | 2010 |
| Kellen Clemens | NYJ | 1 | 56 | 1 | 2 | 50 | 6 | 6 | 3 | 0 | 0 | 2 | 9 | 9 | 4 | 1 | 0 | 0 | 0 | 0 | 2010 |
| Tom Brandstater | STL | 1 | 40 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2011 |
We see that these players really did not play enough in order to have any “negative” statistics pulling down their QBRat. Future work on this dataset would include number of wins each team had, as well as investigating some joint relationship between all of the positions.
date()
## [1] "Sun Oct 6 14:26:50 2013"
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.7-1 reshape2_1.2.2 ggplot2_0.9.3.1 plyr_1.8
## [5] knitr_1.4.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3
## [4] evaluate_0.4.7 formatR_0.9 grid_3.0.1
## [7] gtable_0.1.2 labeling_0.2 MASS_7.3-26
## [10] munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5
## [13] scales_0.2.3 stringr_0.6.2 tools_3.0.1