ggplot2
Since we are not using base R to form our plots, we need to load the ggplot2 package to make sure that our plots appear in the published R markdown file.
library(ggplot2)
Now we can load the data we want to plot. Here, I found a .csv file of every play from every game held during the 2014 National Football League (NFL) season. The data can be found at http://nflsavant.com/about.php
nflplays <- read.csv('https://raw.githubusercontent.com/Logan213/MSDA_R_Bridge_Wk4/master/pbp-2014.csv')
Here we can see the first couple of rows of the data, and the first seven columns by using the square brackets after the nflplays
object which has stored the data from the .csv file.
head(nflplays[1:7])
## GameId GameDate Quarter Minute Second OffenseTeam DefenseTeam
## 1 2014090709 2014-09-07 2 1 57 MIN
## 2 2014090711 2014-09-07 4 0 0 CAR
## 3 2014090702 2014-09-07 4 4 2 BUF CHI
## 4 2014090703 2014-09-07 2 0 4 WAS
## 5 2014091100 2014-09-11 4 0 31 BAL PIT
## 6 2014091100 2014-09-11 4 0 0 PIT
Using the summary
function, we can get a summary of the data contained in the different columns. Since the file had so many columns for data, following is a preview of the columns containing infomation such as Yards (gained/lost on play), play formation, etc.
summary(nflplays[20:25])
## Yards Formation PlayType
## Min. :-20.000 : 692 PASS :18881
## 1st Qu.: 0.000 FIELD GOAL : 1006 RUSH :12765
## Median : 0.000 NO HUDDLE : 873 KICK OFF: 2635
## Mean : 4.212 NO HUDDLE SHOTGUN: 3518 PUNT : 2511
## 3rd Qu.: 6.000 PUNT : 2397 TIMEOUT : 1715
## Max. :102.000 SHOTGUN :17111 : 1462
## UNDER CENTER :20098 (Other) : 5726
## IsRush IsPass IsIncomplete
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2951 Mean :0.4132 Mean :0.1504
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
First, I created a series of boxplots showing the different play formations (in general, so no 2-tight end sets, etc.), and the number of yards gained or lost from that formation. Most plays in the NFL result in zero yards, and most plays are run from the Quarterback “Under Center” formation, which is why the mean is pretty much zero for that boxplot.
ggplot(nflplays, aes(y = Yards, x = Formation), size=10) + geom_boxplot()
Following is a histogram of the yards gained on every NFL play. The staggering number of plays resulting in no gain skews the plot greatly
qplot(Yards, data=nflplays, binwidth = 1, main='Distribution of Yards Gained')
We can use the subset
function to create a different dataset that will exclude any plays resulting in no gained or lost yardage:
nonzeroyds <- subset(nflplays, Yards !=0)
And then plot that data, setting the bin size to 1 so we can see each gain of yardage as its own bin. Removing the zero yard plays allows us to see the rest of the data a little easier.
qplot(Yards, data=nonzeroyds, bin=1)
Lastly is a scatterplot created using the “absolute” Yardline (0-100 yards on the field) versus the fixed Yardline (0-50, then designated by “Own” or “Opponent’s”). Because there are so many points, this may not be the best way to present the data. However, it is intersting to look at the plot for a few reasons.
There are only a finite number of yards to gain on a play, so any play resutling in a touchdown from that yard marker will be plotted as 100-YardLine, so there is a linear plot representing all scoring plays with the offense possesing the ball. Any plots above this line are turnovers resulting in a score (fumble or interception resulting in a score).
qplot(YardLine, Yards, data=nflplays, color = YardLine, main='Field Location and Length of Play')