As a Football fan, I want to predict where there is a correlation between passing plays and whether a team wins or not. I have choosen a 2019 season NFL regular season data sets to see whether there is such correlation. The data set contains play by play data adding up to 45556 observation of data and 256 variables. It is missing the variable of Win/Loss and Score, So I will be joining a table from another data set in order to make some prediction. Some of the questions I want to answer. (These questions, want to answer them for NYJ my favorite football team) Is a specific team runs more pass plays or run plays? Based on the count of specific plays, are there more likely to win or lose? Are teams gaining big yardage or most of the yardage gains are less than 10 yards?
Importing RCurl library for reading github url
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RCurl)
library(ggplot2)
Reading and importing csv from my github
data <- getURL("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2019.csv", ssl.verifypeer=0L, followlocation=1L)
df=read.csv(text=data)
I am getting the shape of the data set as the Head of DF will show long 256 variables.
dim(df)
## [1] 45546 256
df=data.frame(df)
Here I am selecting specific columns out of the 256 variables and then taking the head or first 5 rows of the data sets.
df1=df[,c(1,2,3,4,5,6,7,8,9,10,13,15,16,17,18,19,20,21,22,23,24,26,27,28,34,35,36,37,38)]
head(df1)
From the 45546 observations, I want to select only data for New York Jet team and then take summary of the data.
df1=filter(df1, posteam =="NYJ")
summary(df1)
## play_id game_id home_team away_team
## Min. : 36.0 Min. :2.019e+09 Length:1312 Length:1312
## 1st Qu.: 985.2 1st Qu.:2.019e+09 Class :character Class :character
## Median :2042.0 Median :2.019e+09 Mode :character Mode :character
## Mean :2084.9 Mean :2.019e+09
## 3rd Qu.:3192.2 3rd Qu.:2.019e+09
## Max. :4546.0 Max. :2.019e+09
##
## posteam posteam_type defteam side_of_field
## Length:1312 Length:1312 Length:1312 Length:1312
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## yardline_100 game_date game_seconds_remaining quarter_end
## Min. : 1.00 Length:1312 Min. : 2 Min. :0
## 1st Qu.:35.00 Class :character 1st Qu.: 863 1st Qu.:0
## Median :56.00 Mode :character Median :1834 Median :0
## Mean :53.52 Mean :1771 Mean :0
## 3rd Qu.:75.00 3rd Qu.:2700 3rd Qu.:0
## Max. :99.00 Max. :3600 Max. :0
##
## drive sp qtr down
## Min. : 1.00 Min. :0.00000 Min. :1.000 Min. :1.000
## 1st Qu.: 5.00 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.:1.000
## Median :11.00 Median :0.00000 Median :2.000 Median :2.000
## Mean :11.58 Mean :0.05793 Mean :2.498 Mean :2.062
## 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :26.00 Max. :1.00000 Max. :4.000 Max. :4.000
## NA's :162
## goal_to_go time yrdln ydstogo
## Min. :0.0000 Length:1312 Length:1312 Min. : 0.000
## 1st Qu.:0.0000 Class :character Class :character 1st Qu.: 4.000
## Median :0.0000 Mode :character Mode :character Median :10.000
## Mean :0.0343 Mean : 7.861
## 3rd Qu.:0.0000 3rd Qu.:10.000
## Max. :1.0000 Max. :33.000
##
## ydsnet play_type yards_gained shotgun
## Min. :-13.00 Length:1312 Min. :-13.000 Min. :0.0000
## 1st Qu.: 5.00 Class :character 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 26.00 Mode :character Median : 0.000 Median :1.0000
## Mean : 32.83 Mean : 3.334 Mean :0.5191
## 3rd Qu.: 60.00 3rd Qu.: 5.000 3rd Qu.:1.0000
## Max. : 96.00 Max. : 92.000 Max. :1.0000
##
## pass_length pass_location air_yards yards_after_catch
## Length:1312 Length:1312 Min. :-9.000 Min. :-4.000
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Mode :character Median : 5.000 Median : 4.000
## Mean : 7.812 Mean : 5.427
## 3rd Qu.:13.000 3rd Qu.: 7.000
## Max. :43.000 Max. :59.000
## NA's :791 NA's :989
## run_location
## Length:1312
## Class :character
## Mode :character
##
##
##
##
For my first chart, I want to graph the pass count for each game over the duratuon of the season. The 12 game of the season, there were over 50 pass attempts. We will later see whether we won the game or not.
df_plays=df1 %>% count(game_id, play_type)
df_plays=filter(df_plays, play_type=="pass")
df_plays$game_id=as.character(df_plays$game_id)
summary(df_plays)
## game_id play_type n
## Length:16 Length:16 Min. :27.00
## Class :character Class :character 1st Qu.:32.75
## Mode :character Mode :character Median :34.50
## Mean :35.81
## 3rd Qu.:38.25
## Max. :52.00
df_plays_chart=ggplot(data=df_plays, aes(x=game_id, y=n))+
geom_bar(stat="identity", fill="blue")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
df_plays_chart
Here we are creating a box plot to see the median and the quartiles range to look at the spread of our data. We see the team prefers to pass more than running.
df_plays1=df1 %>% count(game_id, play_type)
df_plays1$game_id=as.character(df_plays1$game_id)
summary(df_plays1)
## game_id play_type n
## Length:116 Length:116 Min. : 1.00
## Class :character Class :character 1st Qu.: 3.00
## Mode :character Mode :character Median : 5.50
## Mean :11.31
## 3rd Qu.:17.50
## Max. :52.00
df_plays1_chart=ggplot(data=df_plays1, aes(x=play_type, y=n, fill=play_type))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90))
df_plays1_chart
Here I am creating a bar chart for an exploratory analysis to see the ratios of different play types.
df_plays2_chart=ggplot(data=df_plays1, aes(x="", y=n, fill=play_type))+
geom_bar(stat="identity", width=1)
coord_polar("y", start=0)
## <ggproto object: Class CoordPolar, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## direction: 1
## distance: function
## is_free: function
## is_linear: function
## labels: function
## modify_scales: function
## r: x
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## start: 0
## theta: y
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordPolar, Coord, gg>
df_plays2_chart
I am inporting another data set which includes game results and scores.
nyj <- getURL("https://github.com/mianshariq/SPS_Bridge/raw/27d4151e8dfee571968ccd0819d24efec8263f40/NYJ_Games.csv", ssl.verifypeer=0L, followlocation=1L)
df_nyj=read.csv(text=nyj)
df_nyj
Here I am merging the NFL data set with the results data set.
df1=merge(x = df1, y = df_nyj, by = "game_id", all = TRUE)
dim(df1)
## [1] 1312 31
Here I am combining two charts to have the pass and runs plays and whether more pass or run plays lead to win or loss. From the chart you can see that when the team has many pass plays for a game they tend to loose and you can see the opposite as when they run more run plays they tend to win those games. This is merely a correlation so More deep diving need to be done for the causation. You can also see from the chart, the tema for the first 8 games were more passing team and you can see only 1 win over those 8 games, but for remaining 8 games, there was a better balance between passing and running plays and thus they won 6 out of 8 bags.
df_plays_w=df1 %>% count(game_id, play_type, Result)
df_plays_w=filter(df_plays_w, play_type=="pass")
df_plays_w$game_id=as.character(df_plays_w$game_id)
summary(df_plays_w)
## game_id play_type Result n
## Length:16 Length:16 Length:16 Min. :27.00
## Class :character Class :character Class :character 1st Qu.:32.75
## Mode :character Mode :character Mode :character Median :34.50
## Mean :35.81
## 3rd Qu.:38.25
## Max. :52.00
df_plays_r=df1 %>% count(game_id, play_type, Result)
df_plays_r=filter(df_plays_r, play_type=="run")
df_plays_r$game_id=as.character(df_plays_r$game_id)
summary(df_plays_r)
## game_id play_type Result n
## Length:16 Length:16 Length:16 Min. :13.00
## Class :character Class :character Class :character 1st Qu.:19.00
## Mode :character Mode :character Mode :character Median :22.50
## Mean :23.12
## 3rd Qu.:28.25
## Max. :31.00
df_plays_w_chart=ggplot()+
geom_point(data=df_plays_w, aes(x=game_id, y=n, color=Result))+
geom_point(data=df_plays_r, aes(x=game_id, y=n, color=Result), shape=17)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
df_plays_w_chart
Last I am adding a histogram of the yards gained in the passing game, You can see a normal distribution which is skewed to the right. You can see most of the data is close to the mean as most of the plays are under 10 yards
df_yards=df1[,c("play_type", "yards_gained", "Result")]
summary(df_yards)
## play_type yards_gained Result
## Length:1312 Min. :-13.000 Length:1312
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 3.334
## 3rd Qu.: 5.000
## Max. : 92.000
df_yards=filter(df_yards, play_type=="pass")
df_yards_chart=ggplot()+
geom_histogram(data=df_yards, aes(x=yards_gained, color=Result))
df_yards_chart
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In conclusion, I believe that the team in the future should take a more balance approach in the passing and running game as based on the past data, they were able to win more games. Obviously more analysis need to be done such as the team personal and their skill set in the passing and running game but this is a good start towards a deeper dive. Next steps would be to apply KNN or Random Forest algorithm to predict W and L based on the many variables.