As a Football fan, I want to predict where there is a correlation between passing plays and whether a team wins or not. I have choosen a 2019 season NFL regular season data sets to see whether there is such correlation. The data set contains play by play data adding up to 45556 observation of data and 256 variables. It is missing the variable of Win/Loss and Score, So I will be joining a table from another data set in order to make some prediction. Some of the questions I want to answer. (These questions, want to answer them for NYJ my favorite football team) Is a specific team runs more pass plays or run plays? Based on the count of specific plays, are there more likely to win or lose? Are teams gaining big yardage or most of the yardage gains are less than 10 yards?

Importing RCurl library for reading github url

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RCurl)
library(ggplot2)

Reading and importing csv from my github

data <- getURL("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2019.csv", ssl.verifypeer=0L, followlocation=1L)
df=read.csv(text=data)

I am getting the shape of the data set as the Head of DF will show long 256 variables.

dim(df)
## [1] 45546   256
df=data.frame(df)

Here I am selecting specific columns out of the 256 variables and then taking the head or first 5 rows of the data sets.

df1=df[,c(1,2,3,4,5,6,7,8,9,10,13,15,16,17,18,19,20,21,22,23,24,26,27,28,34,35,36,37,38)]
head(df1)

From the 45546 observations, I want to select only data for New York Jet team and then take summary of the data.

df1=filter(df1, posteam =="NYJ")
summary(df1)
##     play_id          game_id           home_team          away_team        
##  Min.   :  36.0   Min.   :2.019e+09   Length:1312        Length:1312       
##  1st Qu.: 985.2   1st Qu.:2.019e+09   Class :character   Class :character  
##  Median :2042.0   Median :2.019e+09   Mode  :character   Mode  :character  
##  Mean   :2084.9   Mean   :2.019e+09                                        
##  3rd Qu.:3192.2   3rd Qu.:2.019e+09                                        
##  Max.   :4546.0   Max.   :2.019e+09                                        
##                                                                            
##    posteam          posteam_type         defteam          side_of_field     
##  Length:1312        Length:1312        Length:1312        Length:1312       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   yardline_100    game_date         game_seconds_remaining  quarter_end
##  Min.   : 1.00   Length:1312        Min.   :   2           Min.   :0   
##  1st Qu.:35.00   Class :character   1st Qu.: 863           1st Qu.:0   
##  Median :56.00   Mode  :character   Median :1834           Median :0   
##  Mean   :53.52                      Mean   :1771           Mean   :0   
##  3rd Qu.:75.00                      3rd Qu.:2700           3rd Qu.:0   
##  Max.   :99.00                      Max.   :3600           Max.   :0   
##                                                                        
##      drive             sp               qtr             down      
##  Min.   : 1.00   Min.   :0.00000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 5.00   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :11.00   Median :0.00000   Median :2.000   Median :2.000  
##  Mean   :11.58   Mean   :0.05793   Mean   :2.498   Mean   :2.062  
##  3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :26.00   Max.   :1.00000   Max.   :4.000   Max.   :4.000  
##                                                    NA's   :162    
##    goal_to_go         time              yrdln              ydstogo      
##  Min.   :0.0000   Length:1312        Length:1312        Min.   : 0.000  
##  1st Qu.:0.0000   Class :character   Class :character   1st Qu.: 4.000  
##  Median :0.0000   Mode  :character   Mode  :character   Median :10.000  
##  Mean   :0.0343                                         Mean   : 7.861  
##  3rd Qu.:0.0000                                         3rd Qu.:10.000  
##  Max.   :1.0000                                         Max.   :33.000  
##                                                                         
##      ydsnet        play_type          yards_gained        shotgun      
##  Min.   :-13.00   Length:1312        Min.   :-13.000   Min.   :0.0000  
##  1st Qu.:  5.00   Class :character   1st Qu.:  0.000   1st Qu.:0.0000  
##  Median : 26.00   Mode  :character   Median :  0.000   Median :1.0000  
##  Mean   : 32.83                      Mean   :  3.334   Mean   :0.5191  
##  3rd Qu.: 60.00                      3rd Qu.:  5.000   3rd Qu.:1.0000  
##  Max.   : 96.00                      Max.   : 92.000   Max.   :1.0000  
##                                                                        
##  pass_length        pass_location        air_yards      yards_after_catch
##  Length:1312        Length:1312        Min.   :-9.000   Min.   :-4.000   
##  Class :character   Class :character   1st Qu.: 1.000   1st Qu.: 1.000   
##  Mode  :character   Mode  :character   Median : 5.000   Median : 4.000   
##                                        Mean   : 7.812   Mean   : 5.427   
##                                        3rd Qu.:13.000   3rd Qu.: 7.000   
##                                        Max.   :43.000   Max.   :59.000   
##                                        NA's   :791      NA's   :989      
##  run_location      
##  Length:1312       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

For my first chart, I want to graph the pass count for each game over the duratuon of the season. The 12 game of the season, there were over 50 pass attempts. We will later see whether we won the game or not.

df_plays=df1 %>% count(game_id, play_type)
df_plays=filter(df_plays, play_type=="pass")
df_plays$game_id=as.character(df_plays$game_id)
summary(df_plays)
##    game_id           play_type               n        
##  Length:16          Length:16          Min.   :27.00  
##  Class :character   Class :character   1st Qu.:32.75  
##  Mode  :character   Mode  :character   Median :34.50  
##                                        Mean   :35.81  
##                                        3rd Qu.:38.25  
##                                        Max.   :52.00
df_plays_chart=ggplot(data=df_plays, aes(x=game_id, y=n))+
  geom_bar(stat="identity", fill="blue")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
df_plays_chart

Here we are creating a box plot to see the median and the quartiles range to look at the spread of our data. We see the team prefers to pass more than running.

df_plays1=df1 %>% count(game_id, play_type)
df_plays1$game_id=as.character(df_plays1$game_id)
summary(df_plays1)
##    game_id           play_type               n        
##  Length:116         Length:116         Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median : 5.50  
##                                        Mean   :11.31  
##                                        3rd Qu.:17.50  
##                                        Max.   :52.00
df_plays1_chart=ggplot(data=df_plays1, aes(x=play_type, y=n, fill=play_type))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 90))
df_plays1_chart

Here I am creating a bar chart for an exploratory analysis to see the ratios of different play types.

df_plays2_chart=ggplot(data=df_plays1, aes(x="", y=n, fill=play_type))+
  geom_bar(stat="identity", width=1)
coord_polar("y", start=0)
## <ggproto object: Class CoordPolar, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: FALSE
##     direction: 1
##     distance: function
##     is_free: function
##     is_linear: function
##     labels: function
##     modify_scales: function
##     r: x
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     start: 0
##     theta: y
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordPolar, Coord, gg>
df_plays2_chart 

I am inporting another data set which includes game results and scores.

nyj <- getURL("https://github.com/mianshariq/SPS_Bridge/raw/27d4151e8dfee571968ccd0819d24efec8263f40/NYJ_Games.csv", ssl.verifypeer=0L, followlocation=1L)
df_nyj=read.csv(text=nyj)
df_nyj

Here I am merging the NFL data set with the results data set.

df1=merge(x = df1, y = df_nyj, by = "game_id", all = TRUE)
dim(df1)
## [1] 1312   31

Here I am combining two charts to have the pass and runs plays and whether more pass or run plays lead to win or loss. From the chart you can see that when the team has many pass plays for a game they tend to loose and you can see the opposite as when they run more run plays they tend to win those games. This is merely a correlation so More deep diving need to be done for the causation. You can also see from the chart, the tema for the first 8 games were more passing team and you can see only 1 win over those 8 games, but for remaining 8 games, there was a better balance between passing and running plays and thus they won 6 out of 8 bags.

df_plays_w=df1 %>% count(game_id, play_type, Result)
df_plays_w=filter(df_plays_w, play_type=="pass")
df_plays_w$game_id=as.character(df_plays_w$game_id)
summary(df_plays_w)
##    game_id           play_type            Result                n        
##  Length:16          Length:16          Length:16          Min.   :27.00  
##  Class :character   Class :character   Class :character   1st Qu.:32.75  
##  Mode  :character   Mode  :character   Mode  :character   Median :34.50  
##                                                           Mean   :35.81  
##                                                           3rd Qu.:38.25  
##                                                           Max.   :52.00
df_plays_r=df1 %>% count(game_id, play_type, Result)
df_plays_r=filter(df_plays_r, play_type=="run")
df_plays_r$game_id=as.character(df_plays_r$game_id)
summary(df_plays_r)
##    game_id           play_type            Result                n        
##  Length:16          Length:16          Length:16          Min.   :13.00  
##  Class :character   Class :character   Class :character   1st Qu.:19.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :22.50  
##                                                           Mean   :23.12  
##                                                           3rd Qu.:28.25  
##                                                           Max.   :31.00
df_plays_w_chart=ggplot()+
  geom_point(data=df_plays_w, aes(x=game_id, y=n, color=Result))+
  geom_point(data=df_plays_r, aes(x=game_id, y=n, color=Result), shape=17)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
df_plays_w_chart

Last I am adding a histogram of the yards gained in the passing game, You can see a normal distribution which is skewed to the right. You can see most of the data is close to the mean as most of the plays are under 10 yards

df_yards=df1[,c("play_type", "yards_gained", "Result")]
summary(df_yards)
##   play_type          yards_gained        Result         
##  Length:1312        Min.   :-13.000   Length:1312       
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  3.334                     
##                     3rd Qu.:  5.000                     
##                     Max.   : 92.000
df_yards=filter(df_yards, play_type=="pass")
df_yards_chart=ggplot()+
  geom_histogram(data=df_yards, aes(x=yards_gained, color=Result))
df_yards_chart
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In conclusion, I believe that the team in the future should take a more balance approach in the passing and running game as based on the past data, they were able to win more games. Obviously more analysis need to be done such as the team personal and their skill set in the passing and running game but this is a good start towards a deeper dive. Next steps would be to apply KNN or Random Forest algorithm to predict W and L based on the many variables.