Introduction

In 1977 the idea of applying statistical analysis to American professional sports was popularized with Bill James and the introduction of Sabermetrics. However, professional sports tend to be very conservative and slow to change and you still see articles and comments from players and coaches on how their favored sport can not be boiled down to just data. Baseball began to explore the use of metrics in the 1990’s and early 2000’s and over the last 10 years we have slowly seen the NFL, NBA, HNL, and Professional Soccer Leagues begin to explore the application of data analytics in their own sports. In this project we will apply the tools that we have learned this semester to build a recommender system for NFL play calling.

My interest in this topic is of a personal nature. I have spent 13 of the last 15 years as a high school football coach and spent 3 years as an offensive coordinator and play caller. Currently calling plays in football is based on a mixture of hours of video work, trying to find tendencies and weaknesses in the opponents defense and an intuition of how your opponent will adjust throughout the game. You will also often take the time during the week to script the first 10 to 15 plays to get a feel for how a team will adjust to you. This leads to the production of the massive, although not nearly as large in high school, play calling cards with everything that you think that will work for any given situation. The idea for this recommender is to provide a type of play and a direction to run the play in. For example we would like to indicate that it’s first and 10

Data

For this project we will be using the play by play data for 2013 through 2016 seasons sourced from NFL Savant. The data is contained in a series of 4 csv files broken down by year. Lets load in one of the files and see what we have in the data.

pbp_2016 <- read.csv("../data/nfl/pbp-2016.csv")
head(pbp_2016)

We can see that the play-by-play data contains the following variables which may be of interest in our recommender:

Description of the Recommender System

We can see from the previous section that there are a number of variables in the data set and some of which can be useful to our recommender and some which will not be. Our plan is to take the down and distance information, either with actual distance or broken into short, medium, and long and recommend the play, play type and direction, that results in the most yards. Given that pass plays routinely gain more yards then runs we will recommend the best 3 passes and best 3 runs in that situation. While it would be nice to provide more information about what play to call this is not included in our data and varies on a team-by-team basis. We also will use the past 4 years of data to help provide enough data to avoid cold start issues.

Given that each team is a unique entity, there is no reason to believe that information about what is successful for the New England Patriots will be the same thing that is successful for the Carolina Panthers. Therefore our recommender system will be based on a single team’s play-by-play data. However we will also compute the recommendations for all teams to see if they are informative to the recommender. We will also implement this recommender for a single NFL team and investigate the feasibility of implementing for all teams.

We also recognize that some plays work better in some parts of the field then others so we would like to incorporate the field position into the data. We are concerned that this may make matrix to sparse eve with 4 years data so we will may end up binning distances together into categories like “Red Zone”, inside opponents 20 yard line, or “Backed Up”, inside our own 20. We also plan to look at some of our other variables to see if they may help us to provide a high quality recommendation.

Implementation Plan

To implement this project we will need to implement the following tasks;

  1. Combine the data from the 2013-2016 seasons.
  2. Clean and augment the data to include our “user” variables of down and distance combinations. Our “item” variables will be the play type and play direction.
  3. Play distance will be our metric
  4. Pick a team to be
  5. Use a Spark implementation to perform ALS (best performing method from the semester and can be implemented in spark)
  6. Build our recommender using the prediction matrix for our team, the prediction matrix for all teams, any other interesting variables that we find or create
  7. Evaluate the recommendation system.
