-A survey of NFL field goal attempts from 2005 to 2015-

A compilation of NFL specific data, with over eleven thousand field goal attempts in observation, is explored in this report. Intuitive questions of interest, the statistical significance of explanatory variables offered by the data source, and what predictive aptitude the modeling of this information enables are all covered in the report that follows.

Abstract

The field of information offered in this data file contributes to General Linear Modeling (GLM) with hopes of discovering, specifically, what factors most significantly influence the outcome of a given try and, generally, with what accuracy this causality can be assigned.

Of particular interest to the researcher was the relationship between multiple explanatory variables that may persuade or dissuade a field goal attempt.
Without question, this decision is embedded in the likelihood of either of the binary outcomes given to a field goal tray. The chess moves required of opposing teams on game-day play the balance of necessity and probability.

Ultimately, the purpose of a field goal try is to add three points to the kicking team’s score. In order to gauge the likelihood of accomplishing this feat, specific inputs offered within the data allow for sound mathematical approaches toward the discovery of what factors most significantly influence the probabilities at hand. With what accuracy this causality can be assigned is of primary importance throughout the analysis.

The purpose of this study having been revealed, the methods for achieving these ends will be explored in the ensuing section. These efforts were fruitful.
In the graphs, formulas, and broader analysis that follow, a veritable equation was derived by which the odds of a particular attempt can be assessed based on such agents as distance (in yards) of the attempt, the temperature at time of kick, and more.

Data Cleaning

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(RCurl) 
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
nfl_kick <- read.csv("nflkick.csv")
attach(nfl_kick)

The data for this study was procured from Github.
https://raw.githubusercontent.com/statsbylopez/StatsSports/master/Data/nfl_fg.csv

Source data was utilized to extend the applications for its assessment by adding a columnar attribute for the likely percentage of an attempt per the corresponding observation. The formula for calculating this ‘Prediction_Percentage’ will be described later in the report. For preliminary disclosure, the predicted percentage is derived, per row, by the corresponding input variables weighted according to regression modeling on the eleven years (ten seasons) supervised by this data pool.

Immediate manipulation of the data, shown immediately below, transforms the object Success for purposes of regression modeling. Binary outcomes are labeled and made into factors. The first output cofirms the class change in Success.

nfl_kick$Success  = ifelse(nfl_kick$Success == 0, 'Miss','Make') # Create levels for Success Variable
nfl_kick$Success = as.factor(nfl_kick$Success)

class(nfl_kick$Success)
## [1] "factor"

The levels of Success are then confirmed.

nfl_kick$Success <- factor(nfl_kick$Success, levels =c('Miss', 'Make'))
contrasts(nfl_kick$Success)
##      Make
## Miss    0
## Make    1

The manipulation of columns, aforementioned, will reflect in the abridged view of the data arrangement below. Note, the last column of the extended field goal data is labeled ‘Prediction_Percent’.Again, the process for ascertaining the weighting of these parameters will be covered in more detail.

ext.fg.d8a <- mutate(nfl_kick, Prediction_Percent = round(exp(-106.2 - (.1046 *Distance) + (.05574 *Year)) / 
                                                   (1 + exp(-106.2 - (.1046 *Distance) + (.05574 *Year))), 4) *100)

head(ext.fg.d8a, 15)

Graphical Interpretations

Initial graphical reviews of the headed inputs provides insight into the co-relationships of these influencing components and shed light on the approach for determining what variables are most additive in the approach to estimating the outcome of a field goal attempt. A few basic statistics are presented now.

Number of kicks observed:

nrow(nfl_kick)
## [1] 11187

Discrete listing of field goal attempts (in yards):

all.dist.attempted <- sort(unique(Distance))

In the scatter plot that ensues, we explore the relationship between the score differential and distance. Of question was the tendency to attempt longer field goals when trailing as opposed to when the kicking team is ahead. The relationship seems evident. The relationship should prove negative. But the strength of their interaction was not so strong as supposed (correlation coefficient -.0513, see output after scatter).

ggplot(nfl_kick, aes(x =ScoreDiff, y =Distance)) +
  geom_point(color ='red', size =1) +
  geom_smooth(method ='lm', formula =y ~x, se =TRUE, level =.95)

cor.test(nfl_kick$ScoreDiff, nfl_kick$Distance)
## 
##  Pearson's product-moment correlation
## 
## data:  nfl_kick$ScoreDiff and nfl_kick$Distance
## t = -5.4355, df = 11185, p-value = 5.58e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06979159 -0.03282712
## sample estimates:
##         cor 
## -0.05132693

One relationships to consider is the distribution of attempted field goals relative to temperature. Here, temperature may serve as a substitute variable that acts as a partial surrogate for weather and stadium-dependent factors. A histogram to this effect is provided here.

ggplot(nfl_kick, aes(x =nfl_kick$Temp, color =nfl_kick$Success, fill =nfl_kick$Success)) +
  geom_histogram(alpha =.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The existence of negative skew reveals the less frequent occurrence of extremely high temperatures. Curiously, a non-linear relationship may exist here including the inability to maintain field goal percentage (all else equal) once temperature exceeds a certain threshold. A quadratic term for temperature was tested but did not prove statistically significant.

Another lens into the possibility of a stadium-effect is captured, at least in part, by the relationship between temperature and field surfaces, denoted true and false for grass and artificial turf, respectively. Notice the similar but unequal distributions between grassed and not-grassed fields.

boxplot(nfl_kick$Temp ~nfl_kick$Grass, ylab ='Temperature in degrees F', 
        xlab ='Field Surface: F =Turf, T =Sod',
        main ='Boxplot of Temperature relative to Field Surface', 
        boxlwd =2, outlwd =2, col ='darkorange1', outpch =21, outbg ='cyan')

It is expected that kicking percentage should decrease inversely to yardage of attempted distance. A density plot of all kicks follows and then a printout of the correlation between yardage distance of attempts relative to temperature

ggplot(nfl_kick, aes(x =nfl_kick$Distance, fill =nfl_kick$Success)) +
  geom_density(alpha =.4)

cor.test(nfl_kick$Temp, nfl_kick$Distance)
## 
##  Pearson's product-moment correlation
## 
## data:  nfl_kick$Temp and nfl_kick$Distance
## t = 3.8978, df = 9126, p-value = 9.776e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02027001 0.06123177
## sample estimates:
##        cor 
## 0.04076802

Methodologies

The initial process was to establish linear strength between the most intuitive factors influencing the outcomes of field goal tries. Distance, as was shown in the early findings, seemed to have the strongest association with success, the designated column for attempt outcome.

This binary response variable was first fitted against yardage with a basic GLM and the relationship was proven significant (p-value, 2e-16).

fg.glm <- glm(nfl_kick$Success ~nfl_kick$Distance, family =binomial)

summary(fg.glm)
## 
## Call:
## glm(formula = nfl_kick$Success ~ nfl_kick$Distance, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7193   0.2479   0.4086   0.6297   1.5497  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        5.724620   0.137223   41.72   <2e-16 ***
## nfl_kick$Distance -0.102615   0.003135  -32.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10105.0  on 11186  degrees of freedom
## Residual deviance:  8748.4  on 11185  degrees of freedom
## AIC: 8752.4
## 
## Number of Fisher Scoring iterations: 5

More inclusive logistic regression was then attempted to subset an equation to verily anticipate field goal outcomes as a function of the pertinent explanatory variables. A fuller, optimized model was rendered as follows.

fg.fuller.glm <- glm(nfl_kick$Success ~Distance + Year + 
                    Grass + Temp, family =binomial)

summary(fg.fuller.glm)
## 
## Call:
## glm(formula = nfl_kick$Success ~ Distance + Year + Grass + Temp, 
##     family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8171   0.2351   0.3905   0.6400   1.7264  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.185e+02  1.935e+01  -6.123 9.20e-10 ***
## Distance    -1.095e-01  3.596e-03 -30.454  < 2e-16 ***
## Year         6.175e-02  9.636e-03   6.408 1.47e-10 ***
## GrassTRUE   -1.818e-01  6.294e-02  -2.888  0.00387 ** 
## Temp         8.386e-03  1.876e-03   4.469 7.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8256.5  on 9127  degrees of freedom
## Residual deviance: 7028.7  on 9123  degrees of freedom
##   (2059 observations deleted due to missingness)
## AIC: 7038.7
## 
## Number of Fisher Scoring iterations: 5

Incomplete data (NA values) in the case of temperature and overlap between this and the grass (field surface) variable prompted the decision to pare the model in such a way as to efficiently construct a predictive model (strength) that would capture the trending focus within the league to build ever more weather neutral stadiums. With this consideration, distance and year constituted an ideal model for constructing the Predicted_Percent output that allowed for generalized mapping of odds and percentage likelihoods of kicks per the data that was assessed. This model is presented here.

fg.optimal.glm <- glm(nfl_kick$Success ~Distance + Year, family =binomial)

summary(fg.optimal.glm)
## 
## Call:
## glm(formula = nfl_kick$Success ~ Distance + Year, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7744   0.2491   0.3980   0.6431   1.5753  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.062e+02  1.731e+01  -6.137 8.39e-10 ***
## Distance    -1.046e-01  3.171e-03 -32.973  < 2e-16 ***
## Year         5.574e-02  8.620e-03   6.467 1.00e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10105.0  on 11186  degrees of freedom
## Residual deviance:  8706.3  on 11184  degrees of freedom
## AIC: 8712.3
## 
## Number of Fisher Scoring iterations: 5

Lastly, the verity of this model was compared to the simple model with which this section started. An Analysis of Variance (ANOVA) test was conducted and it supported (p-value, 7.260e-10) the inclusion of Year in modeling a prediction for what percentage to assign a specified kick.

anova(fg.optimal.glm, fg.glm)
1 - pchisq(42.087, 2)
## [1] 7.259791e-10

Results

Pictoral Revelations

As the various regression models demonstrated, a competent model can be orchestrated with respectable accuracy in determining the likelihood of a field goal attempt, dependent upon a host of variables. Distance is the primary determinant in these odds, as the plot below makes evident. Here, the odds of field goals against the distance in yards of the kick show the relationship.

nfl.odds <- mutate(nfl_kick, Odds = exp(5.72462 - (.102615 * Distance)))

ggplot(nfl.odds, aes(x =Distance, y =Odds)) +
  geom_point(color ='deeppink3', size =3.5, pch =19)

fg.odds <- mutate(nfl_kick, Odds =exp(-1.062e2 - (1.046e-1 *Distance) + (5.57e-2 *Year)))
#head(fg.odds, 3)
ggplot(fg.odds, aes(x =Distance, y =Odds)) +
  geom_point(color ='deeppink3', size =3.5, pch =21)

A similar plot of the predicted percentages of a field goal try, as explained by the two principal variables - namely, distance and year - shows a progressive trend in accuracy every five years. This trend is showcased beneath. Note the improvement in field goal percentages across the span of the eleven years supplied by the data set.

ggplot(ext.fg.d8a, aes(x =Distance, y =Prediction_Percent, color =Year)) +
  geom_point(size =2.5, pch =21) 

Conclusions

From the years observed and the variables collected, we are able to devise a model by which to forecast an attempted field goal in the year(s) immediately following the 2015 NFL season. What was revealed, even if short its due consideration at the onset of the study, is the improvement in accuracy over the timeline in question.

As this aspect of special teams play continues to build, the chess pieces existent within the game of football will be employed with enhanced attention. The valuation of a kick, based on expected outcome, changes the mechanics of real time decisions. The strength of the model as revealed by this study prompts an even more comprehensive retention of information and subsequent analysis.

