library(ggplot2)
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(RCurl)
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
nfl_kick <- read.csv("nflkick.csv")
attach(nfl_kick)
data(nfl_kick)
names(nfl_kick)## [1] "Team" "Year" "GameMinute" "Kicker" "Distance"
## [6] "ScoreDiff" "Grass" "Temp" "Success"
head(nfl_kick, 3)nfl_kick$Success = ifelse(nfl_kick$Success == 0, 'Miss','Make') # Create levels for Success Variable
nfl_kick$Success = as.factor(nfl_kick$Success)
class(nfl_kick$Success)## [1] "factor"
levels(nfl_kick$Success)## [1] "Make" "Miss"
nfl_kick$Success <- factor(nfl_kick$Success, levels =c('Miss', 'Make'))
contrasts(nfl_kick$Success)## Make
## Miss 0
## Make 1
This review disects information pertaining to field goal attempts between 2005 and 2015. The information relates 11,187 kicks over eleven years in the NFL. Pre and post graphical exploration of the dataset allows for several questions of interest. In this report, the major question is the relationship of Success, or outcome, of kicks to explanatory variables offered by the dataset.
nrow(nfl_kick) # how many attempted field goals included## [1] 11187
all.dist.attempted <- sort(unique(Distance)) # sorted list of fg attempts by length
all.dist.attempted## [1] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## [26] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68
## [51] 69 71 76
Initial graphical interpretation sets the bar for further study. Consider the relationship between temperature (Temp) and field surface (Grass). What follows is a boxplot depicting this relationship. Notice how the median as well as interquartile values are higher for the Grass variable.
boxplot(nfl_kick$Temp ~nfl_kick$Grass, ylab ='Temperature in degrees F',
xlab ='Field Surface: F =Turf, T =Sod',
main ='Boxplot of Temperature relative to Field Surface',
boxlwd =2, outlwd =2, col ='green', outpch =21, outbg ='red')The most intuitive question relates outcome of any given field goal try to distance. The graph is a density plot relating exactly this, distinguished for Success level, namely ‘make’ or ‘miss’.
ggplot(nfl_kick, aes(x =nfl_kick$Distance, fill =nfl_kick$Success)) +
geom_density(alpha =.4)Additionally, consider the distribution of attempts relative to temperature. Note that colder temperatures seem to disuade kicking field goals, thus lending a negative skew to the histogram. Clearly, though, attempt volume to temperature is not nearly linear, whether this is strictly due to the distribution of temperature or whether or not temperature has a non-linear degree of influence on Success.
ggplot(nfl_kick, aes(x =nfl_kick$Distance, color =nfl_kick$Success, fill =nfl_kick$Success)) +
geom_histogram(alpha =.4)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A simple regression model plots Success against Distance. Due to the binary composition of the response, Logistic regression is suited to discovering an equation to fit Success to the optimal subset of explanatory variables. The statistical significance of Distance (p-value, 2e-16) follows.
# Simple: Success:Distance
fg.glm <- glm(nfl_kick$Success ~nfl_kick$Distance, family =binomial)
summary(fg.glm)##
## Call:
## glm(formula = nfl_kick$Success ~ nfl_kick$Distance, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7193 0.2479 0.4086 0.6297 1.5497
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.724620 0.137223 41.72 <2e-16 ***
## nfl_kick$Distance -0.102615 0.003135 -32.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10105.0 on 11186 degrees of freedom
## Residual deviance: 8748.4 on 11185 degrees of freedom
## AIC: 8752.4
##
## Number of Fisher Scoring iterations: 5
A fuller model incorporates Year as well as Distance.
fg.prime.glm <- glm(nfl_kick$Success ~Distance + Year, family =binomial)
summary(fg.prime.glm)##
## Call:
## glm(formula = nfl_kick$Success ~ Distance + Year, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7744 0.2491 0.3980 0.6431 1.5753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.062e+02 1.731e+01 -6.137 8.39e-10 ***
## Distance -1.046e-01 3.171e-03 -32.973 < 2e-16 ***
## Year 5.574e-02 8.620e-03 6.467 1.00e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10105.0 on 11186 degrees of freedom
## Residual deviance: 8706.3 on 11184 degrees of freedom
## AIC: 8712.3
##
## Number of Fisher Scoring iterations: 5
Almost as interesting to me as what predictive variables suitably constitute the regression model for the log odds of made field goals is the question of what prompts a kick in the first place. One such question entertatins Distance relative to socre differential (Score Diff). The correlation coefficient follows, after which a scatterplot and linear regression line demonstrate the truth of the interaction. However, statistical significance was found insufficient to warrant a fuller model that incorporated this variable in the regression of Success
cor.test(nfl_kick$ScoreDiff, nfl_kick$Distance)##
## Pearson's product-moment correlation
##
## data: nfl_kick$ScoreDiff and nfl_kick$Distance
## t = -5.4355, df = 11185, p-value = 5.58e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06979159 -0.03282712
## sample estimates:
## cor
## -0.05132693
ggplot(nfl_kick, aes(x =ScoreDiff, y =Distance)) +
geom_point(color ='red', size =1) +
geom_smooth(method ='lm', formula =y ~x, se =TRUE, level =.95)If the outcome of a field goal try is of prime importance, it would benefit the dataset to include a percentage as predicted by the regression of Success on Distance and Year.
ext.fg.d8a <- mutate(nfl_kick, Pred_Perc = round(exp(-106.2 - (.1046 *Distance) + (.05574 *Year)) /
(1 + exp(-106.2 - (.1046 *Distance) + (.05574 *Year))), 4) *100)
head(ext.fg.d8a, 4)Here is a scatterplot of the new Predictive Percentage (Pred_Perc) value against Distance.
attach(ext.fg.d8a)## The following objects are masked from nfl_kick:
##
## Distance, GameMinute, Grass, Kicker, ScoreDiff, Success, Team,
## Temp, Year
ggplot(ext.fg.d8a, aes(x =Distance, y =Pred_Perc, color =Year)) +
geom_point(size =2.5, pch =21) A Scatterplot of Temperature’s influence on Percentage shows a nearly normal distribution.
ggplot(ext.fg.d8a, aes(x =Temp, y =Pred_Perc, color =Year)) +
geom_point(size =2.5, pch =21) Correlation Coefficient for Temp and Distance
cor.test(nfl_kick$Temp, nfl_kick$Distance)##
## Pearson's product-moment correlation
##
## data: nfl_kick$Temp and nfl_kick$Distance
## t = 3.8978, df = 9126, p-value = 9.776e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02027001 0.06123177
## sample estimates:
## cor
## 0.04076802
Whatever the confounding variables, progressive seasons are benefiting from increased accuracy. Speculate as to the degree of influence will not improve the findings. But additional data and/or the discovery of strong surrogate inputs for confounding variables would improve the ability of modeling techniques in predicting the outcome of a field goal attempt. The enormous improvements in field goal kicking continues to alter the tactics and results of games and play-by-play decisions.