1 Introduction

1.0.1 Description

In this data set we have data from the 2008 NFL season. More specifically we have factors that go into NFL fielgoals. Some variables include the kicking team, Name, Distance, timerem, defscore, and GOOD.

Kicking team - Name of the kicking team (categorical) Name - Name of the kicker Distance - How far the ball is from the goal Time Remaining - How much time is on the game clock remaining in the game Defensive Score - The score of the opposing team GOOD - If the field goal is made or not, a 1 for a make and 0 for a miss

1.0.2 Question

From general knowledge most fans assume that the longer the distance it becomes less likely for a field goal to be made. Our question for this analysis is to see if this statement remains true. We will be exploring the association between a made field goal and distance

1.0.3 Data Cleaning

fieldgoals <- read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/nfl2008_fga.csv", header = TRUE)
clean_fieldgoals <- na.omit(fieldgoals)
clean_fieldgoals <- clean_fieldgoals %>% select(-GameDate, -AwayTeam, -HomeTeam, -qtr, -min, -sec, -def, -down, -togo, -kicker, -ydline, -homekick, -offscore, -season, -Missed, -Blocked)
y0=clean_fieldgoals$GOOD
fieldgoal.01 = rep(0, length(y0))
fieldgoal.01[which(y0=="pos")] = 1
clean_fieldgoals$fieldgoal.01 = fieldgoal.01
head(clean_fieldgoals)
  kickteam        name distance kickdiff timerem defscore GOOD fieldgoal.01
1      IND A.Vinatieri       30       -3    2822        3    1            0
2      IND A.Vinatieri       46        0    3287        0    1            0
3      IND A.Vinatieri       28        7    2720        0    1            0
4      IND A.Vinatieri       37       14    2742        0    1            0
5      IND A.Vinatieri       39        0    3056        0    1            0
6      IND A.Vinatieri       40       -3    3043        3    1            0

We take out any observations with a missing value. We also take out many variables due to there being a high likeleyhood for multicollineairty. We already have a variable for time so we eliminated many variables related to time. We also already have a variable for a make so we do not need any for a miss or blocked, that would just be a repeat our data. The others are just categorical variables that are to identify the kicker or kicking team which again we already have variables that describe that.

2 Data Analysis

ylimit = max(density(clean_fieldgoals$distance)$y)
hist(clean_fieldgoals$distance, probability = TRUE, main = "Distance", xlab="Dis", 
       col = "azure1", border="lightseagreen")
  lines(density(clean_fieldgoals$distance, adjust=2), col="blue") 

We do an exploritory data anylsis on our predictor variable. We see from the histogram above that there is no skew which means there is no imbalanace.

s.logit = glm(GOOD ~ distance, 
          family = binomial(link = "logit"),
          data = clean_fieldgoals)                 
result = summary(s.logit)
result

Call:
glm(formula = GOOD ~ distance, family = binomial(link = "logit"), 
    data = clean_fieldgoals)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   6.7056     0.5480  12.236   <2e-16 ***
distance     -0.1194     0.0124  -9.631   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 809.65  on 1036  degrees of freedom
Residual deviance: 686.12  on 1035  degrees of freedom
AIC: 690.12

Number of Fisher Scoring iterations: 6
model.coef.stats = summary(s.logit)$coef      
conf.ci = confint(s.logit)                    
Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)
kable(sum.stats,caption = "The summary stats of regression coefficients") 
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 6.7056029 0.5480144 12.236179 0 5.6755150 7.8271583
distance -0.1194428 0.0124020 -9.630942 0 -0.1445903 -0.0958982

From ouroutput above we see that distance is negatively asscoiated with a made field goal. Our estimate is equal to -.1194. Our 95% CI is [-.144, -.095]. This confidence interval also supports our hypothesis.

model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 6.7056029 0.5480144 12.236179 0 816.9704676
distance -0.1194428 0.0124020 -9.630942 0 0.8874148

Now we convert our estimate to an odds ratio. The odds ratio associated with distance is .887 meaning that as distance increases by one unit, the odds of being a made field goal goes down by 11.3%.

bmi.range = range(clean_fieldgoals$distance)
x = seq(bmi.range[1], bmi.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n a made field goal", 
     ylim=c(0, 1.1*ylimit),
     xlab = "distance",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

The graph above is our S curve which is pointing down like we think it would because it shows probability of a made field goal as distance goes up. We see that the probability of a field goal goes down as the distance goes up.

3 Conclusion

We used a real world dataset of the 2008 NFL season of NFL kicking field goals. We have concluded from the analysis above that our hypothesis is correct. As the distance of a field goal goes up in distance the probability of a make goes down.

