1 Introduction

1.0.1 Description

In this data set we have data from the 2008 NFL season. More specifically we have factors that go into NFL fielgoals. Some variables include the kicking team, Name, Distance, timerem, defscore, and GOOD.

Kicking team - Name of the kicking team (categorical) Name - Name of the kicker Distance - How far the ball is from the goal Time Remaining - How much time is on the game clock remaining in the game Defensive Score - The score of the opposing team GOOD - If the field goal is made or not, a 1 for a make and 0 for a miss

1.0.2 Question

From general knowledge most fans assume that the longer the distance it becomes less likely for a field goal to be made. Our question for this analysis is to see if this statement remains true. We will be exploring the association between a made field goal and distance

1.0.3 Data Cleaning

fieldgoals <- read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/nfl2008_fga.csv", header = TRUE)
clean_fieldgoals <- na.omit(fieldgoals)
clean_fieldgoals <- clean_fieldgoals %>% select(-GameDate, -AwayTeam, -HomeTeam, -qtr, -min, -sec, -def, -down, -togo, -kicker, -ydline, -homekick, -offscore, -season, -Missed, -Blocked)
y0=clean_fieldgoals$GOOD
fieldgoal.01 = rep(0, length(y0))
fieldgoal.01[which(y0=="pos")] = 1
clean_fieldgoals$fieldgoal.01 = fieldgoal.01
head(clean_fieldgoals)
  kickteam        name distance kickdiff timerem defscore GOOD fieldgoal.01
1      IND A.Vinatieri       30       -3    2822        3    1            0
2      IND A.Vinatieri       46        0    3287        0    1            0
3      IND A.Vinatieri       28        7    2720        0    1            0
4      IND A.Vinatieri       37       14    2742        0    1            0
5      IND A.Vinatieri       39        0    3056        0    1            0
6      IND A.Vinatieri       40       -3    3043        3    1            0

We take out any observations with a missing value. We also take out many variables due to there being a high likeleyhood for multicollineairty. We already have a variable for time so we eliminated many variables related to time. We also already have a variable for a make so we do not need any for a miss or blocked, that would just be a repeat our data. The others are just categorical variables that are to identify the kicker or kicking team which again we already have variables that describe that.

2 Data Analysis

ylimit = max(density(clean_fieldgoals$distance)$y)
hist(clean_fieldgoals$distance, probability = TRUE, main = "Distance", xlab="Dis", 
       col = "azure1", border="lightseagreen")
  lines(density(clean_fieldgoals$distance, adjust=2), col="blue") 

We do an exploritory data anylsis on our predictor variable. We see from the histogram above that there is no skew which means there is no imbalanace.

s.logit = glm(GOOD ~ distance, 
          family = binomial(link = "logit"),
          data = clean_fieldgoals)                 
result = summary(s.logit)
result

Call:
glm(formula = GOOD ~ distance, family = binomial(link = "logit"), 
    data = clean_fieldgoals)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   6.7056     0.5480  12.236   <2e-16 ***
distance     -0.1194     0.0124  -9.631   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 809.65  on 1036  degrees of freedom
Residual deviance: 686.12  on 1035  degrees of freedom
AIC: 690.12

Number of Fisher Scoring iterations: 6
model.coef.stats = summary(s.logit)$coef      
conf.ci = confint(s.logit)                    
Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)
kable(sum.stats,caption = "The summary stats of regression coefficients") 
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 6.7056029 0.5480144 12.236179 0 5.6755150 7.8271583
distance -0.1194428 0.0124020 -9.630942 0 -0.1445903 -0.0958982

From ouroutput above we see that distance is negatively asscoiated with a made field goal. Our estimate is equal to -.1194. Our 95% CI is [-.144, -.095]. This confidence interval also supports our hypothesis.

model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 6.7056029 0.5480144 12.236179 0 816.9704676
distance -0.1194428 0.0124020 -9.630942 0 0.8874148

Now we convert our estimate to an odds ratio. The odds ratio associated with distance is .887 meaning that as distance increases by one unit, the odds of being a made field goal goes down by 11.3%.

bmi.range = range(clean_fieldgoals$distance)
x = seq(bmi.range[1], bmi.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n a made field goal", 
     ylim=c(0, 1.1*ylimit),
     xlab = "distance",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

The graph above is our S curve which is pointing down like we think it would because it shows probability of a made field goal as distance goes up. We see that the probability of a field goal goes down as the distance goes up.

3 Conclusion

We used a real world dataset of the 2008 NFL season of NFL kicking field goals. We have concluded from the analysis above that our hypothesis is correct. As the distance of a field goal goes up in distance the probability of a make goes down.

---
title: "Factors Influence NFL Field Goals"
author: 'Tyler Battaglini'
date: "2024-10-11"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
always_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-weight: bold;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("nleqslv")) {
   install.packages("nleqslv")
   library(nleqslv)
}
#
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}

if (!require("psych")) {   
  install.packages("psych")
   library(psych)
}
if (!require("MASS")) {   
  install.packages("MASS")
   library(MASS)
}
if (!require("ggplot2")) {   
  install.packages("ggplot2")
   library(ggplot2)
}
if (!require("GGally")) {   
  install.packages("GGally")
   library(GGally)
}
if (!require("car")) {   
  install.packages("car")
   library(car)
}
if (!require("dplyr")) {   
  install.packages("dplyr")
   library(dplyr)
}
if (!require("caret")) {   
  install.packages("caret")
   library(caret)
}

# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      messages = FALSE,  #
                      results = TRUE,
                      
                      comment = NA       # you can also decide whether to include the output
                                         # in the output file.
                      )   
```

# Introduction

### Description
In this data set we have data from the 2008 NFL season. More specifically we have factors that go into NFL fielgoals. Some variables include the kicking team, Name, Distance, timerem, defscore, and GOOD.

Kicking team - Name of the kicking team (categorical)
Name - Name of the kicker
Distance - How far the ball is from the goal
Time Remaining - How much time is on the game clock remaining in the game
Defensive Score - The score of the opposing team
GOOD - If the field goal is made or not, a 1 for a make and 0 for a miss

### Question

From general knowledge most fans assume that the longer the distance it becomes less likely for a field goal to be made. Our question for this analysis is to see if this statement remains true. We will be exploring the association between a made field goal and distance

### Data Cleaning

```{r}

fieldgoals <- read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/nfl2008_fga.csv", header = TRUE)

```

```{r}
clean_fieldgoals <- na.omit(fieldgoals)
clean_fieldgoals <- clean_fieldgoals %>% select(-GameDate, -AwayTeam, -HomeTeam, -qtr, -min, -sec, -def, -down, -togo, -kicker, -ydline, -homekick, -offscore, -season, -Missed, -Blocked)
y0=clean_fieldgoals$GOOD
fieldgoal.01 = rep(0, length(y0))
fieldgoal.01[which(y0=="pos")] = 1
clean_fieldgoals$fieldgoal.01 = fieldgoal.01
head(clean_fieldgoals)
```
We take out any observations with a missing value. We also take out many variables due to there being a high likeleyhood for multicollineairty. We already have a variable for time so we eliminated many variables related to time. We also already have a variable for a make so we do not need any for a miss or blocked, that would just be a repeat our data. The others are just categorical variables that are to identify the kicker or kicking team which again we already have variables that describe that. 



# Data Analysis

```{r}
ylimit = max(density(clean_fieldgoals$distance)$y)
hist(clean_fieldgoals$distance, probability = TRUE, main = "Distance", xlab="Dis", 
       col = "azure1", border="lightseagreen")
  lines(density(clean_fieldgoals$distance, adjust=2), col="blue") 
```

We do an exploritory data anylsis on our predictor variable. We see from the histogram above that there is no skew which means there is no imbalanace. 

```{r}
s.logit = glm(GOOD ~ distance, 
          family = binomial(link = "logit"),
          data = clean_fieldgoals)                 
result = summary(s.logit)
result


```

```{r}
model.coef.stats = summary(s.logit)$coef      
conf.ci = confint(s.logit)                    
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)
kable(sum.stats,caption = "The summary stats of regression coefficients") 

```

From ouroutput above we see that distance is negatively asscoiated with a made field goal. Our estimate is equal to -.1194. Our 95% CI is [-.144, -.095]. This confidence interval also supports our hypothesis.

```{r}
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")

```

Now we convert our estimate to an odds ratio. The odds ratio associated with distance is .887 meaning that as distance increases by one unit, the odds of being a made field goal goes down by 11.3%. 

```{r}
bmi.range = range(clean_fieldgoals$distance)
x = seq(bmi.range[1], bmi.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n a made field goal", 
     ylim=c(0, 1.1*ylimit),
     xlab = "distance",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
```

The graph above is our S curve which is pointing down like we think it would because it shows probability of a made field goal as distance goes up. We see that the probability of a field goal goes down as the distance goes up.

# Conclusion

We used a real world dataset of the 2008 NFL season of NFL kicking field goals. We have concluded from the analysis above that our hypothesis is correct. As the distance of a field goal goes up in distance the probability of a make goes down.



