“There are four seasons, really: winter, spring, summer, and football”

# Load your libraries
library(car)
library(tidyverse)
library(mosaic)
library(DT)
library(pander)
library(ggplot2)
library(plotly)


# Load your data after saving a csv file in your Data folder.
# You can use either 
#   someName <- read.csv("../Data/YourDataFileName.csv", header=TRUE)
# or
#   library(readr)
#   someName <- read_csv("../Data/YourDataFileName.csv")

# Don't forget to run "Session -> Set Working Directory -> To Source file location"

library(readxl)
quarterbackdata <- read_excel("~/Statistics-Notebook-master/Data/quarterbackdata.xlsx")

Background

The National Football League (also known as the NFL) is the major professional American football league in the United States. The NFL is composed of 32 teams and is split into two conferences, the AFC and the NFC. The NFL posts publically the statistics for each player and team for each year of play, also known as a season, on their official website.

Question for Analysis:

Is there correlation for number of passing yards and touch downs for Quarter Backs in the NFL?

To perform the statistical analysis for this question, I have taken the following data from the official NFL website for the 2025 season. The data collected includes quarterbacks that completed at least one touch down during the 2025 season, the name of the quarterbacks, the team played for, number of touchdowns, and number of passing yards. I collected data of 62 quarterbacks total. The number of passing yards and touchdowns is cumulative per season, not per individual game for the data that I will be using. Quarterbacks with no completed touchdowns for the 2025 season have not been considered in the data set to prevent skew. Players whose teams played more games are more likely to have higher passing yards as well as touchdowns, but this should not effect the distribution.

Table of Data:

datatable(quarterbackdata)
tdpy <- select(quarterbackdata, c(`PassingYrds`,`TD`))

Mathematical Model:

\[ \underbrace{Y_i}_\text{Touchdowns} = \overbrace{\beta_0}^\text{y-int} + \overbrace{\beta_1}^\text{slope} \underbrace{X_i}_\text{Passing Yards} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

The estimated regression line is written as:

\[ \underbrace{\hat{Y}_i}_\text{Touchdowns} = \overbrace{b_0}^\text{est. y-int} + \overbrace{b_1}^\text{est. slope} \underbrace{X_i}_\text{Passing Yards} \]

Hypothesis:

For my hypothesis, my null hypothesis is that the slope of the linear regression model is zero, meaning that passing yards do not have correlation or relationship to number of touch downs acheived by a quarterback. For my alternative hypothesis, the slope of the linear regression model is not equal to zero, meaning that there is relationship or correlation between passing yards and touchdowns acheived by a quarterback. For my significance level, I am going to use 0.05.

\[ \begin{aligned} H_0&: \beta_1 = 0 \quad \text{(No relationship between passing yards and touchdowns)}\\ H_a&: \beta_1 \neq 0 \quad \text{(Relationship between passing yards and touchdowns exists)}\\ \alpha &= 0.05 \end{aligned} \]

Assumptions:

When creating a linear regression model, there are five assumptions that are made: 1. The true regression relationship between Y and X variables are linear 2. The normal errors are normally distributed with a mean of zero 3. The variance of the error terms are constant for all values 4. The X values are fixed and measured without error 5. The error terms are independent

library(ggplot2)
library(plotly)
tdpy.lm <- lm(TD ~ PassingYrds, data=tdpy)
par(mfrow=c(1,3))
plot(tdpy.lm, which=1:2)
plot(tdpy.lm$residuals, col="firebrick", pch=19)

For this linear regression model, all of the assumptions above are met.The residuals found in the data set included data points 1, 25, and 17. These data points represent the quarterbacks Matthew Stafford, Tyler Shough, and Cam Ward. The residuals were not removed from the analysis, because they would eventually even out in the data set.

Analysis:

tdpy.lm <- lm(TD ~ PassingYrds, data=tdpy)
pander(summary(tdpy.lm))      
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.013 0.5835 -1.736 0.08776
PassingYrds 0.007235 0.0002409 30.03 7.54e-38
Fitting linear model: TD ~ PassingYrds
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
62 2.724 0.9376 0.9366

Intepreation of the Analysis:

The hypothesis test provides a p-value of 7.54e-38, therefore, there is a positive relationship between passing yards and touchdowns for quarterbacks in the NFL for the 2025 season. The estimated passing yards per touchdown is 0.007235, meaning that per passing yard, 0.007235 touchdowns are completed. The intercept estimate is -1.013, meaning that when a quarterback does not have any passing yards, they have -1.013 completed touchdowns. This is not possible, seeing as a player cannot score negative amounts of touchdowns. The R squared value observed in the analysis is 0.9366, meaning that 93.66% of the quarterbacks fit the estimation of touchdowns based on their passing yards. For this analysis, an alpha level of 0.05 was used. The p-value gathered from the analysis is 7.54e-38, meaning that we can fail to reject the null hypothesis and there is a relationship or correlation between passing yards and completed touchdowns.

Regression Graphic:

library(ggplot2)
library(plotly)
plot <- ggplot(quarterbackdata, aes( x = `PassingYrds`, y = `TD`)) +
  geom_smooth(method= "lm", se= TRUE, color= "firebrick", linetype= "solid") +
  geom_point(aes(text = paste("Player:", `QuarterbackName`, "\nTeam:", `Team`)),
             size = 3, color="midnightblue", alpha = 0.1) +
  labs(title = "Regression Model: Passing Yards VS Touchdowns",
       x = "Passing Yards",
       y = "Touchdowns") +
  theme_minimal()
plot2 <- ggplotly(plot, tooltip = "text")
plot2

Assuming that the relationship is linear, the equation of the fitted line shown in the plot above is:

\[ \underbrace{\hat{Y}_i}_\text{Touchdowns} = -1.013\text + 0.007235 \underbrace{X_i}_\text{Passing Yards} \]

Conclusion:

The hypothesis test provides a p-value of 7.54e-38, therefore, there is a positive relationship between passing yards and touchdowns for quarterbacks in the NFL for the 2025 season. The estimated value of the slope is 0.007235, meaning that for every passing yard, there are 0.007235 touchdowns completed by quarterbacks. The R squared value is 0.9366, meaning that there is high correlation, or relationship, between passing yards and touchdowns completed by quarterbacks in the NFL. This means also that 93.66% of the variance in the dependent variable (touchdowns) is explained by the independent variable (passing yards). The other 6.34% of variance can be explained by factors such as “the red zone squeeze” or when offensive football players get closer and closer to the redzone to complete a touchdown, therefore, causing them to have less passing yards.