Type of study: Observational Cases: The cases in this dataset are quarterbacks in the NFL who play over 13 games. There are 242 cases from 2009-2018. Hypotheses: Null: Age has an impact on quarterback passing yardage performance. Alternate: Age does not have an impact on quarterback passing yardage performance.
# load data
library(tidyverse)
library(readr)
library(curl)
library(ggplot2)
library(dplyr)
library(scales)
library(openintro)
pass_2009 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2009.csv"))
pass_2010 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2010.csv"))
pass_2011 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2011.csv"))
pass_2012 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2012.csv"))
pass_2013 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2013.csv"))
pass_2014 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2014.csv"))
pass_2015 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2015.csv"))
pass_2016 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2016.csv"))
pass_2017 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2017.csv"))
pass_2018 <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data606/main/pass-2018.csv"))
nfl_passing_2009_2018 <- rbind(pass_2009,pass_2010,pass_2011,pass_2012,pass_2013,pass_2014,pass_2015,pass_2016,pass_2017,pass_2018)
nfl_passing_2009_2018_min13games <- nfl_passing_2009_2018 %>%
filter(Pos=="QB", G > 12)
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(nfl_passing_2009_2018_min13games)
## Rk Player Tm Age
## Min. : 1.00 Length:242 Length:242 Min. :21.00
## 1st Qu.: 7.00 Class :character Class :character 1st Qu.:25.00
## Median :13.00 Mode :character Mode :character Median :28.00
## Mean :12.83 Mean :28.79
## 3rd Qu.:19.00 3rd Qu.:32.00
## Max. :30.00 Max. :41.00
##
## Pos G GS QBrec
## Length:242 Min. :13.00 Min. :10.00 Length:242
## Class :character 1st Qu.:15.00 1st Qu.:15.00 Class :character
## Mode :character Median :16.00 Median :16.00 Mode :character
## Mean :15.36 Mean :15.22
## 3rd Qu.:16.00 3rd Qu.:16.00
## Max. :16.00 Max. :16.00
##
## Cmp Att Cmp. Yds TD
## Min. :126.0 Min. :271.0 Min. :46.50 Min. :1558 Min. : 3.00
## 1st Qu.:284.2 1st Qu.:471.2 1st Qu.:60.20 1st Qu.:3301 1st Qu.:19.00
## Median :337.5 Median :528.5 Median :62.90 Median :3825 Median :24.00
## Mean :329.3 Mean :522.4 Mean :62.82 Mean :3814 Mean :24.55
## 3rd Qu.:367.8 3rd Qu.:579.5 3rd Qu.:65.97 3rd Qu.:4324 3rd Qu.:29.75
## Max. :471.0 Max. :727.0 Max. :74.40 Max. :5477 Max. :55.00
##
## TD. Int Int. Lng
## Min. :1.000 Min. : 2.00 Min. :0.300 Min. :45.00
## 1st Qu.:3.725 1st Qu.: 9.00 1st Qu.:1.900 1st Qu.:65.00
## Median :4.600 Median :12.00 Median :2.400 Median :74.00
## Mean :4.671 Mean :12.53 Mean :2.421 Mean :72.87
## 3rd Qu.:5.500 3rd Qu.:15.00 3rd Qu.:2.875 3rd Qu.:80.00
## Max. :9.000 Max. :27.00 Max. :5.500 Max. :99.00
##
## Y.A AY.A Y.C Y.G
## Min. :5.200 Min. : 4.100 Min. : 8.90 Min. :119.8
## 1st Qu.:6.825 1st Qu.: 6.400 1st Qu.:11.00 1st Qu.:218.6
## Median :7.300 Median : 7.150 Median :11.60 Median :249.2
## Mean :7.290 Mean : 7.135 Mean :11.61 Mean :247.4
## 3rd Qu.:7.700 3rd Qu.: 7.800 3rd Qu.:12.10 3rd Qu.:273.0
## Max. :9.300 Max. :10.500 Max. :14.20 Max. :342.3
##
## Rate QBR Sk Yds.1
## Min. : 58.40 Min. :13.20 Min. :10.00 Min. : 74.0
## 1st Qu.: 82.45 1st Qu.:50.30 1st Qu.:26.00 1st Qu.:172.0
## Median : 90.85 Median :59.45 Median :32.00 Median :209.0
## Mean : 90.30 Mean :58.64 Mean :32.87 Mean :215.7
## 3rd Qu.: 97.35 3rd Qu.:68.28 3rd Qu.:39.00 3rd Qu.:255.8
## Max. :122.50 Max. :84.50 Max. :62.00 Max. :420.0
##
## NY.A ANY.A Sk. X4QC
## Min. :4.020 Min. :2.980 Min. : 1.700 Min. :0.000
## 1st Qu.:5.970 1st Qu.:5.585 1st Qu.: 4.700 1st Qu.:1.000
## Median :6.425 Median :6.320 Median : 5.900 Median :2.000
## Mean :6.456 Mean :6.311 Mean : 6.043 Mean :2.095
## 3rd Qu.:6.947 3rd Qu.:7.030 3rd Qu.: 7.100 3rd Qu.:3.000
## Max. :8.250 Max. :9.390 Max. :11.300 Max. :8.000
## NA's :11
## GWD
## Min. :0.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.697
## 3rd Qu.:4.000
## Max. :8.000
## NA's :11
nfl_passing_2009_2018_min13games %>%
select(Player,Age,G,Yds) %>%
arrange(desc(Yds))
nfl_passing_2009_2018_min13games %>%
ggplot(aes(x=Age, y=Yds, color=Yds)) +
geom_point() +
labs(title = "Passing Yds vs. Age in the NFL", x="Age",y="Yards Thrown") +
theme(legend.position = "none")
ggsave("scatter_nofilter.png",plot=last_plot())
## Saving 7 x 5 in image
nfl_passing_2009_2018_min13games %>%
group_by(Age) %>%
summarise(avg_yds = mean(Yds)) %>%
ggplot(aes(x = Age, y = avg_yds, fill=avg_yds)) +
geom_bar(stat = "identity") +
theme(legend.position="none")
ggsave("barplot.png",plot=last_plot())
## Saving 7 x 5 in image
#scale_fill_gradient2(high = muted("blue"), low=muted("grey"))
qbs_over34_13gamesmin <- nfl_passing_2009_2018_min13games %>%
filter(Age>34)
summary(qbs_over34_13gamesmin)
## Rk Player Tm Age
## Min. : 1.000 Length:31 Length:31 Min. :35.00
## 1st Qu.: 3.000 Class :character Class :character 1st Qu.:36.00
## Median : 8.000 Mode :character Mode :character Median :37.00
## Mean : 8.161 Mean :36.97
## 3rd Qu.:10.500 3rd Qu.:38.00
## Max. :26.000 Max. :41.00
## Pos G GS QBrec
## Length:31 Min. :13.00 Min. :13.00 Length:31
## Class :character 1st Qu.:15.00 1st Qu.:15.00 Class :character
## Mode :character Median :16.00 Median :16.00 Mode :character
## Mean :15.55 Mean :15.55
## 3rd Qu.:16.00 3rd Qu.:16.00
## Max. :16.00 Max. :16.00
## Cmp Att Cmp. Yds TD
## Min. :217.0 Min. :358.0 Min. :59.90 Min. :2509 Min. :11.0
## 1st Qu.:350.5 1st Qu.:533.5 1st Qu.:62.45 1st Qu.:4068 1st Qu.:25.0
## Median :373.0 Median :578.0 Median :65.80 Median :4343 Median :32.0
## Mean :370.7 Mean :567.1 Mean :65.33 Mean :4287 Mean :29.1
## 3rd Qu.:397.5 3rd Qu.:611.0 3rd Qu.:68.30 3rd Qu.:4699 3rd Qu.:33.5
## Max. :471.0 Max. :675.0 Max. :74.40 Max. :5477 Max. :55.0
## TD. Int Int. Lng
## Min. :2.700 Min. : 2.00 Min. :0.300 Min. :45.00
## 1st Qu.:4.300 1st Qu.: 9.00 1st Qu.:1.500 1st Qu.:66.00
## Median :5.100 Median :11.00 Median :2.000 Median :75.00
## Mean :5.071 Mean :11.81 Mean :2.135 Mean :73.45
## 3rd Qu.:5.750 3rd Qu.:14.50 3rd Qu.:2.500 3rd Qu.:80.00
## Max. :8.300 Max. :21.00 Max. :5.300 Max. :98.00
## Y.A AY.A Y.C Y.G
## Min. :6.100 Min. :5.200 Min. : 9.90 Min. :193.0
## 1st Qu.:7.200 1st Qu.:7.100 1st Qu.:11.15 1st Qu.:259.7
## Median :7.600 Median :7.800 Median :11.60 Median :274.1
## Mean :7.555 Mean :7.606 Mean :11.57 Mean :274.6
## 3rd Qu.:7.900 3rd Qu.:8.300 3rd Qu.:11.90 3rd Qu.:293.6
## Max. :8.700 Max. :9.300 Max. :13.70 Max. :342.3
## Rate QBR Sk Yds.1
## Min. : 69.90 Min. :35.20 Min. :17.00 Min. :118.0
## 1st Qu.: 90.15 1st Qu.:57.15 1st Qu.:21.00 1st Qu.:140.5
## Median : 97.40 Median :68.00 Median :27.00 Median :175.0
## Mean : 95.99 Mean :65.18 Mean :28.16 Mean :188.1
## 3rd Qu.:102.50 3rd Qu.:74.00 3rd Qu.:34.50 3rd Qu.:214.5
## Max. :115.70 Max. :81.90 Max. :49.00 Max. :358.0
## NY.A ANY.A Sk. X4QC
## Min. :5.450 Min. :4.570 Min. :2.700 Min. :0.000
## 1st Qu.:6.330 1st Qu.:6.170 1st Qu.:3.500 1st Qu.:1.000
## Median :7.000 Median :7.040 Median :4.400 Median :2.000
## Mean :6.872 Mean :6.923 Mean :4.784 Mean :1.903
## 3rd Qu.:7.295 3rd Qu.:7.605 3rd Qu.:5.900 3rd Qu.:2.000
## Max. :8.040 Max. :8.870 Max. :8.900 Max. :6.000
## GWD
## Min. :0.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.613
## 3rd Qu.:3.000
## Max. :7.000
qbs_over34_13gamesmin %>%
ggplot(aes(x=Age, y=Yds, color=Yds)) +
geom_point() +
labs(title = "Passing Yds by Age for QBs 35+ years old", x="Age",y="Yards Thrown")+
theme(legend.position = "none") +
geom_text(aes(label=Player), size=2) +
xlim(35,41) +
ylim(2500, 5500)
ggsave("scatter_over34.png",plot=last_plot())
## Saving 7 x 5 in image
set.seed(5678)
yds_thrown <- qbs_over34_13gamesmin$Yds
#simulate yds thrown for the QBs 35+ years old using the actual yards thrown in the data set, with 100 random samples, with replacement
sim_yds_thrown <- sample(yds_thrown, size = 100, replace = TRUE)
sim_yds_thrown <- as.data.frame(sim_yds_thrown)
table(sim_yds_thrown)
## sim_yds_thrown
## 2509 2926 3001 3468 3571 3753 3992 4027 4109 4202 4233 4251 4308 4334 4343 4355
## 7 7 5 1 2 1 1 3 4 2 2 4 4 5 4 5
## 4386 4442 4515 4577 4659 4671 4727 4770 4827 4870 4952 5129 5477
## 2 3 2 1 5 2 4 3 3 4 2 5 7
#Distribution of yards thrown for QBs 35+
ggplot(qbs_over34_13gamesmin, aes(x=Yds)) +
geom_histogram(binwidth=100, position="identity")
ggsave("actualdistr.png",plot=last_plot())
## Saving 7 x 5 in image
yds_avg <- mean(sim_yds_thrown$sim_yds_thrown)
yds_sd <- sd(as.double(sim_yds_thrown$sim_yds_thrown))
ggplot(data = sim_yds_thrown, aes(x = sim_yds_thrown)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = yds_avg, sd = yds_sd), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave("distr.png",plot=last_plot())
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Construct a normal probability plot quantile-quantile plot to determine the distribution of the samples.
ggplot(data = qbs_over34_13gamesmin, aes(sample = Yds)) +
geom_line(stat = "qq")
ggsave("qq.png",plot=last_plot())
## Saving 7 x 5 in image
qqnormsim(sample = Yds, data = qbs_over34_13gamesmin)
Assuming that the passing yards are normally distributed, what is the probability that a randomly chose quarterback 35+ years old throws for over 4500 yards in a season?
1 - pnorm(q = 4500, mean = yds_avg, sd = yds_sd)
## [1] 0.3609715
#Add a column to denote whether the QB threw for over 4500 yards
qbs_over34_13gamesmin$over_4500 <- ifelse(qbs_over34_13gamesmin$Yds>4499,"Y","N")
#Obtain summary statistics to show probability of throwing over 4500 for QBs 35+
qbs_over34_13gamesmin %>%
count(over_4500) %>%
mutate(p = n /sum(n))
From the dataset, there is a 38.9% probability that a quarterback 35+ years old throws for over 4500 yards in a season.
We can take a look at the distribution by team and drilldown on those that appear to drive the probability.
ggplot(qbs_over34_13gamesmin, aes(x=Yds)) +
facet_wrap(~Tm) +
geom_histogram(binwidth=100, position="identity")
#Teams with lower
qbs_over34_13gamesmin %>%
filter(Tm %in% c("SEA","NYJ","TEN")) %>%
group_by(Tm)
qbs_over34_13gamesmin %>%
filter(Tm %in% c("NWE","DEN","NOR")) %>%
group_by(Tm)
By looking at this subset, with given knowledge of the NFL, you’d know Tom Brady and Peyton Manning will thrown 4500+ yards when they’re 50.
Let’s determine if this can be proven with a sampling method. We’ll use a sample of 10, which is 34% of the population (10/29).
library(infer)
set.seed(5678)
samp <- qbs_over34_13gamesmin %>%
sample_n(10)
samp %>%
count(over_4500) %>%
mutate(p = n /sum(n))
We can see with a simple random sample, the probability of throwing 4500 yards for QBs 35+ is off by approximately 1.3%, although results can be skewed by the small population.
Let’s attempt to get a sense of how much variability we should expect when estimating the population mean through sampling. We’ll take 15,000 different samples of size 10 from the population, calculate the proportion of the responses in each sample, filter for only when the yards are over 4,500 and construct the distributions with replacement.
set.seed(5678)
sample_props10 <- qbs_over34_13gamesmin %>%
rep_sample_n(size = 10, reps = 15000, replace = TRUE) %>%
count(over_4500) %>%
mutate(p_hat = n /sum(n)) %>%
filter(over_4500 == "Y")
And we can visualize the distribution of these proportions with a histogram.
ggplot(data = sample_props10, aes(x = p_hat)) +
geom_histogram(binwidth = 0.1) +
labs(
x = "p_hat (Over 4500 yards)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 10, Number of samples = 15000"
)
ggsave("sampledistr.png",plot=last_plot())
## Saving 7 x 5 in image
mean(sample_props10$p_hat)
## [1] 0.389755
range(sample_props10$p_hat)
## [1] 0.1 0.9
There are 15,000 elements in sample_props10. The sampling distribution follows that of a normal distribution, with a center of 0.403. The range of values for p_hat extend from 1 QB out of 10 that are 35+ years old and throw for 4500+ yards (10%) to 10/10 that do.
Boostrapping:
set.seed(1234)
n <- 10
sample2 <- qbs_over34_13gamesmin %>%
sample_n(size = n)
sample2 %>%
count(over_4500) %>%
mutate(p = n /sum(n))
sample2 %>%
specify(response = over_4500, success = "Y") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
In our second sample of 10, the probability a quarterback 35+ years old throws over 4500 yards is 60%. Using sampling, we’re 95% confident that the true proportion of quarterbacks 35+ years old that throw for 4500+ years is between 10% and 70%, which includes the true population proportion calculated above (38%).
Let’s take a look on how decreasing the confidence interval affects this range.
set.seed(1234)
sample2 %>%
specify(response = over_4500, success = "Y") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.75)
Based on a CI of 75%, we can state that we are 75% confident that the proportion of quarterbacks 35+ years old that throw for 4500+ years is between 20% and 60%.
me_0.95 <- 1.96*(sqrt((.1*(1-.1)/1000)+.7*(1-.7)/1000))
The margin of error at the 95% CI equates to 0.0339.
Given the analysis of “older” quarterbacks above, one would be tempted to find out if there is a correlation between age and yards thrown. One would expect a negative linear regression, but does that hold true?
nfl_passing_2009_2018_min13games %>%
summarise(cor(Age, Yds, use = "complete.obs"))
nfl_passing_2009_2018_min13games %>%
ggplot(aes(x=Age, y=Yds, color=Yds)) +
geom_point() +
labs(title = "Passing Yds vs. Age in the NFL", x="Age",y="Yards Thrown") +
stat_smooth(method = "lm", se = FALSE) +
theme(legend.position="none")
## `geom_smooth()` using formula 'y ~ x'
ggsave("regression.png",plot=last_plot())
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
There actually does not seem to be a strong correlation between a quarterback’s age and passing yardage in a season.
Yards as a function of age.
m2 <- lm(Yds ~ Age, data = nfl_passing_2009_2018_min13games)
summary(m2)
##
## Call:
## lm(formula = Yds ~ Age, data = nfl_passing_2009_2018_min13games)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2044.60 -446.94 21.12 432.25 1633.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2071.577 266.476 7.774 2.23e-13 ***
## Age 60.537 9.142 6.622 2.30e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 653.4 on 240 degrees of freedom
## Multiple R-squared: 0.1545, Adjusted R-squared: 0.151
## F-statistic: 43.85 on 1 and 240 DF, p-value: 2.301e-10
15% of the variability is explained by age. With a p-value (2.301e-10) < 0.05, we reject the null hypothesis.
#Linearity
ggplot(data = m2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
#Nearly Normal Residuals
ggplot(data = m2, aes(x = .resid)) +
geom_histogram(binwidth = 25) +
xlab("Residuals")
ggsave("residuals.png",plot=last_plot())
## Saving 7 x 5 in image
#Normal Probability of the residuals
ggplot(data = m2, aes(sample = .resid)) +
stat_qq()
https://www.kaggle.com/datasets/omzqwonxei/nfl-passing-statistics-20092018
summary(qbs_over34_13gamesmin)
## Rk Player Tm Age
## Min. : 1.000 Length:31 Length:31 Min. :35.00
## 1st Qu.: 3.000 Class :character Class :character 1st Qu.:36.00
## Median : 8.000 Mode :character Mode :character Median :37.00
## Mean : 8.161 Mean :36.97
## 3rd Qu.:10.500 3rd Qu.:38.00
## Max. :26.000 Max. :41.00
## Pos G GS QBrec
## Length:31 Min. :13.00 Min. :13.00 Length:31
## Class :character 1st Qu.:15.00 1st Qu.:15.00 Class :character
## Mode :character Median :16.00 Median :16.00 Mode :character
## Mean :15.55 Mean :15.55
## 3rd Qu.:16.00 3rd Qu.:16.00
## Max. :16.00 Max. :16.00
## Cmp Att Cmp. Yds TD
## Min. :217.0 Min. :358.0 Min. :59.90 Min. :2509 Min. :11.0
## 1st Qu.:350.5 1st Qu.:533.5 1st Qu.:62.45 1st Qu.:4068 1st Qu.:25.0
## Median :373.0 Median :578.0 Median :65.80 Median :4343 Median :32.0
## Mean :370.7 Mean :567.1 Mean :65.33 Mean :4287 Mean :29.1
## 3rd Qu.:397.5 3rd Qu.:611.0 3rd Qu.:68.30 3rd Qu.:4699 3rd Qu.:33.5
## Max. :471.0 Max. :675.0 Max. :74.40 Max. :5477 Max. :55.0
## TD. Int Int. Lng
## Min. :2.700 Min. : 2.00 Min. :0.300 Min. :45.00
## 1st Qu.:4.300 1st Qu.: 9.00 1st Qu.:1.500 1st Qu.:66.00
## Median :5.100 Median :11.00 Median :2.000 Median :75.00
## Mean :5.071 Mean :11.81 Mean :2.135 Mean :73.45
## 3rd Qu.:5.750 3rd Qu.:14.50 3rd Qu.:2.500 3rd Qu.:80.00
## Max. :8.300 Max. :21.00 Max. :5.300 Max. :98.00
## Y.A AY.A Y.C Y.G
## Min. :6.100 Min. :5.200 Min. : 9.90 Min. :193.0
## 1st Qu.:7.200 1st Qu.:7.100 1st Qu.:11.15 1st Qu.:259.7
## Median :7.600 Median :7.800 Median :11.60 Median :274.1
## Mean :7.555 Mean :7.606 Mean :11.57 Mean :274.6
## 3rd Qu.:7.900 3rd Qu.:8.300 3rd Qu.:11.90 3rd Qu.:293.6
## Max. :8.700 Max. :9.300 Max. :13.70 Max. :342.3
## Rate QBR Sk Yds.1
## Min. : 69.90 Min. :35.20 Min. :17.00 Min. :118.0
## 1st Qu.: 90.15 1st Qu.:57.15 1st Qu.:21.00 1st Qu.:140.5
## Median : 97.40 Median :68.00 Median :27.00 Median :175.0
## Mean : 95.99 Mean :65.18 Mean :28.16 Mean :188.1
## 3rd Qu.:102.50 3rd Qu.:74.00 3rd Qu.:34.50 3rd Qu.:214.5
## Max. :115.70 Max. :81.90 Max. :49.00 Max. :358.0
## NY.A ANY.A Sk. X4QC
## Min. :5.450 Min. :4.570 Min. :2.700 Min. :0.000
## 1st Qu.:6.330 1st Qu.:6.170 1st Qu.:3.500 1st Qu.:1.000
## Median :7.000 Median :7.040 Median :4.400 Median :2.000
## Mean :6.872 Mean :6.923 Mean :4.784 Mean :1.903
## 3rd Qu.:7.295 3rd Qu.:7.605 3rd Qu.:5.900 3rd Qu.:2.000
## Max. :8.040 Max. :8.870 Max. :8.900 Max. :6.000
## GWD over_4500
## Min. :0.000 Length:31
## 1st Qu.:2.000 Class :character
## Median :2.000 Mode :character
## Mean :2.613
## 3rd Qu.:3.000
## Max. :7.000