Start by setting up the packages to manipulate data.
suppressPackageStartupMessages({
library(tidyverse)
library(rio)
library(boot)
library(broom)
source("aptheme.R") #Code that helps format graphs
})
Import data
data <- import("plays.csv")
In an earlier data dive, we saw the difference in yards gained based on each offensive formation. On average, shotgun and empty backfields lead to above average gains in yardage while jumbo and wildcat lead to lower than average gains. Here we will test the strenght of that difference. The null hypothesis is that all formations lead to the same number of yards gained.
res <- aov(yardsGained ~ offenseFormation, data)
summary(res)
## Df Sum Sq Mean Sq F value Pr(>F)
## offenseFormation 6 2368 394.6 5.04 3.58e-05 ***
## Residuals 15929 1247327 78.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 188 observations deleted due to missingness
The results of the anova test show a very small p value and an F value greater than 1. Given the size of the data set, it is very unlikely that all the formations result in the same number of yards gained. Next we use a pairwise t test to determine which formations may be different.
pairwise.t.test(data$yardsGained, data$offenseFormation,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$yardsGained and data$offenseFormation
##
## EMPTY I_FORM JUMBO PISTOL SHOTGUN SINGLEBACK
## I_FORM 0.68817 - - - - -
## JUMBO 0.00064 0.03041 - - - -
## PISTOL 1.00000 1.00000 0.03570 - - -
## SHOTGUN 1.00000 0.23530 0.00041 1.00000 - -
## SINGLEBACK 1.00000 1.00000 0.00353 1.00000 0.52637 -
## WILDCAT 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
##
## P value adjustment method: bonferroni
Here we can see that the empty backfield could be pretty similar to pistol, shotgun, singleback and wildcat, but does appear to be diferent from the jumbo formation. Similarly singleback appears to be different from Jumbo, but not very distinct from any of the other formations. Based on this we can reject the hypotheses and assume there is a difference in yards gained by each formation.
ggplot(data, aes(x = dropbackDistance, y = yardsGained)) +
geom_point() +
theme_ap(family = "sans") +
labs(title = "Yards Gained by Dropback Distance",
x = "Dropback Distance",
y = "Yards Gained")
## Warning: Removed 5966 rows containing missing values or values outside the scale range
## (`geom_point()`).
model <- lm(yardsGained ~ dropbackDistance, data)
summary(model)
##
## Call:
## lm(formula = yardsGained ~ dropbackDistance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.472 -6.109 -2.184 3.857 91.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34641 0.19619 27.251 < 2e-16 ***
## dropbackDistance 0.23749 0.05037 4.715 2.46e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.801 on 10156 degrees of freedom
## (5966 observations deleted due to missingness)
## Multiple R-squared: 0.002184, Adjusted R-squared: 0.002085
## F-statistic: 22.23 on 1 and 10156 DF, p-value: 2.455e-06
Here we can see the relationship between dropback distance and the yards gained on a specific play. We can see that for every yard a quarterback drops back, 0.24 yards age gained on a play. Looking at the P value, it seems very unlikley that there is not a positive relationship between the two variables. However, the R squared value is tiny, suggesting that there’s a lot more going into the change in yards gained than can be explained by drop back distance. There needs to be further analysis into what is actually effecting the number of yards gained on any given play.