Data-Dive-8

Data Dive Week 8 - Regression

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  library(boot)
  library(broom)
  source("aptheme.R") #Code that helps format graphs
  })

Import data

data <- import("plays.csv")

Yards Gained by formation

In an earlier data dive, we saw the difference in yards gained based on each offensive formation. On average, shotgun and empty backfields lead to above average gains in yardage while jumbo and wildcat lead to lower than average gains. Here we will test the strenght of that difference. The null hypothesis is that all formations lead to the same number of yards gained.

res <- aov(yardsGained ~ offenseFormation, data)
summary(res)

##                     Df  Sum Sq Mean Sq F value   Pr(>F)    
## offenseFormation     6    2368   394.6    5.04 3.58e-05 ***
## Residuals        15929 1247327    78.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 188 observations deleted due to missingness

The results of the anova test show a very small p value and an F value greater than 1. Given the size of the data set, it is very unlikely that all the formations result in the same number of yards gained. Next we use a pairwise t test to determine which formations may be different.

pairwise.t.test(data$yardsGained, data$offenseFormation, 
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$yardsGained and data$offenseFormation 
## 
##            EMPTY   I_FORM  JUMBO   PISTOL  SHOTGUN SINGLEBACK
## I_FORM     0.68817 -       -       -       -       -         
## JUMBO      0.00064 0.03041 -       -       -       -         
## PISTOL     1.00000 1.00000 0.03570 -       -       -         
## SHOTGUN    1.00000 0.23530 0.00041 1.00000 -       -         
## SINGLEBACK 1.00000 1.00000 0.00353 1.00000 0.52637 -         
## WILDCAT    1.00000 1.00000 1.00000 1.00000 1.00000 1.00000   
## 
## P value adjustment method: bonferroni

Here we can see that the empty backfield could be pretty similar to pistol, shotgun, singleback and wildcat, but does appear to be diferent from the jumbo formation. Similarly singleback appears to be different from Jumbo, but not very distinct from any of the other formations. Based on this we can reject the hypotheses and assume there is a difference in yards gained by each formation.

Linear Regression

ggplot(data, aes(x = dropbackDistance, y = yardsGained)) + 
  geom_point() + 
  theme_ap(family = "sans") + 
  labs(title = "Yards Gained by Dropback Distance",
       x = "Dropback Distance",
       y = "Yards Gained")

## Warning: Removed 5966 rows containing missing values or values outside the scale range
## (`geom_point()`).

model <- lm(yardsGained ~ dropbackDistance, data)
summary(model)

## 
## Call:
## lm(formula = yardsGained ~ dropbackDistance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.472  -6.109  -2.184   3.857  91.637 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.34641    0.19619  27.251  < 2e-16 ***
## dropbackDistance  0.23749    0.05037   4.715 2.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.801 on 10156 degrees of freedom
##   (5966 observations deleted due to missingness)
## Multiple R-squared:  0.002184,   Adjusted R-squared:  0.002085 
## F-statistic: 22.23 on 1 and 10156 DF,  p-value: 2.455e-06

Here we can see the relationship between dropback distance and the yards gained on a specific play. We can see that for every yard a quarterback drops back, 0.24 yards age gained on a play. Looking at the P value, it seems very unlikley that there is not a positive relationship between the two variables. However, the R squared value is tiny, suggesting that there’s a lot more going into the change in yards gained than can be explained by drop back distance. There needs to be further analysis into what is actually effecting the number of yards gained on any given play.

Data-Dive-8

Serena Hawkins

2026-03-08

Data Dive Week 8 - Regression

Yards Gained by formation

Linear Regression