data

taken from Pro Football Focus https://www.pff.com/

manually entered speed, draft selection, PFF grade from last season played (if current, current year), age at PFF grade (e.g if currently in league, age = current age), PFF grade 3 years post draft year

there were some caveats such as Peyton Manning does not have a PFF grade from 3 years out (no PFF grade from 2001, so his specifically is the next closest which was 6 years after being drafted)

some players were also not even in the league 3 years, and some players three year post draft was the same year as their most recent PFF grade

some players are missing speed, probably if they were not invited to the combine?

finally, if a player went undrafted, I entered their selection number as 257 since there are 256 picks in each NFL draft

there are a lot of factors to think about (covid, overall team performance, injury etc) simply beyond the scope of this analysis

PFF grade:

“Each player is given a grade of -2 to +2 in 0.5 increments on a given play with 0 generally being the average or “expected” grade. There are a few exceptions as each position group has different rules, but those are the basics. The zero grade is important as most plays feature many players doing their job at a reasonable, or expected, level, so not every player on every play needs to earn a positive or a negative.

At one end of the scale you have a catastrophic game-ending interception or pick-six from a quarterback, and at the other a perfect deep bomb into a tight window in a critical game situation.

Each position has its own grading rubric so our analysts know how to put a grade on the various expectations for a quarterback on a 10-yard pass beyond the sticks or what the range of grades might look like for a frontside offensive tackle down blocking on a “power” play.

There is then an adjustment made to the “raw” grades to adjust for what the player is “expected” to earn given his situation on the field. For instance, a player’s grade may be adjusted down slightly if he plays in a situation that is historically more favorable while a player in more unfavorable circumstances may get an adjustment the other way. We collect over 200 fields of data on each play, and that data helps to determine what the baseline, or expectation, is for each player on every play.

Each grade goes into a specific “facet” of play in order to properly assess each player’s skillset. The facets include passing, rushing, receiving, pass blocking, run blocking, pass-rushing, run defense and coverage. Special teamers also have their own facets of kicking, punting, returning and general special teams play. Facets are important in order to have a clear view of where a player’s strengths and weaknesses lie.”

pff<- fread("/Users/claire/Desktop/nfl_sibs.csv", header=T, data.table=F)
head(pff)

##   familyID           player  PFF age speed draft PFF_yr_3
## 1        1      Jason Kelce 88.5  35  4.89   191     78.0
## 2        1     Travis Kelce 91.1  33    NA    63     89.0
## 3        2 Shaquill Griffin 61.8  27  4.38    90     64.1
## 4        2 Shaqueem Griffin 66.5  24  4.38   141     66.5
## 5        3   Devin McCourty 70.0  35  4.38    27     87.3
## 6        3   Jason McCourty 72.1  33    NA   203     81.4

visualize data

rgs<- rcorr(as.matrix(pff[,c(3:ncol(pff))]))


corrplot(rgs$r, method = "color",
         type = "lower",  number.cex = 0.9,
         addCoef.col = "black", 
         tl.col = "black", tl.srt = 90,
         tl.cex = .7)

interestingly, draft selection order positively correlate with most recent PFF (r=.21), but negatively correlates with year 3 PFF. this would insinsuate that a lower draft pick (technically a ‘higher’ number, but here meaning getting picked later in the draft) associates wtih a higher (better) most recent PFF score. on the other hand, a lower draft pick (getting selected earlier in the draft) associates with a higher (better) PFF score at year 3. this might make sense looking across a range of data – in year 3, probably only those who were drafted quite highly (earlier) would be expected to perform as early. more mid/later round draft picks might develop into better players (higher PFF scores) with more time in the league, hence the positive correlation between draft number and most recent PFF.

what is the distribution of draft picks in this sample? obvioulsy it is a very small sample, so is it even representative?

ggplot(pff, aes(x=draft)) +
  geom_density(color="blue", fill="deepskyblue")

probs <- c(0.1, 0.25, 0.5, 0.75, 0.9)
quantiles <- quantile(pff$draft, prob=probs)
pff$quant <- factor(findInterval(pff$draft,quantiles))
ggplot(pff, aes(draft)) + 
  geom_density() +
  scale_x_continuous(breaks=quantiles)

the density curve does not look very normal. again, very small sample, but interestingly it seems like the bulk of the siblings get drafted top 100. if we look at the plot with quantiles added that seems to be the case.

analysis

phenotypic

the first step in a within sibling analysis is to simply see what the phenotypic model looks like. that is, do the variables even relate to each other?

pheno<- lmer(PFF~draft+age + (1|familyID), data=pff)
tab_model(pheno)

	PFF
Predictors	Estimates	CI	p
(Intercept)	33.62	-3.53 – 70.77	0.075
draft	0.04	-0.01 – 0.09	0.123
age	0.86	-0.29 – 2.01	0.139
Random Effects
σ²	122.73
τ₀₀ _familyID	16.37
ICC	0.12
N _familyID	18
Observations	38
Marginal R² / Conditional R²	0.096 / 0.202

they do not! above I used age and draft to predict most recent PFF. it seems important to control for age because players are likely to be at very different points in their careers when they stop playing.

also important to note is the random effects allowed for family. each family has their own intercept, because siblings are more likely to be similar than non siblings, which means the data has non-independence that must be accounted for.

we can run another model that assesses whether PFF at year 3 is predicted by draft selection or age.

pheno<- lmer(PFF_yr_3~draft+age + (1|familyID), data=pff)
tab_model(pheno)

	PFF_yr_3
Predictors	Estimates	CI	p
(Intercept)	43.70	6.86 – 80.54	0.022
draft	-0.02	-0.07 – 0.03	0.359
age	0.92	-0.22 – 2.06	0.110
Random Effects
σ²	111.05
τ₀₀ _familyID	23.15
ICC	0.17
N _familyID	18
Observations	38
Marginal R² / Conditional R²	0.119 / 0.271

once again, it is not! this makes sense just given the sheer dearth of data.

plot

this plot shows the draft selection number and PFF year 3 grade for all siblings in the data set

the light green zone indicates a top draft 50 draft pick with a high (>70) PFF year 3 grade

ggplot(pff, aes(x=draft, y=PFF_yr_3)) +
  geom_point() + # Show dots
  ylab("PFF grade in year 3")+
  xlab("Draft Selection Number")+
  geom_label(
    label=pff$player, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T, color=pff$familyID)+
    annotate(geom="rect", ymin=70, ymax=Inf, xmin=-Inf, xmax=50, fill="lightgreen", alpha=0.4)

## Warning in geom_label(label = pff$player, nudge_x = 0.25, nudge_y = 0.25, :
## Ignoring unknown parameters: `check_overlap`

within sibling

the next step would be to create variables that represent the between and within sibling scores. the between sibling score is captured by the average draft selection for each family (e.g for the mannings it would be 1, but for the watts it would be 79.7). the within family score is then capturing how different each sibling is from that average family score.

pff <- pff%>% group_by(familyID) %>%   
  mutate(
    draft_avg_family = mean(draft,na.rm=T),    
    draft_indiv_diff = draft-draft_avg_family)

if the between family variable significantly predicts PFF, it is indicating that what predicts PFF is something that makes famlies different from other families. this would likely mean there is some sort of genetic confounding between draft number and PFF – or that we cannot rule out both genetics and environment having a profound effect on the relationship between draft and PFF grade.

the mannings are an interesting example because they were both selected 1 overall. this means their difference score is 0, the smallest absolute value it could be

fams<- grep("Manning",  pff$player)
pff[fams,c(2:4,6:7,9:10)]

## # A tibble: 2 × 7
##   player           PFF   age draft PFF_yr_3 draft_avg_family draft_indiv_diff
##   <chr>          <dbl> <dbl> <int>    <dbl>            <dbl>            <dbl>
## 1 Eli Manning     62.8    38     1     67.2                1                0
## 2 Peyton Manning  62.6    38     1     94                  1                0

if the within family variable significantly predicts PFF, it is indicating that what predicts PFF is something that makes siblings different from eachother. this variable allows us to control for 50% of shared genetics, but importantly familial environment as well. we are assuming that siblings in this analysis were raised in similar conditions, and thus can serve as perfectly matched environmental controls for eachother. therefore, if this within family draft pick variable significantly predicts PFF grade, it increases our confidence in an exposure pathway – that draft selection number induces liabiltiy for a better or worse PFF grade. this is a pattern consistent with causality, but we are not able to make a causal claim just from this.

on the other hand, the kelce’s were selected 64 picks apart, so their family draft value score is not that high, which is even more interesting considering how high their PFF grades are.

the watts have a very wide draft range (11 to 198), but their average family score is actually not that low. however, derek watt has a quite large individual difference draft score, which means he is quite different from his siblings. this is a case where if that difference significantly predicted differences in PFF grades as well, we might have support for an exposure pathway from draft number to PFF grade.

fams<- grep("Kelce",  pff$player)
pff[fams,c(2:4,6:7,9:10)]

## # A tibble: 2 × 7
##   player         PFF   age draft PFF_yr_3 draft_avg_family draft_indiv_diff
##   <chr>        <dbl> <dbl> <int>    <dbl>            <dbl>            <dbl>
## 1 Jason Kelce   88.5    35   191       78              127               64
## 2 Travis Kelce  91.1    33    63       89              127              -64

fams<- grep("Watt",  pff$player)
pff[fams,c(2:4,6:7,9:10)]

## # A tibble: 3 × 7
##   player       PFF   age draft PFF_yr_3 draft_avg_family draft_indiv_diff
##   <chr>      <dbl> <dbl> <int>    <dbl>            <dbl>            <dbl>
## 1 JJ Watt     68.3    34    11     92.7             79.7            -68.7
## 2 Derek Watt  73.4    30   198     57.5             79.7            118. 
## 3 TJ Watt     82.1    28    30     91.6             79.7            -49.7

run the analysis

here we let both the between and within family variables predict PFF at year 3. since age did not predict PFF at year 3, we are not controlling for it in this model

withinsib<- lmer(PFF_yr_3~draft_avg_family+draft_indiv_diff+(1|familyID), data=pff)
tab_model(withinsib)

	PFF_yr_3
Predictors	Estimates	CI	p
(Intercept)	69.22	62.02 – 76.43	<0.001
draft avg family	0.01	-0.06 – 0.09	0.682
draft indiv diff	-0.08	-0.14 – -0.01	0.016
Random Effects
σ²	97.37
τ₀₀ _familyID	36.01
ICC	0.27
N _familyID	18
Observations	38
Marginal R² / Conditional R²	0.117 / 0.355

interestingly, we do see a significant association in the within family score and PFF grade at year 3, but not for the within family.

this finding in itself indicates there is some evidence for an exposure pathway – such that a lower draft pick (an earlier selection) significantly confers environmental liability for a higher year 3 PFF grade. we can rule out an potential confounding due to the familial environment, such as nutrition, resources that went into football (playing competitive, private coaches etc), assuming siblings were raised the same. given that large assumption, we are able to conclude here that, given this very, very small sample size, the differences between sibling draft order is still significantly predicting PFF grade at year 3.

test to see if the coefficients are significantly different

by comparing the 2 models, we can test whether the two coefficints are significantly different. if the model comparison is significnat, it indicants that breaking up the “draft” variable into the 2 between and within variables is giving us power to detect this effect, because the variables are significantly different.

anova(pheno, withinsib)

## refitting model(s) with ML (instead of REML)

## Data: pff
## Models:
## pheno: PFF_yr_3 ~ draft + age + (1 | familyID)
## withinsib: PFF_yr_3 ~ draft_avg_family + draft_indiv_diff + (1 | familyID)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## pheno        5 300.20 308.38 -145.10   290.20                     
## withinsib    5 299.03 307.22 -144.52   289.03 1.1647  0

since the comparison is significant, it appears the within family predictor is significantly different from the betwen family predictor.

plot findings

fam<- pff %>% group_by(familyID) %>%
  select(familyID, player, draft, PFF_yr_3, draft_avg_family) %>%
  slice(1) %>%
  mutate(player="family draft avgerage",
         variable="draft avgerage \nper family") %>%
  rename(value=draft_avg_family)

individ<- pff %>% select(familyID, player, draft, PFF_yr_3, draft_indiv_diff) %>%
  mutate(variable="draft individual \ndifference from \nfamily average") %>%
  rename(value=draft_indiv_diff)

full<- rbind(fam,individ)

#plot_list<- list()
#for (i in unique(full$familyID)) {
 # dat<- full %>% filter(familyID==i)
#  plt<- ggplot(dat, aes(color=variable, y=value, x=player)) + 
 #   geom_point(stat="identity", size=3)+
#    ylab("Draft selection")+
#    coord_flip() +
#    ggtitle(sub(".* ", "", dat$player[2]))
 #     theme(legend.title = element_blank(),
  #    panel.border = element_blank(),
   #   panel.grid.major = element_blank(),
 #     panel.grid.minor = element_blank())
#  plot_list[[i]]<- plt
#  print(plot_list[[i]])
#}



## on average here, we expect the lower bar (earlier draft pick) to be more green (better year 3 PFF grade)

draft_pff<- list()
for (i in unique(pff$familyID)) {
dat<- full %>% filter(familyID==i) %>%
  filter(player!="family draft avgerage")
plt<- ggplot(dat, aes(fill=PFF_yr_3, y=draft, x=player)) + 
    geom_bar(position="dodge", stat="identity")+
  scale_fill_gradient(low = "firebrick1", high = "springgreen1")+
  theme(legend.title = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank())
draft_pff[[i]]<- plt 
print(draft_pff[[i]])
}

p2<- ggarrange(draft_pff[[1]], draft_pff[[2]], 
          draft_pff[[3]], draft_pff[[4]], 
          draft_pff[[5]], draft_pff[[6]], 
          draft_pff[[7]], draft_pff[[8]], 
          draft_pff[[9]], draft_pff[[10]], 
          draft_pff[[11]], draft_pff[[12]], 
          draft_pff[[13]], draft_pff[[14]], 
          draft_pff[[15]], draft_pff[[16]], 
          draft_pff[[17]], draft_pff[[18]], 
          nrow=6, ncol=3, common.legend = T)

Within Sib NFL Analysis

2023-02-15