Jack Wright
11/30/2020
Research Question
I want to create a model to try to predict the difference in the final score of NFL games from the “modern era” of football, which I will call after the 1992 season. This is somewhat arbitrary, but it is after the introduction of free agency which was a large change in the way the games were played.
This is an observational study, with the data collected by the NFL, and the elo rank is created by fivethirtyeight.com. The response variable will be the numerical difference in score at the end of the game, and I will cycle through several response variables that I will create from the data. There are 7326 cases (games) after filtering for games played after 1992.
First look at the data:
## elo1_pre elo2_pre score1 score2
## 1 1503.947 1300.000 48 0
## 2 1503.420 1300.000 10 0
## 3 1493.002 1504.908 14 0
## 4 1516.108 1478.004 45 0
## 5 1368.333 1300.000 20 0
## 6 1504.688 1300.000 48 0
Note that two teams play in a single game, by always selecting a positive elo differential, I am implicitly creating a target team in each object. This allows me to add a is_home variable, to test whether there is a correlation between being “at home” or “away.”
#filter for modern era
dat<-dat%>%
filter(season>1992)
dat1<-dat%>%
#get the difference in the ELO scores
mutate(elo_dif=elo1_pre-elo2_pre)%>%
#filter out elo differences of zero
filter(abs(elo_dif)>0)%>%
#get the score difference based on who had the higher elo
mutate(score_dif=ifelse(elo_dif>0,score1-score2,score2-score1))%>%
#convert all elo to positive, now that the scores are ordered correctly
mutate(elo_dif=abs(elo_dif))%>%
#create a column for if the higher ELO team is home or away.
mutate(is_home=ifelse(elo1_pre>elo2_pre,1,0))%>%
#get higher elo team's elo rating
mutate(target_elo=ifelse(is_home==1,elo1_pre,elo2_pre))%>%
#get higher elo team's qb rating
mutate(target_qb_value=ifelse(is_home==1,qb1_value_pre,qb2_value_pre))I will take a first look at the relationship between elo difference and final score difference.
It does look like there is a positive correlation between elo_dif and score_dif. It also is worth noting that as the elo_dif trends to zero, the score difference also trends to zero. This is a good sign that the variable is telling us something about score_dif.
Now I will take a look at the is_home variable
This also seems to have a linear correlation, with the home team as benefiting from a higher difference in score.
The requirements for linear modeling are met by this data (linear trend, no overly influential points, and constant variablility), so lets give it a try!
#elo_dif and score_dif
m1<-lm(score_dif~ elo_dif, data=dat1)
#is_home and score_dif
m2<-lm(score_dif~is_home, data=dat1)
(table<-rbind("elo_dif"=m1$coefficients,"is_home"=m2$coefficients))## (Intercept) elo_dif
## elo_dif -0.1741653 0.04151956
## is_home 1.6961249 5.10508158
I will use the summary function to retrieve the R^2.
The R-squared for this function is only about .06, which means that only 6% of the variance in the difference in score is accounted for by the difference in ELO. The R^2 for “is_home” is even lower at .03.
I will now try to run a bivariate regression with my two variables to see if the correlation increases or the adjusted R^2 is higher.
##
## Call:
## lm(formula = score_dif ~ is_home + elo_dif, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.643 -8.449 -0.508 8.410 51.639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.659548 0.306482 -8.678 <2e-16 ***
## is_home 5.017160 0.317095 15.822 <2e-16 ***
## elo_dif 0.041108 0.001963 20.944 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.48 on 7231 degrees of freedom
## (79 observations deleted due to missingness)
## Multiple R-squared: 0.08802, Adjusted R-squared: 0.08777
## F-statistic: 349 on 2 and 7231 DF, p-value: < 2.2e-16
when I run a multiple regression for these two variables, you can see the R^2 and adjusted R^2 have both increased, but not as much as you might think. Instead of just adding the two R^2, there is a penalty built into the adjusted R^2 for complexity, keeping you from adding variables R^2 values to unnecessarily complicate your model.
I will make one last study to see if there are any other variables I can use that have a larger correlation to score_dif using the “backward selection” model building strategy
##
## Call:
## lm(formula = score_dif ~ elo_dif + is_home + target_elo + target_qb_value,
## data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.970 -8.389 -0.431 8.600 50.432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.286935 3.517505 1.219 0.2230
## elo_dif 0.040757 0.002199 18.538 < 2e-16 ***
## is_home 5.013688 0.316206 15.856 < 2e-16 ***
## target_elo -0.006117 0.002408 -2.540 0.0111 *
## target_qb_value 0.020024 0.003031 6.607 4.2e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.44 on 7229 degrees of freedom
## (79 observations deleted due to missingness)
## Multiple R-squared: 0.09351, Adjusted R-squared: 0.093
## F-statistic: 186.4 on 4 and 7229 DF, p-value: < 2.2e-16
When performing backward selection, you remove the variable with the highest p value. This p value states "when all other variables are accounted for, the p value is the evidence that the explanatory variable accounts for the response variable.
I will rerun the regression without the variable “target_elo”
##
## Call:
## lm(formula = score_dif ~ elo_dif + is_home + target_qb_value,
## data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.285 -8.422 -0.468 8.525 50.411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.579301 0.438605 -10.441 < 2e-16 ***
## elo_dif 0.038463 0.002005 19.181 < 2e-16 ***
## is_home 5.005570 0.316309 15.825 < 2e-16 ***
## target_qb_value 0.016746 0.002743 6.104 1.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 7230 degrees of freedom
## (79 observations deleted due to missingness)
## Multiple R-squared: 0.0927, Adjusted R-squared: 0.09232
## F-statistic: 246.2 on 3 and 7230 DF, p-value: < 2.2e-16
When adding in the variable “target_qb_value” to our orignal multiple regression we are able to get the highest adjusted R^2 and can select this as our best model.
In order to confirm our model is fitting the data we will have to make some checks on the residuals.
ggplot(data = back.lm, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")There is no trend in the residual data, which leads me to believe the dataset is normal.
The residuals are normal around zero.
QQ plot
Using a QQ plot, we can further confirm that our residuals are normal, meaning a multivariate linear model is appropriate for interpreting our data.
On the whole, the relationship between score_dif and the variables I have created is noisy, but it is important to understand the context of the study. While predictions of physical processes could have R^2 of .9, “good” R^2 for human processes could be less than .3. This is further confounded by the fact that 22 humans are on a football field at all times, so predicting the outcomes of their behavior will be even more difficult.
I am not sure if I would lay any money down on this data with an R^2 so low, but I know that most sportsbooks develop their betting lines with “home” vs “away” as one of the largest factors (usually 3 full points in score difference). It is also common knowledge that the quarterback is the most impactful player on a football team, so I was a little surprised that it didn’t have a larger effect on the model. I wonder if this is due to colinearity with the elo scores.