Final Project- DATA 606

Jack Wright

11/30/2020

Part 1 - Introduction

Research Question

I want to create a model to try to predict the difference in the final score of NFL games from the “modern era” of football, which I will call after the 1992 season. This is somewhat arbitrary, but it is after the introduction of free agency which was a large change in the way the games were played.

Part 2 - Data

This is an observational study, with the data collected by the NFL, and the elo rank is created by fivethirtyeight.com. The response variable will be the numerical difference in score at the end of the game, and I will cycle through several response variables that I will create from the data. There are 7326 cases (games) after filtering for games played after 1992.

Part 3 - Exploratory data analysis

First look at the data:

head(dat%>%select(elo1_pre,elo2_pre,score1,score2))
##   elo1_pre elo2_pre score1 score2
## 1 1503.947 1300.000     48      0
## 2 1503.420 1300.000     10      0
## 3 1493.002 1504.908     14      0
## 4 1516.108 1478.004     45      0
## 5 1368.333 1300.000     20      0
## 6 1504.688 1300.000     48      0

Note that two teams play in a single game, by always selecting a positive elo differential, I am implicitly creating a target team in each object. This allows me to add a is_home variable, to test whether there is a correlation between being “at home” or “away.”

#filter for modern era
dat<-dat%>%
  filter(season>1992)
dat1<-dat%>%
  #get the difference in the ELO scores
  mutate(elo_dif=elo1_pre-elo2_pre)%>%
  #filter out elo differences of zero
  filter(abs(elo_dif)>0)%>%
  #get the score difference based on who had the higher elo
  mutate(score_dif=ifelse(elo_dif>0,score1-score2,score2-score1))%>%
  #convert all elo to positive, now that the scores are ordered correctly
  mutate(elo_dif=abs(elo_dif))%>%
  #create a column for if the higher ELO team is home or away. 
  mutate(is_home=ifelse(elo1_pre>elo2_pre,1,0))%>%
  #get higher elo team's elo rating
  mutate(target_elo=ifelse(is_home==1,elo1_pre,elo2_pre))%>%
  #get higher elo team's qb rating
  mutate(target_qb_value=ifelse(is_home==1,qb1_value_pre,qb2_value_pre))

I will take a first look at the relationship between elo difference and final score difference.

ggplot(data=dat1, aes(x=elo_dif, y=score_dif))+
geom_point()+
  stat_smooth(method = "lm", se=FALSE)

It does look like there is a positive correlation between elo_dif and score_dif. It also is worth noting that as the elo_dif trends to zero, the score difference also trends to zero. This is a good sign that the variable is telling us something about score_dif.

Now I will take a look at the is_home variable

ggplot(dat1,aes(x=is_home, y=score_dif)) +
  geom_point()+ 
  stat_smooth(method = "lm", se=FALSE)

This also seems to have a linear correlation, with the home team as benefiting from a higher difference in score.

Part 4 - Inference

The requirements for linear modeling are met by this data (linear trend, no overly influential points, and constant variablility), so lets give it a try!

#elo_dif and score_dif
m1<-lm(score_dif~ elo_dif, data=dat1)


#is_home and score_dif
m2<-lm(score_dif~is_home, data=dat1)

(table<-rbind("elo_dif"=m1$coefficients,"is_home"=m2$coefficients))
##         (Intercept)    elo_dif
## elo_dif  -0.1741653 0.04151956
## is_home   1.6961249 5.10508158

I will use the summary function to retrieve the R^2.

#summary(m1)
#summary(m2)

The R-squared for this function is only about .06, which means that only 6% of the variance in the difference in score is accounted for by the difference in ELO. The R^2 for “is_home” is even lower at .03.

Bivariate Regression

I will now try to run a bivariate regression with my two variables to see if the correlation increases or the adjusted R^2 is higher.

score.lm.full<-lm(score_dif~is_home+elo_dif, data=dat1)
summary(score.lm.full)
## 
## Call:
## lm(formula = score_dif ~ is_home + elo_dif, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.643  -8.449  -0.508   8.410  51.639 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.659548   0.306482  -8.678   <2e-16 ***
## is_home      5.017160   0.317095  15.822   <2e-16 ***
## elo_dif      0.041108   0.001963  20.944   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.48 on 7231 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.08802,    Adjusted R-squared:  0.08777 
## F-statistic:   349 on 2 and 7231 DF,  p-value: < 2.2e-16

when I run a multiple regression for these two variables, you can see the R^2 and adjusted R^2 have both increased, but not as much as you might think. Instead of just adding the two R^2, there is a penalty built into the adjusted R^2 for complexity, keeping you from adding variables R^2 values to unnecessarily complicate your model.

I will make one last study to see if there are any other variables I can use that have a larger correlation to score_dif using the “backward selection” model building strategy

back.lm<-lm(score_dif~elo_dif+is_home+target_elo+target_qb_value, data=dat1)
summary(back.lm)
## 
## Call:
## lm(formula = score_dif ~ elo_dif + is_home + target_elo + target_qb_value, 
##     data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.970  -8.389  -0.431   8.600  50.432 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.286935   3.517505   1.219   0.2230    
## elo_dif          0.040757   0.002199  18.538  < 2e-16 ***
## is_home          5.013688   0.316206  15.856  < 2e-16 ***
## target_elo      -0.006117   0.002408  -2.540   0.0111 *  
## target_qb_value  0.020024   0.003031   6.607  4.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 7229 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.09351,    Adjusted R-squared:  0.093 
## F-statistic: 186.4 on 4 and 7229 DF,  p-value: < 2.2e-16

When performing backward selection, you remove the variable with the highest p value. This p value states "when all other variables are accounted for, the p value is the evidence that the explanatory variable accounts for the response variable.

I will rerun the regression without the variable “target_elo”

back.lm<-lm(score_dif~elo_dif+is_home+target_qb_value, data=dat1)
summary(back.lm)
## 
## Call:
## lm(formula = score_dif ~ elo_dif + is_home + target_qb_value, 
##     data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.285  -8.422  -0.468   8.525  50.411 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.579301   0.438605 -10.441  < 2e-16 ***
## elo_dif          0.038463   0.002005  19.181  < 2e-16 ***
## is_home          5.005570   0.316309  15.825  < 2e-16 ***
## target_qb_value  0.016746   0.002743   6.104 1.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 7230 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.0927, Adjusted R-squared:  0.09232 
## F-statistic: 246.2 on 3 and 7230 DF,  p-value: < 2.2e-16

When adding in the variable “target_qb_value” to our orignal multiple regression we are able to get the highest adjusted R^2 and can select this as our best model.

In order to confirm our model is fitting the data we will have to make some checks on the residuals.

ggplot(data = back.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

There is no trend in the residual data, which leads me to believe the dataset is normal.

ggplot(data = back.lm, aes(x = .resid)) +
  geom_histogram(binwidth = 1) +
  xlab("Residuals")

The residuals are normal around zero.

QQ plot

ggplot(data = back.lm, aes(sample = .resid)) +
  stat_qq()

Using a QQ plot, we can further confirm that our residuals are normal, meaning a multivariate linear model is appropriate for interpreting our data.

Part 5 - Conclusion

On the whole, the relationship between score_dif and the variables I have created is noisy, but it is important to understand the context of the study. While predictions of physical processes could have R^2 of .9, “good” R^2 for human processes could be less than .3. This is further confounded by the fact that 22 humans are on a football field at all times, so predicting the outcomes of their behavior will be even more difficult.

I am not sure if I would lay any money down on this data with an R^2 so low, but I know that most sportsbooks develop their betting lines with “home” vs “away” as one of the largest factors (usually 3 full points in score difference). It is also common knowledge that the quarterback is the most impactful player on a football team, so I was a little surprised that it didn’t have a larger effect on the model. I wonder if this is due to colinearity with the elo scores.