Predicting NFL running back rushing yards using Hierarchical Bayesian Linear Regression

Click the Report and Code tabs to view the Project Report and the Project Code.

Report

Introduction

Aims/Objectives

The aim of this project is to produce a Bayesian linear regression model that can accurately predict rushing yards for NFL running backs. In NFL fantasy football, fantasy points are awarded to running backs based on the number of rushing yards they gain in a game. Being able to accurately predict the number of rushing yards an NFL running back will gain in a game would be valuable information for an NFL fantasy football manager as it would allow them to select players for their team that will maximise the points their fantasy team will score, and their chances of beating their opponent. Analysis in this project focuses on Derrick Henry, a high-performing running back playing for the Tennessee Titans, who also happens to be on the author’s fantasy team.

This report summarises the process followed to conduct Bayesian analysis of running back rushing yards in the NFL Running Back dataset, including the creation of a hierarchical Bayesian regression model. The report covers descriptive analysis, creating a JAGS model diagram, specification of prior distributions, generating a hierarchical Bayesian regression JAGS model, diagnostic checking, estimation, and interpretation of model parameters, and making predictions using the model.

All R codes used in this report are provided in the Code tab

Methodology

The NFL Running Back dataset

The NFL Running back dataset used in this project was based on the “NFL Offensive Stats 2019 – 2022” dataset, compiled and published to the Kaggle website in August 2022 by the Kaggle user Daniel Fernandez (Fernandez, 2022). According to Fernandez (2022), the original dataset included NFL offensive statistics such as rushing yards, passing yards, and receiving yards for all players from the 2019-2021 seasons. The NFL Running Back dataset was produced by filtering the original dataset to observations for running backs, and selecting features relevant to running backs (e.g., rushing yards, type of field the game was being played on).

Descriptive look, and the type of data.

After splitting the NFL Running back dataset into dependent (i.e., rush_yds) and independent variables (predictors), descriptive statistics were produced for the dependent variable (Table 1).

##   vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## 1    1 4447 33.94 36.38     23   28.11 31.13 -12 253   265 1.48     2.62 0.55
##   Q0.25 Q0.5 Q0.75
## 1     5   23    53

Table 1: Descriptive statistics for the dependent variable, rush_yds, in the NFL Running Back dataset.

Descriptive statistics of rushing yards showed the data included 4,447 observations of yards gained by running backs in NFL games. The average number of rushing yards gained in a game by a running back was 33.9, while the median was 23, and three-quarters of the observations had 53 or fewer rushing yards. As the median was less than the mean, and the skewness was greater than 1.0, the descriptive statistics indicated the data was positively skewed. The number of rushing yards had a range of 265, with a minimum of -12, and a maximum of 253, indicating the domain was large and could take negative and positive values.

A histogram and a density plot (Figure 1 and Figure 2) were then produced to explore the distribution visually.

Figure 1: Histogram of the dependent variable, rush_yds, in the NFL Running Back dataset.

Figure 2: Density plot of the dependent variable, rush_yds, in the NFL Running Back dataset.

As there appeared to be a small number of very large values in the dataset, values greater than 200 were examined to confirm they were valid and determine if they were relevant to the goal of predicting running back rushing yards.

##              player rush_yds
## 1   Jonathan Taylor      253
## 2    Derrick Henry       250
## 3 Leonard Fournette      225
## 4    Raheem Mostert      220
## 5    Derrick Henry       215
## 6    Derrick Henry       212
## 7    Derrick Henry       211
## 8       Dalvin Cook      206
## 9       Dalvin Cook      205

Table 2: Observations in the NFL Running Back dataset where the number of rushing yards was greater than 200.

Analysis of the large rushing yard values identified 9 observations where the number of rushing yards in a game was greater than 200 (Table 2), with just 2 players achieving this feat more than once in the dataset (Derrick Henry and Dalvin Cook). Although achieving over 200 rushing yards was very uncommon, dropping all observations over 200 yards was deemed inappropriate as Derrick Henry rushed for more than 200 yards multiple times, indicating that rushing for more than 200 yards was not a unique event for Derrick Henry.

However, further review identified that Derrick Henry only rushed for more than 220 yards in one game. Based on this, all observations greater than or equal to 220 rushing yards were deemed unique events and were dropped from the dataset, as they were deemed highly unlikely to be repeated by any player.

Further analysis of the dataset identified 103 players who had participated in fewer than 15 games. A review of the 103 players identified that many had a small number of observations because they had been dropped from their team after playing a handful of games. Producing accurate predictions for these players would be difficult as they had a small number of observations. Furthermore, being able to make predictions for these player would not be as useful as many stopped playing after they were dropped. As such, players who had participated in fewer than 15 games were dropped from the dataset.

After removing observations of 220 rushing yards or more, and observations for players who had played in fewer than 15 games, the resulting dataset had 3,853 observations, with summary statistics shown in Table 3.

#Data Exploration - games
y <- nfl_games$rush_yds
x <- nfl_games[, !names(nfl_games) %in% c("rush_yds", "player_id", "game_id")]
numeric_cols <- unlist(lapply(x, is.numeric), use.names = FALSE)
x_num <- x[,numeric_cols]
x_cat <- x[,!names(x) %in% names(x_num)]
#__________________________
# dependent variable
#__________________________
describe(y, quant=c(.25, .5, .75))

##   vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## 1    1 3853 36.52 36.58     26   31.08 32.62 -11 215   226 1.31     1.76 0.59
##   Q0.25 Q0.5 Q0.75
## 1     6   26    57

Table 3: Descriptive statistics for the dependent variable, rush_yds, in the NFL Running Back dataset after removing observations >220 and players with fewer than 15 games played.

The descriptive statistics in Table 3 showed that the dependent variable had a range of 226, with a minimum of -11, and a maximum of 215, indicating the domain was still large and could take negative and positive values. The average number of rushing yards in a game for a running back was 36.5, while the median remained at 26. As the median was still less than the mean, and the skewness was still greater than 1.0, the descriptive statistics indicated the data was still positively skewed. Finally, with a third-quartile of 57, the descriptive statistics showed that three-quarters of all observations were of 57 or fewer rushing yards.

The histogram in Figure 3 and density plot in Figure 4 showed that after removing observations with rushing yards of 220 yards or more, and players with fewer than 15 games, the distribution of rushing yards was still highly positively skewed, with a long tail and many outliers.

Figure 3: Histogram of the dependent variable, rush_yds, in the NFL Running Back dataset after removing players with less than 15 games played and observations >220.

Figure 4: Density plot of the dependent variable, rush_yds, in the NFL Running Back dataset after removing players with less than 15 games played and observations >220.

length(unique(nfl_games$player_id))

## [1] 123

Table 4: Number of players in the NFL Running Back dataset after removing observations >220 and players with fewer than 15 games played.

After removing observations of 220 rushing yards or more, and observations for players who had played in fewer than 15 games, the final dataset had 123 unique players, who would be used as subjects in the NFL Running Back model.

Feature selection

Aside from the dependent variable, the original dataset included 68 columns of potential features. While some columns were irrelevant to rushing yards (e.g., how many passing yards did a player achieve in a game), fields such as the Temperature, humidity, and whether a player’s team was favoured to win, were identified as potentially relevant to how many rushing yards a running back would gain. After reviewing the features, including visual inspection of several borderline features, 7 predictors (outlined in Table 5) were selected for the final dataset.

Table 5: The variables selected in the final dataset for the NFL Running Back model.

After selecting the independent variables to be included in the final dataset, descriptive statistics were produced for the numerical independent variables, presented in Table 6.

#__________________________
# Independent variables
#__________________________
describe(x_num, quant=c(.25, .5, .75))

##             vars    n  mean   sd median trimmed  mad   min  max range  skew
## TempC          1 3853 17.19 8.51  22.20   17.85 6.52 -13.9 33.9  47.8 -0.77
## HumidityPct    2 3853  0.56 0.17   0.48    0.55 0.12   0.0  1.0   1.0  0.41
## Wind_Speed     3 3853  5.80 5.98   5.00    5.01 7.41   0.0 35.0  35.0  0.91
## Vegas_Line     4 3853 -5.68 3.70  -4.50   -5.25 2.97 -22.0  0.0  22.0 -1.18
##             kurtosis   se Q0.25  Q0.5 Q0.75
## TempC          -0.13 0.14 11.10 22.20 22.20
## HumidityPct     0.04 0.00  0.45  0.48  0.67
## Wind_Speed      0.50 0.10  0.00  5.00 10.00
## Vegas_Line      1.38 0.06 -7.50 -4.50 -3.00

Table 6: Descriptive statistics for the numerical independent variables in the NFL Running Back dataset.

The descriptive statistics for the numerical independent variables showed that:

The average temperature in the dataset was 17.2 degrees (Celsius), with a median of 22.2 degrees, and a third quartile of 22.2 degrees, indicating that the distribution of Temperature was slightly negatively skewed, with three-quarters of all Temperatures at or below 22.2 degrees. The range of Temperatures was 47.8, with Temperatures as low as -13.9 degrees, and as high as 33.9.
The average humidity was 56%, with a median of 48% and a third quartile of 67%, indicating that the distribution of humidity was slightly positively skewed, with three-quarters of all humidity values at or below 67%. As humidity is a percentage, it was unsurprising that the minimum and maximum values for humidity were 0% and 100%, respectively.
The average wind speed recorded in the dataset was 5.8 knots, with a median of 5.0 knots, and a third quartile of 10 knots, indicating that the distribution of wind speed was slightly positively skewed, and three-quarters of all wind speeds were at or below 10 knots. The fastest recorded wind speed was 35 knots, leading to a range of 35 (wind speed cannot be negative).
Vegas lines represent the expected margin for a game, or “how close” betting agencies expect the final scores to be. A Vegas line is always in favour of the team expected to win, so it only takes negative values (e.g., “Vegas expects the Buffalo Bills to win by 6 points” would mean a margin -6 in favour of the Bills). The average Vegas line was -5.70 (points), with a median of -4.50 and a third-quartile of -3.0. The distribution of Vegas lines was positively skewed, and three-quarters of all Vegas lines were at or below -3.0, indicating most games were expected to be decided by at least a field goal (worth 3 points). The largest margin was 22 points (-22) while the smallest margin was 0 (even, no favourite to win).

Figure 5: Plots of Rushing yards by running backs for numerical independent variables, by whether the game was played at the player’s home ground or away, in the NFL Running Back dataset.

The plot of Humidity (%) and rushing yards (Figure 5, top-left) appeared to show that as humidity increased, the number of rushing yards a running back gained tended to increase. As a higher humidity can be associated with a higher chance and volume of rain, and rain can make it difficult to catch a ball, it is possible that higher humidity levels could lead teams to favour running plays, leading to more rushing attempts, and therefore, more rushing yards. There was no clear difference in the relationship between humidity and rushing yards based on whether the player’s team was favoured to win or not.

The plot of Vegas Line and rushing yards (Figure 5, top-right) appeared to show that as the Vegas Line moved closer to zero (the betting agencies expected a game to be close/competitive), the number of rushing yards increased. This was not surprising, as teams in the lead in close games tend to favour running plays over passing plays because the clock is usually not stopped after a running play. This means a running play allows more time pass, compared to a passing play, reducing the amount of time an opponent may have to attempt a comeback. There was no clear difference in any relationship based on whether the player’s team was favoured to win or not.

Although the plot of Wind Speed and rushing yards (Figure 5, bottom-left) appeared to show that higher wind speeds were associated with a decrease in the number of rushing yards gained, many of the observations recorded no wind, which made the relationship difficult to identify. There was no clear difference between wind speed and rushing yards based on whether the player’s team was favoured to win or not.

The plot of Temperature and rushing yards (Figure 5, bottom-right) showed no clear relationship between the number of rushing yards a running back gained and the temperature of the game. There was also no clear relationship between temperature and rushing yards based on whether or not the player’s team was favoured to win.

Table 7: Correlation matrix for the continuous variables in the NFL Running Back dataset.

The correlation matrix in Table 7 showed none of the numerical variables were highly correlated with the other numerical variables, with the strongest correlation being a moderate-to-weak strength negative correlation (-0.341) between Temperature and Humidity. Although the presence of a correlation between temperature and humidity was not surprising, the strength of the correlation being moderate-to-weak was somewhat unexpected given temperature and humidity are known to be inversely proportional.

Frequencies were then produced for the categorical variables, shown in Table 8.

## x_cat$HomeAway : 
##         Frequency Percent
## away         1930    50.1
## home         1923    49.9
##   Total      3853   100.0

## x_cat$Field : 
##               Frequency Percent
## Natural Grass      2339    60.7
## Artificial         1514    39.3
##   Total            3853   100.0

## x_cat$FvrdToWin : 
##         Frequency Percent
## Yes          1977    51.3
## No           1876    48.7
##   Total      3853   100.0

Table 8: Descriptive statistics for the categorical independent variables in the NFL Running Back dataset.

The descriptive statistics for the categorical variables (Table 8) showed that the number of home and away observations was balanced, as was the number of games where a player’s team was/was not favoured to win. Table 8 also showed that there were more observations played on natural grass fields (60.7%), than artificial grass (39.3%). The categorical variables were also plotted to explore their distributions visually (Figure 6, Figure 7, and Figure 8).

Figure 6: Running back rushing yards by Field Type in the NFL Running Back dataset.

From the plot of Field type and rushing yards (Figure 6) there was no clear difference in the number of rushing yards gained when playing on Natural Grass, compared to playing on Artificial Grass. This was not surprising, as advancements in manufacturing have meant artificial grass is very similar to natural grass, reducing the impact of field type on rushing yards gained. Nonetheless, it was interesting to note that there actually appeared to be no difference.

Figure 7: Running back rushing yards by whether they were playing at home or away in the NFL Running Back dataset.

The plot of Home/Away and rushing yards (Figure 7) showed no clear difference in the number of rushing yards gained when playing at home, compared to playing at an opponent’s field (away). This was a little surprising given the concept of “home field advantage” suggests that a player should perform better (gain more yards) when playing at home due to factors such as being familiar with the pitch and having a majority of the crowd cheering for you.

Figure 8: Running back rushing yards by whether the player’s team was favoured to win the game in the NFL Running Back dataset.

The plot of Favoured to Win and rushing yards (Figure 8) appeared to show a slightly higher median for rushing yards gained when the player’s team was favoured to win, compared to when their opponent was favoured to win. The range for rushing yards was also larger for players on teams favoured to win, compared to those on teams not favoured to win. These results were not surprising considering being favoured to win often reflects how strong a team is, compared to their opponent, and a strong team may have advantages over their opponent that allow their running backs to gain more yards, compared to teams that are not favoured to win.

The following columns were initially considered for inclusion in the final dataset but were dropped after exploratory analysis identified issues that could impact the viability of the Bayesian analysis.

Opponent

While some NFL teams tend to be better at “stopping the run” than others, the plot of rushing yards by opponent (Figure 9) showed no opponent was uniquely good or bad at preventing rushing yards. Furthermore, because of the way an NFL season is structured, a team can only play a maximum of 18 unique opponents in a season, meaning the sample size for estimating the effect of each opponent on a player would be very small. Finally, to model the Opponent feature would have required the use of dummy variables, and because there are 32 teams, including the Opponent feature would have resulted in a model with more than 32 parameters. As such, the Opponent feature was not included in the final dataset as it would have resulted in a cumbersome model and would have made achieving acceptable diagnostics difficult.

Figure 9: Running back rushing yards by opponent in the NFL Running Back dataset.

Roof

Although the type of stadium (roof type) a team is playing in could influence the play style (e.g., it is considered easier to pass the ball in a domed stadium than an outdoor stadium), there was a large imbalance in the distribution of Roof type, with more than 2/3 of all observations played outdoors (Figure 10). Furthermore, visualising the relationship between Roof type and rushing yards (Figure 10) showed there was no clear relationship between the Roof type and the number of rushing yards. Because there was an imbalance in Roof types, and no relationship could be observed between Roof type and rushing yards, the Roof variable was not included in the final dataset.

Figure 10: Frequency of Roof type (left), and distribution of rushing yards by Roof Type (right), in the NFL Running Back dataset.

Vegas Over/Under

The Vegas Over/Under indicates the total number of points expected to be scored in a game, with a high Over/Under indicating there should be many scoring opportunities. It also implies a high number of yards will be accumulated as each team moves the ball from one end of the field to the other to score. However, the Vegas Over/Under does not indicate how betting agencies expect the points to be distributed, meaning a high Over/Under could be the result of one team scoring a lot of points and their opponent scoring very few. As such, it was not surprising that the plot of rushing yards and Vegas Over/Under (Figure 11), showed no clear relationship, and because of this, Vegas Over/Under was not included in the final dataset.

Figure 11: Running back rushing yards by Vegas Over/Under in the NFL Running Back dataset.

Mathematical model

Although the dependent variable (rush_yds) was a continuous variable and had a distribution that appeared similar to a Gamma distribution, because the domain of the dependent variable was (-∞, +∞) a Gamma distribution, which has a domain of [0,+∞), was deemed inappropriate.

A student t distribution was chosen for modelling the dependent variable (rush_yds) as the domains were aligned, and the distribution was skewed, making a normal distribution inappropriate. As outliers were observed in the histogram and density plot of rush_yds (Figure 1 and Figure 2), a gamma distribution was chosen to model sigma.

A broad exponential distribution was chosen to model nu because the degrees of freedom (the normality parameter) can only take values from zero to infinity.

As NFL running backs are individuals with varying levels of skill as well as varying height and weight, which can all impact their ability to gain rushing yards, a hierarchy was added to the model to capture individual effects.

The resulting model for rush_yds was written in the form:

This model assumes:

The mean of the residuals is zero.
Homogeneity of variances (variance of 𝑌 doesn’t change for all values of the independent variables).
There is no autocorrelation of residuals.
The independent variables are uncorrelated.
There is no probability distribution assumed for describing the independent variables.

JAGS model diagram

The model diagram was drawn-up as:

Figure 12: JAGS model diagram showing the model hierarchy and the multiple linear regression setting for modelling the NFL Running Back dataset.

Specification of prior distributions

Non-informative priors

When building the model for the first time, non-informative or “vague” prior distributions were used to mitigate the risk that errors in the model or a failure to compile would be due to errors in specifying prior information. Using non-informative priors would also provide a “baseline” for beta coefficients, so that when informative priors were applied, their impact on beta coefficients could be observed by comparing coefficient values with the baseline.

A non-informative prior assigns an equal, or very close to equal, probability to all possible outcomes (Syversveen, 1998). To establish a non-informative Normal prior, the values of mean and variance are set so the resulting distribution is “flat”. The Normal distribution app (Figure 13), shows how a normal distribution with a relatively high variance (green) is closer to a flat distribution than a normal distribution with a small variance (red) (Demirhan H. , Normal Distribution, 2022). Therefore, to establish a non-informative Normal prior, the variance should be set to a relatively high value. Because the distribution is flat, the mean can be set to any value as there is no “peak” in the non-informative distribution.

Figure 13: Probability Density Functions for normal distributions with different variances.

The following R code was used to specify non-informative priors for each parameter. The value of Var0 was set to 1 to improve the stability of the Markov chains.

#====================================
# Priors on original scale – non-informative.
#====================================
mu0 <- ym                             #set to mean of dependent variable
for (a in 1:Nx) { 
     mu[a] <- 1                       #set all prior information to 1
}
#====================================
# Variances to reflect non-informative priors
#====================================
Var0 <- 1                             #Set to 1 to improve stability of the chains
for (q in 1:Nx) { 
     Var[q] <- 500                    #Set all variances to 500 (large, flat, non-informative)
}

Table 9: R code used to specify non-informative prior distributions for each predictor.

As the variance (sigma) was modelled using a Gamma(A,B) distribution, the Gamma distribution specified by the mean and standard deviation app was used to identify values for “A” and “B” that result in a non-informative prior (Demirhan H. , Gamma Distribution Specified by Mean and Standard Deviation, 2022). A Gamma distribution with a mean of 1 and a standard deviation of 10 is often used to produce a non-informative Gamma distribution with a mode of zero (Kruschke, 2015), and Figure 14 shows how a non-informative Gamma prior can be produced using these values, resulting in values for “A” and “B” of 0.01 and 0.01, respectively.

Figure 14: Probability Density Function for Gamma distribution, including values of parameters α and β.

The following R code was used to specify non-informative priors for each parameter. The value of Var0 was set to 1 to improve the stability of the Markov chains.

sigma ~ dgamma(0.01, 0.01)    #sigma distributed as gamma (A,B)

Table 10: R code used to specify non-informative prior distribution for sigma.

The degrees of freedom (normality parameter) was given a broad exponential prior as the normality parameter can only take values from zero to infinity. The Exponential distribution app was used to identify that an Exponential distribution with a small parameter value (blue line) is flatter, and non-informative, compared to Exponential distributions with larger parameter values (yellow and grey lines) (Demirhan H. , 2022). Based on the results from the Exponential distribution app and the normality parameter used by Kruschke (2015), a parameter value of 1/30 was used in the JAGS model to produce a non-informative exponential distribution (Kruschke, 2015).

Figure 15: Probability Density Functions for Exponential distribution with different parameter values.

nu ~ dexp(1/30.0)           #nu distributed as exponential (1/K)

Table 11: R code used to specify non-informative prior distribution for nu.

Informative priors

Once the model had been compiled and a baseline established, informative priors were specified for beta parameters using the R code in Table 13.

The prior information is summarised in Table 12, with information sourced from fantasy football experts, and personal knowledge/experience of fantasy football.

Expert information for the informative priors was obtained from “The Fantasy Footballers” podcast, a sports and entertainment podcast and website, focused on fantasy NFL football (Holloway, Moore, & Wright, The Fantasy Footballers podcast, 2022).

Research conducted by The Fantasy Footballers in 2021 showed that running backs playing in games where their team are favoured to win, average an additional 27 rushing yards, compared to running backs on teams who are not favoured to win (DiSorbo, 2021). Given the statistical analysis this expert knowledge is based on, the degree of belief in this expert knowledge was considered very strong.
Cold temperatures are widely considered on the Fantasy Footballers podcast as having a positive impact on rushing yards (Holloway, Moore, & Wright, Week 3 Matchups + Wheel of Shame, The Yeti Rises?, 2022). Although not explicitly stated, this effect was estimated as an additional 0.5 rushing yards for each 1 degree reduction in temperature (or -0.5 for each 1 degree increase). The belief in this expert knowledge was considered strong.

From my own experience observing NFL running back performances, I have also observed the following:

Running backs tend to perform better when playing at home, as the player is familiar with the field.
Games with high humidity tend to have more rushing yards, as high humidity often leads to rainy conditions that favour running plays over passing plays.
Games with high winds tend to have more rushing yards, as wind makes passing the ball difficult, leading to more running plays and more opportunities to gain rushing yards.
When a game is expected to be a “blow-out” (i.e., the Vegas line is more negative), the number of rushing yards tends to decrease. This is because the team that is losing/expected to lose the game is required to pass the ball more (run the ball less) to catch-up to their opponent. Although the team that is winning/expected to win tends to favour rushing plays (rushing plays take longer, leaving less time for the losing team to mount a comeback), there is an incentive for the winning team to rush for small gains, as this creates more stoppages and consumes more time.

Based on my knowledge, I have proposed the following additional prior information:

The expected number of rushing yards will increase by 10 yards if a running back is playing at home compared to away, and my belief in this information is moderate.
The number of rushing yards will increase by 3 yards for each 10% increase in humidity. As humidity ranges from 0 to 1, the prior information will be 30 (30x0.1 = 3). Degree of belief in this information is moderate.
The number of rushing yards will increase by 1.5 yards for each additional knot of wind speed, although my belief in this information is weak.
The number of rushing yards will reduce by 1 yard for each additional point the Vegas line is from zero. As all Vegas lines are negative, the sign on the prior information will be positive. Given Vegas lines are negatively skewed, and games often do not play-out as expected, my belief in this information is very weak.

Table 12: Summary of informative prior information for the NFL Running Back model.

The expert prior information was implemented using the code presented in Table 13.

#====================================
# Priors on original scale:
#====================================
#Informative
  mu0 <- ym                       #set to mean of y
  mu[1] <- 10                     #Home/Away
  mu[2] <-    1                   #Field
  mu[3] <- -0.5                   #TempC
  mu[4] <- 30                     #HumidityPct (+10% increase in humidity = an extra 5 yards)
  mu[5] <- 1.5                    #Wind_Speed
  mu[6] <- 1                      #Vegas_Line
  mu[7] <- 27                     #FvrdToWin  
#====================================
# Prior variances to reflect the confidence in the expert information
#====================================
#_________________________________________________________________
# Informative = Very Weak | Weak | Moderate | Strong | Very Strong
# Variance    =   100     | 10   |   1.0    |  0.1   |    0.01
#_________________________________________________________________
  Var0 <- 1                       #Set to 1
  Var[1] <- 1.0                   #Home/Away
  Var[2] <-     500               #Field
  Var[3] <- 0.1                   #TempC
  Var[4] <- 1.0                   #HumidityPCt
  Var[5] <- 10                    #Wind_Speed
  Var[6] <- 100                   #Vegas_Line
  Var[7] <- 0.01                  #FvrdToWin

Table 13: R code used to specify informative prior distributions for each predictor.

Creating JAGS data and model blocks

The following R code was used to create the JAGS data block for modelling the NFL Running Back dataset. The data was not standardised due to challenges encountered when attempting to back-transform standardised values to original scale from the hierarchical model.

# The data being modelled
y <- data[,yName]
p <- data[,sName]   # hierarchy data
x <- as.matrix(data[, !names(data) %in% c(yName, sName)])
Nsubj <- length(unique(p)) 
#Specify the data in a list to be sent to JAGS
dataList = list(
  x = x
  ,y = y
  ,p = p              #subjects (player ids)
  ,Nsubj = Nsubj
  ,Nx = dim(x)[2]
  ,Ntotal = dim(x)[1] 
  ,xPred = xPred  # Data for predictions
)
data {
#__________________________________ > create variables using y
  ym  <- mean(y)
}

Table 14: R code used to create JAGS data block.

The following R code was used to create the JAGS model block for modelling the NFL Running Back dataset.

model {
  for ( i in 1:Ntotal ) {
    y[i] ~ dt( dmu[i], 1/sigma^2, nu )            #Likelihood: student t (overall)
    
    dmu[i] <- beta0[p[i]] + beta[p[i],1]*x[i,1] + beta[p[i],2]*x[i,2] + beta[p[i],3]*x[i,3] 
                          + beta[p[i],4]*x[i,4] + beta[p[i],5]*x[i,5] + beta[p[i],6]*x[i,6]
                          + beta[p[i],7]*x[i,7]
  }
  for ( sIdx in 1:Nsubj ) {  #Because p is a vector, we need a *different index* in the for-loop.
     beta0[sIdx] ~ dnorm( mu0 , 1/Var0 )          #beta0 distributed as normal  
     for ( j in 1:Nx ) {
       beta[sIdx,j] ~ dnorm( mu[j], 1/Var[j] )    #beta[j] distributed as normal
     }
  }
  sigma ~ dgamma(0.01, 0.01)                      #sigma distributed as gamma (A,B)
  nu ~ dexp(1/30.0)                               #nu distributed as exponential

#==================================== 
# Compute predictions at every step of the MCMC
#====================================
  for ( i in 1:5) {
    pred[i] <- beta0[xPred[i,1]]    #Column 1 of xPred is 'player_id'. xPred[1,1]=50 >> beta0[50]
               + beta[ xPred[i,1], 1]  * xPred[i,2] #xPred[1,2] col2 of xPred corresponds to beta1
               + beta[ xPred[i,1], 2]  * xPred[i,3] #xPred[1,3] col3 of xPred corresponds to beta2
               + beta[ xPred[i,1], 3]  * xPred[i,4] #xPred[1,4] col4 of xPred corresponds to beta3
               + beta[ xPred[i,1], 4]  * xPred[i,5]
               + beta[ xPred[i,1], 5]  * xPred[i,6] 
               + beta[ xPred[i,1], 6]  * xPred[i,7]
               + beta[ xPred[i,1], 7]  * xPred[i,8] 
  }
}

Table 15: R code used to create JAGS model block.

Compiling the model and using it to create Markov chains

The “run_model_JAGS” function (Table 16) was used to create the JAGS data and model blocks, compile the model, and create Markov chains (refer appendix). The following R code was used to set model inputs and settings, and to execute the “run_model_JAGS” function.

# set the values to be predicted
#predictions
xPred <- as.matrix(predictions[,c(2, 4, 6:(ncol(predictions)-1))])
Actual_Yards <- as.data.frame( predictions[,1] )
#____________________________________________
# set the parameters for the model to monitor
parameters <- c("beta0", "beta", "sigma", "nu", "pred")
numberChains <- 2
adapt_steps <- 1500    # Number of steps to "tune" the samplers
burn_In_Steps <- 1500  # number of burn-in steps
thinningSteps <- 21
numberSavedSteps <- 5000
enrvironmentName <- paste("GamesH_Non-Informative", numberChains, adapt_steps, burn_In_Steps, thinningSteps, numberSavedSteps, sep="-")
#____________________________________________
#Run JAGS Model
run_model_JAGS(data = nfl_games,
               yName = "rush_yds",                #name of dependent variable (must have)
               sName = "player_id",               #subjects in first hierarchy
               modelparameters = parameters,      #parameters to monitor
               initsList = NULL,                  #initial values (none)
               nChains = numberChains,            #Number of MCMC chains
               adaptSteps = adapt_steps,          #Number of steps to "tune" the samplers
               burnInSteps = burn_In_Steps,       #Number of burn-in steps
               numSavedSteps = numberSavedSteps,  #Number of saved steps
               thinSteps = thinningSteps,         #default thinning steps of 1
               runjagsMethod = "parallel",        #e.g., parallel
               xPred = xPred,
               saveName = enrvironmentName)

Table 16: R code used to execute function for compiling model and creating Markov chains.

Initial values were not specified as the large number of beta coefficients produced by the hierarchical structure created challenges for back-transforming. Although standardised values would improve model efficiency (faster run time), the time required to implement back-transformation correctly was deemed too costly, especially given the model ran successfully without standardisation.

Analysis & Results

Assess appropriateness of Markov chains for each parameter using MCMC diagnostics

Because the model was hierarchical, a set of beta coefficients were produced for each of the 123 players in the dataset. With 7 predictors plus an intercept, this meant there were over 800 diagnostic plots produced. As there were too many plots to assess diagnostics for each beta coefficient individually, it was assumed that if the beta coefficients for a particular player were deemed acceptable, and the minimum ESS across all coefficients was sufficiently large, then the diagnostics for the chosen player would be representative of all diagnostic plots. In assessing the diagnostic plots, the beta coefficients for player 52 (Derrick Henry) were chosen for assessing whether the chains were representative, accurate, and efficient (acceptable).

The settings used in the first implementation, and the resulting run time, are shown in Table 17.

Table 17: Settings and run time for the first implementation of the JAGS model of the NFL Running Back dataset.

In the first implementation, 2 chains were used to observe whether either chain was unduly influenced by its initial value. Having 2 chains would also make it easier to identify if a chain, or part of a chain, was not representative, compared to having 1 chain. The number of thinning steps was set to 2 to observe whether this setting addressed autocorrelation in the diagnostic plots, and the number burn-in steps and saved steps were set to observe the amount of overlapping these settings produced in the parameter density plots and trace plots.

Using these settings, the JAGS model took 617 seconds to run (~10 minutes). The diagnostic plots produced using these settings are shown in Figure 16.

Figure 16: Diagnostic plots for model parameters, produced in the first implementation of the JAGS model of the NFL Running Back dataset.

Although the trace plot of β_0 in Figure 16 showed some evidence of mixing among the chains, there was room for improvement in most trace plots. The parameter density plots in Figure 16 also showed there was room for improvement in the amount of overlap in the chains, and the shrink factors for β_2 and β_4 were unacceptably large (>1.10). As such, it was clear that the settings used in the first implementation did not produce representative chains.

Many of the diagnostic plots in Figure 16 showed evidence of autocorrelation, and the Monte Carlo Standard Errors (MCSE) in the diagnostic plots were unacceptably large, with the largest MCSE coming from β_4 (0.678). The Effective Sample Sizes (ESS) were also unacceptably small, with β_4 having the smallest ESS at 593. As the MCSE and ESS values were unacceptable, and there was evidence of autocorrelation in many of the autocorrelation plots, the accuracy of the chains was deemed unacceptable.

The minimum ESS across all beta coefficients for all players was 210 (β_67,3). While the minimum ESS was small, it was not a surprise given the model settings did not produce representative or accurate chains.

As there was plenty of room for improvement in the diagnostic plots, a second implementation was conducted using an increased number of burn-in steps, thinning steps, and saved steps, to observe their impact on the diagnostics. The settings used in the second implementation, and the resulting run time, are shown in Table 18.

Table 18: Settings and run time for the second implementation of the JAGS model of the NFL Running Back dataset.

With the settings listed in Table 18, the JAGS model took 3,907 seconds to run (~1 hour). The diagnostic plots produced using these settings are shown in Figure 17.

Figure 17: Diagnostic plots for model parameters, produced in the second implementation of the JAGS model of the NFL Running Back dataset.

The second implementation showed that increasing the number of burn-in steps and saved steps improved the amount of mixing among the chains in all trace plots and overlap in the parameter density plots (Figure 17). Increasing the number of thinning steps also improved the shrink factors in the second implementation, with all shrink factors less than 1.10. As the mixing and overlapping of chains was acceptable, and the shrink factors were less than 1.10, the chains in the second implementation were deemed representative.

Although increasing the number of thinning steps in the second implementation addressed most of the autocorrelation observed in the diagnostic plots (Figure 17), many autocorrelation plots still showed some evidence of autocorrelation. Although the additional thinning steps improved the MCSEs in the second implementation (MCSE < 0.22), smaller MCSEs would have been preferable. Finally, increasing the number of thinning and saved steps did increase the Effective Sample Sizes (ESS > 4,250), but larger ESS values would still have been preferable. Although the accuracy of the chains in the second implementation had improved relative to the first implementation, as autocorrelation was still present in the diagnostic plots, the accuracy of the chains in the second implementation was deemed unacceptable.

The minimum ESS across all beta coefficients for all players was 1,839.7 (β_28,3). While the minimum ESS was too small, given the chains were not deemed accurate, the minimum ESS for all betas was not yet a concern.

As the chains were not deemed accurate in the second implementation, a third implementation was conducted using an increased number of thinning steps to address the autocorrelation and improve MCSE and ESS. The settings used in the third implementation, and the resulting run time, are shown in Table 19.

Table 19: Settings and run time for the third implementation of the JAGS model of the NFL Running Back dataset.

With the settings listed in Table 19, the JAGS model took 10,988 seconds to run (~3 hours). The diagnostic plots produced using these settings are shown in Figure 18.

Figure 18: Diagnostic plots for model parameters, produced in the third implementation of the JAGS model of the NFL Running Back dataset.

The trace plots in Figure 18 showed the chains for all parameters were mixing well, with no trends observed in any trace plot. The parameter density plots also showed that there was plenty of overlap among the chains for all parameters in the model. Although the shrink factor for β_0 began at 1.10, it quickly dropped and remained at 1.0, and the shrink factors for all remaining parameters were below 1.10. As the chains showed good mixing and overlap, and the shrink factors were acceptable, the chains were deemed representative.

Increasing the number of thinning steps to 21 addressed the autocorrelation identified in the second implementation (Figure 17), with all autocorrelation plots in Figure 18 showing very little or no autocorrelation. The ESS values in the third implementation were sufficiently large, with all beta parameters for Derrick Henry having an ESS of at 10,000 or more. Although sigma did have an ESS of 8,580, this was still considered large and acceptable. The MCSE values in the third implementation were reasonably small, with the largest MCSE value for Derrick Henry’s beta parameters coming from β_4 (0.161). While smaller MCSE values would have improved confidence in the accuracy of the chains, given the large ESS values and the minimal evidence of autocorrelation, the chains were deemed accurate.

The minimum ESS value across all beta coefficients for all players was 5,451 for β_67,3. As this ESS was still reasonably large, it was deemed reasonable to assume that all diagnostic plots for beta parameters in the model were acceptable. Nonetheless, the diagnostics for player 67 were reviewed (Figure 19). While there was a small amount of autocorrelation in the autocorrelation plot for β_67,3, the ESS was large and the MCSE was small, indicating there was a large amount of independent information in the chains. As such, all diagnostic plots for beta parameters for this player were deemed acceptable.

Figure 19: Diagnostic plots for beta parameters for player 67, produced in the third implementation of the JAGS model of the NFL Running Back dataset.

As the settings in the third implementation produced chains that were deemed representative and accurate, and the run time was not excessive or unreasonable, the model settings were chosen for use in the NFL running back model with informative priors.

Assess the appropriateness of Markov chains for each parameter using MCMC diagnostics with informative priors

With the informative priors (as noted in Table 13) applied to the JAGS model, the settings used to model the dataset, and the resulting run time, are shown in Table 20.

Table 20: Settings and run time for implementation of the JAGS model of the NFL Running Back dataset with informative priors.

With the settings listed in Table 20, the JAGS model took 8,185 seconds to run (2 hours and 16 minutes). The diagnostic plots produced using these settings are shown in Figure 20.

Figure 20: Diagnostic plots for model parameters, produced from the JAGS model of the NFL Running Back dataset with informative priors.

Representativeness

The trace plots in Figure 20 showed both chains were mixing well for all parameters, with no evidence of any chains isolating or lingering, and the parameter value density plots showed the chains were also mostly overlapping, indicating the chains had converged. All shrink factors in the informative prior implementation were less than 1.10, which was particularly reassuring for β_0 as this parameter had a shrink factor close to 1.10 in the third non-informative implementation (Figure 18), which had improved to well below 1.10 in the informative implementation (Figure 20). Furthermore, the posterior distributions (Figure 21) for the beta parameters were smooth and symmetrical. As such, the chains in the informative prior implementation were deemed representative.

Accuracy

There was no evidence of autocorrelation in the autocorrelation plots from the informative implementation (Figure 20), and the ESS values for all parameters were large, with the minimum ESS value being 8,589 (Table 21, nu). The majority of MCSE values were small, with β_2 the only parameter with an MCSE greater than 0.1. That said, the MCSE for β_2 was 0.1, so it was still rather small. As there was no evidence of autocorrelation, the ESS values were large, and most of the MCSE were small, there was plenty of evidence that there was a large amount of independent information present in the chains, and as such, the chains were deemed accurate.

Efficiency

From the summary in Table 20, the computation time for the informative prior implementation of the MCMC was 2 hours and 16 minutes. This run time was deemed reasonable considering the size of the dataset, and shorter run times were only achieved with settings that produced chains that were not representative.

Finally, the minimum ESS across all beta coefficients in the model was 8,158 for β_27,1. As the diagnostic plots for Derrick Henry showed the chains were representative and accurate, and the minimum ESS for a beta parameter in the model was sufficiently large, it was assumed that all diagnostic plots for all beta parameters were acceptable.

Figure 21: Posterior distributions of the beta parameters for player 52 in the NFL Running Back model.

Figure 22: Posterior distributions for Sigma, and log10(Nu), in the NFL Running Back model.

Table 21: Summary of MCMC results for Derrick Henry (player 52) in the NFL Running Back model.

Assess the posterior distribution of each parameter and draw inferences

Figure 23: Posterior distributions of the beta parameters for Derrick Henry (player 52) in the NFL Running Back model.

β_0 – The Bayesian point estimate for β_0,52 (intercept for Derrick Henry) was 36.7 (taken from the mode), and the 95% HDI for β_0,52 had a lower bound of 34.7 and an upper bound of 38.6, meaning there was a 95% probability that the value of β_0,52 lay between these values. As the 95% HDI did not include zero, we could conclude that the value of β_0,52 was significant.

β_1 – The Bayesian point estimate for β_1,52, the coefficient for whether the game was played at Derrick Henry’s home stadium or away, was 9.75 (taken from the mode), indicating that when Derrick Henry was playing at home, their predicted rushing yards increased by 9.75 yards compared to playing away, ceteris paribus. The 95% HDI for β_1,52 had a lower bound of 7.92 and an upper bound of 11.8, meaning there was a 95% probability that the value of β_1,52 lay between these values. As the 95% HDI did not include zero, we could conclude that the value of β_1,52 was significant in predicting the number of rushing yards Derrick Henry will achieve in a game, reflecting the prior information that the number of rushing yards increases by 10 when a player is playing at home, relative to playing away.

β_2 – The Bayesian point estimate for β_2,52, the coefficient for field type for Derrick Henry, was -28.3 (taken from the mode), indicating that when Derrick Henry plays on natural grass (Field=1), their predicted rushing yards reduces by 28 yards, ceteris paribus. The 95% HDI for β_2,52 had a lower bound of -48.7 and an upper bound of -9.26, meaning there was a 95% probability that the value of β_2,52 lay between these values. As the 95% HDI did not include zero, we could conclude that the value of β_2,52 was significant in predicting the number of rushing yards Derrick Henry will achieve in a game, with grass fields reducing Derrick Henry’s predicted rushing yards, relative to playing on artificial turf.

β_3 – The Bayesian point estimate for β_3,52, the coefficient for Temperature (Celsius) for Derrick Henry, was -0.273 (taken from the mode), indicating that for each additional degree of Temperature (Celsius) Derrick Henry’s predicted rushing yards reduces by 0.3 yards, ceteris paribus. The 95% HDI for β_3,52 had a lower bound of -0.804 and an upper bound of 0.23, meaning there was a 95% probability that the value of β_3,52 lay between these values. As the 95% HDI included zero, the posterior distribution suggested that Temperature was not significant in determining the number of rushing yards Derrick Henry will gain. However, zero was not near the centre of the 95% HDI, meaning there was a greater probability that the value of β_3,52 was less than zero, compared to being zero/greater than zero. Therefore, although Temperature was insignificant in determining Derrick Henry’s rushing yards, it was not completely insignificant (i.e., more likely to have some impact, than no impact at all).

β_4 – The Bayesian point estimate for β_4,52, the coefficient for Humidity (%) for Derrick Henry, was 29.9 (taken from the mode), indicating that for each additional 10% of humidity (up to a maximum of 100%) Derrick Henry’s predicted rushing yards increases by 2.99 yards, ceteris paribus. The 95% HDI for β_4,52 had a lower bound of 28.2 and an upper bound of 32.0, meaning there was a 95% probability that the value of β_4,52 lay between these values. As the 95% HDI did not include zero, we could conclude that Humidity was significant in predicting the number of rushing yards Derrick Henry will achieve in a game, with higher humidity leading to higher predicted rushing yards (ceteris paribus). This also aligned with the trend observed in Figure 5, with rushing yards tending to increase as humidity increased.

β_5 – The Bayesian point estimate for β_5,52, the coefficient for Wind Speed (knots) for Derrick Henry, was 0.82 (taken from the mode), indicating that for each additional knot of wind speed, Derrick Henry’s predicted rushing yards increases by 0.82 yards, ceteris paribus. The 95% HDI for β_5,52 had a lower bound of -1.13 and an upper bound of 2.87, meaning there was a 95% probability that the value of β_5,52 lay between these values. As the 95% HDI included zero, we could conclude that wind speed was insignificant in predicting the number of rushing yards Derrick Henry would gain, which was surprising given a trend was observed in the plot of rushing yards and wind speed (Figure 5). That said, zero was closer to the lower bound than the centre of the 95% HDI, meaning there was a greater probability that the value of β_5,52 was positive, compared to zero or negative. Therefore, while the effect of wind speed appeared to be insignificant, we could conclude that it was not completely insignificant (i.e., wind speed had some impact, as opposed to no impact at all).

β_6 – The Bayesian point estimate for β_6,52, the coefficient for the Vegas Line for Derrick Henry, was -12.9 (taken from the mode), indicating that a 1 unit decrease in the Vegas Line (more negative) tends to increase the predicted rushing yards for Derrick Henry by an average of 12.9 yards, ceteris paribus. The 95% HDI for β_6,52 had a lower bound of -17.3 and an upper bound of -9.46, meaning there was a 95% probability that the value of β_6,52 lay between these values. As the 95% HDI did not include zero, we could conclude that the Vegas Line was significant in predicting the number of rushing yards Derrick Henry will achieve in a game. It was surprising to find that Derrick Henry’s predicted rushing yards increased as the Vegas Line grew larger (became more negative), as the trend observed in Figure 5 indicated that rushing yards tended to decrease as the Vegas Line grew larger (became more negative). It is possible that the coefficient of β_6,52 reflects specific characteristics for Derrick Henry, who is a high-performing running back who may be relied upon more in games with a wide margin, unlike other “average” players on other teams.

β_7 – The Bayesian point estimate for β_7,52, the coefficient for whether Derrick Henry’s team was favoured to win, was 27 (taken from the mode), indicating that when Derrick Henry’s team was favoured to win, their predicted rushing yards increased by 27 yards compared to when their opponent was favoured to win, ceteris paribus. This was not surprising in the context of the expert prior information (that running backs gain more yards in games where their team are favoured to win). This also aligned with the plot of rushing yards and being favoured to win (Figure 8) which showed a higher median number of rushing yards when a player’s team was favoured to win. The 95% HDI for β_7,52 had a lower bound of 26.8 and an upper bound of 27.2, meaning there was a 95% probability that the value of β_7,52 lay between these values. As the 95% HDI did not include zero, we could conclude that being favoured to win was significant in predicting the number of rushing yards Derrick Henry will achieve in a game.

Figure 24: Posterior distributions for Sigma, and log10(Nu) in the NFL Running Back model.

From the distribution of sigma (Figure 24), the Bayesian estimate of sigma (scale) was 32, and the 95% HDI had a lower bound of 30.8 and an upper bound of 33.2. These values suggested that the distribution of rushing yards had a large scale and rushing yard values in the running back model were highly spread, which was not surprising considering the distribution of observed rushing yards had a large range (Table 3).

From the distribution of nu (Figure 24), the Bayesian estimate of nu (the normality parameter) was 1.26, with the 95% HDI having a lower bound of 1.06 and an upper bound of 1.58. As the normality parameter was fairly large, this suggests that there are not many outliers in the model. Initially this was a surprise considering the number of outliers observed in the distribution of observed data (Figure 3) however, the large sigma value indicated the rushing yards in the model were highly spread, which could explain why the normality parameter suggested there were few outliers.

As a linear regression was implemented within the hierarchical model, the R-squared would normally be used to identify the proportion of variance in rushing yards that is explained by the model. However, in Bayesian analysis, if the beta values in each iteration of the MCMC are not perfectly consistent, the R-squared of the model can become unreasonable (i.e., >1.0). The NFL running back model did not produce perfectly consistent betas in each iteration of MCMC, and so the R-squared for the model was unreasonable (>1.0). As such, it was omitted from the discussion of model parameters.

Figure 25: Pairs plot of the beta parameters of numerical predictors for Derrick Henry, sigma and log10(Nu) in the NFL Running Back model.

The pairs plot (Figure 24) showed a moderate-strong, positive correlation between sigma (σ) and 〖log〗_10 (Nu). Among the beta coefficients of numerical predictors for Derrick Henry, the strongest correlation was a moderate-weak strength (0.36) positive correlation between β_52,5 and β_52,6. As the majority of parameters showed no strong correlations between one-another, and the diagnostic plots showed the beta coefficients had converged, correlations between parameters were not a concern.

Because a hierarchy was used to capture differences among players, the Bayesian point estimates of several beta parameters were also compared across players.

Figure 26: Comparison of posterior distributions for HomeAway for 2 players in the NFL Running Back model.

Figure 26 shows a comparison of the coefficient value for HomeAway (whether a game was played at home or away) for Derrick Henry, and Saquon Barkley, with the beta parameters β_1,52 and β_1,7 corresponding to Derrick Henry and Saquon Barkley, respectively. The comparison plot in the top-right of Figure 26 showed that the difference between the coefficient values was very small (<0.01) and because the 95% HDI in the comparison plot included zero, and zero was close to the centre of the 95% HDI, we could conclude that there was no significant difference in the value of the coefficient of HomeAway between Derrick Henry and Saquon Barkley. That is, playing at home has the same impact on the predicted number of rushing yards for Derrick Henry as it does for Saquon Barkley.

Figure 27: Comparison of posterior distributions for Field type for 2 players in the NFL Running Back model.

Figure 27 shows a comparison of the coefficient value for Field type (whether game was played on artificial turf or natural grass) for Derrick Henry, and Saquon Barkley, with the beta parameters β_2,52 and β_2,7 corresponding to Derrick Henry and Saquon Barkley, respectively. The comparison plot in the top-right of Figure 27 showed that the difference between the coefficient values was very large (-40.3), and because the 95% HDI in the comparison plot did not include zero, we could conclude that there was a significant difference in the impact that field type has on the predicted number of rushing yards that Derrick Henry will gain, compared to Saquon Barkley.

The JAGS model (Table 15) only produced beta coefficients for individual players, meaning comparisons could not be made to the impact of a predictor for NFL running backs overall. Attempts were made to adjust the JAGS model to produce overall beta coefficients however, all attempts resulted in the player-specific beta coefficients having the same values.

Comparing the impact of prior information

Table 22: Summary of beta coefficients for Derrick Henry in the NFL Running Back model with informative, and non-informative priors.

The coefficient modes in Table 22 show how the expert information influenced the beta coefficients for Derrick Henry to varying degrees, based on the degree of belief in the expert prior information. The very strong belief in the prior information that being favoured to win increases the number of yards a running back will gain by 27 yards is reflected in the value of the coefficient for FvrdToWin (β_8,52) changing from 9.99 to 27.02 when expert information with a very strong degree of belief was applied to the model. Meanwhile, the very weak degree of belief in the prior information that a large (more negative) Vegas Line reduces the number of rushing yards a running back will gain is reflected in the coefficient of Vegas Line shifting towards the mean of the prior information (1) by a very small amount (+0.69).

The strength of the degree of belief is also reflected in the 95% HDI of each posterior distribution (Figure 23). The very strong belief in the expert information for FvrdToWin is results in a very small difference between the lower and upper bounds of the 95% HDI, 26.8 and 27.2, respectively. On the other hand, the weak degree of belief in the prior information for Wind Speed resulted in lower and upper bound values of -1.13 and 2.87, respectively. As such, prior information with a weaker degree of belief results in a posterior distribution with a wider 95% HDI.

Figure 28: Posterior distributions of Temperature for Derrick Henry (player 52) using non-informative priors (left) and informative priors (right) including a comparison value of zero, from the NFL Running Back model.

Figure 28 compares the posterior distributions for the coefficient of Temperature for Derrick Henry, produced with non-informative priors (left) and informative priors (right). The comparison value in the posterior with expert prior information (Figure 28, right) was set to the mode of β_3,52 when no expert prior information had been applied, as a hypothesis test could then be used to determine if the expert information had resulted in a statistically significant change in the value of β_3,52. The hypothesis test was written in the following form,

Using the comparison value on the posterior distribution of β_3,52 with expert prior information (Figure 28, right), the probability that the value of β_3,52 was less than 0.64 was 100%. This meant that, at the α=0.05 level, there was sufficient evidence to reject the null hypothesis and conclude that the value of β_3,52 was less than 0.64 after applying expert prior information.

Predictions model using the Bayesian point estimates of the model parameters

The model beta coefficients for Derrick Henry were extracted, with the results presented in Table 23.

Table 23: Summary of results for player 88 in the NFL Running Back model.

Using the coefficients extracted in Table 23, the predictions model for rushing yards for Derrick Henry (player 52) was:

Predicting rushing yards gained for running backs using the Bayesian Hierarchical linear model

Using results from the 2022 NFL season, 5 new observations were created to test the predictive validity of the model. These observations are shown in Table 24, and the posterior distributions for the number of rushing yards predicted by the model for each new observation are shown in Figure 29.

Table 24: New observations extracted from the 2022 NFL season, to test the predictive validity of the model.

Figure 29: Posterior distributions for the predicted rushing yards of 5 players, using the NFL Running Back model.

Using the given information, the JAGS model predicted the following rushing yards for each running back.

AJ Dillon – The Bayesian estimate for the number of rushing yards achieved by AJ Dillon was 30.2 rushing yards, with the 95% HDI having a lower bound of 6.36 and an upper bound of 53.6 yards, meaning there was a 95% probability that the number of rushing yards AJ Dillon gained was between these values. This was a surprise as AJ Dillon achieved 61 yards, more than the upper bound of the 95% HDI for the prediction.

Derrick Henry – The Bayesian estimate for the number of rushing yards gained by Derrick Henry was 90.6 rushing yards, with the 95% HDI having a lower bound of 74.7 and an upper bound of 106 yards, meaning there was a 95% probability that the number of rushing yards Derrick Henry gained was between these values. Again, this was a surprise considering Derrick Henry gained 114 yards, which was outside of the 95% HDI.

Nick Chubb – The Bayesian estimate for the number of rushing yards gained by Nick Chubb was 82 rushing yards, with the 95% HDI having a lower bound of 67.6 and an upper bound of 95.1 yards, meaning there was a 95% probability that the number of rushing yards Nick Chubb gained was between these values. Nick Chubb gained 118 yards in the 2022 observation, which was greater than the upper bound of the 95% HDI.

Saquon Barkley – The Bayesian estimate for the number of rushing yards gained by Saquon Barkley was 77.5 rushing yards, with the 95% HDI having a lower bound of 57.5 and an upper bound of 103 yards, meaning there was a 95% probability that the number of rushing yards Saquon Barkley gained was between these values. In the 2022 game, Saquon gained 146 yards, which was well above the upper bound of the 95% HDI.

Leonard Fournette – The Bayesian estimate for the number of rushing yards gained by Leonard Fournette was 53.6 rushing yards, with the 95% HDI having a lower bound of 27.5 and an upper bound of 79.4 yards, meaning there was a 95% probability that the number of rushing yards Leonard Fournette gained was between these values. In reality, Leonard Fournette gained 56 rushing yards which, relative to the other predictions, was very close to the Bayesian point estimate and fell inside of the 95% HDI for this prediction.

Although the rushing yard predictions were not unreasonably large or small, they were inaccurate, with the actual rushing yards falling outside of the 95% HDI for all predictions except that of Leonard Fournette. In short, there was plenty of room for improvement in the predictions made by the model.

Figure 30: Density plots of observed data and posterior distribution for rushing yards by player 52 (Derrick Henry).

The model’s inaccuracy in predicting rushing yards was highlighted by the density plot of observed and predicted values for player 52 (Derrick Henry) (Figure 30). The plot showed that the model was a poor predictor for most rushing yard observations of this player, over-predicting rushing yards between ~20 and ~45 yards, and underpredicting rushing yards greater than 50 yards. This could be because Derrick Henry is a high-performing running back who regularly gains a large amount of yards, as shown by the small “peak” in observed values around 80 rushing yards.

As a contrast, Figure 31 shows the density plot of observed and predicted values for player 56 (Leonard Fournette). As Leonard Fournette averages a small number of rushing yards (relative to the average running back), the model tends to overpredict rushing yards for this player.

Figure 31: Density plots of observed data and posterior distribution for rushing yards by player 56 (Leonard Fournette).

This suggested the model would only be a reasonable predictor for rushing yards for a running back who’s average rushing yards are close to the group average. This was supported by the density plot in Figure 32, which showed the predictions for player 39 were closer to the observed values, partly because this player’s average yards per game (40) was closer to the average rushing yards of all running observations in the dataset (36.5).

Figure 32: Density plots of observed data and posterior distribution for rushing yards by player 39 (Devonta Freeman).

As many factors can affect the number of rushing yards a player will gain (e.g., injury status, how rushing attempts are distributed among running backs on the same team), it was not surprising that the NFL Running Back model was a poor predictor of rushing yards gained by running backs. The model could be improved by adding information to the dataset, such as the number of active running backs on a player’s team. Adding a hierarchy to capture the effect of which opponent a player was facing in a given game could also improve the model however, as there were only a limited number of observations for each running back (the most observations for a single player was 51), adding a hierarchy for opponent could risk quantifying the effect of an opponent based on a very small number of games. As such, creating a hierarchy for the effect of each opponent would necessitate expanding the date range of the dataset to increase the sample size for each player.

Summary & Conclusion

The objective of this report and the project was to summarise the process undertaken to conduct Bayesian analysis of the NFL Running Back dataset and produce a model to predict rushing yards for NFL running backs.

A histogram of rushing yards (the dependent variable) showed the distribution of rushing yards was positively skewed, with a long right tail and many outliers. Based on the histogram, and the fact that rushing yards was a continuous variable with a distribution that included positive and negative values, a student t distribution was chosen to model the dependent variable.

Plots of rushing yards and each dependent variable showed there was no clear relationship between rushing yards and temperature, field type, or whether the player was playing at home or away. However, there did appear to be a relationship between the number of rushing yards and humidity, wind speed, Vegas line, and whether the player’s team was favoured to win the game, with rushing yards appearing to increase (on average) with higher humidity, a smaller (less negative) Vegas Line, and being favoured to win, and appearing to decrease (on average) with higher wind speeds.

As each running back has unique characteristics, a hierarchical model was used to capture player-specific differences and produce a Bayesian Hierarchical Linear Regression model. Initial values were not used, and the data was not standardised, as these created challenges for specifying values and back transforming, which out-weighed the potential benefits of faster convergence and reduced run times. Furthermore, the chains converged successfully, and the model run time was not excessive, indicating that specifying initial values and standardizing the data were not required to produce a model.

Using non-informative priors, different settings for burn-in, thinning, and saved steps, were trialled with diagnostic plots produced and reviewed at each stage to determine whether the chains produced were representative and accurate. The model diagnostics were initially problematic, however, after increasing burn-in, thinning, and saved steps, it was identified that using 2 chains, 1,500 burn-in steps, 21 thinning steps, and 5,000 saved steps produced chains that were representative, accurate, and reasonably efficient. As such, the chains were deemed acceptable.

Expert information obtained from fantasy football analysts and personal observations was then applied to the model using informative priors and the model settings identified as producing representative and accurate chains. The successful implementation of the expert prior information was confirmed by comparing the posterior distributions from the non-informative model and the informative model and confirming that the modes of parameters with informative priors had moved closer to the value specified in the expert information, and to varying degrees based on the degree of belief in the expert information.

Using the posterior distributions for player 52 (Derrick Henry) extracted from the informative model, it was identified that the coefficients for Temperature and Wind Speed were insignificant for Derrick Henry, as their 95% HDIs included zero. However, zero was not at the centre of these 95% HDIs, so it was possible that the effect of Temperature and Wind Speed on Derrick Henry’s rushing yards was not completely insignificant. Meanwhile, the coefficients for HomeAway, Field, Humidity, Vegas Line, and Favoured To Win did not include zero, and so were deemed significant in determining the number of rushing yards Derrick Henry would gain.

Finally, after specifying the predictions model for Derrick Henry, the number of rushing yards gained by 5 players were predicted using new observations taken from the 2022 NFL season. While the predicted values were not unreasonably large or small, they were highly inaccurate, with the actual number of yards gained falling outside of the 95% HDI for 4 of the 5 predictions. Plots of observed and predicted values also indicated that the NFL running back model tended to under-predict rushing yards for above-average players such as Derrick Henry, while over-predicting rushing yards for below-average players, such as Leonard Fournette. As such, there was a lot of room for improvement in the NFL Running Back model.

To improve the model, additional information could be added to the dataset to capture more factors that affect rushing yards. Adding a hierarchy to capture the effect of opponent was also recommended, although this would require expanding the date range of the data to ensure sample sizes were sufficiently large.

References

Demirhan, H. (2022, 10 01). Exponential Distribution. Retrieved from RMIT Canvas: file:///C:/Users/james/OneDrive/Desktop/Uni%20Studies/RMIT/Applied%20Bayesian%20Statistics%20-%20MATH2269/Interactive%20Apps%20-%20Shiny/Continuous%20Distributions%20(Shiny)/exponential.html

Demirhan, H. (2022, 08 20). Gamma Distribution Specified by Mean and Standard Deviation. Retrieved from RMIT Canvas: https://rmit.instructure.com/courses/91350/files/26025139?fd_cookie_set=1

Demirhan, H. (2022, 08 20). Normal Distribution. Retrieved from RMIT Canvas: https://rmit.instructure.com/courses/91350/files/26025114?fd_cookie_set=1

DiSorbo, M. (2021, 07 15). The Fantasy Football Mythbusters: Flip the (Game) Script. Retrieved from The Fantasy Footballers: https://www.thefantasyfootballers.com/articles/the-fantasy-football-mythbusters-flip-the-game-script/

Fernandez, D. (2022, 08 01). NFL Offensive Stats 2019 - 2022. Retrieved from Kaggle.com: https://www.kaggle.com/datasets/dtrade84/nfl-offensive-stats-2019-2022

Holloway, A., Moore, J., & Wright, M. (2022, 10 15). The Fantasy Footballers podcast. Retrieved from The Fantasy Footballers: https://www.thefantasyfootballers.com/

Holloway, A., Moore, J., & Wright, M. (2022, 09 23). Week 3 Matchups + Wheel of Shame, The Yeti Rises? Retrieved from The Fantasy Footballers: https://www.thefantasyfootballers.com/episodes/week-3-matchups-wheel-of-shame/

Kruschke, J. (2015, 12 02). Prior on df (normality) parameter in t distribution. Retrieved from Doing Bayesian Data Analysis: http://doingbayesiandataanalysis.blogspot.com/2015/12/prior-on-df-normality-parameter-in-t.html

Syversveen, A. R. (1998). Noninformative bayesian priors. interpretation and problems with construction and applications. . Preprint statistics, 3(3), 1-11.

Code

Setup

# Clear the environment and the console
rm(list = ls()); cat("\f")
#----------------------------#
# Install required libraries #
#----------------------------#
packages <- c("TSA", "tseries", "lmtest", "FSAdata", "forecast", "urca", "rugarch", "fGarch",
              "tswge", "imputeTS", "ggplot2", "readxl", "readr", "dplyr", "tidyr", "psych",
              "epiDisplay",
              "stringr", "lubridate", "knitr", "shiny", "coda", "rjags", "runjags", "ggpubr")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }
# Load libraries #
invisible(lapply(packages, library, character.only = TRUE))
#-------------------------#
# Clear the environment and the console
rm(list = ls()); cat("\f")

Custom Functions

# Function: Summary MCMC ####
smryMCMC_HD = function(  codaSamples , compVal = NULL,  saveName=NULL) {
  summaryInfo = NULL
  mcmcMat = as.matrix(codaSamples, chains=TRUE)
  paramName = colnames(mcmcMat)
  for ( pName in paramName ) {
    if (pName %in% colnames(compVal)){
      if (!is.na(compVal[pName])) {
        summaryInfo = rbind( summaryInfo, summarizePost( paramSampleVec = mcmcMat[,pName], compVal = as.numeric(compVal[pName]) ))
      }
      else {
        summaryInfo = rbind( summaryInfo, summarizePost( paramSampleVec = mcmcMat[,pName] ) )
      }
    } else {
      summaryInfo = rbind( summaryInfo, summarizePost( paramSampleVec = mcmcMat[,pName] ) )
    }
  }
  rownames(summaryInfo) = paramName
  if ( !is.null(saveName) ) {
    write.csv( summaryInfo , file=paste(saveName,"SummaryInfo.csv",sep="") )
  }
  return( summaryInfo )
}
###################-------------------------------------------------------------
# Function: Plot MCMC #
plotMCMC = function( codaSamples , data , xName="x" , yName="y" , sName="s", numericals="v",
                     compVal=0.5 , pKey=player_names, xPred=xPred, pIDx=52, betaComp=1 , 
                     diffSList=NULL ,showCurve=FALSE ,  pairsPlot=FALSE ,
                     saveName=NULL , saveType="jpg" ) {
  # showCurve is TRUE or FALSE and indicates whether the posterior should
  #   be displayed as a histogram (by default) or by an approximate curve.
  # pairsPlot is TRUE or FALSE and indicates whether scatterplots of pairs
  #   of parameters should be displayed.
  #-----------------------------------------------------------------------------
  graphics.off() #Sys.sleep(1) #pause execution for specified number of seconds
  y <- data[,yName]    
  x <- as.matrix(data[, !names(data) %in% c(yName, sName)])
  # data specific to selected player
  pid_data <- data[data[,2] == pIDx, ]         # z <- nfl_games[nfl_games[,2] == pIDx,"rush_yds"]
  y_pid <- pid_data[, yName]
  x_pid <- as.matrix(pid_data[, !names(pid_data) %in% c(yName, sName)])
  num_x_pid <- as.matrix(x_pid[, numericals])
  # MCMC data
  mcmcMat <- as.matrix(codaSamples, chains=TRUE)
  chainLength <- NROW( mcmcMat )
  beta0 <- mcmcMat[,paste0("beta0[",pIDx,"]")]
  beta  <- as.matrix(mcmcMat[,paste0("beta[",pIDx,",1]")]) %>% `colnames<-`(c(paste0("beta[",pIDx,",1]")))
  for ( i in 2:ncol(x) ) {
    beta <- cbind(beta, mcmcMat[, paste0("beta[", pIDx, ",", i, "]")])
    colnames(beta)[ncol(beta)] <- paste0("beta[", pIDx, ",", i, "]")
  }
  num_beta <- beta[,c(paste0("beta[",pIDx,",3]"), paste0("beta[",pIDx,",4]"),
                      paste0("beta[",pIDx,",5]"), paste0("beta[",pIDx,",6]"))]
  if ( ncol(x)==1 ) { beta = matrix( beta , ncol=1 ) }
  sigma   <- mcmcMat[,"sigma"]
  nu      <- mcmcMat[,"nu"]
  log10nu <- log10(nu)
  # --- predictions --- #
  pred1 <- mcmcMat[,"pred[1]"]
  pred2 <- mcmcMat[,"pred[2]"]
  pred3 <- mcmcMat[,"pred[3]"]
  pred4 <- mcmcMat[,"pred[4]"]
  pred5 <- mcmcMat[,"pred[5]"]
  # --- names --- #
  pIDx_name  <- player_names[player_names$player_id == pIDx, 2]
  pred1_Name <- player_names[player_names$player_id == xPred[1,1], 2]
  pred2_Name <- player_names[player_names$player_id == xPred[2,1], 2]
  pred3_Name <- player_names[player_names$player_id == xPred[3,1], 2]
  pred4_Name <- player_names[player_names$player_id == xPred[4,1], 2]
  pred5_Name <- player_names[player_names$player_id == xPred[5,1], 2]
  #_______________________________________________________________
  # Compute R^2 for credible parameters:
  YcorX <- cor( y , x ) # correlation of y with each x predictor
  Rsq <- beta %*% matrix( YcorX , ncol=1 )
  Rsq <- Rsq[,1]
  #_______________________________________________________________
  if ( pairsPlot ) {
    # Plot the parameters pairwise, to see correlations:
    openGraph()
    nPtToPlot = 1000
    plotIdx = floor(seq(1, chainLength, by=chainLength/nPtToPlot))
    panel.cor = function(num_x_pid, y_pid, digits=2, prefix="", cex.cor, ...) {
      usr = par("usr"); on.exit(par(usr))
      par(usr = c(0, 1, 0, 1))
      r = (cor(num_x_pid , y_pid))    #r = (cor(x, y))
      txt = format(c(r, 0.123456789), digits=digits)[1]
      txt = paste(prefix, txt, sep="")
      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
      text(0.5, 0.5, txt, cex=1.25 ) # was cex=cex.cor*r
    }
    pairs( cbind( beta0 , num_beta , sigma , log10nu )[plotIdx,] ,
           labels=c( "beta[0]" 
                     ,paste0(colnames(num_beta), "\n", numericals)
                     ,expression(sigma)  
                     ,expression(log10(nu)) ),
           lower.panel=panel.cor,
           col="skyblue" )
    if ( !is.null(saveName) ) {
      saveGraph( file=paste(saveName,"PostPairs",sep=""), type=saveType)
    }
  }
  #-----------------------------------------------------------------------------
  # Marginal histograms:
  
  decideOpenGraph = function( panelCount, saveName, finished=FALSE, nRow=2, nCol=4 ) {
    # If finishing a set:
    if ( finished==TRUE ) {
      if ( !is.null(saveName) ) {
        saveGraph( file=paste0(saveName,ceiling((panelCount-1)/(nRow*nCol))), type=saveType)
      }
      panelCount = 1 # re-set panelCount
      return(panelCount)
    } else {
      # If this is first panel of a graph:
      if ( ( panelCount %% (nRow*nCol) ) == 1 ) {
        # If previous graph was open, save previous one:
        if ( panelCount>1 & !is.null(saveName) ) {
          saveGraph( file=paste0(saveName,(panelCount%/%(nRow*nCol))), type=saveType)
        }
        # Open new graph
        openGraph(width=nCol*7.0/3,height=nRow*2.0)
        layout( matrix( 1:(nRow*nCol) , nrow=nRow, byrow=TRUE ) )
        par( mar=c(4,4,2.5,0.5) , mgp=c(2.5,0.7,0) )
      }
      # Increment and return panel count:
      panelCount = panelCount+1
      return(panelCount)
    }
  }
  #______________________________________________________
  # Original scale:
  panelCount <- 1
  panelCount <- decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
  # beta parameters
  histInfo <- plotPost( beta0, cex.lab=1.75, compVal=compVal, showCurve=showCurve, 
                        xlab=bquote(beta[0]), 
                        main=paste0("Intercept \n Player ", pIDx, " - ", pIDx_name))
  for ( bIDx in 1:ncol(beta) ) {
    panelCount <- decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
    histInfo <- plotPost( beta[,bIDx], cex.lab=1.75, compVal=compVal, showCurve=showCurve, 
                          xlab=bquote(beta[.(paste0(bIDx, ",", pIDx))]),  
                          main=paste0(colnames(x)[bIDx],"\nPlayer ",pIDx," - ",pIDx_name))
  }
  # Finish the new panel for displaying prediction posteriors
  panelCount <- decideOpenGraph( panelCount, finished=TRUE, saveName=paste0(saveName, "PostMarg") )
  #______________________________________________________
  # Begin a new panel for displaying Scale and R-squared
  panelCount <- decideOpenGraph( panelCount, nRow=1, nCol=3, saveName=paste0(saveName,"PostMargSigmaNu") )
  #______________________________________________________
  # Sigma
  histInfo <- plotPost( sigma, cex.lab=1.75, showCurve=showCurve, xlab=bquote(sigma), main=paste("Scale"))
  panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName,"PostMarg") )
  
  # log10(nu)
  histInfo <- plotPost( log10nu, cex.lab=1.75 , showCurve=showCurve, xlab=bquote(log10(nu)), main=paste("Normality"))
  #panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName,"PostMarg") )
  
  # R-squared  
  #histInfo <- plotPost( Rsq, cex.lab=1.75, showCurve=showCurve, xlab=bquote(R^2), main=paste("Prop Var Accntd"))
  # Finish the new panel for displaying prediction posteriors
  panelCount <- decideOpenGraph( panelCount, finished=TRUE, saveName=paste0(saveName, "PostMargSigmaNu") )
  
  #______________________________________________________
  # Begin a new panel for displaying posterior distributions of predictions
  panelCount <- decideOpenGraph( panelCount ,  nRow=1 , nCol=5 , saveName=paste0(saveName,"PostMargPred") )
  #______________________________________________________
  # prediction 1
  histInfo <- plotPost( pred1, cex.lab=1.75, showCurve=showCurve, xlab="prediction 1", 
                        main=paste0("Predicted yards gained for", "\n", pred1_Name ))
  panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName, "PostMargPred") )
  
  # prediction 2
  histInfo <- plotPost( pred2, cex.lab=1.75, showCurve=showCurve, xlab="prediction 2", 
                        main=paste0("Predicted yards gained for", "\n", pred2_Name ))
  panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName, "PostMargPred") )
  
  # prediction 3
  histInfo <- plotPost( pred3, cex.lab=1.75, showCurve=showCurve, xlab="prediction 3",
                        main=paste0("Predicted yards gained for", "\n", pred3_Name ))
  panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName, "PostMargPred") )
  
  # prediction 4
  histInfo <- plotPost( pred4, cex.lab=1.75, showCurve=showCurve, xlab="prediction 4",
                        main=paste0("Predicted yards gained for", "\n", pred4_Name ))
  panelCount <- decideOpenGraph( panelCount, saveName=paste0(saveName, "PostMargPred") )
  
  # prediction 5
  histInfo <- plotPost( pred5, cex.lab=1.75, showCurve=showCurve, xlab="prediction 5",
                        main=paste0("Predicted yards gained for", "\n", pred5_Name ))
  # End of panel
  panelCount <- decideOpenGraph( panelCount, finished=TRUE, saveName=paste0(saveName, "PostMargPred") )
  #-----------------------------------------------------------------------------
  # Comparison plots
  if ( !is.null(diffSList) ) {
    nToPlot <- length(mcmcMat[,1]) #700
    ptIdx <- round( seq(1, chainLength, length=nToPlot) )
    for ( compIdx in 1:length(diffSList) ) {
      diffSVec <- diffSList[[compIdx]] #length( diffSList[[1]] )
      Nidx <- length(diffSVec)
      openGraph(width=3.0*Nidx, height=2.0*Nidx) #openGraph(width=2.5*Nidx, height=2.0*Nidx)
      par( mfrow=c(Nidx, Nidx) )
      for ( t1Idx in 1:Nidx ) {
        for ( t2Idx in 1:Nidx ) {
          player1_name <- player_names[player_names$player_id == diffSVec[t1Idx], 2]
          player2_name <- player_names[player_names$player_id == diffSVec[t2Idx], 2]
          parName1 <- paste0("beta[",diffSVec[t1Idx], ",", betaComp, "]")
          parName2 <- paste0("beta[",diffSVec[t2Idx], ",", betaComp, "]")
          #posterior distribution of player with lower player_id
          if ( t1Idx == t2Idx ) {
            par( mar=c(3,1.5,3,1.5) , mgp=c(2.0,0.7,0) , pty="m" )
            plotPost( mcmcMat[ptIdx, parName1], cex.lab=1.75, 
                      xlab=bquote(beta[.(paste0(betaComp, ",", diffSVec[t1Idx]))]),  
                      main=paste0(colnames(x)[betaComp],
                                  "\nPlayer ",diffSVec[t1Idx]," (",player1_name,")"))
          #Comparison plots of 1st posterior subtracted from 2nd
          } else if ( t1Idx < t2Idx ) {
              par( mar=c(3,1.5,3,1.5) , mgp=c(2.0,0.7,0) , pty="m" )
              plotPost( mcmcMat[, parName1] - mcmcMat[, parName2]
                        ,cex.main=1.25 , cex.lab=1.25
                        ,xlab = bquote("Difference of "*beta[.(betaComp)]*"'s")
                        ,main = paste0(player1_name , "\n - ", player2_name) )
          #scatter plot comparing beta values in each chain
          } else if ( t1Idx > t2Idx) {  
              par( mar=c(3,3,3,1) , mgp=c(2.0,0.7,0) , pty="s" )
              plot ( mcmcMat[ptIdx, parName2] , mcmcMat[ptIdx, parName1] , cex.lab=1.25
                     ,xlab = player2_name ,ylab = player1_name ,col = "skyblue" )
            abline(0,1,lty="dotted")
          }
        }
      }
      if ( !is.null(saveName) ) {
        saveGraph( file=paste(saveName,"PlayerBetaComparison",sep=""), type=saveType)
      }
    }
  }
}

Project code

#################################################################################
# close all of R's graphics windows.
graphics.off()
# Load Bayesian Data Analysis Utilities file
source('DBDA2E-utilities.R')
# DBDA2E-utilities.R comes from 'Doing Bayesian Data Analysis' by John K. Kruschke
#################################################################################
# set the values to be predicted
wd <- getwd() #save working directory so it can be reset once predictions have been generated
setwd("C:/Users/james/OneDrive/Desktop/Uni Studies/RMIT/Applied Bayesian Statistics - MATH2269/Assignment 3 - Final Project/Data")
nfl_games <- read.csv("nfl_offensive_stats_predict.csv", header=TRUE) %>%
  filter(position == "RB") %>% 
  mutate(HomeAway = ifelse(team == home_team, "home", "away"),
         Roof = ifelse( tolower(Roof) == "dome"                      , "Dome", 
                ifelse( tolower(Roof) == "retractable roof (closed)" , "Indoors", 
                ifelse( tolower(Roof) == "retractable roof (open)"   , "Indoors",
                ifelse( tolower(Roof) == "outdoors"                  , "Outdoors",
                "Other")))),
         Field = ifelse(startsWith(tolower(Surface), "grass"), "Natural Grass", "Artificial"),
         TempC = round( (Temperature-32)*(5/9), 1),
         HumidityPct = Humidity/100,
         Vegas_Favorite = ifelse( Vegas_Favorite == "ERROR - abbrev_team", home_team, 
                          ifelse( Vegas_Favorite == "", "EVEN", Vegas_Favorite )),
         FvrdToWin = ifelse(Vegas_Favorite == team, "Yes", "No")) %>%
  rename(Opponent = Opponent_abbrev)
#
player_names <- nfl_games[, c("player_id", "player")]
player_names$player_id <- as.numeric(as.factor(player_names$player_id))
player_names <-  unique(player_names)
#
cols <- c("rush_yds", "player_id", #"game_id",# "team", 
          "Opponent", "HomeAway", "Roof", "Field", 
          "TempC", "HumidityPct", "Wind_Speed", 
          "Vegas_Line", "FvrdToWin", "Over_Under")
####
nfl_games %>% filter(rush_yds > 200) %>% select(c(player, rush_yds)) %>% arrange(desc(rush_yds))
nfl_games <- nfl_games[,cols] %>% filter(rush_yds < 220)
####
player_count <- nfl_games[,c("player_id", "player")] %>% 
 group_by(player_id) %>% summarise(count = n())
nfl_games <- nfl_games %>% left_join(player_count, by=c("player_id")) %>% filter(count > 14)
####
remove(cols)
#
setwd(wd)
# convert categorical data to numerical values
nfl_games$player_id <- as.numeric(as.factor(nfl_games$player_id))   #1=Ameer Abdullah, 88=Derrick Henry...
nfl_games$Opponent <- as.numeric(as.factor(nfl_games$Opponent))     #1=Arizona, 2=Atlanta, ...
nfl_games$HomeAway <- as.numeric(as.factor(nfl_games$HomeAway))-1   #0=Away, 1=Home
nfl_games$Roof <- as.numeric(as.factor(nfl_games$Roof))             #1=Indoors, 2=Outdoors, 3=Retractable Roof
nfl_games$Field <- as.numeric(as.factor(nfl_games$Field)) -1        #0=Artificial, 1=Natural Grass
nfl_games$FvrdToWin <- as.numeric(as.factor(nfl_games$FvrdToWin))-1 #0=No, 1=Yes
###################-------------------------------------------------------------
predictions <- tail(nfl_games, 5)
###################################################################################
xPred <- as.matrix(predictions[,c(2, 4, 6:(ncol(predictions)-1))]) 
Actual_Yards <- as.data.frame( predictions[,1] )
#____________________________________________
# set the parameters for the model to monitor
parameters <- c("beta0", "beta", "sigma", "nu", "pred")
numberChains <- 2
adapt_steps <- 1500   # Number of steps to "tune" the samplers
burn_In_Steps <- 1500 # number of burn-in steps
thinningSteps <- 21
numberSavedSteps <- 5000
enrvironmentName <- paste("GamesH_Informative_500Vars", numberChains, adapt_steps, burn_In_Steps, thinningSteps, numberSavedSteps, sep="-")
###################-------------------------------------------------------------
# Function to create model and run JAGS ####
run_model_JAGS = function(  data, 
                            yName = "y",          #default name for dependent variable is "y"
                            sName,                #Name of field containing subject identifier
                            modelparameters,      #parameters to monitor
                            initsList,            #initial values
                            nChains = 2,          #default number of MCMC chains
                            adaptSteps = 1500,    #default number of steps to "tune" the samplers
                            burnInSteps = 500,    #default number of burn-in steps
                            numSavedSteps = 1000, #default number of saved steps
                            thinSteps = 1,        #default number of thinning steps
                            runjagsMethod = "parallel", #default run method is parallel
                            xPred = xPred,
                            saveName=NULL) {        #name for saved output
  require(runjags)
  # The data being modeled
  y <- data[,yName]
  p <- data[,sName]   # hierarchy data
  x <- as.matrix(data[, !names(data) %in% c(yName, sName)])
  Nsubj <- length(unique(p))
  #extract non-categorical columns
  x_num <- nfl_games[, c("TempC", "HumidityPct", "Wind_Speed", "Vegas_Line")]
  # Check that data makes sense:
  if ( any( !is.finite(y) ) ) { stop("All y values must be finite.") }
  if ( any( !is.finite(x) ) ) { stop("All x values must be finite.") }
  cat("\nCORRELATION MATRIX OF NUMERICAL PREDICTORS:\n ")
  show( round(cor(x_num),3) )
  cat("\n")
  flush.console()
  
  #Specify the data in a list to be sent to JAGS
  dataList = list(
    x = x
    ,y = y
    ,p = p              #subjects (player ids)
    ,Nsubj = Nsubj
    ,Nx = dim(x)[2]     # v<-as.matrix(nfl_games[,c(-1,-2)]) ; dim(v)[2]
    ,Ntotal = dim(x)[1] 
    ,xPred = xPred  # Data for predictions
  )
  # ****************************************************************************
  # THE MODEL.
  modelString = "
  #==================================== 
  # Prepare the data:
  #====================================
  data {
  #__________________________________ > create variables using y
    ym  <- mean(y)
  #====================================
  # Priors on original scale:
  #====================================
  # #Non-informative
    # mu0 <- ym                          #set to mean of y
    # for (a in 1:Nx) { mu[a] <- 1 }     #set all prior information to 1
  #____________________________________
  #Informative
    mu0 <- ym                        #set to mean of y
    mu[1] <- 10                      #Home/Away
    mu[2] <-    1                    #Field
    mu[3] <- -0.5                    #TempC
    mu[4] <- 30                      #HumidityPct (+10% increase in humidity = an extra 3 yards)
    mu[5] <- 1.5                     #Wind_Speed
    mu[6] <- 1                       #Vegas_Line
    mu[7] <- 27                      #FvrdToWin

  #====================================
  # Prior variances to reflect the confidence in the expert information
  #====================================
  # # #Non-informative
    # Var0 <- 500                        #Set to 500 to make non-informative
    # for (q in 1:Nx) { Var[q] <- 500 }  #Set all variances to 500 (large, flat, non-informative)
  #_________________________________________________________________
  # Informative = Very Weak | Weak | Moderate | Strong | Very Strong
  # Variance    =     100   |  10  |   1.0    |  0.1   |    0.01
  #_________________________________________________________________
    Var0 <- 500                      #Set to large value to make it non-informative
    Var[1] <- 1.0                    #Home/Away
    Var[2] <-     500                #Field
    Var[3] <- 0.1                    #TempC
    Var[4] <- 1.0                    #HumidityPCt
    Var[5] <- 10                     #Wind_Speed
    Var[6] <- 100                    #Vegas_Line
    Var[7] <- 0.01                   #FvrdToWin
    
  } # ________________________ END OF DATA BLOCK ________________________________
    # ===========================================================================
  #====================================
  # Specify the model: 
  #====================================
  model {
    for ( i in 1:Ntotal ) {
      y[i] ~ dt( dmu[i], 1/sigma^2, nu )              #Likelihood: student t (overall)
      
      dmu[i] <- beta0[p[i]] + beta[p[i],1]*x[i,1] + beta[p[i],2]*x[i,2] + beta[p[i],3]*x[i,3] 
                            + beta[p[i],4]*x[i,4] + beta[p[i],5]*x[i,5] + beta[p[i],6]*x[i,6]
                            + beta[p[i],7]*x[i,7]
    }
    for ( sIdx in 1:Nsubj ) { #Because p is a vector we need a *different index* in the for-loop.
       beta0[sIdx] ~ dnorm( mu0 , 1/Var0 )            #zbeta0 distributed as normal
       
       for ( j in 1:Nx ) {
         beta[sIdx,j] ~ dnorm( mu[j], 1/Var[j] )      #zbeta[x] distributed as normal
       }
    }
    
    sigma ~ dgamma(0.01, 0.01)                        #zsigma distributed as gamma (A,B)
    nu ~ dexp(1/30.0) 

  #==================================== 
  # Compute predictions at every step of the MCMC
  #====================================
    for ( i in 1:5) {
      pred[i] <- beta0[xPred[i,1]]                    #Column 1 of xPred is 'player_id'.
      xPred[1,1]=50 >> beta0[50]
                 + beta[ xPred[i,1], 1]  * xPred[i,2] #xPred[1,2] col2 of xPred is beta1
                 + beta[ xPred[i,1], 2]  * xPred[i,3] #xPred[1,3] col3 of xPred is beta2
                 + beta[ xPred[i,1], 3]  * xPred[i,4] #xPred[1,4] col4 of xPred is beta3
                 + beta[ xPred[i,1], 4]  * xPred[i,5]
                 + beta[ xPred[i,1], 5]  * xPred[i,6] 
                 + beta[ xPred[i,1], 6]  * xPred[i,7]
                 + beta[ xPred[i,1], 7]  * xPred[i,8] 
    }
  }
  " # close quote for modelString
  # Write modelString to a text file
  writeLines( modelString , con="TEMPmodel.txt" )
  # ............................................
  # Prepare the chains
  nIter <- ceiling( (numSavedSteps * thinSteps) / nChains)
  # ............................................
  # time the procedure started
  startTime <- proc.time() 
  # ********************************************
  # run the JAGS model
  runJagsOut <- run.jags( method = runjagsMethod ,
                          model = "TEMPmodel.txt" ,
                          monitor = modelparameters,
                          data = dataList ,
                          inits = initsList ,
                          n.chains = nChains ,
                          adapt = adaptSteps ,
                          burnin = burnInSteps ,
                          sample = numSavedSteps ,
                          thin = thinSteps , 
                          summarise = FALSE , 
                          plots = FALSE)
  # Assign JAGS output to variable in the global environment
  codaSamples <<- as.mcmc.list( runJagsOut )
  # save the model environment
  if ( !is.null(saveName) ) {
    save( codaSamples, file=paste(saveName, "Mcmc.Rdata", sep="") )
  }
  # ********************************************
  stopTime <- proc.time() # time the procedure finished
  # Assign the 'run time' to a variable in the global environment
  model_runtime <<- stopTime - startTime
  # What the function will print in the console
  return(list(model_runtime))
}
###################-------------------------------------------------------------
">> Run JAGS Model"         #####
run_model_JAGS(data = nfl_games,
               yName = "rush_yds",               #name of dependent variable (must have)
               sName = "player_id",              #subjects in first hierarchy
               modelparameters = parameters,     #parameters to monitor
               initsList = NULL,                 #initial values (none)
               nChains = numberChains,           #Number of MCMC chains
               adaptSteps = adapt_steps,         #Number of steps to "tune" the samplers
               burnInSteps = burn_In_Steps,      #Number of burn-in steps
               numSavedSteps = numberSavedSteps, #Number of saved steps
               thinSteps = thinningSteps,        #default thinning steps of 1
               runjagsMethod = "parallel",       #e.g., parallel
               xPred = xPred,
               saveName = enrvironmentName)
###################-------------------------------------------------------------
"Produce and save model diagnostics" #####
fileNameRoot <- "Ass3_games"
# Display diagnostics of chain, for specified parameters:
parameterNames <- varnames(codaSamples) # get all parameter names
specifiedParameters <- c("beta0[52]", "beta[52,1]", "beta[52,2]", "beta[52,3]", "beta[52,4]", 
                         "beta[52,5]","beta[52,6]", "beta[52,7]",
                         "sigma", "nu",
                         "pred[1]", "pred[2]", "pred[3]", "pred[4]", "pred[5]")
for ( parName in specifiedParameters ) { 
  diagMCMC( codaObject=codaSamples, parName=parName, saveName=fileNameRoot, saveType="jpeg" )
}
graphics.off()
###################-------------------------------------------------------------
"Display the posterior distribution of each parameter" #####
summaryInfo <- smryMCMC_HD( codaSamples = codaSamples ) #, compVal = compVal )
print(summaryInfo)
#
for ( i in 52 ) {
  PlayerParameters <- c(paste0("beta0[",i,"]"), paste0("beta[",i,",1]"),paste0("beta[",i,",2]"),
                        paste0("beta[",i,",3]"),paste0("beta[",i,",4]"),paste0("beta[",i,",5]"),
                        paste0("beta[",i,",6]"),paste0("beta[",i,",7]"))
}
summaryInfo[PlayerParameters, c(3,4)]
head(summaryInfo[order(summaryInfo[,4], decreasing=FALSE), c(1:5)], 30)
#
for ( parName in PlayerParameters ) { diagMCMC( codaObject=codaSamples, parName=parName, saveName=NULL, saveType=NULL) }
graphics.off()
#_____________________________________________________________
MyData <- nfl_games
#_____________________________________________________________
plotMCMC(codaSamples = codaSamples
         ,data = MyData
         ,xName = names(MyData[c(-1,-2)]) ,yName = names(MyData[1])
         ,sName = "player_id" ,compVal = NULL#0.64
         ,numericals = c("TempC", "HumidityPct", "Wind_Speed", "Vegas_Line")
         ,pIDx = 52 ,pKey = player_names 
         ,diffSList = list( c(52, 7) ) #, c(26, 18) )
         ,betaComp = 2 
         ,xPred = xPred
         ,pairsPlot = TRUE ,showCurve = FALSE
         ,saveName= "A3_" ,saveType="jpeg" )
Actual_Yards
graphics.off()
###################-------------------------------------------------------------
"Model coefficients"  #####
#mean(nfl_games$rush_yds) #avg_rush <- nfl_games %>% group_by(player_id) %>% summarise(mean = mean(rush_yds))
player_id <- 7
player_games <- nfl_games[nfl_games$player_id == player_id, ]
y <- player_games[, 1]
x <- as.matrix(player_games[, !names(player_games) %in% c("rush_yds", "player_id")])
for ( i in player_id ) {
  PlayerParameters <- c(paste0("beta0[",i,"]"), paste0("beta[",i,",1]"),paste0("beta[",i,",2]"),
                        paste0("beta[",i,",3]"),paste0("beta[",i,",4]"),paste0("beta[",i,",5]"),
                        paste0("beta[",i,",6]"),paste0("beta[",i,",7]"))
}
#overallParameters <- c("mu0", "mu1", "mu2", "mu3", "mu4", "mu5", "mu6", "mu7")
#
coefficients <- summaryInfo[PlayerParameters,3] # Get the model coefficients (mode of each parameter)
coefficients
modelSig <- summaryInfo["sigma",3] # Get sigma from the model
modelNu <- summaryInfo["nu",3] # Get nu from the model
# Since we imposed the regression model on the mean of the student likelihood,
# we use the model (X*beta) to generate the mean of student population for each 
# observed x vector.

#meanStudent <- as.matrix( cbind(rep(1,nrow(x)),  x)) %*% as.vector(coefficients)

# Generate random data from the posterior distribution. Take the reparameterisation back to alpha and beta.
randomData <- rt(n=300, df=modelNu, ncp=modelSig) #experiment with 1/sigma^2
# Display the density plot of observed data and posterior distribution:
predicted <- data.frame(player_games = randomData)
observed <- data.frame(player_games = y)
predicted$type <- "Predicted"
observed$type <- "Observed"
dataPred <- rbind(predicted, observed)

ggplot(dataPred, aes(player_games, fill = type)) + geom_density(alpha = 0.2) + 
  ggtitle(paste0("Observed and Predicted rushing yards for player ", player_id)) +
  xlab("Rushing Yards")

Predicting NFL running back rushing yards using Hierarchical Bayesian Linear Regression

MATH2269 – Applied Bayesian Statistics

James Angus

Report

Code