Fantasy Football Score Composition

The Data

Data is loaded into a Github Repository for easy access. It was downloaded from the FantasyPros website. The data from the FantasyPros website is inaccurate for players like Cordarrelle Patterson, who are listed as runningbacks or tight ends in the FantasyPros database. I will attempt to clean this up by removing all players with a score of 0.0 from the calculations.

This data is from last year’s NFL season. I will use that data to perform multiple regression to determine an appropriate model for the components that predict the total amount of yards that a player will accrue.

##       Rank          Player               REC             TGT        
##  Min.   :  1.0   Length:223         Min.   :  1.0   Min.   :  1.00  
##  1st Qu.: 56.5   Class :character   1st Qu.:  5.0   1st Qu.: 10.50  
##  Median :112.0   Mode  :character   Median : 24.0   Median : 37.00  
##  Mean   :112.1                      Mean   : 30.3   Mean   : 47.96  
##  3rd Qu.:167.5                      3rd Qu.: 44.5   3rd Qu.: 69.50  
##  Max.   :224.0                      Max.   :145.0   Max.   :191.00  
##       YDS              YPR              LG             X20.         TD        
##  Min.   :   5.0   Min.   : 4.00   Min.   : 5.00   Min.   :0   Min.   : 0.000  
##  1st Qu.:  54.5   1st Qu.: 9.90   1st Qu.:21.50   1st Qu.:0   1st Qu.: 0.000  
##  Median : 250.0   Median :12.20   Median :39.00   Median :0   Median : 1.000  
##  Mean   : 378.9   Mean   :12.39   Mean   :38.15   Mean   :0   Mean   : 2.336  
##  3rd Qu.: 562.0   3rd Qu.:14.10   3rd Qu.:52.50   3rd Qu.:0   3rd Qu.: 4.000  
##  Max.   :1947.0   Max.   :38.00   Max.   :91.00   Max.   :0   Max.   :16.000  
##       ATT              RuYD             RuTD               FL        
##  Min.   : 0.000   Min.   :-13.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.000   Median :  0.00   Median :0.00000   Median :0.0000  
##  Mean   : 1.865   Mean   : 12.03   Mean   :0.09866   Mean   :0.2466  
##  3rd Qu.: 2.000   3rd Qu.: 11.50   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :59.000   Max.   :365.00   Max.   :8.00000   Max.   :3.0000  
##        G              FPTS            FPTS.G          ROST          
##  Min.   : 1.00   Min.   :  0.80   Min.   : 0.10   Length:223        
##  1st Qu.: 7.00   1st Qu.: 14.15   1st Qu.: 2.60   Class :character  
##  Median :12.00   Median : 60.30   Median : 5.40   Mode  :character  
##  Mean   :10.75   Mean   : 84.03   Mean   : 6.73                     
##  3rd Qu.:16.00   3rd Qu.:126.90   3rd Qu.: 9.85                     
##  Max.   :17.00   Max.   :439.50   Max.   :25.90

Multiple Linear Regression

## 
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + G + TD + RuYD + FL + 
##     RuTD, data = wrclean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -229.701  -30.869    1.518   29.173  254.200 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -100.4678    16.6332  -6.040 6.78e-09 ***
## REC            6.2877     1.0099   6.226 2.51e-09 ***
## TGT            2.8591     0.6793   4.209 3.79e-05 ***
## YPR            6.3830     1.2352   5.168 5.44e-07 ***
## LG             0.9832     0.3741   2.628  0.00922 ** 
## G             -1.3021     1.4085  -0.924  0.35629    
## TD            19.5365     2.7476   7.110 1.72e-11 ***
## RuYD           0.2022     0.2439   0.829  0.40800    
## FL            -5.5996     9.8307  -0.570  0.56955    
## RuTD          24.7996    12.3871   2.002  0.04655 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.24 on 213 degrees of freedom
## Multiple R-squared:  0.9696, Adjusted R-squared:  0.9683 
## F-statistic: 754.3 on 9 and 213 DF,  p-value: < 2.2e-16

In this model, I will seek to eliminate any p-value greater than 0.05 from the model. Yards per reception has the greatest p-value of any variable, so it will be removed from the model. The adjusted R-squared for the model is 0.9683. I will be dropping FL, RuYD, and G from the model as each one has a p-value above 0.05. I will drop them one by one and I will recalculate “mlrdrop” after each variable is removed. With an adjusted R-squared of 0.9685, this model explains the variability in the YDS statistic very well.

## 
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wrclean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -224.947  -31.946    1.299   31.155  250.819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -107.6196    14.4726  -7.436 2.41e-12 ***
## REC            6.2604     0.9857   6.351 1.24e-09 ***
## TGT            2.7695     0.6551   4.227 3.49e-05 ***
## YPR            6.3548     1.2187   5.215 4.30e-07 ***
## LG             0.9301     0.3522   2.641  0.00887 ** 
## TD            19.9664     2.6983   7.400 3.00e-12 ***
## RuTD          31.8624     7.7705   4.100 5.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.04 on 216 degrees of freedom
## Multiple R-squared:  0.9693, Adjusted R-squared:  0.9685 
## F-statistic:  1138 on 6 and 216 DF,  p-value: < 2.2e-16

The equation for this linear model is:

\[ \begin{aligned} \widehat{YDS} &= \hat{\beta}_0 + \hat{\beta}_1 \times REC + \hat{\beta}_2 \times TGT +\hat{\beta}_3 \times YPR + \hat{\beta}_4 \times LG + \hat{\beta}_5 \times TD + \hat{\beta}_6 \times RuTD\end{aligned} \]

Fully expanded, this becomes

\[ \begin{aligned} \widehat{YDS} &= -107.62 + 6.26 \times REC + 2.77 \times TGT + 6.35 \times YPR + 0.93 \times LG + 19.97 \times TD + 31.86 \times RuTD\end{aligned} \]

Knowing what I know about football, for wide receivers, rushing touchdowns come few and far between. In fact, the mean is 0.0987, and the maximum value for rushing touchdowns is 8. Only 13 players out of the 223 remaining in the trimmed version of the dataframe scored a rushing touchdown. Because rushing touchdowns are so infrequent, it may skew the data to include rushing touchdowns in the model.

wrclean %>%
  select(c(3:14)) %>%
  ggpairs()

## Regression Analysis

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 7.5
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The residuals appear to be normally distributed. The residuals vs. fitted values plot appears to take a cornucopia shaped path. It looks heteroscedastic. In order to more accurately predict yards for players, it would be wiser to limit the model to remove outliers above 1000 receiving yards. The normal probability plot is mostly linear, with some deviation throughout.

I recalculated everything using a maximum yardage threshold of 1000, and the difference was not noticeable. I will recalculate it and remove players with under 250 yards instead.

wr_trim <- subset(wrclean, YDS > 250) 
wr_trim %>%
  select(c(3:14)) %>%
  ggpairs()

trim_mod <- lm(YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wr_trim)
summary(trim_mod)
## 
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wr_trim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -200.525  -27.519    4.467   34.959  160.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -523.7371    34.2312 -15.300  < 2e-16 ***
## REC            9.6873     0.8898  10.887  < 2e-16 ***
## TGT            1.4667     0.5988   2.450 0.015974 *  
## YPR           41.2497     2.8173  14.641  < 2e-16 ***
## LG            -0.2821     0.4492  -0.628 0.531409    
## TD             9.7183     2.4499   3.967 0.000134 ***
## RuTD          20.7281     6.5359   3.171 0.001994 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.96 on 104 degrees of freedom
## Multiple R-squared:  0.9754, Adjusted R-squared:  0.974 
## F-statistic: 687.5 on 6 and 104 DF,  p-value: < 2.2e-16

Receptions and targets appear to be the best predictors of receiving yards. With that being said, when omitting players with under 250 yards, the adjusted R-squared increases to 0.974.

Regression Analysis

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 7.5
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

This model is better than the one for the full range. The histogram of residuals is nearly normal, the normal probability plot is mostly linear, and the scatterplot of the residuals appears much more constantly variable than the original.