Data is loaded into a Github Repository for easy access. It was downloaded from the FantasyPros website. The data from the FantasyPros website is inaccurate for players like Cordarrelle Patterson, who are listed as runningbacks or tight ends in the FantasyPros database. I will attempt to clean this up by removing all players with a score of 0.0 from the calculations.
This data is from last year’s NFL season. I will use that data to perform multiple regression to determine an appropriate model for the components that predict the total amount of yards that a player will accrue.
## Rank Player REC TGT
## Min. : 1.0 Length:223 Min. : 1.0 Min. : 1.00
## 1st Qu.: 56.5 Class :character 1st Qu.: 5.0 1st Qu.: 10.50
## Median :112.0 Mode :character Median : 24.0 Median : 37.00
## Mean :112.1 Mean : 30.3 Mean : 47.96
## 3rd Qu.:167.5 3rd Qu.: 44.5 3rd Qu.: 69.50
## Max. :224.0 Max. :145.0 Max. :191.00
## YDS YPR LG X20. TD
## Min. : 5.0 Min. : 4.00 Min. : 5.00 Min. :0 Min. : 0.000
## 1st Qu.: 54.5 1st Qu.: 9.90 1st Qu.:21.50 1st Qu.:0 1st Qu.: 0.000
## Median : 250.0 Median :12.20 Median :39.00 Median :0 Median : 1.000
## Mean : 378.9 Mean :12.39 Mean :38.15 Mean :0 Mean : 2.336
## 3rd Qu.: 562.0 3rd Qu.:14.10 3rd Qu.:52.50 3rd Qu.:0 3rd Qu.: 4.000
## Max. :1947.0 Max. :38.00 Max. :91.00 Max. :0 Max. :16.000
## ATT RuYD RuTD FL
## Min. : 0.000 Min. :-13.00 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.000 Median : 0.00 Median :0.00000 Median :0.0000
## Mean : 1.865 Mean : 12.03 Mean :0.09866 Mean :0.2466
## 3rd Qu.: 2.000 3rd Qu.: 11.50 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :59.000 Max. :365.00 Max. :8.00000 Max. :3.0000
## G FPTS FPTS.G ROST
## Min. : 1.00 Min. : 0.80 Min. : 0.10 Length:223
## 1st Qu.: 7.00 1st Qu.: 14.15 1st Qu.: 2.60 Class :character
## Median :12.00 Median : 60.30 Median : 5.40 Mode :character
## Mean :10.75 Mean : 84.03 Mean : 6.73
## 3rd Qu.:16.00 3rd Qu.:126.90 3rd Qu.: 9.85
## Max. :17.00 Max. :439.50 Max. :25.90
##
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + G + TD + RuYD + FL +
## RuTD, data = wrclean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -229.701 -30.869 1.518 29.173 254.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.4678 16.6332 -6.040 6.78e-09 ***
## REC 6.2877 1.0099 6.226 2.51e-09 ***
## TGT 2.8591 0.6793 4.209 3.79e-05 ***
## YPR 6.3830 1.2352 5.168 5.44e-07 ***
## LG 0.9832 0.3741 2.628 0.00922 **
## G -1.3021 1.4085 -0.924 0.35629
## TD 19.5365 2.7476 7.110 1.72e-11 ***
## RuYD 0.2022 0.2439 0.829 0.40800
## FL -5.5996 9.8307 -0.570 0.56955
## RuTD 24.7996 12.3871 2.002 0.04655 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.24 on 213 degrees of freedom
## Multiple R-squared: 0.9696, Adjusted R-squared: 0.9683
## F-statistic: 754.3 on 9 and 213 DF, p-value: < 2.2e-16
In this model, I will seek to eliminate any p-value greater than 0.05 from the model. Yards per reception has the greatest p-value of any variable, so it will be removed from the model. The adjusted R-squared for the model is 0.9683. I will be dropping FL, RuYD, and G from the model as each one has a p-value above 0.05. I will drop them one by one and I will recalculate “mlrdrop” after each variable is removed. With an adjusted R-squared of 0.9685, this model explains the variability in the YDS statistic very well.
##
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wrclean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.947 -31.946 1.299 31.155 250.819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -107.6196 14.4726 -7.436 2.41e-12 ***
## REC 6.2604 0.9857 6.351 1.24e-09 ***
## TGT 2.7695 0.6551 4.227 3.49e-05 ***
## YPR 6.3548 1.2187 5.215 4.30e-07 ***
## LG 0.9301 0.3522 2.641 0.00887 **
## TD 19.9664 2.6983 7.400 3.00e-12 ***
## RuTD 31.8624 7.7705 4.100 5.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.04 on 216 degrees of freedom
## Multiple R-squared: 0.9693, Adjusted R-squared: 0.9685
## F-statistic: 1138 on 6 and 216 DF, p-value: < 2.2e-16
The equation for this linear model is:
\[ \begin{aligned} \widehat{YDS} &= \hat{\beta}_0 + \hat{\beta}_1 \times REC + \hat{\beta}_2 \times TGT +\hat{\beta}_3 \times YPR + \hat{\beta}_4 \times LG + \hat{\beta}_5 \times TD + \hat{\beta}_6 \times RuTD\end{aligned} \]
Fully expanded, this becomes
\[ \begin{aligned} \widehat{YDS} &= -107.62 + 6.26 \times REC + 2.77 \times TGT + 6.35 \times YPR + 0.93 \times LG + 19.97 \times TD + 31.86 \times RuTD\end{aligned} \]
Knowing what I know about football, for wide receivers, rushing touchdowns come few and far between. In fact, the mean is 0.0987, and the maximum value for rushing touchdowns is 8. Only 13 players out of the 223 remaining in the trimmed version of the dataframe scored a rushing touchdown. Because rushing touchdowns are so infrequent, it may skew the data to include rushing touchdowns in the model.
wrclean %>%
select(c(3:14)) %>%
ggpairs()
## Regression Analysis
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 7.5
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The residuals appear to be normally distributed. The residuals vs. fitted values plot appears to take a cornucopia shaped path. It looks heteroscedastic. In order to more accurately predict yards for players, it would be wiser to limit the model to remove outliers above 1000 receiving yards. The normal probability plot is mostly linear, with some deviation throughout.
I recalculated everything using a maximum yardage threshold of 1000, and the difference was not noticeable. I will recalculate it and remove players with under 250 yards instead.
wr_trim <- subset(wrclean, YDS > 250)
wr_trim %>%
select(c(3:14)) %>%
ggpairs()
trim_mod <- lm(YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wr_trim)
summary(trim_mod)
##
## Call:
## lm(formula = YDS ~ REC + TGT + YPR + LG + TD + RuTD, data = wr_trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -200.525 -27.519 4.467 34.959 160.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -523.7371 34.2312 -15.300 < 2e-16 ***
## REC 9.6873 0.8898 10.887 < 2e-16 ***
## TGT 1.4667 0.5988 2.450 0.015974 *
## YPR 41.2497 2.8173 14.641 < 2e-16 ***
## LG -0.2821 0.4492 -0.628 0.531409
## TD 9.7183 2.4499 3.967 0.000134 ***
## RuTD 20.7281 6.5359 3.171 0.001994 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.96 on 104 degrees of freedom
## Multiple R-squared: 0.9754, Adjusted R-squared: 0.974
## F-statistic: 687.5 on 6 and 104 DF, p-value: < 2.2e-16
Receptions and targets appear to be the best predictors of receiving yards. With that being said, when omitting players with under 250 yards, the adjusted R-squared increases to 0.974.
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 7.5
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
This model is better than the one for the full range. The histogram of residuals is nearly normal, the normal probability plot is mostly linear, and the scatterplot of the residuals appears much more constantly variable than the original.