Fantasy Football Regression Model

##Introduction

Fantasy football is a very unpredictable hobby. Most player season projections can end up being wildly inaccurate due to the unpredictability of the sport as an single injury or coaching change can have a large effect on a players fantasy production. The goal of this model is to use a player’s previous year production to make a projection of a players points per game in the following season. This model is intended to complement other information that is not captured in the data used for this model such as team changes, coaching changes, or team personal changes in the offseason.

##Gathering data

The main source of data for this model is obtained from Pro Football Focus. The data contains end of season game stats for offensive players and their fantasy points from 2008-2018. The other data comes from the NFLScrapR package that gathers play by play data for every NFL game. This data is used to add expected points added, win probability added, and total snaps for each player. This data only covers the 2009-2017 seasons which means is does limit the size of the data but that is a fair trade off for more detailed data.

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(rebus)

## 
## Attaching package: 'rebus'

## The following object is masked from 'package:stringr':
## 
##     regex

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(Metrics)
library(readxl)

Raw_Data10yrs_v3 <- read_excel("Raw Data10yrs_v3.xlsx")


#gathering data from NFLscrapR for play by play Expected points added and win probability added per play, along with snaps per season
stats2017 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2017.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )

## See spec(...) for full column specifications.

## Warning: 224 parsing failures.
##  row                        col           expected       actual         file
## 1095 blocked_player_id          1/0/T/F/TRUE/FALSE 00-0032366   <connection>
## 1095 blocked_player_name        1/0/T/F/TRUE/FALSE T.Matakevich <connection>
## 2589 blocked_player_id          1/0/T/F/TRUE/FALSE 00-0031270   <connection>
## 2589 blocked_player_name        1/0/T/F/TRUE/FALSE S.Harris     <connection>
## 2768 lateral_receiver_player_id 1/0/T/F/TRUE/FALSE 00-0027973   <connection>
## .... .......................... .................. ............ ............
## See problems(...) for more details.

stats2016 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2016.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 81 parsing failures.
##  row                                col           expected     actual         file
## 3010 lateral_receiver_player_id         1/0/T/F/TRUE/FALSE 00-0030068 <connection>
## 3010 lateral_receiver_player_name       1/0/T/F/TRUE/FALSE M.Goodwin  <connection>
## 5439 forced_fumble_player_2_team        1/0/T/F/TRUE/FALSE GB         <connection>
## 5439 forced_fumble_player_2_player_id   1/0/T/F/TRUE/FALSE 00-0028002 <connection>
## 5439 forced_fumble_player_2_player_name 1/0/T/F/TRUE/FALSE R.Cobb     <connection>
## .... .................................. .................. .......... ............
## See problems(...) for more details.

stats2015 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2015.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 169 parsing failures.
##  row                        col           expected     actual         file
## 1226 blocked_player_id          1/0/T/F/TRUE/FALSE 00-0021577 <connection>
## 1226 blocked_player_name        1/0/T/F/TRUE/FALSE C.Clemons  <connection>
## 2058 blocked_player_id          1/0/T/F/TRUE/FALSE 00-0023509 <connection>
## 2058 blocked_player_name        1/0/T/F/TRUE/FALSE J.Tuck     <connection>
## 2464 lateral_receiver_player_id 1/0/T/F/TRUE/FALSE 00-0027450 <connection>
## .... .......................... .................. .......... ............
## See problems(...) for more details.

stats2013 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2014.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 194 parsing failures.
##  row                              col           expected       actual         file
## 2400 pass_defense_2_player_id         1/0/T/F/TRUE/FALSE 00-0027983   <connection>
## 2400 pass_defense_2_player_name       1/0/T/F/TRUE/FALSE R.Moore      <connection>
## 2525 own_kickoff_recovery_player_id   1/0/T/F/TRUE/FALSE 00-0026343   <connection>
## 2525 own_kickoff_recovery_player_name 1/0/T/F/TRUE/FALSE A.Studebaker <connection>
## 3277 pass_defense_2_player_id         1/0/T/F/TRUE/FALSE 00-0026945   <connection>
## .... ................................ .................. ............ ............
## See problems(...) for more details.

stats2012 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2013.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_logical(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 154 parsing failures.
##  row                          col           expected     actual         file
## 1217 two_point_conv_result        1/0/T/F/TRUE/FALSE failure    <connection>
## 2263 two_point_conv_result        1/0/T/F/TRUE/FALSE success    <connection>
## 2645 two_point_conv_result        1/0/T/F/TRUE/FALSE failure    <connection>
## 3424 lateral_receiver_player_id   1/0/T/F/TRUE/FALSE 00-0027758 <connection>
## 3424 lateral_receiver_player_name 1/0/T/F/TRUE/FALSE R.Cooper   <connection>
## .... ............................ .................. .......... ............
## See problems(...) for more details.

stats2011 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2012.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 285 parsing failures.
##  row                            col           expected     actual         file
## 1070 lateral_receiver_player_id     1/0/T/F/TRUE/FALSE 00-0025761 <connection>
## 1070 lateral_receiver_player_name   1/0/T/F/TRUE/FALSE R.Turner   <connection>
## 1715 qb_hit_2_player_id             1/0/T/F/TRUE/FALSE 00-0022793 <connection>
## 1715 qb_hit_2_player_name           1/0/T/F/TRUE/FALSE A.Smith    <connection>
## 1850 lateral_interception_player_id 1/0/T/F/TRUE/FALSE 00-0027943 <connection>
## .... .............................. .................. .......... ............
## See problems(...) for more details.

stats2010 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2010.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 106 parsing failures.
##  row                                col           expected     actual         file
## 1209 tackle_for_loss_2_player_id        1/0/T/F/TRUE/FALSE 00-0022008 <connection>
## 1209 tackle_for_loss_2_player_name      1/0/T/F/TRUE/FALSE T.McGee    <connection>
## 2021 forced_fumble_player_2_team        1/0/T/F/TRUE/FALSE STL        <connection>
## 2021 forced_fumble_player_2_player_id   1/0/T/F/TRUE/FALSE 00-0026241 <connection>
## 2021 forced_fumble_player_2_player_name 1/0/T/F/TRUE/FALSE J.King     <connection>
## .... .................................. .................. .......... ............
## See problems(...) for more details.

stats2009 <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2009.csv"))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   home_team = col_character(),
##   away_team = col_character(),
##   posteam = col_character(),
##   posteam_type = col_character(),
##   defteam = col_character(),
##   side_of_field = col_character(),
##   game_date = col_date(format = ""),
##   game_half = col_character(),
##   time = col_time(format = ""),
##   yrdln = col_character(),
##   desc = col_character(),
##   play_type = col_character(),
##   pass_length = col_character(),
##   pass_location = col_character(),
##   run_location = col_character(),
##   run_gap = col_character(),
##   field_goal_result = col_character(),
##   extra_point_result = col_character(),
##   two_point_conv_result = col_character(),
##   timeout_team = col_character()
##   # ... with 88 more columns
## )
## See spec(...) for full column specifications.

## Warning: 244 parsing failures.
##  row                                col           expected      actual         file
## 1081 pass_defense_2_player_id           1/0/T/F/TRUE/FALSE 00-0025828  <connection>
## 1081 pass_defense_2_player_name         1/0/T/F/TRUE/FALSE W.Woodyard  <connection>
## 1619 forced_fumble_player_2_team        1/0/T/F/TRUE/FALSE MIA         <connection>
## 1619 forced_fumble_player_2_player_id   1/0/T/F/TRUE/FALSE 00-0023408  <connection>
## 1619 forced_fumble_player_2_player_name 1/0/T/F/TRUE/FALSE G.Camarillo <connection>
## .... .................................. .................. ........... ............
## See problems(...) for more details.

##Staging NFLScrapR data

All the data from NFLScrapR needs to be joined together and a new year column is created to match with the year column in the Pro Football Focus to be joined later. For the Pro Football Focus data, the name column needs to be changed to match the NFLScrapR name column so they can be joined together later. To do this, the first character is selected and added to the last name to match the format of NFLScrapR.

#joining all years together
Epa <- rbind(stats2017, stats2016, stats2015, stats2013, stats2012, stats2011, stats2010, stats2009)

#creates a column to be used in join with other dataset
Epa <- Epa %>%
  mutate(Year = as.character(format(game_date,"%Y")))

#add new column to match the NFLScrapR name format to be used in joining datasets
Raw_Data10yrs_v3$joinname = substr(Raw_Data10yrs_v3$Name, 1, 1) 
Raw_Data10yrs_v3$joinname = paste(Raw_Data10yrs_v3$joinname, str_extract(Raw_Data10yrs_v3$Name, pattern = 
                                                 capture(one_or_more(SPC)) %R%
                                                 capture(one_or_more(WRD))
))
Raw_Data10yrs_v3$joinname = gsub("  ", ".", Raw_Data10yrs_v3$joinname)
Raw_Data10yrs_v3$Year = as.character(Raw_Data10yrs_v3$Year)

##Running Back Data

To start, all data for running backs will be gathered. To do this, the NFLScrapR data is filtered to select only run plays and each players EPA, WPA, and snaps are calculated for each season. Then the Pro Football Focus data is filtered to only select running backs and only the top 75 runningbacks in each season to make sure only fantasy relevant players are selected. As a result of fantasy football projections mostly being based on a player’s previous year production, the lag function is used to select data from the players previous season. For rookie player seasons there is nothing to project in this model unless college stats are used which are beyond the scope of this model so those seasons are removed from the dataset.

The data is then joined to the NFLScrapR dataset using the specific name and year columns made for the join in each dataset. Once they are joined the lag function is used to gather the stats from the NFLScrapR data in the same way as before. Finally, only players with NFLScrapR data are selected and NA values are removed as the NFL ScrapR package occasionally has errors in gathering data due how the data is reported before being scraped.

#selects only RBs
rbepa <- Epa %>%
  filter(play_type == "run" & qb_scramble == 0 & !is.na(epa) & !is.na(wpa)) %>% #selects only plays where ball is run and removes few plays where EPA is not calculated
  group_by(rusher_player_name, Year ) %>%
  summarise(averageEpa = mean(epa), totalEpa = sum(epa), averageWpa = mean(wpa), totalWpa = sum(wpa), snaps = n()) %>%
  arrange(rusher_player_name, Year) 


rbs <- Raw_Data10yrs_v3 %>%
  filter(FantPos == "RB" & PosRank <= 75 & !is.na(PosRank)) %>% #selects top 75 RBs to only select fantasy relevant RBs
  arrange(Name, Year) %>%
  mutate(prev_year_rank = ifelse(lag(Name) != Name, NA, lag(PosRank)), #gathers data from previous years to make prediction using lag function
         prev_year_rankdiff = prev_year_rank - PosRank,
         prev_year_RushingY_A = ifelse(lag(Name) != Name, NA, lag(RushingY_A)),
         prev_year_Rushingatt = ifelse(lag(Name) != Name, NA, lag(RushingAtt)),
         prev_year_RushingTD = ifelse(lag(Name) != Name, NA, lag(RushingTD)),
         prev_year_G = ifelse(lag(Name) != Name, NA, lag(G)),
         prev_year_GS = ifelse(lag(Name) != Name, NA, lag(GS)),
         prev_year_PPRPG = ifelse(lag(Name) != Name, NA, lag(PPRPG)),
         prev_year_ReceivingTgt = ifelse(lag(Name) != Name, NA, lag(ReceivingTgt)),
         prev_year_ReceivingY_R = ifelse(lag(Name) != Name, NA, lag(ReceivingY_R)),
         prev_year_ReceivingTD = ifelse(lag(Name) != Name, NA, lag(ReceivingTD)),
         )

#joins fantasy data with EPA, WPA, and snaps per season
rbsfull <- left_join(rbs, rbepa, by = c("joinname" = "rusher_player_name", "Year" = "Year"))

rbsfull <- rbsfull %>%   
  mutate(prev_year_averageEpa = ifelse(lag(Name) != Name, NA, lag(averageEpa)), #gathers data from previous years to make prediction using lag function
                                prev_year_totalEpa = ifelse(lag(Name) != Name, NA, lag(totalEpa)),
                                prev_year_averageWpa = ifelse(lag(Name) != Name, NA, lag(averageWpa)),
                                prev_year_totalWpa = ifelse(lag(Name) != Name, NA, lag(totalWpa)),
                                prev_year_snaps = ifelse(lag(Name) != Name, NA, lag(snaps)),
)

#removes any player's rookie year since we have no stats to predict and players that only played one year
rbsfull <- rbsfull %>% 
  filter(!is.na(prev_year_averageWpa))

head(rbsfull)

##Running Back Model

The data is split into test and train sets and then plugged into the linear model with all the possible predictor variables to project the points per game for each running back. After calibrating the model to select the features found significant, it has an R-squared of 0.4245 and an RMSE of 4.69065. Not bad for a sport as unpredictable as the NFL. The most useful features were previous year points per game, targets, snaps, and age. Previous year points per game is expected as that is the variable being predicted and is a good base for the model to use. Targets having a high p-value seems to confirm that running backs that are involved in the passing game are likely to have more touches on the ball and more chances for points while running backs that do not receive targets will have less chances for points. Snaps is also related to this as running backs that can catch generally are on the field more plays, which means more chances for points as well. Age is also an expected variable, as football players tend to perform at a lower level at higher ages, their fantasy points also decline which means that younger players are more likely to have more points while older players are more likely to have less points.

set.seed(999)
#splitting test/train data
sample <- sample.int(n = nrow(rbsfull), size = floor(.75*nrow(rbsfull)), replace = F)
rbtrain <- rbsfull[sample, ]
rbtest  <- rbsfull[-sample, ]

#starting with all previous year variables
rbmod <- lm(PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + prev_year_G + prev_year_GS + prev_year_RushingY_A + prev_year_Rushingatt +
                    prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + prev_year_totalWpa + prev_year_RushingTD +
                    prev_year_ReceivingTgt + prev_year_ReceivingY_R + prev_year_ReceivingTD + prev_year_snaps, data = rbtrain)

summary(rbmod)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + 
##     prev_year_G + prev_year_GS + prev_year_RushingY_A + prev_year_Rushingatt + 
##     prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + 
##     prev_year_totalWpa + prev_year_RushingTD + prev_year_ReceivingTgt + 
##     prev_year_ReceivingY_R + prev_year_ReceivingTD + prev_year_snaps, 
##     data = rbtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0937 -2.4932 -0.3048  2.2562 13.1233 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.407e+01  5.539e+00   2.539 0.011646 *  
## prev_year_PPRPG         1.870e-01  2.461e-01   0.760 0.448019    
## prev_year_rank          9.199e-04  3.383e-02   0.027 0.978326    
## Age                    -3.218e-01  8.217e-02  -3.916 0.000113 ***
## prev_year_G            -2.134e-01  2.282e-01  -0.935 0.350442    
## prev_year_GS            1.705e-02  7.052e-02   0.242 0.809155    
## prev_year_RushingY_A    3.194e-01  3.757e-01   0.850 0.395965    
## prev_year_Rushingatt    1.235e-02  1.056e-02   1.170 0.242929    
## prev_year_averageEpa    1.214e-01  2.796e+00   0.043 0.965390    
## prev_year_totalEpa      5.118e-02  3.744e-02   1.367 0.172688    
## prev_year_averageWpa   -4.300e+01  1.076e+02  -0.400 0.689725    
## prev_year_totalWpa     -1.043e+00  1.203e+00  -0.867 0.386830    
## prev_year_RushingTD    -3.032e-02  1.384e-01  -0.219 0.826784    
## prev_year_ReceivingTgt  6.504e-02  2.663e-02   2.443 0.015194 *  
## prev_year_ReceivingY_R -5.025e-02  1.015e-01  -0.495 0.620969    
## prev_year_ReceivingTD   2.465e-02  2.430e-01   0.101 0.919263    
## prev_year_snaps         5.665e-03  4.363e-03   1.299 0.195188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.635 on 278 degrees of freedom
## Multiple R-squared:  0.4496, Adjusted R-squared:  0.4179 
## F-statistic: 14.19 on 16 and 278 DF,  p-value: < 2.2e-16

#trimmed unneeded variables
rbmodtrimmed <- lm(PPRPG ~ prev_year_PPRPG + Age + prev_year_G + prev_year_GS + prev_year_RushingY_A 
                   + prev_year_averageEpa + prev_year_averageWpa + prev_year_ReceivingTgt  + prev_year_snaps, data = rbtrain)

summary(rbmodtrimmed)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + Age + prev_year_G + prev_year_GS + 
##     prev_year_RushingY_A + prev_year_averageEpa + prev_year_averageWpa + 
##     prev_year_ReceivingTgt + prev_year_snaps, data = rbtrain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.975 -2.468 -0.256  2.180 12.834 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             11.450840   2.886543   3.967 9.22e-05 ***
## prev_year_PPRPG          0.339185   0.094395   3.593 0.000385 ***
## Age                     -0.307754   0.079695  -3.862 0.000139 ***
## prev_year_G             -0.090530   0.108811  -0.832 0.406108    
## prev_year_GS             0.053768   0.058522   0.919 0.358998    
## prev_year_RushingY_A     0.379910   0.318858   1.191 0.234460    
## prev_year_averageEpa     1.365826   2.277267   0.600 0.549139    
## prev_year_averageWpa   -89.987317  82.640734  -1.089 0.277119    
## prev_year_ReceivingTgt   0.047919   0.014544   3.295 0.001109 ** 
## prev_year_snaps          0.006621   0.003576   1.852 0.065110 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.614 on 285 degrees of freedom
## Multiple R-squared:  0.4421, Adjusted R-squared:  0.4245 
## F-statistic:  25.1 on 9 and 285 DF,  p-value: < 2.2e-16

rbpred <- predict(rbmodtrimmed, rbtest)

rbresults <- cbind(rbtest, rbpred) %>%
  select(Name, Year, PPRPG, rbpred)

rbrsme <- rbresults %>%
  filter(!is.na(rbpred))

rmse(rbrsme$PPRPG,rbrsme$rbpred)

## [1] 4.69065

head(rbresults)

##Wide Reciever Data

This data is wrangled mostly the same way as the running back data except the NFLScrapR play type is changed to pass and the filtered position is changed to wide receiver.

wrepa <- Epa %>%
  filter(play_type == "pass" & qb_scramble == 0 & !is.na(epa) & !is.na(wpa)) %>% #selects only plays where ball is passed and removes few plays where EPA is not calculated
  group_by(receiver_player_name, Year ) %>%
  summarise(averageEpa = mean(epa), totalEpa = sum(epa), averageWpa = mean(wpa), totalWpa = sum(wpa), snaps = n()) %>%
  arrange(receiver_player_name, Year) 

wrs <- Raw_Data10yrs_v3 %>%
  filter(FantPos == "WR" & PosRank <= 75 & !is.na(PosRank)) %>% #selects top 75 WRs to only select fantasy relevant WRs
  arrange(Name, Year) %>%
  mutate(prev_year_rank = ifelse(lag(Name) != Name, NA, lag(PosRank)), #gathers data from previous years to make prediction using lag function
         prev_year_rankdiff = prev_year_rank - PosRank,
         prev_year_Rushingatt = ifelse(lag(Name) != Name, NA, lag(RushingAtt)),
         prev_year_RushingTD = ifelse(lag(Name) != Name, NA, lag(RushingTD)),
         prev_year_G = ifelse(lag(Name) != Name, NA, lag(G)),
         prev_year_GS = ifelse(lag(Name) != Name, NA, lag(GS)),
         prev_year_PPRPG = ifelse(lag(Name) != Name, NA, lag(PPRPG)),
         prev_year_ReceivingTgt = ifelse(lag(Name) != Name, NA, lag(ReceivingTgt)),
         prev_year_ReceivingY_R = ifelse(lag(Name) != Name, NA, lag(ReceivingY_R)),
         prev_year_ReceivingTD = ifelse(lag(Name) != Name, NA, lag(ReceivingTD)),
  )

#joins fantasy data with EPA, WPA, and snaps per season
wrsfull <- left_join(wrs, wrepa, by = c("joinname" = "receiver_player_name", "Year" = "Year"))

wrsfull <- wrsfull %>%   
  mutate(prev_year_averageEpa = ifelse(lag(Name) != Name, NA, lag(averageEpa)), #gathers data from previous years to make prediction using lag function
         prev_year_totalEpa = ifelse(lag(Name) != Name, NA, lag(totalEpa)),
         prev_year_averageWpa = ifelse(lag(Name) != Name, NA, lag(averageWpa)),
         prev_year_totalWpa = ifelse(lag(Name) != Name, NA, lag(totalWpa)),
         prev_year_snaps = ifelse(lag(Name) != Name, NA, lag(snaps)),
  )

#removes any player's rookie year since we have no stats to predict and players that only played one year
wrsfull <- wrsfull %>% 
  filter(!is.na(prev_year_averageWpa))

##Wide Reciever Model

The wide receiver model after being optimized comes to an R-squared of 0.3696 and a RMSE of 3.320411 The most useful features in this model were previous year points per game and age again, as well as rushing touchdowns, average win probability added, and receiving targets. Rushing touchdowns tells us that receivers that can get touchdowns on the few carries they get are a big indicator of their ability. Win probability added likely means that wide receivers that are or are not making big catches in important moments explains a large portion of their points. Receiving targets is a clear statistic that shows that the amount of opportunity a player has is an important indicator of points.

#splitting test/train data
set.seed(999)
sample <- sample.int(n = nrow(wrsfull), size = floor(.75*nrow(wrsfull)), replace = F)
wrtrain <- wrsfull[sample, ]
wrtest  <- wrsfull[-sample, ]

#starting with all previous year variables
wrmod <- lm(PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + prev_year_G + prev_year_GS + prev_year_Rushingatt + prev_year_RushingTD +
                    prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + prev_year_totalWpa + prev_year_ReceivingTgt +
                    prev_year_ReceivingY_R + prev_year_ReceivingTD + prev_year_snaps, data = wrtrain)

summary(wrmod)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + 
##     prev_year_G + prev_year_GS + prev_year_Rushingatt + prev_year_RushingTD + 
##     prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + 
##     prev_year_totalWpa + prev_year_ReceivingTgt + prev_year_ReceivingY_R + 
##     prev_year_ReceivingTD + prev_year_snaps, data = wrtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8701 -2.0795 -0.3897  1.6031 10.3997 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.208274   6.067412   1.847  0.06570 .  
## prev_year_PPRPG         0.356176   0.229107   1.555  0.12110    
## prev_year_rank          0.017970   0.029789   0.603  0.54680    
## Age                    -0.212726   0.058815  -3.617  0.00035 ***
## prev_year_G            -0.204413   0.244625  -0.836  0.40404    
## prev_year_GS           -0.023885   0.059562  -0.401  0.68870    
## prev_year_Rushingatt    0.068164   0.052923   1.288  0.19876    
## prev_year_RushingTD    -0.681197   0.688273  -0.990  0.32312    
## prev_year_averageEpa   -2.300185   1.821790  -1.263  0.20773    
## prev_year_totalEpa      0.042994   0.028225   1.523  0.12876    
## prev_year_averageWpa   46.340524  63.323026   0.732  0.46486    
## prev_year_totalWpa     -1.219992   0.846943  -1.440  0.15079    
## prev_year_ReceivingTgt  0.043056   0.020921   2.058  0.04046 *  
## prev_year_ReceivingY_R  0.047638   0.092775   0.513  0.60800    
## prev_year_ReceivingTD   0.107654   0.117291   0.918  0.35945    
## prev_year_snaps        -0.005137   0.005721  -0.898  0.36999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 296 degrees of freedom
## Multiple R-squared:  0.3949, Adjusted R-squared:  0.3643 
## F-statistic: 12.88 on 15 and 296 DF,  p-value: < 2.2e-16

#trimmed unneeded variables
wrmodtrimmed <- lm(PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + prev_year_Rushingatt + prev_year_RushingTD + 
                      prev_year_averageWpa + prev_year_totalWpa + prev_year_ReceivingTgt + prev_year_snaps, data = wrtrain)

summary(wrmodtrimmed)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + 
##     prev_year_Rushingatt + prev_year_RushingTD + prev_year_averageWpa + 
##     prev_year_totalWpa + prev_year_ReceivingTgt + prev_year_snaps, 
##     data = wrtrain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.333 -2.046 -0.477  1.659 10.317 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.483795   2.266107   4.185 3.74e-05 ***
## prev_year_PPRPG          0.570761   0.088836   6.425 5.13e-10 ***
## prev_year_rank           0.014663   0.017535   0.836   0.4037    
## Age                     -0.228929   0.056426  -4.057 6.33e-05 ***
## prev_year_Rushingatt     0.053623   0.050684   1.058   0.2909    
## prev_year_RushingTD     -0.882819   0.660184  -1.337   0.1822    
## prev_year_averageWpa   -30.830362  25.330869  -1.217   0.2245    
## prev_year_totalWpa      -0.127498   0.391854  -0.325   0.7451    
## prev_year_ReceivingTgt   0.020547   0.011098   1.851   0.0651 .  
## prev_year_snaps         -0.002768   0.005291  -0.523   0.6013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.042 on 302 degrees of freedom
## Multiple R-squared:  0.3878, Adjusted R-squared:  0.3696 
## F-statistic: 21.26 on 9 and 302 DF,  p-value: < 2.2e-16

wrpred <- predict(wrmodtrimmed, wrtest)

wrresults <- cbind(wrtest, wrpred) %>%
  select(Name, Year, PPRPG, wrpred)

wrrsme <- wrresults %>%
  filter(!is.na(wrpred))

rmse(wrrsme$PPRPG,wrrsme$wrpred)

## [1] 3.320411

head(wrresults)

##Quarterback Data

This data for quarterbacks is also wrangled mostly the same way as the running backs and wide receivers but with the filtered position as Quarterback.

#qb
qbepa <- Epa %>%
  filter(play_type == "pass" & !is.na(epa) & !is.na(wpa)) %>% #this time qb scrambles do not need to be removed 
  group_by(passer_player_name, Year ) %>%
  summarise(averageEpa = mean(epa), totalEpa = sum(epa), averageWpa = mean(wpa), totalWpa = sum(wpa), snaps = n()) %>%
  arrange(passer_player_name, Year) 

qbs <- Raw_Data10yrs_v3 %>%
  filter(FantPos == "QB" & PosRank <= 75 & !is.na(PosRank)) %>% #selects top 75 Qbs to only select fantasy relevant QBs
  arrange(Name, Year) %>%
  mutate(prev_year_rank = ifelse(lag(Name) != Name, NA, lag(PosRank)), #gathers data from previous years to make prediction using lag function
         prev_year_rankdiff = prev_year_rank - PosRank,
         prev_year_Rushingatt = ifelse(lag(Name) != Name, NA, lag(RushingAtt)),
         prev_year_RushingTD = ifelse(lag(Name) != Name, NA, lag(RushingTD)),
         prev_year_G = ifelse(lag(Name) != Name, NA, lag(G)),
         prev_year_GS = ifelse(lag(Name) != Name, NA, lag(GS)),
         prev_year_PPRPG = ifelse(lag(Name) != Name, NA, lag(PPRPG)),
         prev_year_PassingCmp = ifelse(lag(Name) != Name, NA, lag(PassingCmp)),
         prev_year_PassingAtt = ifelse(lag(Name) != Name, NA, lag(PassingAtt)),
         prev_year_PassingYds = ifelse(lag(Name) != Name, NA, lag(PassingYds)),
         prev_year_PassingTD = ifelse(lag(Name) != Name, NA, lag(PassingTD))
  )

#joins fantasy data with EPA, WPA, and snaps per season
qbsfull <- left_join(qbs, qbepa, by = c("joinname" = "passer_player_name", "Year" = "Year"))

qbsfull <- qbsfull %>%   
  mutate(prev_year_averageEpa = ifelse(lag(Name) != Name, NA, lag(averageEpa)), #gathers data from previous years to make prediction using lag function
         prev_year_totalEpa = ifelse(lag(Name) != Name, NA, lag(totalEpa)),
         prev_year_averageWpa = ifelse(lag(Name) != Name, NA, lag(averageWpa)),
         prev_year_totalWpa = ifelse(lag(Name) != Name, NA, lag(totalWpa)),
         prev_year_snaps = ifelse(lag(Name) != Name, NA, lag(snaps)),
  )

#removes any player's rookie year since we have no stats to predict and players that only played one year
qbsfull <- qbsfull %>% 
  filter(!is.na(prev_year_averageWpa))

##Quarterback Model

The quarterback model after being optimized comes to an R-squared of 0.5002 and a RMSE of 4.369192. The most useful features in this model were again previous year points per game and age, along with passing touchdowns, rushing attempts, and passing completions. Passing touchdowns being found useful in the quarterback model while rushing touchdowns and receiving touchdowns were not found significant in the running back and wide receiver models is interesting. It seems to suggest that passing touchdowns are more consistent for quarterbacks. Rushing attempts likely means that quarterbacks that rush have more chances to earn points with the ball as they can run when a play does not work while quarterbacks unwilling to rush will absorb more negative plays when a play does not work. Passing completions is not a very exciting feature but likely indicates that a quarterback’s accuracy is important to their score.

#splitting test/train data
set.seed(999)
sample <- sample.int(n = nrow(qbsfull), size = floor(.75*nrow(qbsfull)), replace = F)
qbtrain <- qbsfull[sample, ]
qbtest  <- qbsfull[-sample, ]

#starting with all previous year variables
qbmod <- lm(PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + prev_year_G + prev_year_GS + prev_year_Rushingatt + prev_year_RushingTD +
             prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + prev_year_totalWpa + prev_year_PassingCmp +
           prev_year_PassingAtt + prev_year_PassingYds + prev_year_PassingTD + prev_year_snaps, data = qbtrain)

summary(qbmod)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + prev_year_rank + Age + 
##     prev_year_G + prev_year_GS + prev_year_Rushingatt + prev_year_RushingTD + 
##     prev_year_averageEpa + prev_year_totalEpa + prev_year_averageWpa + 
##     prev_year_totalWpa + prev_year_PassingCmp + prev_year_PassingAtt + 
##     prev_year_PassingYds + prev_year_PassingTD + prev_year_snaps, 
##     data = qbtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9677  -3.0213  -0.2222   2.8507  20.8530 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          14.6495790  4.5843081   3.196  0.00155 **
## prev_year_PPRPG       0.0507074  0.1318409   0.385  0.70080   
## prev_year_rank       -0.0593470  0.0628868  -0.944  0.34609   
## Age                  -0.2038066  0.0760849  -2.679  0.00781 **
## prev_year_G          -0.1233547  0.1739066  -0.709  0.47869   
## prev_year_GS         -0.4618473  0.2620667  -1.762  0.07905 . 
## prev_year_Rushingatt  0.0410837  0.0229455   1.790  0.07440 . 
## prev_year_RushingTD   0.1660550  0.2589205   0.641  0.52180   
## prev_year_averageEpa -0.6435303  1.0990447  -0.586  0.55863   
## prev_year_totalEpa    0.0008555  0.0167005   0.051  0.95918   
## prev_year_averageWpa 27.1378310 45.0844451   0.602  0.54768   
## prev_year_totalWpa   -0.5091855  0.5184374  -0.982  0.32683   
## prev_year_PassingCmp  0.0147158  0.0242650   0.606  0.54467   
## prev_year_PassingAtt -0.0126857  0.0175422  -0.723  0.47016   
## prev_year_PassingYds  0.0027353  0.0018089   1.512  0.13158   
## prev_year_PassingTD   0.1300665  0.0854595   1.522  0.12909   
## prev_year_snaps       0.0052379  0.0034488   1.519  0.12989   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.598 on 295 degrees of freedom
## Multiple R-squared:  0.5253, Adjusted R-squared:  0.4996 
## F-statistic: 20.41 on 16 and 295 DF,  p-value: < 2.2e-16

#trimmed unneeded variables
qbmodtrimmed <- lm(PPRPG ~ prev_year_PPRPG + Age + prev_year_Rushingatt + prev_year_RushingTD + prev_year_totalWpa +
                     prev_year_PassingCmp + prev_year_PassingAtt  + prev_year_PassingTD + prev_year_snaps, data = qbtrain)

summary(qbmodtrimmed)

## 
## Call:
## lm(formula = PPRPG ~ prev_year_PPRPG + Age + prev_year_Rushingatt + 
##     prev_year_RushingTD + prev_year_totalWpa + prev_year_PassingCmp + 
##     prev_year_PassingAtt + prev_year_PassingTD + prev_year_snaps, 
##     data = qbtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5600  -3.0584  -0.1121   2.7716  20.2828 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.213773   2.279730   4.480 1.06e-05 ***
## prev_year_PPRPG       0.174539   0.085536   2.041   0.0422 *  
## Age                  -0.194646   0.075176  -2.589   0.0101 *  
## prev_year_Rushingatt  0.023024   0.020274   1.136   0.2570    
## prev_year_RushingTD   0.249549   0.254617   0.980   0.3278    
## prev_year_totalWpa   -0.225112   0.280148  -0.804   0.4223    
## prev_year_PassingCmp  0.026884   0.021945   1.225   0.2215    
## prev_year_PassingAtt -0.013768   0.013899  -0.991   0.3227    
## prev_year_PassingTD   0.164397   0.077647   2.117   0.0351 *  
## prev_year_snaps       0.003282   0.003321   0.988   0.3238    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.595 on 302 degrees of freedom
## Multiple R-squared:  0.5147, Adjusted R-squared:  0.5002 
## F-statistic: 35.59 on 9 and 302 DF,  p-value: < 2.2e-16

qbpred <- predict(qbmodtrimmed, qbtest)

qbresults <- cbind(qbtest, qbpred) %>%
  select(Name, Year, PPRPG, qbpred)

qbrsme <- qbresults %>%
  filter(!is.na(qbpred))

rmse(qbrsme$PPRPG,qbrsme$qbpred)

## [1] 4.369192

head(qbresults)

##Results

The accuracy of the models is not perfect but I believe it will still be useful in predicting which players will have good seasons based on their previous season production. The models are able to explain 35 to 50 percent of the variance in points per game which make them a useful tool considering how many events not captured in this data can have an effect on a player’s points per game. This model will have a strong impact when combined with other information like team changes, coaching changes, and team personal changes in the offseason.