Abstract

This research was divised to provide the reader with insight as to which offensive statistcis are most significant in relation to total offensive points scored by a team. Those statistics will then be used as significant variables in order to predict offensive point outputs by teams in the years to come via a created linear regression model.

Upon creating a model that was not only accurate but simplistic enough (using a reasonable amount of significant statistics), we then checked to make sure it passed all 5 linear regression assumptions. Upon passing the assumptions the model was ready for prediction. The 8 significant variables included in it were: TotYds (total yards of offense), FL (fumbles lost), PTD (passing touchdowns), YPA (yards per pass attempt), P1std (passing first downs), RuTD (rushing touchdowns), Ru1stD (rushing first downs), and Pen (penalties). When put into context it makes sense these are some of the most significant variables as they are all critical statistics that either depict signs of a juggernaut offense that can keep the chains moving or a weak offense that struggles to get first downs and constantly turns the ball over.

With this 8 variable model we were able to construct a function that upon inputting values for each of these 8 significant variables was able to give a rough projection of average points scored per team as well as average points scored per game.

Introduction

Problem

Points per game in the NFL has been on a steady upward trend basically since the leagues inception. While it’s easy to say that “Players have just become more skilled now than they were back then” and there certainly can be some truth to this statement, there’s defintely much more to it. Therefore we’ll be taking a deep dive into NFL offensive averages per game by season to hopefully uncover which of these statistics have helped play the biggest role in putting more points up on the scoreboard for us viewers.

Focus

To dive into this data we’ll be using Pro Football Reference’s Team Offense League Averages Per Team Game dataset from the years 1949-2020, for completeness of recorded statistics. The statistics range from total combined offensive stats per game, to passing/rushing splits. Additionally statistics regarding turnovers as well as penalties are taken into account. Therefore there will be no statistical stone left unturned throughout this analysis.

Methodology

To analyze this data we’ll be using a combination of visualization as well as statistical techniques. We’ll be plotting the primary statistics with avergae points per game to see if they have any distinct positive or negative correlation with each other. Additionally we will be running a regression on the statistical dataset, and then we’ll create and test multiple other fits/models to see if we are able to construct one that combines accuracy of prediction as well as simplicity of variables used into one. The variables of our dataset as well as a description of them are listed below.

Variable Name

  • PtsFor

  • Tot Yds

  • Ply

  • YPP

  • TO

  • FL

  • Tot1std

  • Cmp

  • Att

  • PYds

  • PTD

  • Int

  • YPA

  • P1stD

  • RuAtt

  • RuYds

  • RuTD

  • YPC

  • Ru1stD

  • Pen

  • PenYds

  • 1stPy

Variable Type

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Numeric

Variable Description

Points For (Per team)

Total Yds (Rush+Receiving)

Total Plays (Per team)

Yards Per Play

Turnovers

Fumbles Lost

Total First Downs

Completions

Pass Attempts

Pass Yards

Pass Touchdowns

Interceptions

Yards Per Attempt (Passing)

Passing 1st Downs

Rush Attempts

Rush Yards

Rush Touchdowns

Yards Per Carry

Rushing First Downs

Penalties Committed (Accepted by opposition)

Penalty Yardage

Penalties Resulting in First Downs

Data Cleaning

library(readxl)
NFL <- read_xlsx("NFLFinalProj.xlsx", col_names = TRUE, na = "NA")

head(NFL, 50)

To begin we’ll remove columns 1,2, and 3 respectively which correspond to rank, year, and number of teams in the league. These stats are of no use to our set as rank has no effect here, it’s just ordering the years. Additionally since we are analyzing which statistics correspond to more points being scored on average the year is of no use to us either, as it provides no football significance. Finally although the number of teams in the league has expanded over the years, we are analyzing leaguewide data, so as long as all teams are included the total number isn’t of much concern.

clean1 <- NFL[,-c(1:3)]

head(clean1, 50)

Next we’ll be omitting columns 23-28 as they’ve only begun recording these statistics regarding drive length, scoring percentage, turnover percentage, average number of plays, average number of yards, and average points per drive since roughly 2000. Given the data goes back to 1922 we’d be missing out on a plethora of years in our sample.

clean2 <- clean1[,-c(23:28)]

head(clean2, 50)

To continue we’ll run a na.omit command on our remaining data, to clean out any remaining NA values that are still within our data.

FinalNFL <- na.omit(clean2)
head(FinalNFL, 50)

Now that are data is fully cleaned we see our data consists of the NFL Team Offense League Averages Per Team Game for the seasons 1949-2020, thus the final cleaning of the data rid the set of the years 1922-1948 due to lack of data regarding them.

Nontheless we now have 72 years of NFL average offensive statistics to find out which variables contribute most to a team putting up points.

Packages Used

library(readxl) #Used to read data in via excel file
library(ggplot2) # Used to help creaate and organize both box and scatterplots
library(reshape2) # Used to plot and fit boxplots of all variables appropriately
library(leaps) # Used to help determine model's most significant variables
library(car) # Used as a part of the prediction function created

Exploratory Data Analysis

library(ggplot2)
library(reshape2)

NFLPoints <- melt(data = FinalNFL)
## No id variables; using all as measure variables
ggplot(data = NFLPoints, aes(x=variable, y=value)) + geom_boxplot() +facet_wrap(~variable, scales = "free")

As we can see all the data here presented within the variables appears to be normally distributed with only a few of them containing an outlier or two.

ggplot(FinalNFL, aes(x=TotYds, y=PtsFor)) + geom_point(color="green")

ggplot(FinalNFL, aes(x=PTD, y=PtsFor)) + geom_point(color="blue")

ggplot(FinalNFL, aes(x=Pen, y=PtsFor)) + geom_point(color="red")

ggplot(FinalNFL, aes(x=YPA, y=PtsFor)) + geom_point(color="cyan")

It wouldn’t be feasible to show a scatter plot for all 21 variables as it would occupy too much space or they would all be too small to get a fair visual on. However from these four randomly chosen variables from the dataset you can see that they all follow a linear pattern/trend thus they are appropriate to use in a linear regression model. This additionally is a precursor for the rest of the variables too, as they also follow linear trends. Thus we can safely attempt to use a linear regression model to fit this data.

Creating the Model

To begin we’ll run a regression on our data using all 21 variables.

options(scipen=4)
MLR <- lm(formula = PtsFor ~ ., data=FinalNFL)
summary(MLR)
## 
## Call:
## lm(formula = PtsFor ~ ., data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77090 -0.22923  0.02343  0.20991  0.93498 
## 
## Coefficients:
##               Estimate Std. Error t value     Pr(>|t|)    
## (Intercept) 31.8203560 14.6514728   2.172       0.0346 *  
## TotYds       0.2695916  0.9698411   0.278       0.7822    
## Ply          0.1428058  0.3893455   0.367       0.7153    
## YPP          0.0002126  1.9739941   0.000       0.9999    
## TO           1.5333739  1.6149700   0.949       0.3469    
## FL          -3.3832308  1.5835883  -2.136       0.0376 *  
## Tot1stD     -0.4237103  1.1353244  -0.373       0.7106    
## Cmp          0.0606371  0.2159394   0.281       0.7800    
## Att         -0.2042333  0.3119737  -0.655       0.5157    
## PYds        -0.1813786  0.9762030  -0.186       0.8534    
## PTD          5.6853297  0.8715506   6.523 0.0000000334 ***
## Int         -1.7924614  1.5883578  -1.128       0.2645    
## YPA         -2.0366638  1.6082082  -1.266       0.2112    
## P1stD       -0.6570512  1.1667418  -0.563       0.5758    
## RuAtt       -0.5731412  0.4728462  -1.212       0.2312    
## RuYds       -0.1151497  0.9710158  -0.119       0.9061    
## RuTD         4.5086568  1.0407345   4.332 0.0000711003 ***
## YPC         -3.2024224  2.0073370  -1.595       0.1169    
## Ru1stD      -0.2422972  1.1145780  -0.217       0.8288    
## Pen         -0.3169721  0.2332310  -1.359       0.1802    
## PenYds      -0.0231319  0.0318336  -0.727       0.4708    
## `1stPy`      0.6263404  1.2596146   0.497       0.6212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4214 on 50 degrees of freedom
## Multiple R-squared:  0.9329, Adjusted R-squared:  0.9047 
## F-statistic:  33.1 on 21 and 50 DF,  p-value: < 2.2e-16

Although this model provides accurate predictions, as evidenced by the r-squared/adjusted r-squared values, 21 varaiables for one regression model is a rather large number. Thus we’ll run a step test on the data to see if we can find any models that maintain a similar level of accuracy while using a smaller amount of significant variables.

summary(step)
## 
## Call:
## lm(formula = PtsFor ~ TotYds + FL + Tot1stD + PYds + PTD + YPA + 
##     RuAtt + RuTD + YPC + Pen + `1stPy`, data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74547 -0.25041  0.04895  0.22151  1.05136 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.09525    7.42370   3.785 0.000358 ***
## TotYds       0.15007    0.06230   2.409 0.019092 *  
## FL          -1.74502    0.57895  -3.014 0.003773 ** 
## Tot1stD     -0.74372    0.17363  -4.283 6.75e-05 ***
## PYds        -0.07923    0.05792  -1.368 0.176440    
## PTD          5.43341    0.61023   8.904 1.43e-12 ***
## YPA         -1.81473    0.38195  -4.751 1.30e-05 ***
## RuAtt       -0.40866    0.23109  -1.768 0.082074 .  
## RuTD         4.55816    0.80361   5.672 4.30e-07 ***
## YPC         -2.77170    1.73684  -1.596 0.115781    
## Pen         -0.43326    0.13696  -3.163 0.002447 ** 
## `1stPy`      0.94169    0.45796   2.056 0.044112 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3965 on 60 degrees of freedom
## Multiple R-squared:  0.9287, Adjusted R-squared:  0.9157 
## F-statistic: 71.07 on 11 and 60 DF,  p-value: < 2.2e-16

The step test was able to provide us with a nearly identical model accuracy wise, while only using 11 of 21 variables. However for simplicity purposes we will continue to test further to see if we can find a model that once again uses less variables. However, the models lack of variables doesn’t allow for model accuracy to be compromised.

Therefore the next 3 models I randomly chose using my knowledge of football I’ve obtained throughout my many years of watching, to see if I could create the most accurate model yet.

randlm1 <- lm(formula = PtsFor ~ TotYds+Ply+TO+Att+PTD+RuTD+Pen+PenYds , data=FinalNFL)
summary(randlm1)
## 
## Call:
## lm(formula = PtsFor ~ TotYds + Ply + TO + Att + PTD + RuTD + 
##     Pen + PenYds, data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20473 -0.32844  0.01517  0.27351  1.20777 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.983787   2.628269   2.277   0.0262 *  
## TotYds       0.002304   0.009925   0.232   0.8172    
## Ply          0.005953   0.064663   0.092   0.9269    
## TO          -0.415252   0.263968  -1.573   0.1207    
## Att          0.126701   0.057660   2.197   0.0317 *  
## PTD          5.884595   0.773575   7.607 1.74e-10 ***
## RuTD         4.124290   0.951266   4.336 5.34e-05 ***
## Pen         -0.524228   0.247492  -2.118   0.0381 *  
## PenYds       0.048013   0.026735   1.796   0.0773 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4777 on 63 degrees of freedom
## Multiple R-squared:  0.8913, Adjusted R-squared:  0.8776 
## F-statistic:  64.6 on 8 and 63 DF,  p-value: < 2.2e-16
randlm2 <- lm(formula = PtsFor ~ TotYds+Ply+YPP+Tot1stD+Cmp+YPA+YPC , data=FinalNFL)
summary(randlm2)
## 
## Call:
## lm(formula = PtsFor ~ TotYds + Ply + YPP + Tot1stD + Cmp + YPA + 
##     YPC, data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52911 -0.47881 -0.06348  0.29589  2.24158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.63530   19.33200   1.688 0.096250 .  
## TotYds       0.22093    0.06476   3.411 0.001124 ** 
## Ply         -0.51053    0.30589  -1.669 0.100001    
## YPP          5.87291    3.12933   1.877 0.065116 .  
## Tot1stD     -0.70463    0.27426  -2.569 0.012533 *  
## Cmp         -0.55277    0.14331  -3.857 0.000269 ***
## YPA         -6.33532    1.88559  -3.360 0.001318 ** 
## YPC         -4.75050    1.74464  -2.723 0.008330 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7843 on 64 degrees of freedom
## Multiple R-squared:  0.7025, Adjusted R-squared:   0.67 
## F-statistic: 21.59 on 7 and 64 DF,  p-value: 1.205e-14
randlm3 <- lm(formula = PtsFor ~ TotYds+Ply+PTD+RuTD+FL+Int+Pen , data=FinalNFL)
summary(randlm3)
## 
## Call:
## lm(formula = PtsFor ~ TotYds + Ply + PTD + RuTD + FL + Int + 
##     Pen, data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39555 -0.25000  0.02328  0.22411  1.12568 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.018159   2.370464   2.961  0.00430 ** 
## TotYds       0.003016   0.008361   0.361  0.71954    
## Ply          0.052600   0.062418   0.843  0.40254    
## PTD          6.796104   0.625071  10.873 3.52e-16 ***
## RuTD         3.361831   0.762273   4.410 4.04e-05 ***
## FL          -1.839797   0.635373  -2.896  0.00517 ** 
## Int          0.222068   0.477314   0.465  0.64333    
## Pen         -0.169123   0.144473  -1.171  0.24609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4762 on 64 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8783 
## F-statistic: 74.21 on 7 and 64 DF,  p-value: < 2.2e-16

Unfortunately all three of these models randomly created by me had significantly lower r-squared values than what we’ve seen in the initial two models. Thus we must continue to search for the ideal model combination.

Using the leaps package we’ll have a handful more models given to us, allowing us to analyze and decide if any of them have an adequate enough combination of accuracy and simplicity, so we can use for our prediction model.

library(leaps)
best <- regsubsets(PtsFor~., data=FinalNFL)
plot(best, scale="adjr2", main="Best-fit: Adjusted R square")

best.sum<- summary(best)
out.table <- data.frame(best.sum$outmat, best.sum$adjr2, best.sum$rss, best.sum$cp, best.sum$bic)
out.table

Looking through each of these 8 models, we see that fit8 is the best of the group. It has a similar r-squared to our first two models tested. Additionally, it uses only 8 of 21 variables within it. Thus we will test/summarize this model to make sure it is fully adequate to be tested.

BestModel <- lm(PtsFor ~ TotYds+FL+PTD+YPA+P1stD+RuTD+Ru1stD+Pen, data = FinalNFL)
summary(BestModel)
## 
## Call:
## lm(formula = PtsFor ~ TotYds + FL + PTD + YPA + P1stD + RuTD + 
##     Ru1stD + Pen, data = FinalNFL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84121 -0.28561  0.03454  0.27031  0.94456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.02499    1.96410   7.650 1.46e-10 ***
## TotYds       0.05313    0.01235   4.301 6.03e-05 ***
## FL          -1.64012    0.42781  -3.834 0.000294 ***
## PTD          5.81081    0.54924  10.580 1.33e-15 ***
## YPA         -1.51224    0.36113  -4.188 8.93e-05 ***
## P1stD       -0.44992    0.16037  -2.805 0.006675 ** 
## RuTD         4.40183    0.91018   4.836 8.88e-06 ***
## Ru1stD      -0.78746    0.19202  -4.101 0.000120 ***
## Pen         -0.39789    0.11611  -3.427 0.001080 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4023 on 63 degrees of freedom
## Multiple R-squared:  0.9229, Adjusted R-squared:  0.9131 
## F-statistic: 94.31 on 8 and 63 DF,  p-value: < 2.2e-16

After reading the summary we see that everything from fit8, now titled BestModel, checked out. Therefore it is now time to check and make sure that this model meets all the assumptions, so that it is alright to go forward with creating a prediction model.

plot(BestModel)

Meeting Assumptions

We now must make sure that our model meets all 5 assumptions of multiple linear regression.

  • Assumption 1 - the probability distribition of the random errors (epsilon′s) has a mean of 0.
    • There is no direct way to check this condition, the least squares regression equation ensures this is true forthe sample by its calculations - we will need to assume this is an ok assumption for the population.
  • Assumption 2 - the probability distribution of the random errors (epsilon′s) has a constant variance.
    • The residual versus fit plot shows a random scatter pattern except for a few data values.It may be that our model does not fit best with these types of extreme values. We may want to consider splitting the dataset into 2 populations thus creating models for each.
  • Assumption 3: the probability distribution of the random errors (epsilon) follows a normal distribution.
    • The normal probability plot of the residuals appears to follow a relatively linear path. We don’t have enoughevidence to say the random errors are not normally distributed.
  • Assumption 4: the random errors (epsilon) are independent (little to no autocorrelation).
library(car)
## Loading required package: carData
dwt <- durbinWatsonTest(BestModel)
dwt
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1851193      1.612345   0.032
##  Alternative hypothesis: rho != 0

Since the p-value (0.06) is greater than any reasonable alpha level, there is evidence to fail to reject thenull hypothesis. There is insufficient evidence to suggest that there is autocorrelation.

  • Assumption 5: Little to no multicollinearity.
    • Given that no pair of variables share a correlation coefficient over .8, in addition to having no VIF values over 5, thus there’s not sufficient evidence to indicate multicollinearity.

We see that our chosen model has met all assumptions and thus is ready to be used for prediction! The function created below has inputs of the 8 significant variables/parameters used by this fit. The function runs a regression using the model and its inputted values, and the output provides us not only what the average team total per game would be in the NFL with those inputs, but also what the entire game total of points would be if those were the season averages of our inputted statistics as well.

library(car)
PredPtsFor <- function(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen){
  inputs <- data.frame(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen)
  PointsPred <- predict(BestModel, newdata = inputs)
  PointsForNew <- as.integer(PointsPred)
  
  cat("Teams are projected to score on average",PointsForNew, "points per game this season.\n" )
  
  cat("Games are projected to have a total score of",2*PointsForNew, "points this season." )
}
PredPtsFor(300, .9, 1.6, 6.3, 13, 1.2, 7, 4.8)
## Teams are projected to score on average 21 points per game this season.
## Games are projected to have a total score of 42 points this season.
PredPtsFor(360, .6, 1.9, 7.1, 15, 1.4, 9, 5.5)
## Teams are projected to score on average 23 points per game this season.
## Games are projected to have a total score of 46 points this season.
PredPtsFor(347, .8, 2.1, 6.8, 14, 1.4, 10, 4)
## Teams are projected to score on average 24 points per game this season.
## Games are projected to have a total score of 48 points this season.

Summary/Conclusion

Findings

To summarize we took a set of total league offensive statistics of the NFL from the years 1922-2020. We then cleaned this dataset/narrowed it down to the seasons spanning 1949-2020. With this newly cleaned dataset we embarked to see which offensive related statistics were most significant in causing the gradual increase in offensive points scored per game in the NFL year-over-year. To do so we created and tested numerous linear regression models until we were able to arrive at a model that was able to perfectly balance accuracy with simplicity. The significant variables in this model were TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, and Pen. Before using this model for prediction, all assumptions were checked and subsequently met, allowing us to proceed with our prediction process as our model was deemed statistically significant. With all of this in mind using this model should give you a rough idea about what the offensive landscape in the NFL should look like for the upcoming season(s) given the values of the statistics that are inputted.

This research isn’t necessarily groundbreaking or world changing, however, it certainly has a niche. Whether it be coaches or analysts using a model like this to determine what areas a team needs to shore up or improve upon in order to put more points up on the board. Or if it’s just a casual fan inputting stats into these of their favorite team to see how many points they’re projected to score this week/how many points the opposition is projected to score this week.

Where to Look Next?

The next best, and most realistic step to take with this data would be to run smaller sample sized regressions using some of the more advanced statistics that were introduced in the late 90’s to the data. These statistics include drive duration, average number of plays per drive, average total yardage per drive, and much more. If the regression models consistently show over the next few years that any of these models have a great deal of significance they could then be added to our current prediction model to enhance it’s accuracy even more.

Personal Information

James Weinblatt is a senior Mathematical Science major here at Clemson University. He’ll be receving his bachelor of science degree next week at graduation and then continue his search for an employment opportunity back home in New York.

This project combines two of James’s favorite things; sports as well as analyzing data. He plans to show this project to possible employers to exemplify his coding/ analysis skills. In hopes to receive an entry level position at a company.

Additionally James hopes that anyone else who happens to be looking at this project enjoys its contents and is able to learn something they find valuable within it.

Full Code

library(readxl)
NFL <- read_xlsx("NFLFinalProj.xlsx", col_names = TRUE, na = "NA")
#Reading data into R
head(NFL, 50)

clean1 <- NFL[,-c(1:3)]

head(clean1, 50)

clean2 <- clean1[,-c(23:28)]

head(clean2, 50)

FinalNFL <- na.omit(clean2)
head(FinalNFL, 50)
# 3Steps of data cleaning to get our final set of data to be used

library(readxl) #Used to read data in via excel file
library(ggplot2) # Used to help creaate and organize both box and scatterplots
library(reshape2) # Used to plot and fit boxplots of all variables appropriately
library(leaps) # Used to help determine model's most significant variables
library(car) # Used as a part of the prediction function created

NFLPoints <- melt(data = FinalNFL)

ggplot(data = NFLPoints, aes(x=variable, y=value)) + geom_boxplot() +facet_wrap(~variable, scales = "free")
#Creates individual boxplots of each variable/statistic

ggplot(FinalNFL, aes(x=TotYds, y=PtsFor)) + geom_point(color="green")
ggplot(FinalNFL, aes(x=PTD, y=PtsFor)) + geom_point(color="blue")
ggplot(FinalNFL, aes(x=Pen, y=PtsFor)) + geom_point(color="red")
ggplot(FinalNFL, aes(x=YPA, y=PtsFor)) + geom_point(color="cyan")

# scatter plots of a few statistics to exemplify all variables present in the dataset are appropriate for a linear model/follow a linear trend.

options(scipen=4)
MLR <- lm(formula = PtsFor ~ ., data=FinalNFL)
summary(MLR)
#linear regression with all variables included

step <- step(MLR, direction = "both", test="F")
summary(step)
#step test to try and drop less significant variables from the model

randlm1 <- lm(formula = PtsFor ~ TotYds+Ply+TO+Att+PTD+RuTD+Pen+PenYds , data=FinalNFL)
summary(randlm1)

randlm2 <- lm(formula = PtsFor ~ TotYds+Ply+YPP+Tot1stD+Cmp+YPA+YPC , data=FinalNFL)
summary(randlm2)

randlm3 <- lm(formula = PtsFor ~ TotYds+Ply+PTD+RuTD+FL+Int+Pen , data=FinalNFL)
summary(randlm3)
#3 chosen at random models to be tested, by me using my own "football knowledge"

best <- regsubsets(PtsFor~., data=FinalNFL)
plot(best, scale="adjr2", main="Best-fit: Adjusted R square")

best.sum<- summary(best)
out.table <- data.frame(best.sum$outmat, best.sum$adjr2, best.sum$rss, best.sum$cp, best.sum$bic)
out.table
# 8 additional models created to see if any are as accurate yet more simple than any of the previous ones

BestModel <- lm(PtsFor ~ TotYds+FL+PTD+YPA+P1stD+RuTD+Ru1stD+Pen, data = FinalNFL)
summary(BestModel)

plot(BestModel)
# graphs that help with verifying MLR assumptions 1-3

dwt <- durbinWatsonTest(BestModel)
dwt
# test to check assumption 4 of MLR

PredPtsFor <- function(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen){
  inputs <- data.frame(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen)
  PointsPred <- predict(BestModel, newdata = inputs)
  PointsForNew <- as.integer(PointsPred)
  
  cat("Teams are projected to score on average",PointsForNew, "points per game this season.\n" )
  
  cat("Games are projected to have a total score of",2*PointsForNew, "points this season." )
  
  #function created that runs a regression using our chosen model upon plugging in values for our 8 significant statistics
}

PredPtsFor(300, .9, 1.6, 6.3, 13, 1.2, 7, 4.8)

PredPtsFor(360, .6, 1.9, 7.1, 15, 1.4, 9, 5.5)

PredPtsFor(347, .8, 2.1, 6.8, 14, 1.4, 10, 4)
---
title: "NFL Scoring History: What Affects It?"
author: "James Weinblatt"
date: "4/22/2021"
output:
  html_document:
    toc: TRUE
    toc_float: TRUE
    number_sections: FALSE
    df_print: paged
    theme: united
    code_download: TRUE
---
![](NFLProj.png){width=50% align=center}

## <span style="color:blue;"> Abstract </span>

This research was divised to provide the reader with insight as to which offensive statistcis are most significant in relation to total offensive points scored by a team. Those statistics will then be used as significant variables in order to predict offensive point outputs by teams in the years to come via a created linear regression model.

Upon creating a model that was not only accurate but simplistic enough (using a reasonable amount of significant statistics), we then checked to make sure it passed all 5 linear regression assumptions. Upon passing the assumptions the model was ready for prediction. The 8 significant variables included in it were: TotYds (total yards of offense), FL (fumbles lost), PTD (passing touchdowns), YPA (yards per pass attempt), P1std (passing first downs), RuTD (rushing touchdowns), Ru1stD (rushing first downs), and Pen (penalties). When put into context it makes sense these are some of the most significant variables as they are all critical statistics that either depict signs of a juggernaut offense that can keep the chains moving or a weak offense that struggles to get first downs and constantly turns the ball over. 

With this 8 variable model we were able to construct a function that upon inputting values for each of these 8 significant variables was able to give a rough projection of average points scored per team as well as average points scored per game.


## <span style="color:blue;"> Introduction </span>


<font size ="5" color ="blue">Problem</font>

Points per game in the NFL has been on a steady upward trend basically since the leagues inception. While it's easy to say that "Players have just become more skilled now than they were back then" and there certainly can be some truth to this statement, there's defintely much more to it. Therefore we'll be taking a deep dive into NFL offensive averages per game by season to hopefully uncover which of these statistics have helped play the biggest role in putting more points up on the scoreboard for us viewers.

<font size ="5" color ="blue">Focus</font>

To dive into this data we'll be using Pro Football Reference's  [Team Offense League Averages Per Team Game](https://www.pro-football-reference.com/years/NFL/index.htm) dataset from the years 1949-2020, for completeness of recorded statistics. The statistics range from total combined offensive stats per game, to passing/rushing splits. Additionally statistics regarding turnovers as well as penalties are taken into account. Therefore there will be no statistical stone left unturned throughout this analysis.



<font size ="5" color ="blue">Methodology</font>

To analyze this data we'll be using a combination of visualization as well as statistical techniques. We'll be plotting the primary statistics with avergae points per game to see if they have any distinct positive or negative correlation with each other. Additionally we will be running a regression on the statistical dataset, and then we'll create and test multiple other fits/models to see if we are able to construct one that combines accuracy of prediction as well as simplicity of variables used into one. The variables of our dataset as well as a description of them are listed below.

:::::::::::::: {.columns align=right totalwidth=5em}
::: {.column width="33%" align=left}

Variable Name

* PtsFor

* Tot Yds

* Ply

* YPP

* TO

* FL

* Tot1std

* Cmp

* Att

* PYds

* PTD

* Int

* YPA

* P1stD

* RuAtt

* RuYds

* RuTD

* YPC

* Ru1stD

* Pen

* PenYds

* 1stPy
:::
:::{.column width="33%" align=left}

Variable Type

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric
  
  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric

  Numeric
  
  Numeric
:::
::: {.column width="33%" align=left}

Variable Description

Points For (Per team)

Total Yds (Rush+Receiving)

Total Plays (Per team)

Yards Per Play

Turnovers

Fumbles Lost

Total First Downs

Completions 

Pass Attempts

Pass Yards

Pass Touchdowns

Interceptions

Yards Per Attempt (Passing)

Passing 1st Downs

Rush Attempts

Rush Yards

Rush Touchdowns

Yards Per Carry

Rushing First Downs

Penalties Committed (Accepted by opposition)

Penalty Yardage

Penalties Resulting in First Downs
  
:::
::::::::::::::



## <span style="color:blue;"> Data Cleaning </span>

```{r}

library(readxl)
NFL <- read_xlsx("NFLFinalProj.xlsx", col_names = TRUE, na = "NA")

head(NFL, 50)

```


To begin we'll remove columns 1,2, and 3 respectively which correspond to rank, year, and number of teams in the league. These stats are of no use to our set as rank has no effect here, it's just ordering the years. Additionally since we are analyzing which statistics correspond to more points being scored on average the year is of no use to us either, as it provides no football significance. Finally although the number of teams in the league has expanded over the years, we are analyzing leaguewide data, so as long as all teams are included the total number isn't of much concern.


```{r}

clean1 <- NFL[,-c(1:3)]

head(clean1, 50)

```


Next we'll be omitting columns 23-28 as they've only begun recording these statistics regarding drive length, scoring percentage, turnover percentage, average number of plays, average number of yards, and average points per drive since roughly 2000. Given the data goes back to 1922 we'd be missing out on a plethora of years in our sample.


```{r}
clean2 <- clean1[,-c(23:28)]

head(clean2, 50)
```


To continue we'll run a na.omit command on our remaining data, to clean out any remaining NA values that are still within our data.


```{r}
FinalNFL <- na.omit(clean2)
head(FinalNFL, 50)
```


Now that are data is fully cleaned we see our data consists of the NFL Team Offense League Averages Per Team Game for the seasons 1949-2020, thus the final cleaning of the data rid the set of the years 1922-1948 due to lack of data regarding them.

Nontheless we now have 72 years of NFL average offensive statistics to find out which variables contribute most to a team putting up points.

## <span style="color:blue;"> Packages Used </span> 


```{r eval=FALSE}

library(readxl) #Used to read data in via excel file
library(ggplot2) # Used to help creaate and organize both box and scatterplots
library(reshape2) # Used to plot and fit boxplots of all variables appropriately
library(leaps) # Used to help determine model's most significant variables
library(car) # Used as a part of the prediction function created


```


## <span style="color:blue;"> Exploratory Data Analysis </span> 


```{r}
library(ggplot2)
library(reshape2)

NFLPoints <- melt(data = FinalNFL)

ggplot(data = NFLPoints, aes(x=variable, y=value)) + geom_boxplot() +facet_wrap(~variable, scales = "free")

```

As we can see all the data here presented within the variables appears to be normally distributed with only a few of them containing an outlier or two.

```{r}
ggplot(FinalNFL, aes(x=TotYds, y=PtsFor)) + geom_point(color="green")
ggplot(FinalNFL, aes(x=PTD, y=PtsFor)) + geom_point(color="blue")
ggplot(FinalNFL, aes(x=Pen, y=PtsFor)) + geom_point(color="red")
ggplot(FinalNFL, aes(x=YPA, y=PtsFor)) + geom_point(color="cyan")
```

It wouldn't be feasible to show a scatter plot for all 21 variables as it would occupy too much space or they would all be too small to get a fair visual on. However from these four randomly chosen variables from the dataset you can see that they all follow a linear pattern/trend thus they are appropriate to use in a linear regression model. This additionally is a precursor for the rest of the variables too, as they also follow linear trends. Thus we can safely attempt to use a linear regression model to fit this data.

<font size ="5" color ="blue">Creating the Model</font>

To begin we'll run a regression on our data using all 21 variables.


```{r}
options(scipen=4)
MLR <- lm(formula = PtsFor ~ ., data=FinalNFL)
summary(MLR)
```


Although this model provides accurate predictions, as evidenced by the r-squared/adjusted r-squared values, 21 varaiables for one regression model is a rather large number. Thus we'll run a step test on the data to see if we can find any models that maintain a similar level of accuracy while using a smaller amount of significant variables.


```{r include=FALSE}
step <- step(MLR, direction = "both", test="F")
```

```{r}
summary(step)
```


The step test was able to provide us with a nearly identical model accuracy wise, while only using 11 of 21 variables. However for simplicity purposes we will continue to test further to see if we can find a model that once again uses less variables. However, the models lack of variables doesn't allow for model accuracy to be compromised.

Therefore the next 3 models I randomly chose using my knowledge of football I've obtained throughout my many years of watching, to see if I could create the most accurate model yet.


```{r}
randlm1 <- lm(formula = PtsFor ~ TotYds+Ply+TO+Att+PTD+RuTD+Pen+PenYds , data=FinalNFL)
summary(randlm1)

randlm2 <- lm(formula = PtsFor ~ TotYds+Ply+YPP+Tot1stD+Cmp+YPA+YPC , data=FinalNFL)
summary(randlm2)

randlm3 <- lm(formula = PtsFor ~ TotYds+Ply+PTD+RuTD+FL+Int+Pen , data=FinalNFL)
summary(randlm3)
```


Unfortunately all three of these models randomly created by me had significantly lower r-squared values than what we've seen in the initial two models. Thus we must continue to search for the ideal model combination.

Using the leaps package we'll have a handful more models given to us, allowing us to analyze and decide if any of them have an adequate enough combination of accuracy and simplicity, so we can use for our prediction model.


```{r}
library(leaps)
best <- regsubsets(PtsFor~., data=FinalNFL)
plot(best, scale="adjr2", main="Best-fit: Adjusted R square")

best.sum<- summary(best)
out.table <- data.frame(best.sum$outmat, best.sum$adjr2, best.sum$rss, best.sum$cp, best.sum$bic)
out.table
```


Looking through each of these 8 models, we see that fit8 is the best of the group. It has a similar r-squared to our first two models tested. Additionally, it uses only 8 of 21 variables within it. Thus we will test/summarize this model to make sure it is fully adequate to be tested.


```{r}
BestModel <- lm(PtsFor ~ TotYds+FL+PTD+YPA+P1stD+RuTD+Ru1stD+Pen, data = FinalNFL)
summary(BestModel)
```


After reading the summary we see that everything from fit8, now titled BestModel, checked out. Therefore it is now time to check and make sure that this model meets all the assumptions, so that it is alright to go forward with creating a prediction model.


```{r}
plot(BestModel)
```

<font size ="5" color ="blue">Meeting Assumptions</font>

We now must make sure that our model meets all 5 assumptions of multiple linear regression.

* Assumption 1 - the probability distribition of the random errors (epsilon′s) has a mean of 0.
  * There is no direct way to check this condition, the least squares regression equation ensures this is true forthe sample by its calculations - we will need to assume this is an ok assumption for the population.
  
* Assumption 2 - the probability distribution of the random errors (epsilon′s) has a constant variance.
  * The residual versus fit plot shows a random scatter pattern except for a few data values.It may be that our model does not fit best with these types of extreme values. We may want to consider splitting the dataset into 2 populations thus creating models for each. 
  
* Assumption 3: the probability distribution of the random errors (epsilon) follows a normal distribution.
  * The normal probability plot of the residuals appears to follow a relatively linear path. We don’t have enoughevidence to say the random errors are not normally distributed.
  
* Assumption 4: the random errors (epsilon) are independent (little to no autocorrelation).
```{r}
library(car)
dwt <- durbinWatsonTest(BestModel)
dwt
```
 Since the p-value (0.06) is greater than any reasonable alpha level, there is evidence to fail to reject thenull hypothesis. There is insufficient evidence to suggest that there is autocorrelation.
  
* Assumption 5: Little to no multicollinearity.
  * Given that no pair of variables share a correlation coefficient over .8, in addition to having no VIF values over 5, thus there's not sufficient evidence to indicate multicollinearity.

We see that our chosen model has met all assumptions and thus is ready to be used for prediction! The function created below has inputs of the 8 significant variables/parameters used by this fit. The function runs a regression using the model and its inputted values, and the output provides us not only what the average team total per game would be in the NFL with those inputs, but also what the entire game total of points would be if those were the season averages of our 
inputted statistics as well.


```{r}
library(car)
PredPtsFor <- function(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen){
  inputs <- data.frame(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen)
  PointsPred <- predict(BestModel, newdata = inputs)
  PointsForNew <- as.integer(PointsPred)
  
  cat("Teams are projected to score on average",PointsForNew, "points per game this season.\n" )
  
  cat("Games are projected to have a total score of",2*PointsForNew, "points this season." )
}

```
```{r}
PredPtsFor(300, .9, 1.6, 6.3, 13, 1.2, 7, 4.8)
```
```{r}
PredPtsFor(360, .6, 1.9, 7.1, 15, 1.4, 9, 5.5)
```
```{r}
PredPtsFor(347, .8, 2.1, 6.8, 14, 1.4, 10, 4)
```
## <span style="color:blue;"> Summary/Conclusion </span> 

<font size ="5" color ="blue">Findings</font>

To summarize we took a set of total league offensive statistics of the NFL from the years 1922-2020. We then cleaned this dataset/narrowed it down to the seasons spanning 1949-2020. With this newly cleaned dataset we embarked to see which offensive related statistics were most significant in causing the gradual increase in offensive points scored per game in the NFL year-over-year. To do so we created and tested numerous linear regression models until we were able to arrive at a model that was able to perfectly balance accuracy with simplicity. The significant variables in this model were TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, and Pen. Before using this model for prediction, all assumptions were checked and subsequently met, allowing us to proceed with our prediction process as our model was deemed statistically significant. With all of this in mind using this model should give you a rough idea about what the offensive landscape in the NFL should look like for the upcoming season(s) given the values of the statistics that are inputted.

This research isn't necessarily groundbreaking or world changing, however, it certainly has a niche. Whether it be coaches or analysts using a model like this to determine what areas a team needs to shore up or improve upon in order to put more points up on the board. Or if it's just a casual fan inputting stats into these of their favorite team to see how many points they're projected to score this week/how many points the opposition is projected to score this week.

<font size ="5" color ="blue">Where to Look Next?</font>

The next best, and most realistic step to take with this data would be to run smaller sample sized regressions using some of the more advanced statistics that were introduced in the late 90's to the data. These statistics include drive duration, average number of plays per drive, average total yardage per drive, and much more. If the regression models consistently show over the next few years that any of these models have a great deal of significance they could then be added to our current prediction model to enhance it's accuracy even more.

## <span style="color:blue;"> Personal Information </span>


:::::::::::::: {.columns align=right totalwidth=8em}
::: {.column width="40%"}
![](IMG_4325.jpeg)
:::
:::{.column width="10%" align=center}
:::
::: {.column width="50%" align=left}
  James Weinblatt is a senior Mathematical Science major here at Clemson University. He'll be receving his bachelor of science degree next week at graduation and then continue his search for an employment opportunity back home in New York. 
  
  This project combines two of James's favorite things; sports as well as analyzing data. He plans to show this project to possible employers to exemplify his coding/ analysis skills. In hopes to receive an entry level position at a company.
  
  Additionally James hopes that anyone else who happens to be looking at this project enjoys its contents and is able to learn something they find valuable within it.
:::
::::::::::::::

## <span style="color:blue;"> References </span> 

<https://www.pro-football-reference.com/years/NFL/index.htm>

## <span style="color:blue;"> Full Code </span> 

```{r eval = FALSE} 
library(readxl)
NFL <- read_xlsx("NFLFinalProj.xlsx", col_names = TRUE, na = "NA")
#Reading data into R
head(NFL, 50)

clean1 <- NFL[,-c(1:3)]

head(clean1, 50)

clean2 <- clean1[,-c(23:28)]

head(clean2, 50)

FinalNFL <- na.omit(clean2)
head(FinalNFL, 50)
# 3Steps of data cleaning to get our final set of data to be used

library(readxl) #Used to read data in via excel file
library(ggplot2) # Used to help creaate and organize both box and scatterplots
library(reshape2) # Used to plot and fit boxplots of all variables appropriately
library(leaps) # Used to help determine model's most significant variables
library(car) # Used as a part of the prediction function created

NFLPoints <- melt(data = FinalNFL)

ggplot(data = NFLPoints, aes(x=variable, y=value)) + geom_boxplot() +facet_wrap(~variable, scales = "free")
#Creates individual boxplots of each variable/statistic

ggplot(FinalNFL, aes(x=TotYds, y=PtsFor)) + geom_point(color="green")
ggplot(FinalNFL, aes(x=PTD, y=PtsFor)) + geom_point(color="blue")
ggplot(FinalNFL, aes(x=Pen, y=PtsFor)) + geom_point(color="red")
ggplot(FinalNFL, aes(x=YPA, y=PtsFor)) + geom_point(color="cyan")

# scatter plots of a few statistics to exemplify all variables present in the dataset are appropriate for a linear model/follow a linear trend.

options(scipen=4)
MLR <- lm(formula = PtsFor ~ ., data=FinalNFL)
summary(MLR)
#linear regression with all variables included

step <- step(MLR, direction = "both", test="F")
summary(step)
#step test to try and drop less significant variables from the model

randlm1 <- lm(formula = PtsFor ~ TotYds+Ply+TO+Att+PTD+RuTD+Pen+PenYds , data=FinalNFL)
summary(randlm1)

randlm2 <- lm(formula = PtsFor ~ TotYds+Ply+YPP+Tot1stD+Cmp+YPA+YPC , data=FinalNFL)
summary(randlm2)

randlm3 <- lm(formula = PtsFor ~ TotYds+Ply+PTD+RuTD+FL+Int+Pen , data=FinalNFL)
summary(randlm3)
#3 chosen at random models to be tested, by me using my own "football knowledge"

best <- regsubsets(PtsFor~., data=FinalNFL)
plot(best, scale="adjr2", main="Best-fit: Adjusted R square")

best.sum<- summary(best)
out.table <- data.frame(best.sum$outmat, best.sum$adjr2, best.sum$rss, best.sum$cp, best.sum$bic)
out.table
# 8 additional models created to see if any are as accurate yet more simple than any of the previous ones

BestModel <- lm(PtsFor ~ TotYds+FL+PTD+YPA+P1stD+RuTD+Ru1stD+Pen, data = FinalNFL)
summary(BestModel)

plot(BestModel)
# graphs that help with verifying MLR assumptions 1-3

dwt <- durbinWatsonTest(BestModel)
dwt
# test to check assumption 4 of MLR

PredPtsFor <- function(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen){
  inputs <- data.frame(TotYds, FL, PTD, YPA, P1stD, RuTD, Ru1stD, Pen)
  PointsPred <- predict(BestModel, newdata = inputs)
  PointsForNew <- as.integer(PointsPred)
  
  cat("Teams are projected to score on average",PointsForNew, "points per game this season.\n" )
  
  cat("Games are projected to have a total score of",2*PointsForNew, "points this season." )
  
  #function created that runs a regression using our chosen model upon plugging in values for our 8 significant statistics
}

PredPtsFor(300, .9, 1.6, 6.3, 13, 1.2, 7, 4.8)

PredPtsFor(360, .6, 1.9, 7.1, 15, 1.4, 9, 5.5)

PredPtsFor(347, .8, 2.1, 6.8, 14, 1.4, 10, 4)
```
