The Philadelphia Eagles had a shocking season in 2017-18 when they ultimately beat the New England Patriots and won the Super Bowl on February 4, 2018. “Could anyone have predicted that?” is a question that went wild during and after the season. We will use league data from the 2016-17 season to build your model, then use that model to predict the wins for the Eagles in the 2017-18 season.

Team: Team Name
Code: Official league abbreviation of the team name
Wins: Games won in the 2016 regular season (RESPONSE VARIABLE)
OffPY: Offensive Passing Yards in the 2016 regular season
OffRY: Offensive Rushing Yards in the 2016 regular season
PA: Points Scored Against in the 2016 regular season
FPG: Field Goal Percentage (FG made/FG attempted) in the 2016 regular season
Penalty: Total Offensive and Defensive Penalty Yards in the 2016 regular season
TurnoverDiff: Turnover Differential (turnovers acquired - turnovers lost) in the 2016 regular season
OppPY: Opponents’ Passing Yards in the 2016 regular season
OppRY: Opponents’ Rushing Yards in the 2016 regular season

Step 1: Collect the Data

The data were compiled from https://www.pro-football-reference.com/years/2016/ and ESPN. We have no reason to question the validity of these sources.
Are the data organized for MLR- each row contains information about one observation including all of the explanatory predictors and the response.
Check the appropriateness of response variable for regression: View a histogram of response variable. It should be continuous, and approximately unimodal and symmetric, with few outliers.

# Load in the data from a csv
nfl2016<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/nfl2016_updated.csv")
# Produce the first 6 rows of the data
head(nfl2016)

# Create a histogram of the response
hist(nfl2016$Wins, xlab="Wins", main="Histogram of Wins")

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, or none.

# Generate the names and positions of the columns in the data
names(nfl2016)

 [1] "Team"         "Code"         "Wins"         "OffPY"        "OffRY"       
 [6] "PA"           "FGP"          "Penalty"      "TurnoverDiff" "OppPY"       
[11] "OppRY"

# Create one scatter plot
# plot(x,y)
plot(nfl2016[,"OffPY"], nfl2016$Wins,xlab="Offensive Passing Yards",ylab="Wins")

# Create several plots at once
# plot(y~x+x+...,data=DATATABLENAME)
plot(Wins~OffPY+OffRY,data=nfl2016)

# Create your plots in a grid
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(nfl2016)[4:11]) {
  plot(nfl2016[,i], nfl2016$Wins,xlab=i,ylab="Wins")
}

# compute correlation simply
cor(nfl2016$OffPY,nfl2016$Wins)

[1] 0.3745733

# compute correlations and round to three decimal places
round(cor(nfl2016[4:11],nfl2016$Wins),3)

               [,1]
OffPY         0.375
OffRY         0.151
PA           -0.675
FGP           0.328
Penalty       0.198
TurnoverDiff  0.760
OppPY         0.180
OppRY        -0.421

We see Turnover Differential (r=0.759) and Points Against (r=-0.675) have the strongest linear relationships.
Potentially, Opponents’ Rushing Yards has a curvilinear relationship.
Therefore, our hypothesized model is:

\(wins=\beta_0+\beta_1 TurnoverDiff+\beta_2 PA+\beta_3 OppRy+\beta_4 OppRy^2+\epsilon\)

Step 3: Estimate the model parameters (fit the model using R)

# store the model as an object
# lm(RESPONSE~x1+...,data=DATAFRAME)
nflmod1<-lm(Wins~TurnoverDiff+PA+OppRY+I(OppRY^2),data=nfl2016)
summary(nflmod1)


Call:
lm(formula = Wins ~ TurnoverDiff + PA + OppRY + I(OppRY^2), data = nfl2016)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2532 -1.1877  0.3448  1.1127  3.1319 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.778e+01  9.809e+00   1.813   0.0810 .  
TurnoverDiff  2.082e-01  3.991e-02   5.218 1.69e-05 ***
PA           -1.637e-02  7.584e-03  -2.159   0.0399 *  
OppRY        -2.036e-03  1.011e-02  -0.201   0.8420    
I(OppRY^2)   -1.057e-07  2.654e-06  -0.040   0.9685    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.759 on 27 degrees of freedom
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.6984 
F-statistic: 18.94 on 4 and 27 DF,  p-value: 1.595e-07

# Extract just the coefficients and round them 4 decimal places
round(coef(nflmod1),4)

 (Intercept) TurnoverDiff           PA        OppRY   I(OppRY^2) 
     17.7827       0.2082      -0.0164      -0.0020       0.0000

The prediction equation is:

\(\widehat{wins}=17.78+0.21TurnoverDiff-0.016PA-0.002OppRy-.0000OppRy^2\)

Interpretation of coefficients:

For example: For a one unit increase in turnover differential, the predicted number of wins is expected to increase by 0.21, given the other explanatory variables are held constant.
The coefficients for the linear and curvilinear components of opponents’ rushing yard are not practical to interpret individually.

Step 4: Specify the distribution of the errors and find the estimate of the variance

\(\epsilon \overset{\mathrm{iid}}{\sim} N(0,\sigma^2 )\)
estimate of \(\sigma^2\) is \(MSE=3.09\)
estimate of \(\sigma\) is \(\sqrt{MSE}=1.76\)
Approximately 95% of our predictions will be within 2*1.76=3.52 games of the actual number of wins. (This is not great considering there are only 16 games (at the time).)

Step 5: Evaluate the Utility of the model

\(wins=\beta_0+\beta_1 TurnoverDiff+\beta_2 PA+\beta_3 OppRy+\beta_4 OppRy^2+\epsilon\)

First we Perform the Global F Test:

Hypotheses:
- \(H_0: \beta_1= \beta_2=\beta_3=\beta_4=0\) (the model is not adequate)
- \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3,\beta_4\) does not equal 0 (the model is adequate)
Distribution of test statistic: F with 4, 27 DF
Test Statistic: F=18.94
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with Turnover Differential, Points Against, Opponents Rushing Yards, and Opponents Rushing Yards squared is adequate at predicting wins.

Then we test “the most important predictors”: Test the Individual Significance of OppRy^2

Hypotheses:
- \(H_0: \beta_4=0\) (the quadratic relationship does not contribute to predicting wins)
- \(H_a:\beta_4 \neq 0\) (the quadratic relationship contributes to predicting wins)
Distribution of test statistic: T with 27 DF
Test Statistic: t=-0.04
Pvalue: 0.9685
Decision: 0.97>0.05 -> FAIL TO REJECT H0
Conclusion: The quadratic relationship of opponents rushing yards does not contribute information for predicting a team’s wins. We will remove just the higher order term and refit the model.

Refit the model

Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.

nflmod2<-lm(Wins~TurnoverDiff+PA+OppRY,data=nfl2016)
summary(nflmod2)


Call:
lm(formula = Wins ~ TurnoverDiff + PA + OppRY, data = nfl2016)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2464 -1.1798  0.3529  1.1188  3.1307 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.159540   2.546624   7.131 9.27e-08 ***
TurnoverDiff  0.208215   0.039182   5.314 1.18e-05 ***
PA           -0.016401   0.007417  -2.211   0.0353 *  
OppRY        -0.002436   0.001171  -2.081   0.0467 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.727 on 28 degrees of freedom
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.7091 
F-statistic: 26.19 on 3 and 28 DF,  p-value: 2.812e-08

round(coef(nflmod2),4)

 (Intercept) TurnoverDiff           PA        OppRY 
     18.1595       0.2082      -0.0164      -0.0024

round(confint(nflmod2),4)

               2.5 %  97.5 %
(Intercept)  12.9430 23.3761
TurnoverDiff  0.1280  0.2885
PA           -0.0316 -0.0012
OppRY        -0.0048  0.0000

\(\widehat{wins}=18.16+0.21TurnoverDiff-0.016PA-0.002OppRy\)

We see from the global F test this is significant, and OppRy is significant in the model.

Further Assessment:

Root MSE: 1.73 (slightly better than the model with higher order term)
Adjusted R-Sq: 0.7091 (better than model with higher order term, and it is closer to R2)
- After accounting for the number of predictors and observations, 70% of the variation in wins explained by the model with turnover differential, points against, and opponent’s rushing yards.
Confidence Interval for Betas: We are 95% confident that for each increase in turnover differential, the true number of wins increases by between 0.13 and 0.29, while point against and opponents rushing yards remains constant.

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

Now it is time to go back to our original research question- Could we have predicted the number of wins for the Eagles in the 2017 season from the 2016 season? In the 2016 season the Eagles had 3585 offensive passing yards; 1813 offensive rushing yards; 331 points against; 85.4% FPG; 212 penalty yards; 6 turnover differential; 3832 opponents’ passing yards; and 1652 opponents’ rushing yards. Note this is also observation 25 in the data.

nflmod2<-lm(Wins~TurnoverDiff+PA+OppRY,data=nfl2016)
round(coef(nflmod2),4)

 (Intercept) TurnoverDiff           PA        OppRY 
     18.1595       0.2082      -0.0164      -0.0024

# The long way to get y-hat
18.1595 +0.2082*6-0.0164*331-0.0024*1652

[1] 10.0155

NOTE: THE COEFFICIENTS HAVE BEEN ROUDNED, so there is slight error from the true estimate below. If we use this model and the statistics from the 2016 season, we would have predicted the Eagles to win ~10.0155 games.

# A much easier and exact way
# The Eagles are observation 25 in the data table
# we can pull values from the model object
nflmod2$fitted.values

        1         2         3         4         5         6         7         8 
 8.524117  9.717652 10.452034  8.020043  7.580750  2.700531  9.203390  2.685731 
        9        10        11        12        13        14        15        16 
10.927055  8.625321  7.935686  9.770712  7.436671  5.997037  4.118728 11.668834 
       17        18        19        20        21        22        23        24 
 5.362968  6.869346 11.246377 13.105708  6.127438  9.633048  3.435469 10.593994 
       25        26        27        28        29        30        31        32 
 9.955402  9.939474  5.946754  2.780175  9.955950  7.956021  8.517495  7.210088

nflmod2$fitted.values[25]

      25 
9.955402

# or we can use the predict function
predict(nflmod2)

        1         2         3         4         5         6         7         8 
 8.524117  9.717652 10.452034  8.020043  7.580750  2.700531  9.203390  2.685731 
        9        10        11        12        13        14        15        16 
10.927055  8.625321  7.935686  9.770712  7.436671  5.997037  4.118728 11.668834 
       17        18        19        20        21        22        23        24 
 5.362968  6.869346 11.246377 13.105708  6.127438  9.633048  3.435469 10.593994 
       25        26        27        28        29        30        31        32 
 9.955402  9.939474  5.946754  2.780175  9.955950  7.956021  8.517495  7.210088

# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newnfl<-data.frame(TurnoverDiff=6,PA=331,OppRY=1652)
newnfl

predict(nflmod2,newnfl, interval="prediction")

       fit      lwr      upr
1 9.955402 6.332131 13.57867

# to compute confidence intervals for estimation we would use interval="confidence"

If we use this model and the statistics from the 2016 season, we would have predicted the Eagles to win 9.9554 games exactly. Further, we are 95% confident that the Eagles would win between 6.33 and 13.58 games, given their statistics in 2016.
The Eagles won 13 regular season games in 2017. How accurate was this model at predicting?
- Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, we should not be used for prediction beyond this season. We should update this model with recent data if we want to continue using it.

STAT 3220 Unit 1