Team members: Nadia Shchetnikova, Daniel Zhang, Esubalew Milam, Aubrey Cook
Project 1
Question 1:
NFL=read.csv("NFL Team Scoring.csv")NFL=NFL[,-1]summary(NFL)
Year PTS TA PAVG
Min. :2009 Min. :10.90 Min. :0.9375 Min. :5.100
1st Qu.:2009 1st Qu.:18.77 1st Qu.:1.3750 1st Qu.:6.500
Median :2010 Median :22.20 Median :1.6250 Median :7.000
Mean :2010 Mean :21.90 Mean :1.6504 Mean :7.042
3rd Qu.:2011 3rd Qu.:24.60 3rd Qu.:1.8906 3rd Qu.:7.700
Max. :2011 Max. :35.00 Max. :2.5000 Max. :9.300
TO SCK RAVG FD
Min. :0.625 Min. :0.8125 Min. :3.300 Min. :14.10
1st Qu.:1.375 1st Qu.:1.7969 1st Qu.:4.000 1st Qu.:17.48
Median :1.594 Median :2.1875 Median :4.200 Median :18.95
Mean :1.655 Mean :2.2259 Mean :4.231 Mean :19.02
3rd Qu.:1.938 3rd Qu.:2.6875 3rd Qu.:4.400 3rd Qu.:20.70
Max. :2.625 Max. :3.5000 Max. :5.400 Max. :26.00
P20 R20 COMP THRD
Min. :1.562 Min. :0.1875 Min. :49.40 Min. :25.80
1st Qu.:2.500 1st Qu.:0.4375 1st Qu.:57.58 1st Qu.:34.02
Median :2.812 Median :0.6875 Median :60.45 Median :37.65
Mean :3.024 Mean :0.7201 Mean :60.36 Mean :38.33
3rd Qu.:3.578 3rd Qu.:0.8750 3rd Qu.:62.95 3rd Qu.:41.73
Max. :4.500 Max. :1.6875 Max. :71.30 Max. :56.70
PA RA RY PY
Min. :24.60 Min. :20.00 Min. : 80.9 Min. :129.8
1st Qu.:31.20 1st Qu.:25.20 1st Qu.:100.7 1st Qu.:192.3
Median :33.95 Median :27.00 Median :114.5 Median :219.0
Mean :33.67 Mean :27.33 Mean :116.1 Mean :223.2
3rd Qu.:36.02 3rd Qu.:29.10 3rd Qu.:127.0 3rd Qu.:253.9
Max. :42.40 Max. :37.90 Max. :172.3 Max. :334.2
attach(NFL)
Question 2:
NFL.lm1=lm(PTS ~ ., data=NFL)summary(NFL.lm1)
Call:
lm(formula = PTS ~ ., data = NFL)
Residuals:
Min 1Q Median 3Q Max
-5.5866 -1.3118 0.0109 1.3825 4.0220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 110.79041 577.63096 0.192 0.8484
Year -0.05770 0.28673 -0.201 0.8410
TA 3.69956 0.63061 5.867 9.6e-08 ***
PAVG 0.58548 2.46659 0.237 0.8130
TO -1.50645 0.57465 -2.622 0.0105 *
SCK -0.46959 0.65403 -0.718 0.4748
RAVG 0.65070 1.04548 0.622 0.5355
FD 0.58207 0.36352 1.601 0.1133
P20 0.29124 0.82689 0.352 0.7256
R20 0.06099 1.14600 0.053 0.9577
COMP -0.07057 0.11360 -0.621 0.5362
THRD 0.07478 0.06683 1.119 0.2665
PA -0.17546 0.49597 -0.354 0.7244
RA 0.01212 0.11852 0.102 0.9188
RY 0.01517 0.03393 0.447 0.6560
PY 0.04865 0.06784 0.717 0.4754
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.032 on 80 degrees of freedom
Multiple R-squared: 0.8518, Adjusted R-squared: 0.8241
F-statistic: 30.67 on 15 and 80 DF, p-value: < 2.2e-16
library(boot)
Warning: package 'boot' was built under R version 4.5.3
The MSE of PTS is 6.737557, but since this MSE is the squared error of PTS in points, it’s units are currently in points squared. To put the MSE into the the same scale as PTS (points), we take the square root of the MSE and get 2.596, indicating that the predicted points scored per game will be off by approximtaely 2.596 points from the actual points scored per game, more or less.
Question 3:
plot(NFL.lm1, which=1)
library(car)
Warning: package 'car' was built under R version 4.5.3
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:boot':
logit
residualPlots(NFL.lm1)
Test stat Pr(>|Test stat|)
Year -1.1992 0.234030
TA 0.6917 0.491169
PAVG 0.1056 0.916155
TO 0.3792 0.705553
SCK 0.8677 0.388165
RAVG 0.3998 0.690410
FD 1.9734 0.051948 .
P20 -0.6348 0.527388
R20 0.5836 0.561123
COMP 3.0311 0.003294 **
THRD 2.4008 0.018713 *
PA 2.0456 0.044124 *
RA 0.8955 0.373217
RY 1.8072 0.074544 .
PY 1.3956 0.166750
Tukey test 0.7276 0.466865
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Year TA PAVG TO SCK RAVG FD
1.273739 1.273692 99.727090 1.383903 3.559683 4.293351 17.548656
P20 R20 COMP THRD PA RA RY
8.251755 3.002759 5.960095 3.505388 64.810355 3.133505 10.982873
PY
188.392735
RAVG R20 RA RY
RAVG 1.0000000 0.7574716 0.3139383 0.6789253
R20 0.7574716 1.0000000 0.4563373 0.7020973
RA 0.3139383 0.4563373 1.0000000 0.7442925
RY 0.6789253 0.7020973 0.7442925 1.0000000
Linearity: Satisfied; The residuals are distributed relatively evenly across the residuals vs fitted plot, implying linearity.
Normality: NotSatisfied; The residuals closely follow the diagonal line with some slight skew at the beginning and end, indicating a not normal pattern. To resolve this issue, we could linearize the model by squaring the response variable, PTS.
Homoscedasticity: Satisfied; the NCV test of the model results in a p-value greater than 0.05, meaning we fail to reject the null hypothesis that there is no heteroscedasticity or non-constant variance between residuals.
Independence of Errors: Satisfied; the Durbin Watson test of the model results in a p-value greater than 0.05, meaning we fail to reject the null hypothesis that there is a no significant evidence of autocorrelation between residuals.
No Highly Influencial Observations: Satisfied; No points venture beyond Cook’s distance.
No Multicollinearity: Not Satisfied; multiple predictor variables have VIF statistics above 5 or 10, indicating moderate or high multicollinearity with other predictor variables. To resolve this issue, we could eliminate the predictors with high VIF statistics one at a time starting with the greatest VIF and rerun the model until all VIF’s are below 10 or 5.
Question 4:
pred.lm1=predict(NFL.lm1,interval="prediction")
Warning in predict.lm(NFL.lm1, interval = "prediction"): predictions on current data refer to _future_ responses
Cp (left plot)optimal predictors: 6 The minimum Cp is at index 6, meaning a 6-predictor model offers the best balance of fit and complexity by this criterion. / BIC (middle plot) → optimal predictors: 4 BIC penalizes model complexity more heavily than Cp, so it selects a more parsimonious 4-predictor model. / Adjusted R^2 (right plot) optimal predictors: 7, The adjusted R^2 keeps increasing until 7 predictors, where it levels off — adding more predictors beyond that doesn’t meaningfully improve the model. / The three criteria suggest slightly different optimal model sizes. Cp selects 6 predictors, BIC selects 4, and adjusted R^2 selects 7. BIC tends to favor smaller models due to its stronger penalty for complexity, while adjusted R^2 allows for a few more predictors. Overall, a model with around 4–7 predictors appears to be a reasonable choice for predicting NFL team points scored ##
model2 <-lm(PTS ~ TA + TO + FD + PA + RY + PY, data = NFL)model3 <-lm(PTS ~ TA + PAVG + TO + FD, data = NFL)model4 <-lm(PTS ~ TA + TO + FD + THRD + PA + RY + PY, data = NFL)cv2 <-cv.glm(NFL, glm(PTS ~ TA + TO + FD + PA + RY + PY, data = NFL))$delta[1]cv3 <-cv.glm(NFL, glm(PTS ~ TA + PAVG + TO + FD, data = NFL))$delta[1]cv4 <-cv.glm(NFL, glm(PTS ~ TA + TO + FD + THRD + PA + RY + PY, data = NFL))$delta[1]cv2; cv3; cv4
[1] 4.145494
[1] 4.322625
[1] 4.152935
AIC(model2); AIC(model3); AIC(model4)
[1] 410.4483
[1] 414.2749
[1] 410.8583
Model 2 (Cp, 6 predictors) has the best prediction accuracy with the lowest LOOCV error (4.145) and lowest AIC (410.45). / Model 4 (adjusted R^2, 7 predictors) performs similarly with a LOOCV of 4.153 and AIC of 410.86. / Model 3 (BIC, 4 predictors) has the weakest prediction accuracy, with the highest LOOCV (4.323) and AIC (414.27).
library(leaps) x =model.matrix(PTS ~ (TA + TO + FD + PA + RY + PY)^2, data = NFL)y = NFL$PTSall.interact.fit =regsubsets(PTS ~ (TA + TO + FD + PA + RY + PY)^2, data = NFL, nvmax =21)all.interact.sum =summary(all.interact.fit)best.cp.index =which.min(all.interact.sum$cp)coef(all.interact.fit, best.cp.index)
(Intercept) TA PA TA:TO TA:RY FD:RY
12.504544487 10.398222190 -0.736362590 -0.795752994 -0.045227617 0.005504967
PA:PY
0.002056339
The six predictors that are selected in the model are the the following:
RAVG R20 RA RY
RAVG 1.0000000 0.7574716 0.3139383 0.6789253
R20 0.7574716 1.0000000 0.4563373 0.7020973
RA 0.3139383 0.4563373 1.0000000 0.7442925
RY 0.6789253 0.7020973 0.7442925 1.0000000
Linearity: Satisfied; The residuals are distributed relatively evenly across the residuals vs fitted plot, implying linearity.
Normality: Satisfied; At first, the residuals appear to have a slight skew at the edges of the Q-Q residuals plot, indicating a non-normal pattern. Normally, to resolve this issue, we could make the model more linear by finding out which variables need to be squared with residualPlots(). However, after using residualPlots(), none of the predictors have a p-value less than 0.05, meaning the skewness is statistically insignificant at a = 0.05. PA and RY residuals have p statistics insignificant at a = 0.05, but significant at a = 0.10.
Homoscedasticity: Satisfied; the NCV test of Model5 has a p-value greater than 0.05, meaning we fail to reject the null hypothesis that there is no heteroscedasticity or non-constant variance between residuals.
Independence of Errors: Not Satisfied; the Durbin Watson test of the model results in a p-value less than 0.05, meaning there is autocorrelation between residuals. To fix this, generalized least squares (GLS) functions can be used.
No Highly Influential Observations: Satisfied; No points are beyond Cook’s distance.
No Multicollinearity: Not Satisfied; FD and PY have VIF statistics above 5, indicating multicollinearity with other predictor variables. To resolve this issue, we could eliminate the predictors with high VIF statistics one at a time starting with the greatest VIF (FD) and rerun the model until all VIF’s are below 10 or 5.
Fixing the model
Multicollinearity: Remove the FD predictor since it has the greatest VIF score.
Model6 <-lm(PTS ~ TA + TO + PA + RY + PY, data = NFL)vif(Model6)
TA TO PA RY PY
1.082188 1.209053 3.236346 1.513007 2.711652
The new VIF statistics show no scores above 5, so no further removals are necessary.
Independence of errors:
durbinWatsonTest(Model6)
lag Autocorrelation D-W Statistic p-value
1 0.1484058 1.665536 0.072
Alternative hypothesis: rho != 0
After removing the FD predictor, the new test statistics show no violation of independence of errors, so no further changes are necessary.
Question 10:
Model6_glm <-glm(PTS ~ TA + TO + PA + RY + PY, data = NFL)cv.glm(NFL, Model6_glm)$delta
[1] 4.305078 4.302199
AIC(Model6)
[1] 413.988
Model5_glm <-glm(PTS ~ TA + TO + FD + PA + RY + PY, data = NFL)cv.glm(NFL, Model5_glm)$delta
[1] 4.145494 4.142310
AIC(Model5)
[1] 410.4483
Model6 is not the best model in terms of prediction accuracy. Models 2 and 5 have the lowest test MSE estimate and AIC, making them the best for predictions.