This is a data set of the results of all NFL regular season field goal attempts for the 2008 season. There are 1039 observations with 23 variables The variables are:
GameDate
AwayTeam
HomeTeam
qtr (quarter, 5=overtime)
min (minutes remaining)
sec (seconds remaining, added to minutes)
kickteam (team kicking field goal)
def (defending team)
down
togo (yards to go for 1st down)
kicker (ID #)
ydline (yardline of kicking team)
name (kicker’s name)
distance (yards)
homekick (1 if kicker at Home, 0 if Away)
kickdiff (kicking team lead +, or deficit -, prior to kick)
timerem (Time remaining in seconds, negative = overtime)
offscore (kicking team’s score prior to kick)
defscore (defense team’s score prior to kick)
season (2008)
GOOD (1 is Success, 0 is Miss)
Missed (Missed, not blocked = -1, 0 ow)
Blocked (1 if Blocked, 0 ow)
The variables Missed, Blocked, season, GameDate, qtr, and ydline are removed from the data set because they do not affect the success of a field goal. The Good variable which measures if a field goal is good is the response varaible.
Using the data, can we make a linear model that correctly predicts if a field goal is a success/good?
First, the following pairwise scatter plots are made to inspect the potential issues with predictor variables and colinearity.
## Warning: package 'psych' was built under R version 4.2.2
TWo of the predictor variables appear to not be unimodal: down and
homekick. These variables will be removed from the model.
Two variables that were highly correlated were kicker vs. kickteam and min vs. timerem so the variables kicker and timerem will be removed.
Two varaibles that are redundant that can be removed are AwayTeam and HomeTeam so they will be removed.
Another pairwise scatter plots are made to make sure that their are not potential issues with the rest of the variables.
Kickdiff and offscore appear to be moderately correlated so the variables offscore will be removed. The name, kickteam, and def variables are not numberic so they will be removed.
We randomly split the data into two subsets. 80% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 20% of hold-up sample will be used for assessing the performance of the final model.
Table: Average of prediction errors of candidate models
|—–:| NA| NA| | FALSE| FALSE| 54| | TRUE| FALSE| 112|