Goal

Our goal is to create a logistic regression model from the NFL’s 2008 Field Goals Attempted (FGA) data that can accurately classify missed or made field goals.

NFL Field Goals Attempted (FGA) Dataset

# load packages
library(tidyverse)
library(lubridate)
library(stringr)
# load in the data
df <- read_csv("nfl2008_fga.csv")

Initial Observations of the Dataset

# specify dimensions
dim(df)
## [1] 1039   23

The NFL FGA dataset is from the 2008 football season. There are 1039 observations and 23 variables in the dataset. The next code chunk lists the names of the 23 variables included in the dataset.

# list variable names
names(df)
##  [1] "GameDate" "AwayTeam" "HomeTeam" "qtr"      "min"      "sec"     
##  [7] "kickteam" "def"      "down"     "togo"     "kicker"   "ydline"  
## [13] "name"     "distance" "homekick" "kickdiff" "timerem"  "offscore"
## [19] "defscore" "season"   "GOOD"     "Missed"   "Blocked"

Next, we combine this information together and add details on the types of variables with the glimpse() function.

# glimpse of the dataframe
glimpse(df)
## Observations: 1,039
## Variables: 23
## $ GameDate <int> 20081130, 20081005, 20081228, 20081012, 20080907, 200...
## $ AwayTeam <chr> "IND", "IND", "TEN", "BAL", "CHI", "HOU", "IND", "KC"...
## $ HomeTeam <chr> "CLE", "HOU", "IND", "IND", "IND", "IND", "SD", "DEN"...
## $ qtr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ min      <int> 47, 54, 45, 45, 50, 50, 46, 52, 46, 49, 48, 55, 51, 4...
## $ sec      <int> 2, 47, 20, 42, 56, 43, 45, 34, 12, 46, 7, 48, 49, 2, ...
## $ kickteam <chr> "IND", "IND", "IND", "IND", "IND", "IND", "IND", "KC"...
## $ def      <chr> "CLE", "HOU", "TEN", "BAL", "CHI", "HOU", "SD", "DEN"...
## $ down     <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ togo     <int> 11, 3, 3, 1, 21, 7, 5, 7, 7, 9, 9, 3, 9, 2, 11, 6, 5,...
## $ kicker   <int> 15, 15, 15, 15, 15, 15, 15, 18, 18, 29, 29, 29, 29, 2...
## $ ydline   <int> 12, 28, 10, 19, 21, 22, 5, 8, 20, 27, 26, 16, 26, 33,...
## $ name     <chr> "A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinat...
## $ distance <int> 30, 46, 28, 37, 39, 40, 23, 26, 38, 45, 44, 34, 43, 5...
## $ homekick <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,...
## $ kickdiff <int> -3, 0, 7, 14, 0, -3, 0, 0, -3, -7, -4, 0, 0, 3, 0, 4,...
## $ timerem  <int> 2822, 3287, 2720, 2742, 3056, 3043, 2805, 3154, 2772,...
## $ offscore <int> 0, 0, 7, 14, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 7, 0, 7...
## $ defscore <int> 3, 0, 0, 0, 0, 3, 0, 0, 3, 7, 7, 0, 0, 0, 0, 3, 0, 0,...
## $ season   <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,...
## $ GOOD     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,...
## $ Missed   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Blocked  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Before listing a description of the variables in the dataset, we should look at the first and last 10 rows of the data to see if we notice any patterns and compute the summary statistics of the variables.

# first 10 rows
head(df, n = 10)
## # A tibble: 10 × 23
##    GameDate AwayTeam HomeTeam   qtr   min   sec kickteam   def  down  togo
##       <int>    <chr>    <chr> <int> <int> <int>    <chr> <chr> <int> <int>
## 1  20081130      IND      CLE     1    47     2      IND   CLE     4    11
## 2  20081005      IND      HOU     1    54    47      IND   HOU     4     3
## 3  20081228      TEN      IND     1    45    20      IND   TEN     4     3
## 4  20081012      BAL      IND     1    45    42      IND   BAL     4     1
## 5  20080907      CHI      IND     1    50    56      IND   CHI     4    21
## 6  20081116      HOU      IND     1    50    43      IND   HOU     4     7
## 7  20081123      IND       SD     1    46    45      IND    SD     4     5
## 8  20081207       KC      DEN     1    52    34       KC   DEN     4     7
## 9  20081130       KC      OAK     1    46    12       KC   OAK     4     7
## 10 20090118      PHI      ARI     1    49    46      PHI   ARI     4     9
## # ... with 13 more variables: kicker <int>, ydline <int>, name <chr>,
## #   distance <int>, homekick <int>, kickdiff <int>, timerem <int>,
## #   offscore <int>, defscore <int>, season <int>, GOOD <int>,
## #   Missed <int>, Blocked <int>
# last 10 rows
tail(df, n = 10)
## # A tibble: 10 × 23
##    GameDate AwayTeam HomeTeam   qtr   min   sec kickteam   def  down  togo
##       <int>    <chr>    <chr> <int> <int> <int>    <chr> <chr> <int> <int>
## 1  20080914       SF      SEA     5    -5    24       SF   SEA     4     1
## 2  20080928      HOU      JAC     5    -4    30      JAC   HOU     2    10
## 3  20080929      BAL      PIT     5    -6     1      PIT   BAL     4     7
## 4  20080921       TB      CHI     5   -11    24       TB   CHI     2     3
## 5  20081102       TB       KC     5    -5    27       TB    KC     4     1
## 6  20081102       GB      TEN     5    -6    41      TEN    GB     3     2
## 7  20081211       NO      CHI     5    -3    19      CHI    NO     2    13
## 8  20081222       GB      CHI     5    -4    33      CHI    GB     3    10
## 9  20081116      PHI      CIN     5   -15    13      CIN   PHI     4     5
## 10 20081019      NYJ      OAK     5   -13    35      OAK   NYJ     4    11
## # ... with 13 more variables: kicker <int>, ydline <int>, name <chr>,
## #   distance <int>, homekick <int>, kickdiff <int>, timerem <int>,
## #   offscore <int>, defscore <int>, season <int>, GOOD <int>,
## #   Missed <int>, Blocked <int>
# summary statistics of variables
summary(df)
##     GameDate          AwayTeam           HomeTeam              qtr       
##  Min.   :20080904   Length:1039        Length:1039        Min.   :1.000  
##  1st Qu.:20081005   Class :character   Class :character   1st Qu.:2.000  
##  Median :20081106   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :20081399                                         Mean   :2.498  
##  3rd Qu.:20081201                                         3rd Qu.:4.000  
##  Max.   :20090201                                         Max.   :5.000  
##                                                                          
##       min              sec          kickteam             def           
##  Min.   :-15.00   Min.   : 0.00   Length:1039        Length:1039       
##  1st Qu.: 14.00   1st Qu.: 9.50   Class :character   Class :character  
##  Median : 30.00   Median :25.00   Mode  :character   Mode  :character  
##  Mean   : 27.97   Mean   :26.55                                        
##  3rd Qu.: 42.00   3rd Qu.:43.00                                        
##  Max.   : 58.00   Max.   :59.00                                        
##                                                                        
##       down            togo            kicker          ydline     
##  Min.   :1.000   Min.   : 1.000   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:4.000   1st Qu.: 4.000   1st Qu.: 9.00   1st Qu.:10.00  
##  Median :4.000   Median : 6.000   Median :20.00   Median :19.00  
##  Mean   :3.819   Mean   : 6.739   Mean   :19.54   Mean   :18.77  
##  3rd Qu.:4.000   3rd Qu.: 9.000   3rd Qu.:30.00   3rd Qu.:26.00  
##  Max.   :4.000   Max.   :25.000   Max.   :37.00   Max.   :86.00  
##  NA's   :2       NA's   :2                                       
##      name              distance        homekick         kickdiff       
##  Length:1039        Min.   :18.00   Min.   :0.0000   Min.   :-41.0000  
##  Class :character   1st Qu.:28.00   1st Qu.:0.0000   1st Qu.: -6.0000  
##  Mode  :character   Median :37.00   Median :0.0000   Median :  0.0000  
##                     Mean   :36.67   Mean   :0.4947   Mean   :  0.3821  
##                     3rd Qu.:44.00   3rd Qu.:1.0000   3rd Qu.:  6.0000  
##                     Max.   :76.00   Max.   :1.0000   Max.   : 44.0000  
##                                                                        
##     timerem          offscore        defscore          season    
##  Min.   :-887.0   Min.   : 0.00   Min.   : 0.000   Min.   :2008  
##  1st Qu.: 895.5   1st Qu.: 0.00   1st Qu.: 3.000   1st Qu.:2008  
##  Median :1808.0   Median : 7.00   Median : 7.000   Median :2008  
##  Mean   :1704.6   Mean   : 9.95   Mean   : 9.568   Mean   :2008  
##  3rd Qu.:2556.5   3rd Qu.:16.00   3rd Qu.:14.000   3rd Qu.:2008  
##  Max.   :3507.0   Max.   :48.00   Max.   :44.000   Max.   :2008  
##                                                                  
##       GOOD            Missed          Blocked       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.8662   Mean   :0.1338   Mean   :0.02117  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
## 

It is clear from briefly looking at the dataset that the unit of observation is each, individual field goal taken throughout the course of the 2008 season. Now that we have a general idea of what the dataset looks like and includes, we can start describing the variables.

Description of Variables

  • GameDate – This variable represents the date the football game took place. It is listed as an integer in year-month-day (ymd) format but with no separator. We will clean this up later. All games occurred between September 4, 2008 (The first day of the 2008 NFL season) and February 1, 2009 (Super Bowl XLIII). The Regular Season was from September 4 to December 28, 2008 and the Playoffs started January 3, 2009.1

  • AwayTeam – An abbreviation that represents the name of the away team for the given game.

  • HomeTeam – An abbreviation that represents the name of the home team for the given game.

  • qtr – An integer that represents the quarter of the game the corresponding field goal was taken. There are 4 quarters in every football game. Notice, however, that the max of this variable is listed as 5. I hypothesize this is either because the people who generated the codes for the dataset used a value of 5 to represent overtime or a mistake in the data entry process. It is most likely not a mistake as the last ten rows all have a qtr value of 5.

  • min – An integer that represents the minute the kick was taken in. There are 4 quarters and each quarter is 15 minutes. Thus there are 60 minutes of regulation (non-overtime) in a given football game. Looking at the first ten rows of the data helps us see that the minute of the football game is actually listed as minutes remaining in the game, i.e. the first quarter is minute 60 to minute 45. I conjecture this is why the summary and tail of the data both show negative values for minutes. All negative values seem to correspond to the fifth quarter of a football game (overtime).

  • sec – An integer that represents how many seconds were remaining in the minute the ball was kicked. This implies that seconds also count down from 60.2

  • kickteam – An abbreviation that represents the name of the team kicking the field goal. This should match either the home or away team in every case. A quick look at the head and tail suggests this might be the case, but we will check more thoroughly later on.

  • def – An abbreviation that represents the name of the team defending the field goal. Similarly, this should match either the home or away team in every case and we will check more thoroughly later on.

  • down – An integer (in the range 1 to 4) that represents the down the ball was kicked on. Note there are 2 missing values.

  • togo – An integer that represents the number of yards the kicking team needed to gain in order to get a first down. I assume this number is rounded to an integer for ease of entry and standardization of measurement as you can have a non-integer value of yards to-go in a football game. This does not necessarily introduce any blatant biases to the data.

  • kicker – An integer that corresponds to the name of the player kicking the field goal. We will check later on if this is assigned alphabetically, by rank (success rate), etc.

  • ydline – The yard line the ball was kicked from on the football field. I am particularly troubled that the maximum value of this variable is 86 as 75% of field goals were taken within 26 yards from the end zone. Moreover, a kick from 86 yards away from the end zone corresponds to a 96 yard field goal attempt before you factor in the snapping distance. So, there is reason to believe this variable has at least 1 value error.

  • name – A character string that holds the first letter of the first name and full last name of the kicker.

  • distance – An integer that represents the total distance of the field goal. We can represent the total distance of the field goal with the following equation: \(distance = ydline + endzone + snapdist\), where ydline is the ydline variable, endzone is a constant equal to 10 that represents the distance from the yard line to the goal post, and snapdist is a constant equal to the snapping distance (the distance from the ydline to the place the kick was actually taken). This equation implies that distance is a linear combination of ydline and we should be careful not to include both variables in our model as this would introduce near-perfect multicollinearity—a big problem. The glimpse of the data seems to suggest distance is usually 18 yards greater than ydline.

  • homekick – A binary variable that equals 0 if the away team kicked the field goal or 1 if the home team kicked the field goal.

  • kickdiff – An integer that represents how many points ahead the kicking team is at the time the field goal is kicked. This number is positive when the kicking team (offensive team) is winning, negative when they are losing, and equal to zero when the teams are tied. This implies that \(kickdiff = offscore - defscore\). Similar to distance, we should be careful which variables we include in the model as this is a linear combination of two other variables in the dataset.

  • timerem – An integer that represents the amount of time remaining in the game (in seconds). This value is negative when the game is in overtime.

  • offscore – An integer that represents the number of points the offensive team has already scored at the time the field goal is taken.

  • defscore – An integer that represents the number of points the defensive team has already scored at the time the field goal is taken.

  • season – An integer that represents the season the game was played. All values should—and do—equal 2008.

  • GOOD – A binary variable that equals 0 if the kick was missed and 1 if the kick was made.

  • Missed – A binary variable that equals 0 if the kick was made and 1 if the kick was missed. This is merely the complement of the GOOD variable and one of the two will be dropped later on.

  • Blocked – A binary variable that equals 0 if the kick was not blocked and 1 if it was blocked.

Data Cleaning & Exploratory Data Analysis (EDA)

First, let’s standardize the names of our variables so that all of them are lowercase.

# Change all variable names to lowercase 
names(df) <- tolower(names(df))

Now, for all variables that contain two words and could be confusing, we add an underscore in between words for readability.

# add underscore before team & score to all column names that contain words team or score
names(df) <- str_replace(names(df), "team", "_team")
names(df) <- str_replace(names(df), "score", "_score")
# rename some other variables
df <- df %>% 
  rename(game_date = gamedate, 
         to_go = togo, 
         yd_line = ydline, 
         home_kick = homekick, 
         time_rem = timerem)

We should change the names of the poorly defined variables as well.

# rename poorly defined variables
df <- df %>% 
  rename(kick_score_diff = kickdiff, 
         min_rem = min, 
         sec_rem = sec, 
         def_team = def,
         made = good)

You can see that the names are much easier to read now.

# show column names
names(df)
##  [1] "game_date"       "away_team"       "home_team"      
##  [4] "qtr"             "min_rem"         "sec_rem"        
##  [7] "kick_team"       "def_team"        "down"           
## [10] "to_go"           "kicker"          "yd_line"        
## [13] "name"            "distance"        "home_kick"      
## [16] "kick_score_diff" "time_rem"        "off_score"      
## [19] "def_score"       "season"          "made"           
## [22] "missed"          "blocked"

The game_date variable

The game date isn’t very useful in integer form, so let’s change it to a date object before we continue.

# change game date to a date variable in yyyy-mm-dd format 
df$game_date <- ymd(df$game_date)

In my opinion, the game_date still isn’t particularly helpful. I suggest we change it to an integer that represents what week of the NFL season the game was played or to a binary variable to indicate whether the game was a regular season game or a playoff game. Additionally, it would be nice to have a variable that tells us which game a field goal was taken in, so we can use the game date and home team to variables to compute this. 3

# creates a dataframe that shows number of FGA and made for each game
by_game <- df %>% 
  group_by(game_date, home_team) %>% 
  summarise(num_FGA = n(), num_made = sum(made))
# create an index of game numbers (1 to 263)
game <- seq(1,263)
# create a new variable in game dataframe to represent the game number in the season
by_game$game_num <- game
# create a lookup table that holds the date and home team for each game played
lookup <- by_game %>% 
  select(game_date, home_team, game_num)
lookup <- lookup %>% 
  unite(game_index, game_date, home_team, sep = " ")
# create a column similar to game_index (titled game_index) in the main dataframe (df)
df$temp1 <- df$game_date
df$temp2 <- df$home_team
df <- df %>% unite(game_index, temp1, temp2, sep = " ")
# arrange the dataframe by game_index (same order as lookup dataframe)
df <- df %>% arrange(game_index)
# left join lookup to df on the game_index
df <- left_join(df, lookup, by = "game_index")

# create a new variable playoff that is equal to game date
df <- df %>% mutate(playoff = game_date, week = game_date)
# if game date is during regular season, coerce the value of playoff to zero; else coerce to 1
df$playoff <- ifelse(df$game_date <= "2008-12-28", 0, 1)
df$playoff <- as.integer(df$playoff)
# nested if-else statement to code week of NFL season
df$week <- ifelse(df$week <= "2008-09-08", 1, 
  ifelse(df$week > "2008-09-08" & df$week <= "2008-09-15", 2, 
  ifelse(df$week > "2008-09-15" & df$week <= "2008-09-22", 3,
  ifelse(df$week > "2008-09-22" & df$week <= "2008-09-29", 4,
  ifelse(df$week > "2008-09-29" & df$week <= "2008-10-06", 5, 
  ifelse(df$week > "2008-10-06" & df$week <= "2008-10-13", 6,
  ifelse(df$week > "2008-10-13" & df$week <= "2008-10-20", 7,
  ifelse(df$week > "2008-10-20" & df$week <= "2008-10-27", 8, 
  ifelse(df$week > "2008-10-27" & df$week <= "2008-11-03", 9,
  ifelse(df$week > "2008-11-03" & df$week <= "2008-11-10", 10,
  ifelse(df$week > "2008-11-10" & df$week <= "2008-11-17", 11,
  ifelse(df$week > "2008-11-17" & df$week <= "2008-11-24", 12,
  ifelse(df$week > "2008-11-17" & df$week <= "2008-11-24", 12,
  ifelse(df$week > "2008-11-24" & df$week <= "2008-12-01", 13,
  ifelse(df$week > "2008-12-01" & df$week <= "2008-12-08", 14,
  ifelse(df$week > "2008-12-01" & df$week <= "2008-12-08", 14,
  ifelse(df$week > "2008-12-08" & df$week <= "2008-12-15", 15,
  ifelse(df$week > "2008-12-15" & df$week <= "2008-12-22", 16,
  ifelse(df$week > "2008-12-22" & df$week <= "2008-12-28", 17, 18)))))))))))))))))))
df$week <- as.integer(df$week)
# drop game_date variable
df$game_date <- NULL
df$game_index <- NULL
df$season <- NULL

After the conversion, I dropped the game_date, game_index, and the season variables as we already know it is the 2008 season. We can see from the new by_game dataframe that, on average, there are 3.9505703 field goals attempted per game and 3.4220532 made field goals per game.

Obviously, there is going to be a large difference in the number of kicks during regular season games and playoff games since there were 252 regular season games and 11 playoff games.4

To be precise, only 38 of the 1039 kicks were taken in a playoff game (3.66%). Since the dependent variable in this model will be whether the field goal was made or missed, we should compare the success rates for each type of game. I’ll also use this as an opportunity to drop the missed variable as it is the complement of made.

# drop the missed variable
df$missed <- NULL
# show success rate and number of field goal attempts by game type
temp <- df %>% 
  group_by(playoff) %>% 
  summarise(success_rate = mean(made) * 100, 
            num_attempts = n(), num_games = sum(n_distinct(game_num))) %>% 
  mutate(attempts_per_game = round(num_attempts/num_games, 2))
temp
## # A tibble: 2 × 5
##   playoff success_rate num_attempts num_games attempts_per_game
##     <int>        <dbl>        <int>     <int>             <dbl>
## 1       0     86.71329         1001       252              3.97
## 2       1     84.21053           38        11              3.45

The success rate for playoff games is lower than that of regular season games, but I suspect this has more to do with the sample size of playoff games than anything else.

The away_team and home_team variables

For these two variables, I’m just going to that all teams are represented in the dataset. There are 32 teams in the NFL so this is pretty easy to check.

# check to see if 32 teams are represented in each variable
n_distinct(df$away_team) == 32
## [1] TRUE
n_distinct(df$home_team) == 32
## [1] TRUE

The qtr variable

ggplot(data = df, aes(x = as.factor(qtr))) + geom_bar(aes(fill = as.factor(qtr))) + 
  labs(x = "Quarter", y = "FGA", title = "The Number of Field Goals Attempted by Quarter", fill = "Quarter") + theme(plot.title = element_text(hjust = 0.5))

It seems that most field goals are kicked in the second and fourth quarters, respectively. This can also be seen in the table below.

temp <- df %>% 
  group_by(qtr) %>% 
  summarise(success_rate = mean(made) * 100, 
            num_attempts = n(), num_games = sum(n_distinct(game_num))) %>% 
  mutate(attempts_per_game = round(num_attempts/num_games, 2))
temp
## # A tibble: 5 × 5
##     qtr success_rate num_attempts num_games attempts_per_game
##   <int>        <dbl>        <int>     <int>             <dbl>
## 1     1     89.71963          214       158              1.35
## 2     2     84.26667          375       209              1.79
## 3     3     84.61538          182       144              1.26
## 4     4     88.62745          255       181              1.41
## 5     5     92.30769           13        13              1.00

The min_rem variable

ggplot(data = df, aes(x = min_rem)) + geom_histogram(color="grey30", fill="lightskyblue3", bins = 20) + 
  labs(x = "Minutes Remaining", y = "FGA", title = "The Number of Field Goals Attempted by Minutes Remaining") + theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df, aes(x = min_rem)) + geom_histogram(color="grey30", fill="lightskyblue3", bins = 15) + facet_grid(. ~ made) + labs(x = "Minutes Remaining", y = "FGA", title = "The Number of Field Goals Made by Minutes Remaining") + theme(plot.title = element_text(hjust = 0.5))

The number of field goals attempted is relatively uniform across regulation, but there is a massive spike right around 30 minutes. I’m guessing this is because teams attempt more field goals with little-to-no time remaining in the first half since there is no consequence for missing the field goal—the other team won’t get the ball with plenty of time and good field position.

# calculate number of kicks taken in last minute of first half
df %>% filter(min_rem == 30) %>% nrow()
## [1] 121
# sort number of field goal attempts by minutes remaining in descending order
df %>% group_by(min_rem) %>% summarise(n = n()) %>% arrange(desc(n))
## # A tibble: 67 × 2
##    min_rem     n
##      <int> <int>
## 1       30   121
## 2        0    37
## 3       31    37
## 4       45    31
## 5        2    27
## 6       23    25
## 7       32    24
## 8       18    23
## 9       51    23
## 10      33    21
## # ... with 57 more rows

This seems fairly in line with the data as there were 121 field goals attempted in the last minute of the first half. From the sorted list, the next highest numbers of field goals attempted were in the last minute of the game and the second to last minute of the first half.

The sec_rem variable

It doesn’t really make much sense to keep the second variable as is, so I am going to bin it into 15-second half-open intervals that are open on the left and closed on the right.

df$sec_rem <- ifelse(df$sec_rem > 0 & df$sec_rem <= 15,1, 
                     ifelse(df$sec_rem > 15 & df$sec_rem <= 30,2,
                     ifelse(df$sec_rem > 30 & df$sec_rem <= 45,3, 4)))
df$sec_rem <- as.integer(df$sec_rem)

I’m not going to go very in-depth for this variable but I am going to look at the number of field goals attempted in the last 30 seconds of each half.

df %>% 
  group_by(sec_rem) %>% 
  filter((min_rem == 30 | min_rem == 0) & sec_rem <= 2) %>% 
  nrow()
## [1] 131

The chunk above shows that 131 of the 1039 field goals attempted (12.61%) were in the last 30 seconds of the first or second half. Since we are keeping the min_rem and sec_rem variables we can go ahead and drop the time_rem variable.

df$time_rem <- NULL

The down variable

ggplot(data = df, aes(x = as.factor(down))) + geom_bar(aes(fill = as.factor(down))) + 
  labs(x = "Down", y = "FGA", title = "The Number of Field Goals Attempted by Down", fill = "Down") + theme(plot.title = element_text(hjust = 0.5))

The vast majority of field goals are taken on fourth down. This makes sense intuitively. I’m guessing that most field goals taken on first, second, or third down were with little-to-no time left or in overtime to win the game.

table(df$down)
## 
##   1   2   3   4 
##  35  27  29 946

The table above shows that 946 of the 1039 field goals (91.05%) were attempted on fourth down.

df %>% 
  group_by(down, min_rem) %>% 
  filter(down == 1) %>% 
  summarise(n = n())
## Source: local data frame [3 x 3]
## Groups: down [?]
## 
##    down min_rem     n
##   <int>   <int> <int>
## 1     1       0     7
## 2     1      15     1
## 3     1      30    27

Breaking this down a little further, you can see that 34 of the 35 field goals attempted on first down were in the last minute of the first or second half and 33 of those were in the last 15 seconds of the first or second half. The kick taken on first down at the end of the 3rd quarter is rather odd, so let’s take a closer look.

df %>% 
  filter(down == 1 & min_rem == 15)
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <int>   <int>   <int>     <chr>    <chr> <int> <int>
## 1       DAL       ARI     3      15       1       DAL      ARI     1     5
## # ... with 13 more variables: kicker <int>, yd_line <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <int>,
## #   off_score <int>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>

Looking closer at this observation, we can see it was the week 6 matchup between the Arizona Cardinals and Dallas Cowboys. Moreover, it says the Dallas Cowboys kicked the ball when they were losing 21 to 24. ESPN’s summary of the scoring plays for this game clarifies that N.Folk kicked a 52 yard field goal in the fourth quarter of the game with 4 seconds to go.5 Let’s fix this mistake.

index <- which(df$down == 1 & df$min_rem == 15)
df[["qtr"]][[index]] <- 4
df[["min_rem"]][[index]] <- 0
df[index, ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <int> <int>
## 1       DAL       ARI     4       0       1       DAL      ARI     1     5
## # ... with 13 more variables: kicker <int>, yd_line <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <int>,
## #   off_score <int>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>

Now that we fixed the error, let’s see if there are any unusual observations on second or third down.

df %>% 
  group_by(down, min_rem) %>% 
  filter(down == 2) %>% 
  summarise(n = n())
## Source: local data frame [6 x 3]
## Groups: down [?]
## 
##    down min_rem     n
##   <int>   <dbl> <int>
## 1     2     -11     1
## 2     2      -7     1
## 3     2      -4     1
## 4     2      -3     1
## 5     2       0     3
## 6     2      30    20

There doesn’t seem to be any obvious mistakes as all of the field goals attempted on second down were at the end of a half or in overtime.

df %>% 
  group_by(down, min_rem) %>% 
  filter(down == 3) %>% 
  summarise(n = n())
## Source: local data frame [6 x 3]
## Groups: down [?]
## 
##    down min_rem     n
##   <int>   <dbl> <int>
## 1     3     -11     1
## 2     3      -6     1
## 3     3      -4     1
## 4     3       0    10
## 5     3      30    15
## 6     3      39     1

It doesn’t sound right that someone attempted a field goal on third down with about 9 minutes to go in the second quarter. Let’s check this observation out.

df[which(df$down == 3 & df$min_rem == 39), ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <int> <int>
## 1       CHI       CAR     2      39       1       CHI      CAR     3     8
## # ... with 13 more variables: kicker <int>, yd_line <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <int>,
## #   off_score <int>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>

Looking at the summary of scoring plays, we can see that the only field goal the Chicago Bears took was with 39 minutes remaining, but they were ahead 7-0 and they actually took the kick on fourth down with 8 yards to go.6 Let’s fix these mistakes.

index <- which(df$down == 3 & df$min_rem == 39)
df[["down"]][[index]] <- 4
df[["kick_score_diff"]][[index]] <- 7
df[["off_score"]][[index]] <- 7
df[index, ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <dbl> <int>
## 1       CHI       CAR     2      39       1       CHI      CAR     4     8
## # ... with 13 more variables: kicker <int>, yd_line <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <dbl>,
## #   off_score <dbl>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>

Next we should take care of the two missing values.

which(is.na(df$down))
## [1] 650 964
index <- 650
df[index, ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <dbl> <int>
## 1       NYG       ARI     2      30       1       ARI      NYG    NA    NA
## # ... with 13 more variables: kicker <int>, yd_line <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <dbl>,
## #   off_score <dbl>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>
df[["down"]][[index]] <- 1
df[["to_go"]][[index]] <- 10
df[["yd_line"]][[index]] <- 58
df[index, ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <dbl> <dbl>
## 1       NYG       ARI     2      30       1       ARI      NYG     1    10
## # ... with 13 more variables: kicker <int>, yd_line <dbl>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <dbl>,
## #   off_score <dbl>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>
index <- 964
df[index, ]
## # A tibble: 1 × 22
##   away_team home_team   qtr min_rem sec_rem kick_team def_team  down to_go
##       <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <dbl> <dbl>
## 1       DET        GB     2      30       1        GB      DET    NA    NA
## # ... with 13 more variables: kicker <int>, yd_line <dbl>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <dbl>,
## #   off_score <dbl>, def_score <int>, made <int>, blocked <int>,
## #   game_num <int>, playoff <int>, week <int>
df[["down"]][[index]] <- 1
df[["to_go"]][[index]] <- 10
df[["yd_line"]][[index]] <- 59

Mistakes were fixed by the same method as earlier.7 8 9 Now that we fixed some mistakes, let’s look at the distribution again.

table(df$down)
## 
##   1   2   3   4 
##  37  27  28 947
df %>% 
  group_by(down, min_rem) %>% 
  filter(min_rem == 30 | min_rem <= 0) %>% 
  summarise(n = n())
## Source: local data frame [20 x 3]
## Groups: down [?]
## 
##     down min_rem     n
##    <dbl>   <dbl> <int>
## 1      1       0     8
## 2      1      30    29
## 3      2     -11     1
## 4      2      -7     1
## 5      2      -4     1
## 6      2      -3     1
## 7      2       0     3
## 8      2      30    20
## 9      3     -11     1
## 10     3      -6     1
## 11     3      -4     1
## 12     3       0    10
## 13     3      30    15
## 14     4     -15     1
## 15     4     -13     1
## 16     4      -8     1
## 17     4      -6     1
## 18     4      -5     2
## 19     4       0    17
## 20     4      30    57

This tells the story pretty well: all field goals that were not attempted on fourth down were either attempted as the time in the first or second half was expiring or during overtime. I am going to go ahead and drop the to_go variable as the down variable tells the whole story.

df$to_go <- NULL

The kick_score_diff, off_score, and def_score variables

We are more concerned with whether a team is ahead, behind, or tied when they attempt a field goal and the margin by which they are ahead or behind. For this reason, I chose to drop the off_score and def_score variables in favor of the all-encompassing kick_score_diff variable.

df$off_score <- NULL
df$def_score <- NULL

The yd_line and distance variables

Since distance is a linear combination of yd_line and other factors, I am going to drop yd_line.

df$yd_line <- NULL

I’ll also take this time to quickly look at the distribute of field goals attempted by distance.

ggplot(data = df, aes(x = distance)) + geom_histogram(color="grey30", fill="lightskyblue3", bins = 20) + 
  labs(x = "Distance (yards)", y = "FGA", title = "The Number of Field Goals Attempted by Distance") + theme(plot.title = element_text(hjust = 0.5))

# filter df by min_rem
df %>% filter(min_rem == 30 | min_rem == 0) %>% 
  ggplot(aes(x = distance)) + geom_histogram(color="grey30", fill="lightskyblue3", bins = 20) + 
  labs(x = "Distance (yards)", y = "FGA", title = "The Distance of Field Goals Attempted in the Last Minute of a Half") + theme(plot.title = element_text(hjust = 0.5))

The blocked variable

table(df$blocked)
## 
##    0    1 
## 1017   22
# two-way table
xtabs(~ made + blocked, data = df)
##     blocked
## made   0   1
##    0 139   0
##    1 878  22
# list of blocked field goals
df %>% filter(blocked == 1)
## # A tibble: 22 × 18
##    away_team home_team   qtr min_rem sec_rem kick_team def_team  down
##        <chr>     <chr> <dbl>   <dbl>   <int>     <chr>    <chr> <dbl>
## 1        SEA       BUF     4      11       3       BUF      SEA     4
## 2        MIN        GB     2      30       1        GB      MIN     3
## 3        ARI       NYJ     1      50       1       ARI      NYJ     4
## 4         SD       OAK     2      31       1        SD      OAK     4
## 5        MIN        NO     1      51       3        NO      MIN     4
## 6        DET       MIN     4      10       1       MIN      DET     4
## 7        PHI        SF     2      30       1       PHI       SF     1
## 8        MIN       CHI     2      33       4       CHI      MIN     4
## 9        IND        GB     3      24       1       IND       GB     4
## 10        SF       NYG     3      17       1       NYG       SF     4
## # ... with 12 more rows, and 10 more variables: kicker <int>, name <chr>,
## #   distance <int>, home_kick <int>, kick_score_diff <dbl>, made <int>,
## #   blocked <int>, game_num <int>, playoff <int>, week <int>

Only 22 of the 1039 (2.12%) field goals attempted in the 2008 NFL season were blocked. For some reason, the two-way table shows that all blocked field goals were made. Just looking at the first game in which a field goal was blocked, we can see the kick was actually not made.10 Therefore, I am going to assume—to save the time of checking all 22 games—that all blocked field goals were actually missed.

# save indexes of blocked field goals 
index <- which(df$blocked == 1)
# change made to missed
df$made[index] <- 0
xtabs(~ made + blocked, data = df)
##     blocked
## made   0   1
##    0 139  22
##    1 878   0

The kicker and name variables

It is more than likely that at least one of the 32 teams was not happy with a kicker’s performance and tried a second out, so let’s organize the kickers by the kicker ID.

# check to see if any names are duplicated
df %>% 
  group_by(name, kicker, kick_team) %>% 
  summarise(n = n(), made = mean(made)) %>% 
  arrange(name)
## Source: local data frame [41 x 5]
## Groups: name, kicker [41]
## 
##           name kicker kick_team     n      made
##          <chr>  <int>     <chr> <int>     <dbl>
## 1  A.Vinatieri     15       IND    26 0.8076923
## 2      C.Barth     18        KC    12 0.8333333
## 3      D.Akers     29       PHI    50 0.8400000
## 4  D.Carpenter     19       MIA    26 0.8461538
## 5     D.Rayner      9       CIN     1 1.0000000
## 6    G.Hartley     24        NO    13 1.0000000
## 7      J.Brown     34       STL     1 0.0000000
## 8      J.Brown     35       STL    35 0.8857143
## 9     J.Carney     25       NYG    43 0.8837209
## 10      J.Elam      2       ATL    32 0.9375000
## # ... with 31 more rows
# fix duplicate names and their IDs
index <- which(df$name == "S.Hauschka" & df$kicker == 3)
df[["kicker"]][[index]] <- 4
index <- which(df$name == "J.Brown" & df$kicker == 34)
df[["kicker"]][[index]] <- 35
# check to see if any numbers are duplicated
df %>% 
  group_by(name, kicker, kick_team) %>% 
  summarise(n = n(), made = mean(made)) %>% 
  arrange(kicker)
## Source: local data frame [39 x 5]
## Groups: name, kicker [39]
## 
##          name kicker kick_team     n      made
##         <chr>  <dbl>     <chr> <int>     <dbl>
## 1   N.Rackers      1       ARI    35 0.8571429
## 2      J.Elam      2       ATL    32 0.9375000
## 3    M.Stover      3       BAL    37 0.8378378
## 4  S.Hauschka      4       BAL     2 0.5000000
## 5   R.Lindell      5       BUF    38 0.7894737
## 6     J.Kasay      6       CAR    31 0.9032258
## 7     R.Gould      7       CHI    29 0.8965517
## 8    S.Graham      8       CIN    24 0.8750000
## 9    D.Rayner      9       CIN     1 1.0000000
## 10   P.Dawson      9       CLE    36 0.8333333
## # ... with 29 more rows
# fix duplicate kicker IDs
index <- which(df$kicker == 9 & df$name == "D.Rayner")
df[["kicker"]][[index]] <- 38
index <- which(df$kicker == 26 & df$name == "M.Nugent")
df[["kicker"]][[index]] <- 39
# final check
setequal(n_distinct(df$name), n_distinct(df$kicker))
## [1] TRUE

This table shows us a few things. The kicker column is determined by team name in alphabetical order, not kicker rank. Most kickers took at least 10 field goals and it probably only makes sense to keep kickers that attempted at least 5 field goals.

# dataframe of kickers with less than 5 attempts
temp <- df %>% 
  group_by(name, kicker) %>% 
  summarise(n = n(), made = mean(made)) %>% 
  filter(n < 5)
# list of names of kickers with less than 5 attempts
names = temp$name
# filter dataset to not include the kickers with less than 5 attempts
for (n in names) {
  df <- df %>% filter(name != n)
}

Now I’m going to rank kickers by made field goal percentage from highest to lowest. For kickers that have the same made field goal percentage, I am going to break the tie by giving the higher rank to the kicker with more field goals attempted. We could also try to incorporate whether or not a kicker only attempted short distance kicks and whether 60 yard field goals in the last second of a half should count against them, but I chose to save this exercise for another time.

temp <- df %>% 
  group_by(name) %>% 
  summarise(n= n(), made_pct = mean(made)) %>% 
  arrange(desc(made_pct), desc(n)) %>% 
  mutate(kicker_rank = rank(desc(made_pct), ties.method = "first")) %>% 
  arrange(kicker_rank)
temp
## # A tibble: 34 × 4
##            name     n  made_pct kicker_rank
##           <chr> <int>     <dbl>       <int>
## 1     G.Hartley    13 1.0000000           1
## 2      J.Hanson    22 0.9545455           2
## 3        J.Elam    32 0.9375000           3
## 4        N.Folk    23 0.9130435           4
## 5       J.Kasay    31 0.9032258           5
## 6  S.Gostkowski    40 0.9000000           6
## 7       R.Gould    29 0.8965517           7
## 8       Je.Reed    36 0.8888889           8
## 9        O.Mare    27 0.8888889           9
## 10     J.Carney    43 0.8837209          10
## # ... with 24 more rows

Now that we have a kicker rank, I am going to suggest we incorporate it into the dataframe and then drop the name and kicker columns.

temp <- temp %>% 
  select(name, kicker_rank)
df <- left_join(df, temp, by = "name")
df$name <- NULL
df$kicker <- NULL

The kick_team and def_team variables

Since we already have variables for the home and away team and a binary variable that represents whether the home team kicked the field goal, we can drop the kick_team and def_team variables.

df$kick_team <- NULL
df$def_team <- NULL

Getting variables ready to model

df$game_num <- NULL
df$week <- NULL
df$home_team <- NULL
df$away_team <- NULL
df$qtr <- as.integer(df$qtr)
df$min_rem <- as.integer(df$min_rem)
df$down <- as.integer(df$down)
df$kick_score_diff <- as.integer(df$kick_score_diff)
df$made <- as.integer(df$made)

Model Building

I’m going to start off with a ‘kitchen sink’ logistic regression model and then use backward selection to minimize the Akaike Information Criterion (AIC). I’ll also start with a model with just an intercept and use forward selection. Lastly, I will fit some models on my own and then evaluate them and pick one.

Kitchen Sink

kitchen_sink <- glm(made ~ ., data = df, family = "binomial")
summary(kitchen_sink)
## 
## Call:
## glm(formula = made ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9687   0.1740   0.3102   0.5273   1.4773  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       7.344320   1.652651   4.444 8.83e-06 ***
## qtr               0.073455   0.372370   0.197    0.844    
## min_rem           0.003755   0.024413   0.154    0.878    
## sec_rem           0.111009   0.094151   1.179    0.238    
## down              0.010477   0.151632   0.069    0.945    
## distance         -0.122022   0.013014  -9.376  < 2e-16 ***
## home_kick        -0.261013   0.208709  -1.251    0.211    
## kick_score_diff  -0.001208   0.010801  -0.112    0.911    
## blocked         -18.612726 484.104941  -0.038    0.969    
## playoff          -0.385053   0.502211  -0.767    0.443    
## kicker_rank      -0.052030   0.011390  -4.568 4.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 636.77  on 1019  degrees of freedom
## AIC: 658.77
## 
## Number of Fisher Scoring iterations: 15

Stepwise Selection (Forward and Backward)

backward <- step(kitchen_sink, direction = "backward")
nothing <- glm(made ~ 1, data = df, family = "binomial")
forward = step(nothing,
scope=list(lower=formula(nothing),upper=formula(kitchen_sink)), direction="forward")
summary(forward)
## 
## Call:
## glm(formula = made ~ distance + blocked + kicker_rank, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9405   0.1747   0.3114   0.5290   1.4818  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.83916    0.61385  12.771  < 2e-16 ***
## distance     -0.12387    0.01277  -9.701  < 2e-16 ***
## blocked     -18.63706  482.43798  -0.039    0.969    
## kicker_rank  -0.05144    0.01125  -4.572 4.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 640.67  on 1026  degrees of freedom
## AIC: 648.67
## 
## Number of Fisher Scoring iterations: 15
summary(backward)
## 
## Call:
## glm(formula = made ~ distance + blocked + kicker_rank, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9405   0.1747   0.3114   0.5290   1.4818  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.83916    0.61385  12.771  < 2e-16 ***
## distance     -0.12387    0.01277  -9.701  < 2e-16 ***
## blocked     -18.63706  482.43798  -0.039    0.969    
## kicker_rank  -0.05144    0.01125  -4.572 4.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 640.67  on 1026  degrees of freedom
## AIC: 648.67
## 
## Number of Fisher Scoring iterations: 15

Forward and backward selection yielded the same logistic regression model. Let’s look at the correlations to see which variables are strongly correlated and which variables are not correlated with made, the dependent variable.

# correlation matrix
cor(df)
##                         qtr       min_rem     sec_rem         down
## qtr              1.00000000 -0.9649035798  0.02355001 -0.022040995
## min_rem         -0.96490358  1.0000000000  0.01402472  0.107379212
## sec_rem          0.02355001  0.0140247204  1.00000000  0.316634431
## down            -0.02204100  0.1073792122  0.31663443  1.000000000
## distance         0.03226611 -0.0482647852 -0.09346194 -0.157767104
## home_kick        0.02103755 -0.0009982006  0.01589280 -0.008984271
## kick_score_diff  0.08629239 -0.0932438740 -0.03365168 -0.012594241
## made            -0.00691875  0.0196371092  0.07621336  0.097224990
## blocked          0.01198707 -0.0157763452 -0.01844813 -0.029809613
## playoff         -0.02805134  0.0279569414  0.01721630  0.016847084
## kicker_rank      0.03271421 -0.0354921955  0.03068206  0.008019830
##                    distance     home_kick kick_score_diff         made
## qtr              0.03226611  0.0210375486      0.08629239 -0.006918750
## min_rem         -0.04826479 -0.0009982006     -0.09324387  0.019637109
## sec_rem         -0.09346194  0.0158928046     -0.03365168  0.076213359
## down            -0.15776710 -0.0089842706     -0.01259424  0.097224990
## distance         1.00000000  0.0537687554      0.02308362 -0.350951337
## home_kick        0.05376876  1.0000000000      0.19035036 -0.055112702
## kick_score_diff  0.02308362  0.1903503638      1.00000000 -0.017160075
## made            -0.35095134 -0.0551127020     -0.01716007  1.000000000
## blocked          0.06360226  0.0157271807      0.01616502 -0.347064963
## playoff         -0.01670827 -0.0484708852      0.03709635 -0.002442183
## kicker_rank      0.04306622  0.0014345914     -0.03412495 -0.161181053
##                     blocked      playoff  kicker_rank
## qtr              0.01198707 -0.028051337  0.032714210
## min_rem         -0.01577635  0.027956941 -0.035492195
## sec_rem         -0.01844813  0.017216296  0.030682055
## down            -0.02980961  0.016847084  0.008019830
## distance         0.06360226 -0.016708272  0.043066218
## home_kick        0.01572718 -0.048470885  0.001434591
## kick_score_diff  0.01616502  0.037096351 -0.034124949
## made            -0.34706496 -0.002442183 -0.161181053
## blocked          1.00000000 -0.028914590  0.029391920
## playoff         -0.02891459  1.000000000 -0.018426342
## kicker_rank      0.02939192 -0.018426342  1.000000000

Let’s drop some variables and fit some more models.

# Drop some variables
df$min_rem <- NULL
df$sec_rem <- NULL
df$qtr <- NULL
df$kick_score_diff <- NULL
df$playoff <- NULL
# model 3
model3 <- glm(made ~ ., data = df, family = "binomial")
summary(model3)
## 
## Call:
## glm(formula = made ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9833   0.1742   0.3130   0.5318   1.4832  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.63142    0.86058   8.868  < 2e-16 ***
## down          0.07573    0.13339   0.568    0.570    
## distance     -0.12264    0.01300  -9.437  < 2e-16 ***
## home_kick    -0.25621    0.20412  -1.255    0.209    
## blocked     -18.61862  482.06588  -0.039    0.969    
## kicker_rank  -0.05120    0.01126  -4.549 5.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 638.78  on 1024  degrees of freedom
## AIC: 650.78
## 
## Number of Fisher Scoring iterations: 15
# model 4
model4 <- glm(made ~ . - home_kick, data = df, family = "binomial")
summary(model4)
## 
## Call:
## glm(formula = made ~ . - home_kick, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9396   0.1769   0.3117   0.5324   1.4660  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.50779    0.85320   8.800  < 2e-16 ***
## down          0.07409    0.13354   0.555    0.579    
## distance     -0.12264    0.01295  -9.474  < 2e-16 ***
## blocked     -18.62511  482.02358  -0.039    0.969    
## kicker_rank  -0.05139    0.01126  -4.565 4.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 640.36  on 1025  degrees of freedom
## AIC: 650.36
## 
## Number of Fisher Scoring iterations: 15

I have chose to use the model found in both forward and backward selection as I also came to the conclusion that the model was the best classifier from the summaries.

# final model
final_model <- glm(made ~ distance + kicker_rank + blocked, data = df, family = "binomial")
summary(final_model)
## 
## Call:
## glm(formula = made ~ distance + kicker_rank + blocked, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9405   0.1747   0.3114   0.5290   1.4818  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.83916    0.61385  12.771  < 2e-16 ***
## distance     -0.12387    0.01277  -9.701  < 2e-16 ***
## kicker_rank  -0.05144    0.01125  -4.572 4.84e-06 ***
## blocked     -18.63706  482.43798  -0.039    0.969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 882.83  on 1029  degrees of freedom
## Residual deviance: 640.67  on 1026  degrees of freedom
## AIC: 648.67
## 
## Number of Fisher Scoring iterations: 15

Which variables contribute most to the model?

From the regression summary, we can see that kicker_rank and distance are statistically significant predictors given any confidence level. It is worth noting that I kept the blocked variable in the model even though it isn’t significant. In my opinion, the only reason it doesn’t appear significant is because only about 2% of all field goals attempted were blocked. The estimated regression coefficient for blocked is much larger than the others, but it has such a high standard error that the test statistic does not appear significantly different. However, this probably has more to do with the large number of field goals that weren’t blocked and every blocked field goal was missed. This is why I chose to leave it in the model.

To actually interpret these coefficients, we should exponentiate them.

odds ratios and 95% CI

round(exp(coef(final_model)), 4)
## (Intercept)    distance kicker_rank     blocked 
##   2538.0803      0.8835      0.9499      0.0000

These coefficients tell us that for every 1 yard increase in distance, the odds of making a field goal decrease by a factor of 0.1165. Similary, for every increase in the kicker_rank (the kicker has a worse made field goal percentage), the odds of making a field goal decrease by a factor of 0.0501.

What is the “hit rate” if we use 50% as the break point?

To figure out the “hit rate,” I am going to compute the predicted probabilities for each observation and see which ones were correctly classified.

pred_prob <- predict.glm(final_model, type = "response")
df$pred_prob = pred_prob
df$predicted = ifelse(df$pred_prob >= .5, 1, 0)
xtabs(~ made + predicted, data = df)
##     predicted
## made   0   1
##    0  39 119
##    1  16 856
FN <- 16
FP <- 119
TN <- 39
TP <- 856
N <- sum(39, 119)
P <- sum(16, 856)
N_star <- sum(39, 16)
P_star <- sum(119, 856)
false_pos_rate <- FP/N
true_pos_rate <- TP/P
pos_pred_rate <- TP/P_star
neg_pred_rate <- TN/N_star

Using 50% as the hit rate, we correctly classified 895 of the 1030 field goals attempted. This means our model had an accuracy level of 86.89 percent. Our model boasts a true positive rate of 98.17 percent. But, we also have an extremely high false positive rate of 75.32 percent. Clearly, our model is much better at classifying made field goals than it is at classifying missed field goals. Much of this has to do with the fact that there are many more made field goals than there are missed field goals. Another reason is the small number of predictors. I’m sure adding more predictors could alleviate some of these problems, but I am wary of doing so to overfit the model.

Calculate some test values to show whether our model is well-defined.

Akaike Information Criterion (AIC)

AIC(final_model)
## [1] 648.6675

We already know based on the feature selection process that our model has the lowest AIC possible.

Goodness of Fit

Pseudo-\(R^2\)

library(survey)
library(ResourceSelection)
library(pscl)
library(pROC)
pR2(final_model)
##          llh      llhNull           G2     McFadden         r2ML 
## -320.3337536 -441.4151108  242.1627143    0.2743027    0.2095157 
##         r2CU 
##    0.3639851

The pseudo-\(R^2\) is not very high for our model which implies that the predictive power of our model is not very strong.

Hosmer-Lemeshow Goodness of Fit Test

hoslem.test(df$made, fitted(final_model))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  df$made, fitted(final_model)
## X-squared = 5.9557, df = 8, p-value = 0.6522

The Hosmer-Lemeshow test examines whether the observed proportion of events are similar to the predicted probabilities of occurences in subgroups of the dataset using a pearson chi-square statistic of observed and expected frequencies. In other words, we are testing to see whether the model we specified is a better fit than a model with just an intercept. The large p-value helps us feel a little more confident about our model being a good fit as we do not reject the null hypothesis.

Area under the Receiver Operating Curve

f <- roc(df$made, df$predicted)
f
## 
## Call:
## roc.default(response = df$made, predictor = df$predicted)
## 
## Data: df$predicted in 158 controls (df$made 0) < 872 cases (df$made 1).
## Area under the curve: 0.6142
plot(f)

Holding all other variables constant, what are the chances of making a 50 yard field goal?

newdata1 <- with(df, data_frame(kicker_rank = mean(kicker_rank), blocked = mean(blocked), distance = 50))
newdata1$pred <- predict(final_model, newdata = newdata1, type = "response")
newdata1
## # A tibble: 1 × 4
##   kicker_rank    blocked distance      pred
##         <dbl>      <dbl>    <dbl>     <dbl>
## 1    17.27864 0.02135922       50 0.5887768

All else equal, the model predicts a 58.88% chance of making a 50 yard field goal.

References


  1. https://en.wikipedia.org/wiki/2008_NFL_season

  2. Note however that since we are referring to seconds remaining in a minute that seconds only takes on values in the range [0, 59]. This implies seconds are recorded in modulo 60. Thus, 0 and 60 are congruent since they share the same equivalence class.

  3. The week variable was coded using the information found at http://www.databasefootball.com/boxscores/scheduleyear.htm?yr=2008&lg=NFL.

  4. It is worth noting that each team usually plays 16 games in the regular season which means in a normal season that 256 regular season games are played.

  5. http://www.espn.com/nfl/playbyplay?gameId=281012022

  6. http://www.espn.com/nfl/playbyplay?gameId=280914029

  7. http://www.espn.com/nfl/playbyplay?gameId=281123022

  8. http://www.espn.com/nfl/playbyplay?gameId=281228009

  9. Note that the yard lines were changed to 58 and 59 because the field goals were attempted from their own 42 and 41 yard lines, respectively. I thought counting yard line from zero and not switching scales once the 50 yard line was crossed was less confusing than the data collector’s method of multiplying the yard line by a factor of 2.

  10. http://www.espn.com/nfl/playbyplay?gameId=280907002