Introduction

For this assignment I chose FiveThirtyEight’s NFL Elo dataset. This dataset uses historical NFL data dating back to the early day’s of the league to asses which team has the highest likelihood of winning individual games as well as playoff and superbowl games. For this dataset, the focus is specifically on the 2022 NFL season, with the predicted winner being the Kansas City Chiefs. The link to the article is below.

https://projects.fivethirtyeight.com/2022-nfl-predictions/

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.3.3
elo <- "https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv"
elo_latest <- "https://projects.fivethirtyeight.com/nfl-api/nfl_elo_latest.csv"

nfl_elo <- read_csv(elo)
## Rows: 17379 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): playoff, team1, team2, qb1, qb2
## dbl  (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nfl_elo_latest <- read_csv(elo_latest)
## Rows: 284 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): playoff, team1, team2, qb1, qb2
## dbl  (27): season, neutral, elo1_pre, elo2_pre, elo_prob1, elo_prob2, elo1_p...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(nfl_elo)
## spc_tbl_ [17,379 × 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date          : Date[1:17379], format: "1920-09-26" "1920-10-03" ...
##  $ season        : num [1:17379] 1920 1920 1920 1920 1920 1920 1920 1920 1920 1920 ...
##  $ neutral       : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ playoff       : chr [1:17379] NA NA NA NA ...
##  $ team1         : chr [1:17379] "RII" "AKR" "BFF" "DAY" ...
##  $ team2         : chr [1:17379] "STP" "WHE" "WBU" "COL" ...
##  $ elo1_pre      : num [1:17379] 1504 1503 1478 1493 1516 ...
##  $ elo2_pre      : num [1:17379] 1300 1300 1300 1505 1478 ...
##  $ elo_prob1     : num [1:17379] 0.825 0.824 0.802 0.576 0.644 ...
##  $ elo_prob2     : num [1:17379] 0.175 0.176 0.198 0.424 0.356 ...
##  $ elo1_post     : num [1:17379] 1516 1515 1490 1515 1542 ...
##  $ elo2_post     : num [1:17379] 1288 1288 1288 1482 1452 ...
##  $ qbelo1_pre    : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qbelo2_pre    : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb1           : chr [1:17379] NA NA NA NA ...
##  $ qb2           : chr [1:17379] NA NA NA NA ...
##  $ qb1_value_pre : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb2_value_pre : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb1_adj       : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb2_adj       : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qbelo_prob1   : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qbelo_prob2   : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb1_game_value: num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb2_game_value: num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb1_value_post: num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qb2_value_post: num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qbelo1_post   : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ qbelo2_post   : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ score1        : num [1:17379] 48 43 32 14 45 10 20 48 26 25 ...
##  $ score2        : num [1:17379] 0 0 6 0 0 0 0 0 0 7 ...
##  $ quality       : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ importance    : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_rating  : num [1:17379] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   season = col_double(),
##   ..   neutral = col_double(),
##   ..   playoff = col_character(),
##   ..   team1 = col_character(),
##   ..   team2 = col_character(),
##   ..   elo1_pre = col_double(),
##   ..   elo2_pre = col_double(),
##   ..   elo_prob1 = col_double(),
##   ..   elo_prob2 = col_double(),
##   ..   elo1_post = col_double(),
##   ..   elo2_post = col_double(),
##   ..   qbelo1_pre = col_double(),
##   ..   qbelo2_pre = col_double(),
##   ..   qb1 = col_character(),
##   ..   qb2 = col_character(),
##   ..   qb1_value_pre = col_double(),
##   ..   qb2_value_pre = col_double(),
##   ..   qb1_adj = col_double(),
##   ..   qb2_adj = col_double(),
##   ..   qbelo_prob1 = col_double(),
##   ..   qbelo_prob2 = col_double(),
##   ..   qb1_game_value = col_double(),
##   ..   qb2_game_value = col_double(),
##   ..   qb1_value_post = col_double(),
##   ..   qb2_value_post = col_double(),
##   ..   qbelo1_post = col_double(),
##   ..   qbelo2_post = col_double(),
##   ..   score1 = col_double(),
##   ..   score2 = col_double(),
##   ..   quality = col_double(),
##   ..   importance = col_double(),
##   ..   total_rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(nfl_elo_latest)
## spc_tbl_ [284 × 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date          : Date[1:284], format: "2022-09-08" "2022-09-11" ...
##  $ season        : num [1:284] 2022 2022 2022 2022 2022 ...
##  $ neutral       : num [1:284] 0 0 0 0 0 0 0 0 0 0 ...
##  $ playoff       : chr [1:284] NA NA NA NA ...
##  $ team1         : chr [1:284] "LAR" "MIA" "CIN" "ATL" ...
##  $ team2         : chr [1:284] "BUF" "NE" "PIT" "NO" ...
##  $ elo1_pre      : num [1:284] 1615 1540 1558 1437 1365 ...
##  $ elo2_pre      : num [1:284] 1614 1537 1510 1544 1509 ...
##  $ elo_prob1     : num [1:284] 0.593 0.597 0.657 0.44 0.388 ...
##  $ elo_prob2     : num [1:284] 0.407 0.403 0.343 0.56 0.612 ...
##  $ elo1_post     : num [1:284] 1577 1561 1539 1431 1344 ...
##  $ elo2_post     : num [1:284] 1652 1516 1529 1550 1530 ...
##  $ qbelo1_pre    : num [1:284] 1583 1518 1550 1379 1391 ...
##  $ qbelo2_pre    : num [1:284] 1613 1514 1473 1516 1546 ...
##  $ qb1           : chr [1:284] "Matthew Stafford" "Tua Tagovailoa" "Joe Burrow" "Marcus Mariota" ...
##  $ qb2           : chr [1:284] "Josh Allen" "Mac Jones" "Mitch Trubisky" "Jameis Winston" ...
##  $ qb1_value_pre : num [1:284] 200 139 174 112 110 ...
##  $ qb2_value_pre : num [1:284] 260 144 168 151 163 ...
##  $ qb1_adj       : num [1:284] 5.939 -0.539 3.743 -12.436 19.853 ...
##  $ qb2_adj       : num [1:284] 9.145 0.149 3.347 3.306 -3.618 ...
##  $ qbelo_prob1   : num [1:284] 0.534 0.582 0.675 0.357 0.384 ...
##  $ qbelo_prob2   : num [1:284] 0.466 0.418 0.325 0.643 0.616 ...
##  $ qb1_game_value: num [1:284] -47.5 179.6 23.3 314.4 116.2 ...
##  $ qb2_game_value: num [1:284] 402 122 112 152 112 ...
##  $ qb1_value_post: num [1:284] 176 143 159 132 111 ...
##  $ qb2_value_post: num [1:284] 274 141 163 151 158 ...
##  $ qbelo1_post   : num [1:284] 1549 1540 1530 1375 1371 ...
##  $ qbelo2_post   : num [1:284] 1647 1492 1493 1521 1567 ...
##  $ score1        : num [1:284] 10 20 20 26 9 24 28 35 19 20 ...
##  $ score2        : num [1:284] 31 7 23 27 24 26 22 38 10 20 ...
##  $ quality       : num [1:284] 91 56 56 19 34 10 18 33 37 21 ...
##  $ importance    : num [1:284] 59 85 75 54 51 48 41 64 59 51 ...
##  $ total_rating  : num [1:284] 75 71 66 37 43 29 30 49 48 36 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   season = col_double(),
##   ..   neutral = col_double(),
##   ..   playoff = col_character(),
##   ..   team1 = col_character(),
##   ..   team2 = col_character(),
##   ..   elo1_pre = col_double(),
##   ..   elo2_pre = col_double(),
##   ..   elo_prob1 = col_double(),
##   ..   elo_prob2 = col_double(),
##   ..   elo1_post = col_double(),
##   ..   elo2_post = col_double(),
##   ..   qbelo1_pre = col_double(),
##   ..   qbelo2_pre = col_double(),
##   ..   qb1 = col_character(),
##   ..   qb2 = col_character(),
##   ..   qb1_value_pre = col_double(),
##   ..   qb2_value_pre = col_double(),
##   ..   qb1_adj = col_double(),
##   ..   qb2_adj = col_double(),
##   ..   qbelo_prob1 = col_double(),
##   ..   qbelo_prob2 = col_double(),
##   ..   qb1_game_value = col_double(),
##   ..   qb2_game_value = col_double(),
##   ..   qb1_value_post = col_double(),
##   ..   qb2_value_post = col_double(),
##   ..   qbelo1_post = col_double(),
##   ..   qbelo2_post = col_double(),
##   ..   score1 = col_double(),
##   ..   score2 = col_double(),
##   ..   quality = col_double(),
##   ..   importance = col_double(),
##   ..   total_rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
nfl_elo_clean <- nfl_elo %>%
  select(season, date, team1, team2, elo1_pre, elo2_pre, elo_prob1, elo_prob2, score1, score2, qb1, qb2, qbelo1_pre, qbelo2_pre,)

nfl_elo_clean <- nfl_elo_clean %>%
  rename(
    year = season,
    game_date = date,
    home_team = team1,
    away_team = team2,
    home_elo_pre = elo1_pre,
    away_elo_pre = elo2_pre,
    home_win_prob = elo_prob1,
    away_win_prob = elo_prob2,
    home_score = score1,
    away_score = score2,
    home_qb = qb1,
    away_qb = qb2,
    home_qb_elo_pre = qbelo1_pre,
    away_qb_elo_pre = qbelo2_pre,
  )

head(nfl_elo_clean)
## # A tibble: 6 × 14
##    year game_date  home_team away_team home_elo_pre away_elo_pre home_win_prob
##   <dbl> <date>     <chr>     <chr>            <dbl>        <dbl>         <dbl>
## 1  1920 1920-09-26 RII       STP              1504.        1300          0.825
## 2  1920 1920-10-03 AKR       WHE              1503.        1300          0.824
## 3  1920 1920-10-03 BFF       WBU              1478.        1300          0.802
## 4  1920 1920-10-03 DAY       COL              1493.        1505.         0.576
## 5  1920 1920-10-03 RII       MUN              1516.        1478.         0.644
## 6  1920 1920-10-03 RCH       ABU              1503.        1300          0.824
## # ℹ 7 more variables: away_win_prob <dbl>, home_score <dbl>, away_score <dbl>,
## #   home_qb <chr>, away_qb <chr>, home_qb_elo_pre <dbl>, away_qb_elo_pre <dbl>
avg_home_elo <- nfl_elo_clean %>%
  summarize(avg_home_elo = mean(home_elo_pre, na.rm = TRUE))

avg_away_elo <- nfl_elo_clean %>%
  summarize(avg_away_elo = mean(away_elo_pre, na.rm = TRUE))

avg_home_elo
## # A tibble: 1 × 1
##   avg_home_elo
##          <dbl>
## 1        1503.
avg_away_elo
## # A tibble: 1 × 1
##   avg_away_elo
##          <dbl>
## 1        1499.
elo_diff <- nfl_elo_clean %>%
  mutate(elo_diff = abs(home_elo_pre - away_elo_pre)) %>%
  summarize(avg_elo = mean(elo_diff, na.rm = TRUE))

elo_diff
## # A tibble: 1 × 1
##   avg_elo
##     <dbl>
## 1    113.
library(dplyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
avg_elo_by_year <- nfl_elo_clean %>%
  group_by(year) %>%
  summarize(
    avg_home_elo = mean(home_elo_pre, na.rm = TRUE),
    avg_away_elo = mean(away_elo_pre, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(avg_home_elo, avg_away_elo), names_to = "team_type", values_to = "avg_elo")

head(avg_elo_by_year)
## # A tibble: 6 × 3
##    year team_type    avg_elo
##   <dbl> <chr>          <dbl>
## 1  1920 avg_home_elo   1453.
## 2  1920 avg_away_elo   1420.
## 3  1921 avg_home_elo   1497.
## 4  1921 avg_away_elo   1487.
## 5  1922 avg_home_elo   1526.
## 6  1922 avg_away_elo   1463.
ggplot(avg_elo_by_year, aes(x = year, y = avg_elo, fill = team_type)) +
  geom_histogram(stat = "identity", position = "dodge", color = "black", binwidth = 1) +
  labs(
    title = "Average Elo Ratings for Home and Away Teams Over the Years",
    x = "Year",
    y = "Average Elo Rating"
  ) +
  scale_fill_manual(values = c("avg_home_elo" = "skyblue", "avg_away_elo" = "orange"), 
                    name = "Team Type", labels = c("Home Team", "Away Team")) +
  theme_minimal()
## Warning in geom_histogram(stat = "identity", position = "dodge", color =
## "black", : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Conclusion

This dataset of NFL statistics and player performances provides great insight into the projected skill level of different teams. Through basic exploratory analysis, it can be seen that the Elo rating does not change much between a home team and an away team, and on average, there is not a large difference in the average Elo score between two competing teams. This dataset also reveals that throughout the past 100 years or so, the average Elo ratings of teams do not drastically change, which surprised me due to the physical and technical advancements in the sport. To add to this dataset, it would be interesting to include findings related to what some fans believe to be “superstitions” in games. Adding factors such as division rivalries, average home stadium climate, fan attendance, etc., would be interesting to see how a team’s Elo may be affected.