Assignment 7

Author

Rocco Roman

Big Picture Question

In the NFL, fans and analysts often say that point differential tells the truth — meaning that how much a team outscores its opponents is a better indicator of team strength than its win–loss record alone. This idea is widely discussed in football analytics because point differential captures not just whether a team wins, but how convincingly it wins.

Overall Question: How strongly is point differential related to win percentage for NFL teams from 2019–2023, and does this relationship remain consistent across seasons?

This question is interesting because it tests a core analytics belief using real multi‑season data. If point differential is strongly tied to winning, it supports the idea that teams with unusually high or low records relative to their scoring margins may regress in future seasons. If the relationship varies across seasons, that may indicate changes in league parity, scoring environment, or competitive balance.

To answer this question, I collected team‑level NFL regular‑season standings data from FootballDB, a publicly accessible website that provides HTML tables of wins, losses, points scored, and points allowed for every team.

  • I wrote a dedicated scraping function in R that:

    • Visits each season’s standings page on FootballDB

    • Extracts the HTML table containing team results

    • Cleans and formats the data

    • Adds derived variables such as win percentage and point differential

  • The function loops through all seasons from 2019 to 2023

  • The scraped dataset was saved as a static CSV file and uploaded to GitHub, where it can be imported into R using a direct raw link.

Data Wrangling

  • games = total games played

  • win_pct = wins divided by games

  • point_diff = points scored minus points allowed

  • season is converted to a factor for plotting

This prepares the dataset for all visualizations.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nfl <- readr::read_csv(
  "https://raw.githubusercontent.com/roccoroman/nfl-dataset/main/nfl_standings_2019_2023.csv"
)
Rows: 200 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): team, X8, X9, X10, X11, X12
dbl (9): wins, losses, ties, pct, points_for, points_against, season, games,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nfl_clean <- nfl %>%
  mutate(
    games = wins + losses + if_else(is.na(ties), 0L, ties),
    win_pct = wins / games,
    point_diff = points_for - points_against,
    season = factor(season)
  )

Scatterplot: Point Differential vs Win Percentage

This plot shows how strongly point differential predicts win percentage. Each point = one team in one season. The red line = linear trend. A steep upward slope means teams with higher point differential win more often. This directly tests the idea that “point differential tells the truth.”

ggplot(nfl_clean, aes(point_diff, win_pct)) +
  geom_point(alpha = 0.7, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Point Differential vs Win Percentage (NFL 2019–2023)",
    x = "Point Differential",
    y = "Win Percentage"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 40 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 40 rows containing missing values (`geom_point()`).

Scatterplot by Season

This version breaks the relationship down by season. Each color = a different NFL season. Each season gets its own trend line. This lets you see whether the relationship is stable or varies year‑to‑year.

ggplot(nfl_clean, aes(point_diff, win_pct, color = season)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Point Differential vs Win % by Season",
    x = "Point Differential",
    y = "Win Percentage",
    color = "Season"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 40 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 40 rows containing missing values (`geom_point()`).

Boxplot of Point Differential by Season

This shows how point differential varies across seasons. The median line inside each box shows typical team performance. The dashed line at 0 shows whether teams tend to be positive or negative. This helps identify whether certain seasons were more competitive or more lopsided.

We can see that only in 2021 were the most teams positive in point differential. This could be caused by a clear separation between the good and bad teams in the league.

ggplot(nfl_clean, aes(season, point_diff)) +
  geom_boxplot(fill = "orange", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Distribution of Point Differential by Season",
    x = "Season",
    y = "Point Differential"
  )
Warning: Removed 40 rows containing non-finite values (`stat_boxplot()`).

Summary Statistics by Season

This table summarizes each season:

  • How many teams

  • Average win percentage

  • Average point differential

  • Spread (standard deviation) of point differential

This gives a high‑level view of how seasons differ.

nfl_clean2 <- nfl %>%
  mutate(
    wins = as.numeric(wins),
    losses = as.numeric(losses),
    ties = as.numeric(ties),
    points_for = as.numeric(points_for),
    points_against = as.numeric(points_against),
    games = wins + losses + if_else(is.na(ties), 0, ties),
    win_pct = wins / games,
    point_diff = points_for - points_against,
    season = factor(season)
  )

season_summary <- nfl_clean %>%
  group_by(season) %>%
  summarise(
    teams = n(),
    avg_win_pct = mean(win_pct, na.rm = TRUE),
    avg_point_diff = mean(point_diff, na.rm = TRUE),
    sd_point_diff = sd(point_diff, na.rm = TRUE)
  )

season_summary
# A tibble: 5 × 5
  season teams avg_win_pct avg_point_diff sd_point_diff
  <fct>  <int>       <dbl>          <dbl>         <dbl>
1 2019      40       0.498              0         108. 
2 2020      40       0.498              0         102. 
3 2021      40       0.498              0         113. 
4 2022      40       0.497              0          82.5
5 2023      40       0.5                0         100. 

Top & Bottom 10 Teams by Point Differential

This visualization highlights the extremes:

  • The 10 best teams by point differential

  • The 10 worst teams

It visually reinforces how point differential aligns with team success.

top_bottom <- nfl_clean %>%
  arrange(desc(point_diff)) %>%
  slice_head(n = 10) %>%
  mutate(group = "Top 10") %>%
  bind_rows(
    nfl_clean %>%
      arrange(point_diff) %>%
      slice_head(n = 10) %>%
      mutate(group = "Bottom 10")
  )

ggplot(top_bottom, aes(reorder(team, point_diff), point_diff, fill = group)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top & Bottom 10 Teams by Point Differential",
    x = "Team",
    y = "Point Differential",
    fill = "Group"
  )

Conclusion

Across all five seasons from 2019–2023, the analysis shows a strong and consistent positive relationship between point differential and win percentage. Teams that outscore their opponents by larger margins almost always finish with higher winning percentages, and this pattern holds true in every season examined.

The scatterplots reveal a clear upward trend, and the correlation between point differential and win percentage is high, confirming that point differential is one of the most reliable indicators of team strength. The top‑10 and bottom‑10 visualizations further reinforce this: teams with the largest positive point differentials consistently appear among the league’s best records, while teams with large negative differentials almost always finish near the bottom.

Overall, the results strongly support the analytics claim that point differential “tells the truth” about team performance. Not only does it correlate closely with winning, but the relationship remains stable across multiple seasons, suggesting that point differential is a robust and meaningful metric for evaluating NFL teams.