All NFL teams aim for success, but how can they enhance their chances of achieving it?
The initial step is to define success for an NFL team. Most NFL players and coaches might associate a successful season with winning the Super Bowl.
However, given that only one out of 32 teams claims the Super Bowl each year, this leaves 31 teams short of their initial goal. Using the Super Bowl as the sole criterion for success makes it challenging to gauge how a team can enhance its prospects for future victories.
The objective of this project is to develop a comprehensive model that can effectively evaluate an NFL team’s overall performance, offering a broader and more nuanced measure of success beyond the binary outcome of winning or losing the Super Bowl. This model will help teams identify strengths and areas for improvement, enabling more targeted strategies to enhance their competitiveness and chances of success in future seasons.
Success will be broken down into multiple targets. The targets will be built by categorizing teams based on total number of wins, whether they made the playoffs, and whether they won a Super Bowl. The model will be considered complete when a team can compare themselves to a target group and understand their level of performance and in what areas they could improve.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(standings)
## # A tibble: 6 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
Pro Football Reference offers a glossary of terms which describes the variables in each column.
Variable | Class | Description |
---|---|---|
team | character | Team city |
team_name | character | Team name |
year | integer | season year |
wins | double | Wins (0 to 16) |
loss | double | Losses (0 to 16) |
points_for | double | points for (offensive performance) |
points_against | double | points for (defensive performance) |
points_differential | double | Point differential (points_for - points_against) |
margin_of_victory | double | (Points Scored - Points Allowed)/ Games Played |
strength_of_schedule | double | Average quality of opponent as measured by SRS (Simple Rating System) |
simple_rating | double | Team quality relative to average (0.0) as measured by SRS (Simple
Rating System) SRS = MoV + SoS = OSRS + DSRS |
offensive_ranking | double | Team offense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
defensive_ranking | double | Team defense quality relative to average (0.0) as measured by SRS (Simple Rating System) |
playoffs | character | Made playoffs or not |
sb_winner | character | Won Super Bowl or not |
An analysis of the data and the documentation table suggests a simplification in approach and a reduction in the number of variables is required.
To maintain consistency in the results, teams should be refered to
using the team_name
variable. Over the past 20 years, some
teams have relocated to a different city. For instance, between 2015 and
2016, the Rams moved from St. Louis to Los Angeles.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
standings |>
filter(team_name == 'Rams') |>
filter(year == 2015 | year == 2016) |>
select(team, team_name, year)
## # A tibble: 2 × 3
## team team_name year
## <chr> <chr> <dbl>
## 1 St. Louis Rams 2015
## 2 Los Angeles Rams 2016
For accurate modeling, it is crucial to ensure that the variables are independent. Independent variables should not be linearly correlated with each other.
Upon reviewing the data and documentation, it is observed that some variables are dependent on each other. For example…
\[ \text{POINTS FOR - POINTS AGAINST = POINTS DIFFERENTIAL} \\ \]
points_differential
should be excluded from any models
developed because it can be fully explained using the
points_for
and points_against
variables.
Furthermore, margin_of_victory
is also dependent on both
points_for
, points_against
, and
wins
+ loss
to calculate the number of games
played.
\[ \text{(Points For - Points Against) / Games Played} \\ \]
Below is a correlation matrix displaying the relationships between each variable.
standings |>
select(points_for, points_against, points_differential, margin_of_victory) |>
cor()
## points_for points_against points_differential
## points_for 1.0000000 -0.1852656 0.8154306
## points_against -0.1852656 1.0000000 -0.7199052
## points_differential 0.8154306 -0.7199052 1.0000000
## margin_of_victory 0.8153530 -0.7199803 0.9999894
## margin_of_victory
## points_for 0.8153530
## points_against -0.7199803
## points_differential 0.9999894
## margin_of_victory 1.0000000
The variables
offensive_ranking
,defensive_ranking
,
strength_of_schedule
, and simple_rating
are
not independent variables either. Variable simple_rating
is
used to calculate the others, making them dependent on
simple_rating
, and simple_rating
is calculated
based on the average performance of all teams. This means that
simple_rating
is not independent of all other data points.
For these reasons, these variables should be excluded from the
model.
Variables loss
and wins
are interchangeable
when explaining a result, consider that the sum of wins
and
loss
typically totals 16. Knowing either the number of
win
or loss
allows for the calculation of the
other. An exception exists for certain teams who played a game which
ended in a draw causing game to not be a win
or a
loss
.
standings |>
filter(wins + loss != 16)
## # A tibble: 20 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pittsburgh Steelers 2002 10 5 390 345
## 2 Atlanta Falcons 2002 9 6 402 314
## 3 Cincinnati Bengals 2008 4 11 204 364
## 4 Philadelphia Eagles 2008 9 6 416 289
## 5 San Francisco 49ers 2012 11 4 397 273
## 6 St. Louis Rams 2012 7 8 299 348
## 7 Green Bay Packers 2013 8 7 417 428
## 8 Minnesota Vikings 2013 5 10 391 480
## 9 Cincinnati Bengals 2014 10 5 365 344
## 10 Carolina Panthers 2014 7 8 339 374
## 11 Cincinnati Bengals 2016 6 9 325 315
## 12 Washington Redskins 2016 8 7 396 383
## 13 Seattle Seahawks 2016 10 5 354 292
## 14 Arizona Cardinals 2016 7 8 418 362
## 15 Pittsburgh Steelers 2018 9 6 428 360
## 16 Cleveland Browns 2018 7 8 359 392
## 17 Minnesota Vikings 2018 8 7 360 341
## 18 Green Bay Packers 2018 6 9 376 400
## 19 Detroit Lions 2019 3 12 341 423
## 20 Arizona Cardinals 2019 5 10 361 442
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
A new column, named draws
, can be created to represent
the difference between the total number of games in a season (16) and
the sum of win
and loss
. Subsequently, the
loss
column can be removed from the model since its
variance can be explained using the wins
and
draws
columns.
standings$draws <- 16 - (standings$wins + standings$loss)
draws
lacks any substantial amount of data for when the
value is above 0. Roughly 3.1% of teams experienced a draw and for this
reason, draws should not be included in the model.
standings |>
filter(draws > 0)
## # A tibble: 20 × 16
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pittsburgh Steelers 2002 10 5 390 345
## 2 Atlanta Falcons 2002 9 6 402 314
## 3 Cincinnati Bengals 2008 4 11 204 364
## 4 Philadelphia Eagles 2008 9 6 416 289
## 5 San Francisco 49ers 2012 11 4 397 273
## 6 St. Louis Rams 2012 7 8 299 348
## 7 Green Bay Packers 2013 8 7 417 428
## 8 Minnesota Vikings 2013 5 10 391 480
## 9 Cincinnati Bengals 2014 10 5 365 344
## 10 Carolina Panthers 2014 7 8 339 374
## 11 Cincinnati Bengals 2016 6 9 325 315
## 12 Washington Redskins 2016 8 7 396 383
## 13 Seattle Seahawks 2016 10 5 354 292
## 14 Arizona Cardinals 2016 7 8 418 362
## 15 Pittsburgh Steelers 2018 9 6 428 360
## 16 Cleveland Browns 2018 7 8 359 392
## 17 Minnesota Vikings 2018 8 7 360 341
## 18 Green Bay Packers 2018 6 9 376 400
## 19 Detroit Lions 2019 3 12 341 423
## 20 Arizona Cardinals 2019 5 10 361 442
## # ℹ 9 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>, draws <dbl>
The remaining columns are selected to undergo further analysis.
standings <- standings |>
select(team_name,year,wins,draws,points_for, points_against, playoffs, sb_winner)
This section will visualize the distributions of selected variables to identify any outliers in the data.
No outliers exist in the data in terms of the number of wins.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = wins, y = "")) +
labs(title = "Wins from Standings Data",
x = "Wins", y = "Team") +
theme_classic()
Several teams could be considered outliers in terms of their number of points scored. While that data points fall outside of the normal range, they are not exceptionally far from this range and will not have a negative impact on the result. For now, these data points can stay.
library(ggrepel)
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_for, y = "")) +
geom_text_repel(data = filter(standings, points_for > 540),
mapping = aes(x = points_for,
y = "",
label = paste(team_name, year, sep = " ")),
color = "darkred") +
labs(title = "Points For from Standings Data",
x = "Points For",
y = "Team") +
theme_classic()
Two teams could be classified as outliers in terms of the number of points they conceded. Similar to the teams who were outliers in the points for section, these data points do not fall exceptionally far from the normal range of values, and will not negatively effect the analysis. Additionally, these two teams are separate teams from the ones which were outliers in the points for section.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_against, y = "")) +
geom_text_repel(data = filter(standings, points_against < 190 | points_against > 500),
mapping = aes(x = points_against,
y = "",
label = paste(team_name, year, sep = " ")),
color = "darkred") +
labs(title = "Points Against from Standings Data",
x = "Points Against",
y = "Team") +
theme_classic()
The above analysis of outliers suggests there are a few teams that had exceptional numbers in terms of points_for and points_against, but nothing that indicates an error in the data, or any indication that any data point should be removed from further analysis.
This analysis will break down success into 4 separate levels. These levels of success are winning more than half the games, also known as having an above .500 record, making the Playoffs, and winning the Super Bowl. When comparing Super Bowl winning teams to teams that did not, you can break down the ‘did not’ group into two separate groups. These groups are teams that made the playoffs, and those that did not. This allows for the comparison of playoff teams that won the Super Bowl and those that did not.
The first set of hypothesis tests will be comparing teams that have a .500 an above record and those that do not. These groups will be compared in terms of points_for and points_against.
The following plot shows the distribution of points_for for teams with .500 and above record and those that do not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_for, y = as.factor(wins>=8))) +
labs(
title = "Boxplot of Points Scored for .500 and Above Record",
x = "Points For",
y = "Wins >= 8"
) +
theme_classic()
There appears to be a significant difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points scored is equal between teams with 8 or more wins and teams with less than 8 wins.} \\ H_0 : \text{PF}_0 = \text{PF}_a \to |\text{PF}_a - \text{PF}_0| = 0 \]
result <- t.test(points_for ~ as.factor(wins>=8), data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_for by as.factor(wins >= 8)
## t = -19.001, df = 620.22, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -94.90020 -77.12132
## sample estimates:
## mean in group FALSE mean in group TRUE
## 302.5599 388.5706
The hypothesis test resulted in a p-value of 2.2e-16, suggesting there is little evidence to support that the means are the same.
A bootstrapping example can be done to further demonstrate the differences in means.
# Improved Bootstrap Function
bootstrap <- function(x, func, n_iter=10^3) {
func_values <- replicate(n_iter, {
x_sample <- x[sample(nrow(x), replace = TRUE), , drop = FALSE]
func(x_sample)
})
return(func_values)
}
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(win_group = as.factor(wins >= 8)) |>
group_by(win_group) |>
summarize(avg_points_for = mean(points_for, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_for[2] - data$avg_points_for[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_for[1] -diff_in_avg_result$avg_points_for[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points For",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far below the expected difference if the means were equal.
The following plot shows the distribution of points_against for groups of teams with 8 or more wins, and groups of teams with less than 8 wins, also known as an above .500 record.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_against, y = as.factor(wins>=8))) +
labs(
title = "Boxplot of Points Conceded for .500 and Above Record",
x = "Points Against",
y = "Wins >= 8"
) +
theme_classic()
There appears to be a significant difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points aginst is equal between teams with 8 or more wins and teams with less than 8 wins.} \\ H_0 : \text{PA}_0 = \text{PA}_a \to |\text{PA}_a - \text{PA}_0| = 0 \]
result <- t.test(points_against ~ as.factor(wins>=8), data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_against by as.factor(wins >= 8)
## t = 17.546, df = 615.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 60.54686 75.80844
## sample estimates:
## mean in group FALSE mean in group TRUE
## 388.1127 319.9350
The hypothesis test resulted in a p-value of 2.2e-16, suggesting there is little evidence to support that the means are the same.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(wins_group = as.factor(wins>=8)) |>
group_by(wins_group) |>
summarize(avg_points_against = mean(points_against, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_against[2] - data$avg_points_against[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_against[1] -diff_in_avg_result$avg_points_against[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points Against",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far above the expected difference if the means were equal.
The next set of hypothesis tests will be comparing teams that made the playoffs and those that do not. These groups will be compared in terms of points_for and points_against. Wins will not be compared between groups because playoff teams are set based on the number of wins that a team has, compared to the other teams. This means that wins is not independent of whether a team makes the playoffs.
The following plot shows the distribution of points_for between teams that made the playoffs, and those that did not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_for, y = playoffs)) +
labs(
title = "Boxplot of Points Scored for Playoff Teams and Not Playoff Teams",
x = "Points For",
y = "Playoff Result"
) +
theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points scored is equal between teams with playoffs and teams without} \\ H_0 : \text{PF}_0 = \text{PF}_a \to |\text{PF}_a - \text{PF}_0| = 0 \]
result <- t.test(points_for ~ playoffs, data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_for by playoffs
## t = -17.087, df = 497.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
## 95 percent confidence interval:
## -92.37487 -73.32228
## sample estimates:
## mean in group No Playoffs mean in group Playoffs
## 319.1181 401.9667
The hypothesis test resulted in a p-value of 2.2e-16, suggesting there is little evidence to support that the means are the same.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(playoff_group = playoffs) |>
group_by(playoff_group) |>
summarize(avg_points_for = mean(points_for, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_for[2] - data$avg_points_for[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_for[1] -diff_in_avg_result$avg_points_for[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points For",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far below the expected difference if the means were equal.
The following plot shows the distribution of points_against between teams that made the playoffs, and those that did not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_against, y = playoffs)) +
labs(
title = "Boxplot of Points Conceded for Playoff Teams and Not Playoff Teams",
x = "Points Against",
y = "Playoff Result"
) +
theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points against is equal between teams with playoffs and teams without} \\ H_0 : \text{PA}_0 = \text{PA}_a \to |\text{PA}_a - \text{PA}_0| = 0 \]
result <- t.test(points_against ~ playoffs, data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_against by playoffs
## t = 15.951, df = 516.33, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
## 95 percent confidence interval:
## 57.24858 73.33152
## sample estimates:
## mean in group No Playoffs mean in group Playoffs
## 374.8442 309.5542
The hypothesis test resulted in a p-value of 2.2e-16, suggesting there is little evidence to support that the means are the same.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(playoff_group = playoffs) |>
group_by(playoff_group) |>
summarize(avg_points_against = mean(points_against, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_against[2] - data$avg_points_against[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_against[1] -diff_in_avg_result$avg_points_against[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points Against",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far above the expected difference if the means were equal.
The next set of hypothesis tests will be comparing teams that made won the Super Bowl and those that do not. These groups will be compared in terms of points_for, points_against, and wins.
The following plot shows the distribution of points_for between teams that won the Super Bowl and teams that did not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_for, y = sb_winner )) +
labs(
title = "Boxplot of Points Scored",
subtitle = "Super Bowl winning teams vs not Super Bowl winning teams",
x = "Points For",
y = "Super Bowl Result"
) + theme_classic()
here appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points scored is equal between teams who won a Super Bowl and teams that did not} \\ H_0 : \text{PF}_0 = \text{PF}_a \to |\text{PF}_a - \text{PF}_0| = 0 \]
result <- t.test(points_for ~ sb_winner, data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_for by sb_winner
## t = -4.9868, df = 21.773, p-value = 5.597e-05
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## -78.96948 -32.55932
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 348.5356 404.3000
The hypothesis test resulted in a p-value of 5.597e-05, suggesting there is little evidence to support the null hypothesis that the means are the same.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_points_for = mean(points_for, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_for[2] - data$avg_points_for[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_for[1] -diff_in_avg_result$avg_points_for[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points For",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far above the expected difference if the means were equal.
The following plot shows the distribution of points_against between teams that won the Super Bowl and teams that did not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = points_against, y = sb_winner )) +
labs(
title = "Boxplot of Points Conceded",
subtitle = "Super Bowl winning teams vs not Super Bowl winning teams",
x = "Points Against",
y = "Super Bowl Result"
) + theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points against is equal between teams who won a Super Bowl and teams that did not} \\ H_0 : \text{PA}_0 = \text{PA}_a \to |\text{PA}_a - \text{PA}_0| = 0 \]
result <- t.test(points_against ~ sb_winner, data = standings)
print(result)
##
## Welch Two Sample t-test
##
## data: points_against by sb_winner
## t = 5.0571, df = 20.168, p-value = 5.883e-05
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## 40.64319 97.65972
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 352.4515 283.3000
The hypothesis test resulted in a p-value of 5.883e-05, suggesting there is little evidence to support that the means are equal.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_points_against = mean(points_against, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_against[2] - data$avg_points_against[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_points_against[1] -diff_in_avg_result$avg_points_against[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points Against",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far above the expected difference if the means were equal.
Here wins
is used as a a explanatory metric, similar to
how points_for
and points_against
.
The following plot shows the distribution of wins between teams that won the Super Bowl and teams that did not.
standings |>
ggplot() +
geom_boxplot(mapping = aes(x = wins, y = sb_winner)) +
labs(
title = "Boxplot of Wins",
subtitle = "Super Bowl winning teams vs not Super Bowl winning teams",
x = "Number of Wins",
y = "Super Bowl Result"
) + theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.\[ H_0: \text{Average seasonal win totals is equal between teams who won a Super Bowl and teams that did not .} \\ H_0 : \text{wins}_0 = \text{wins}_a \to |\text{wins}_a - \text{wins}_0| = 0 \]
results <- t.test(wins ~ sb_winner, data = standings)
print(results)
##
## Welch Two Sample t-test
##
## data: wins by sb_winner
## t = -11.696, df = 25.007, p-value = 1.238e-11
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## -4.693501 -3.288052
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 7.859223 11.850000
The hypothesis test resulted in a p-value of 1.238e-11, suggesting there is little evidence to support that the means are equal.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_wins = mean(wins, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(standings, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_wins[2] - data$avg_wins[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(standings)
observed_diff <- diff_in_avg_result$avg_wins[1] -diff_in_avg_result$avg_wins[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-5, 5)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Wins",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is far below the expected difference if the means were equal.
In this section playoff teams that did not win a Superbowl are compared to playoff teams that did win a Super Bowl. These groups will be compared in terms of points_for, points_against, and wins.
playoff_mask <- standings |>
filter(playoffs == 'Playoffs' & sb_winner == 'No Superbowl')
sb_mask <- standings |>
filter(sb_winner == 'Won Superbowl')
combined_mask <- rbind(playoff_mask, sb_mask)
The following plot shows the distribution of points_for between playoff teams that won the Super Bowl and playoff teams that did not.
combined_mask |>
ggplot() +
geom_boxplot(mapping = aes(x = points_for, y = sb_winner )) +
labs(
title = "Boxplot of Points Scored",
subtitle = "Playoff Teams That Won a Super Bowl vs Playoff Teams That Did Not",
x = "Points For",
y = "Super Bowl Result"
) + theme_classic()
There appears to be no difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points scored is equal between playoff teams that won a Super Bowl and playoff teams that did not win a Superbowl.} \\ H_0 : \text{PF}_0 = \text{PF}_a \to |\text{PF}_a - \text{PF}_0| = 0 \]
result <- t.test(points_for ~ sb_winner, data = combined_mask)
print(result)
##
## Welch Two Sample t-test
##
## data: points_for by sb_winner
## t = -0.22025, df = 24.799, p-value = 0.8275
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## -26.35740 21.26649
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 401.7545 404.3000
The hypothesis test resulted in a p-value of 0.8275, suggesting there is evidence to support that the means are equal.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_points_for = mean(points_for, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(combined_mask, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_for[2] - data$avg_points_for[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(combined_mask)
observed_diff <- diff_in_avg_result$avg_points_for[1] -diff_in_avg_result$avg_points_for[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Points For",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is close to an average difference in points for of 0, suggesting that the difference in means between the two groups are equal.
The following plot shows the distribution of points_against between playoff teams that won the Super Bowl and playoff teams that did not.
combined_mask |>
ggplot() +
geom_boxplot(mapping = aes(x = points_against, y = sb_winner)) +
labs(
title = "Boxplot of Points Conceded",
subtitle = "Playoff Teams That Won a Super Bowl vs Playoff Teams That Did Not",
x = "Points Against",
y = "Super Bowl Result"
) + theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average points conceded is equal between teams who won the Super Bowl and playoff teams who did not win the Superbowl.} \\ H_0 : \text{PA}_0 = \text{PA}_a \to |\text{PA}_a - \text{PA}_0| = 0 \]
result <- t.test(points_against ~ sb_winner, data = combined_mask)
print(result)
##
## Welch Two Sample t-test
##
## data: points_against by sb_winner
## t = 2.0676, df = 21.235, p-value = 0.05108
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## -0.1468964 57.4287146
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 311.9409 283.3000
The hypothesis test resulted in a p-value of 0.05108. Depending on the threshold set by a team, this result could be interpreted one way or the other. More data should be collected to further prove that the means are not equal.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_points_against = mean(points_against, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(combined_mask, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_points_against[2] - data$avg_points_against[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(combined_mask)
observed_diff <- diff_in_avg_result$avg_points_against[1] -diff_in_avg_result$avg_points_against[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-100, 100)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Points For",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group.
The following plot shows the distribution of wins between playoff teams that won the Super Bowl and playoff teams that did not.
combined_mask |>
ggplot() +
geom_boxplot(mapping = aes(x = wins, y = sb_winner)) +
labs(
title = "Boxplot of Wins",
subtitle = "Playoff Teams That Won a Super Bowl vs Playoff Teams That Did Not",
x = "Wins",
y = "Super Bowl Result"
) + theme_classic()
There appears to be a difference between the two groups. A hypothesis test can be conducted to test this assumption.
\[ H_0: \text{The average number of wins is equal between teams who won the Super Bowl and playoff teams who did not win the Superbowl.} \\ H_0 : \text{Wins}_0 = \text{Wins}_a \to |\text{Wins}_a - \text{Wins}_0| = 0 \]
result <- t.test(wins ~ sb_winner, data = combined_mask)
print(result)
##
## Welch Two Sample t-test
##
## data: wins by sb_winner
## t = -2.5201, df = 23.363, p-value = 0.019
## alternative hypothesis: true difference in means between group No Superbowl and group Won Superbowl is not equal to 0
## 95 percent confidence interval:
## -1.5388664 -0.1520427
## sample estimates:
## mean in group No Superbowl mean in group Won Superbowl
## 11.00455 11.85000
The hypothesis test resulted in a p-value of 0.019, suggesting there is little evidence to support that the means are equal.
A bootstrapping example can be done to further demonstrate the differences in means.
# Difference in averages function
diff_in_avg <- function(x_data) {
result <- x_data |>
mutate(sb_group = sb_winner) |>
group_by(sb_group) |>
summarize(avg_wins = mean(wins, na.rm = TRUE)) |>
ungroup() # Ensure it returns a dataframe
return(result)
}
# Applying bootstrap
bootstrap_result <- bootstrap(combined_mask, function(x) {
data <- diff_in_avg(x)
diff <- data$avg_wins[2] - data$avg_wins[1]
return(diff)
}, n_iter = 1000)
# Observed difference calculation
diff_in_avg_result <- diff_in_avg(combined_mask)
observed_diff <- diff_in_avg_result$avg_wins[1] -diff_in_avg_result$avg_wins[2]
# Plotting
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = sd(bootstrap_result)), xlim = c(-5, 5)) +
geom_vline(aes(xintercept = observed_diff, color = "Observed Difference")) +
labs(title = "Bootstrapped Sampling Distribution of Differences",
x = "Difference in Average Wins",
y = "Probability Density") +
scale_color_manual(name = "", values = "red") +
theme_classic()
In the above visualization, the red line indicates the observed differences in means between the two group. This line is below the expected difference if the means were equal.
Four models can be constructed. They are above .500 teams, playoff teams, Super Bowl winning teams, and playoff teams who win the Super Bowl.
From the prior analysis points_for and points_against were shown to have a significant effect on the whether a team’s record was .500 and above or not.
model_500 <- glm(as.factor(wins>=8) ~ points_for + points_against,
data = standings,
family = binomial(link = 'logit'))
summary(model_500)
##
## Call:
## glm(formula = as.factor(wins >= 8) ~ points_for + points_against,
## family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.532309 1.183761 1.294 0.196
## points_for 0.040132 0.003680 10.905 <2e-16 ***
## points_against -0.042660 0.004025 -10.599 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 876.76 on 637 degrees of freedom
## Residual deviance: 333.34 on 635 degrees of freedom
## AIC: 339.34
##
## Number of Fisher Scoring iterations: 7
Both coefficients of points_for and points_against were shown to be significant in the model.
# Create a grid of values for points_for and points_against
points_for <- seq(min(standings$points_for), max(standings$points_for), length.out = 100)
points_against <- seq(min(standings$points_against), max(standings$points_against), length.out = 100)
grid <- expand.grid(points_for = points_for, points_against = points_against)
# Calculate probabilities using the logistic regression equation
grid$probability <- 1 / (1 + exp(-(1.53230870 + 0.04013196 * grid$points_for - 0.04265988 * grid$points_against)))
# Create the heat map
ggplot(grid, aes(x = points_for, y = points_against, fill = probability)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0.5,
name = "Probability\nof ≥8 Wins") +
labs(x = "Points Scored (For)", y = "Points Allowed (Against)",
title = "Probability of Winning at Least 8 Games",
subtitle = "Based on Points Scored and Allowed") +
theme_classic() +
theme(axis.title = element_text(size = 12, face = "bold"),
title = element_text(size = 14, face = "bold"))
From the prior analysis, points_for and points_against were shown to have a significant effect on the whether a team made the playoffs or not.
standings$playoff_logit <- ifelse(standings$playoffs == 'Playoffs', 1, 0)
model_playoff <- glm(playoff_logit ~ points_for + points_against,
data = standings,
family = binomial(link = 'logit'))
summary(model_playoff)
##
## Call:
## glm(formula = playoff_logit ~ points_for + points_against, family = binomial(link = "logit"),
## data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.371117 1.095377 -0.339 0.735
## points_for 0.030934 0.002857 10.828 <2e-16 ***
## points_against -0.033448 0.003192 -10.480 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 844.92 on 637 degrees of freedom
## Residual deviance: 396.06 on 635 degrees of freedom
## AIC: 402.06
##
## Number of Fisher Scoring iterations: 6
Both coefficients of points_for and points_against were shown to be significant in the model.
# Create a grid of values for points_for and points_against
points_for <- seq(min(standings$points_for), max(standings$points_for), length.out = 100)
points_against <- seq(min(standings$points_against), max(standings$points_against), length.out = 100)
grid <- expand.grid(points_for = points_for, points_against = points_against)
# Calculate probabilities using the logistic regression equation
grid$probability <- 1 / (1 + exp(-(-0.37111746 + 0.03093391 * grid$points_for - 0.03344767 * grid$points_against)))
# Create the heat map
ggplot(grid, aes(x = points_for, y = points_against, fill = probability)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0.5,
name = "Probability\nof Playoffs") +
labs(x = "Points Scored (For)", y = "Points Allowed (Against)",
title = "Probability of Making the Playoffs",
subtitle = "Based on Points Scored and Allowed") +
theme_classic() +
theme(axis.title = element_text(size = 12, face = "bold"),
title = element_text(size = 14, face = "bold"))
From the prior analysis there is a significant difference in the average number of wins, points_for, and points_against between teams who won the Super Bowl and those that did not.
standings$sb_winner_logit <- ifelse(standings$sb_winner == 'Won Superbowl', 1, 0)
model_sb <- glm(sb_winner_logit ~ wins + points_for + points_against,
data = standings,
family = binomial(link = 'logit'))
summary(model_sb)
##
## Call:
## glm(formula = sb_winner_logit ~ wins + points_for + points_against,
## family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.699438 2.694418 -1.744 0.0811 .
## wins 0.436277 0.210832 2.069 0.0385 *
## points_for 0.001244 0.006172 0.201 0.8403
## points_against -0.011610 0.007350 -1.580 0.1142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 135.76 on 634 degrees of freedom
## AIC: 143.76
##
## Number of Fisher Scoring iterations: 8
Interestingly, only the wins coefficient appears to be significant. This could be related to the fact that in order to win the Super Bowl a team needs to make the playoffs, which are determined by the number of wins a team has. This could mean that wins are not independent of a team winning the Super Bowl. This could cause the wins to overpower the other coefficients in the model.
# Generate a range of wins values (assuming realistic values based on your data)
wins_range <- seq(min(standings$wins), max(standings$wins), by = 1)
# Predict probabilities across the range of wins while holding other variables at their means
data_for_plot <- data.frame(wins = wins_range,
points_for = mean(standings$points_for),
points_against = mean(standings$points_against))
data_for_plot$probability <- predict(model_sb, newdata = data_for_plot, type = "response")
# Create the plot
ggplot(data_for_plot, aes(x = wins, y = probability)) +
geom_line() +
labs(title = "Effect of Wins on Super Bowl Winning Probability",
x = "Wins",
y = "Probability of Winning Super Bowl") +
theme_classic()
# Generate a range of points_for values
points_for_range <- seq(min(standings$points_for), max(standings$points_for), by = 10)
# Predict probabilities across the range of points_for
data_for_plot <- data.frame(points_for = points_for_range,
wins = mean(standings$wins),
points_against = mean(standings$points_against))
data_for_plot$probability <- predict(model_sb, newdata = data_for_plot, type = "response")
# Create the plot
ggplot(data_for_plot, aes(x = points_for, y = probability)) +
geom_line() +
labs(title = "Effect of Points For on Super Bowl Winning Probability",
x = "Points For",
y = "Probability of Winning Super Bowl") +
theme_classic()
# Generate a range of points_against values
points_against_range <- seq(min(standings$points_against), max(standings$points_against), by = 10)
# Predict probabilities across the range of points_against
data_for_plot <- data.frame(points_against = points_against_range,
wins = mean(standings$wins),
points_for = mean(standings$points_for))
data_for_plot$probability <- predict(model_sb, newdata = data_for_plot, type = "response")
# Create the plot
ggplot(data_for_plot, aes(x = points_against, y = probability)) +
geom_line() +
labs(title = "Effect of Points Against on Super Bowl Winning Probability",
x = "Points Against",
y = "Probability of Winning Super Bowl") +
theme_classic()
From the prior analysis there is a significant difference in the average number of wins and points_against between playoff teams who won the Super Bowl and playoff teams that did not.
combined_mask$sb_winner_logit <- ifelse(combined_mask$sb_winner == 'Won Superbowl', 1, 0)
model_playoff_sb <- glm(sb_winner_logit ~ wins + points_against,
data = combined_mask,
family = binomial(link = 'logit'))
summary(model_playoff_sb)
##
## Call:
## glm(formula = sb_winner_logit ~ wins + points_against, family = binomial(link = "logit"),
## data = combined_mask)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.483777 2.755192 -0.901 0.3673
## wins 0.253513 0.162778 1.557 0.1194
## points_against -0.009456 0.005215 -1.813 0.0698 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.68 on 239 degrees of freedom
## Residual deviance: 129.04 on 237 degrees of freedom
## AIC: 135.04
##
## Number of Fisher Scoring iterations: 5
Interestingly, the p-values of both variables suggests that their coefficients are not significant. Further analysis could be done to figure out why this is the case.
# Generate data for points_against and wins
points_against <- seq(min(combined_mask$points_against), max(combined_mask$points_against), length.out = 100)
wins <- unique(combined_mask$wins) # Use actual discrete wins values from your data
# Create a data frame for prediction
expand_grid <- expand.grid(points_against = points_against, wins = wins)
expand_grid$log_wins <- log(expand_grid$wins)
expand_grid$log_points_against <- log(expand_grid$points_against)
# Predict probabilities
expand_grid$probability <- predict(model_playoff_sb, newdata = expand_grid, type = "response")
# Plot
ggplot(expand_grid, aes(x = points_against, y = probability, color = as.factor(wins))) +
geom_line() + # Using line plots to show the trend
labs(x = "Points Against", y = "Probability of Winning Super Bowl", color = "Number of Wins",
title = "Probability of Playoff Teams",
subtitle = "Winning the Super Bowl by Points Against and Wins") +
scale_color_brewer(type = 'div', palette = "Spectral") + # Using a color palette
theme_classic()
All NFL teams aim for success, but the areas of their team they choose to strengthen can play a significant factor in that success
This project has defined 4 levels success for an NFL team. Each of these models can effectively evaluate an NFL team’s performance, offering a broader and more nuanced measure of success beyond the binary outcome of winning or losing a Super Bowl. These model help teams identify their strengths and areas for improvement, enabling more targeted strategies to enhance their competitiveness and chances of success in future seasons.
Ultimately, these models are only part of the picture. Many more variables determine what actually makes an NFL team successful. This analysis is limited by the number of variables in the data and the fact that the data is sourced from seasons 2000 - 2019. Future seasons worth of data may prove the models irrelevant.