A Pythagorean expectation is a statistic used to measure how many wins a team should expect, based on how many points they score and how many they allow.
In this project, we delve into the world of sports analytics by exploring the Pythagorean Expectation as a predictive tool for NHL team performance. Focusing on the 2022-2023 NHL season, we embark on a comprehensive analysis of the Pythagorean Expectation across all teams, and expand our investigation to encompass historical data from previous seasons. The ultimate goal is to assess the efficacy of this model by comparing its predictive capacity between the first and second halves of the ongoing season.
Overview:
Pythagorean Expectation: Derived from the Pythagorean theorem, the Pythagorean Expectation is a formula used in sports analytics to predict a team’s win-loss record based on their points scored and allowed. It is a valuable tool for evaluating the efficiency of a team’s performance relative to their actual win-loss record.
Data Collection and Analysis: We gather detailed statistics for all NHL teams during the 2022-2023 season, including goals scored, goals allowed, and actual win-loss records. By applying the Pythagorean Expectation formula, we calculate the expected win-loss records for each team and compare them with their actual records. This provides insight into over-performing and under-performing teams in terms of wins and losses.
Historical Context: To understand the consistency of the Pythagorean Expectation across seasons, we extend our analysis to previous NHL seasons. By aggregating data from multiple seasons, we aim to identify trends, outliers, and whether the Pythagorean Expectation remains a relevant predictive tool over time.
Seasonal Dynamics: One key aspect of this project involves assessing whether the Pythagorean Expectation’s predictive capacity changes between the first and second halves of the ongoing NHL season. By dividing the season into two parts, we can analyze whether teams’ performance becomes more aligned with their expected records as the season progresses.
### Data
shots_2007<- read_csv(here("MP_data", "CAN_shots_2007.csv"))
shots_2008<- read_csv(here("MP_data", "CAN_shots_2008.csv"))
shots_2009<- read_csv(here("MP_data", "CAN_shots_2009.csv"))
shots_2010<- read_csv(here("MP_data", "CAN_shots_2010.csv"))
shots_2011<- read_csv(here("MP_data", "CAN_shots_2011.csv"))
shots_2012<- read_csv(here("MP_data", "CAN_shots_2012.csv"))
shots_2013<- read_csv(here("MP_data", "CAN_shots_2013.csv"))
shots_2014<- read_csv(here("MP_data", "CAN_shots_2014.csv"))
shots_2015<- read_csv(here("MP_data", "CAN_shots_2015.csv"))
shots_2016<- read_csv(here("MP_data", "CAN_shots_2016.csv"))
shots_2017<- read_csv(here("MP_data", "CAN_shots_2017.csv"))
shots_2018<- read_csv(here("MP_data", "CAN_shots_2018.csv"))
shots_2019<- read_csv(here("MP_data", "CAN_shots_2019.csv"))
shots_2020<- read_csv(here("MP_data", "CAN_shots_2020.csv"))[, colnames(shots_2019)]
shots_2021<- read_csv(here("MP_data", "CAN_shots_2021.csv"))
shots_2022<- read_csv(here("MP_data", "CAN_shots_2022.csv"))
shot_data<- rbind(shots_2009, shots_2010, shots_2011, shots_2012, shots_2013, shots_2014, shots_2015, shots_2016, shots_2017, shots_2018, shots_2019, shots_2020, shots_2021, shots_2022)## What we need in the data frame:
# identity of team
# points scored
# poitns scored against them
# aggregate of thos values across the entire season
pe_nhl<- shot_data %>%
filter(season == 2022) %>%
clean_names() %>%
select(home_team_code, away_team_code, home_team_goals, away_team_goals, game_id)
## variable where 1 --> if home team won the game and 0 --> if away team didnt win the game
pe_nhl_score<- pe_nhl %>%
group_by(game_id) %>%
mutate(hwin = case_when (sum(home_team_goals)> sum(away_team_goals) ~ 1,
sum(away_team_goals)>sum(home_team_goals) ~ 0)) %>%
mutate(awin = case_when (sum(home_team_goals)< sum(away_team_goals) ~ 1,
sum(away_team_goals)<sum(home_team_goals) ~ 0)) %>%
mutate(count =1)
### aggregation
nhl_home<- pe_nhl_score %>% ### record of the team as a home team
group_by(home_team_code) %>%
summarise(hwin = sum(hwin),
home_team_goals = sum(home_team_goals),
away_team_goals = sum(away_team_goals),
count = sum(count)) %>%
rename(team = home_team_code, away_team_goals_h = away_team_goals, home_team_goals_h = home_team_goals, Gh = count)
nhl_away<- pe_nhl_score %>% ### record of the team as an away team
group_by(away_team_code) %>%
summarise(awin = sum(awin),
home_team_goals = sum(home_team_goals),
away_team_goals = sum(away_team_goals),
count = sum(count)) %>%
rename(team = away_team_code, away_team_goals_a = away_team_goals, home_team_goals_a = home_team_goals, Ga = count)
NHL22 <- merge(nhl_home, nhl_away, by = "team")
# Now we create the total wins, games, played, runs scored and run conceded by summing the totals as home team and away team
NHL22$W <- NHL22$hwin + NHL22$awin
NHL22$G <- NHL22$Gh + NHL22$Ga
NHL22$R <- NHL22$home_team_goals_h + NHL22$away_team_goals_a
NHL22$RA <- NHL22$away_team_goals_h + NHL22$home_team_goals_a
#NHL22
#define win percentage and the Pythagorean Expectation
NHL22$wpc <- NHL22$W / NHL22$G
NHL22$pyth <- NHL22$R^2 / (NHL22$R^2 + NHL22$RA^2)
#NHL22
## regression
pyth_lm <- lm(wpc ~ pyth, data = NHL22)
summary(pyth_lm)##
## Call:
## lm(formula = wpc ~ pyth, data = NHL22)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20828 -0.11902 -0.02924 0.11852 0.33060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11477 0.06938 -1.654 0.109
## pyth 1.32684 0.11328 11.713 1.02e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 30 degrees of freedom
## Multiple R-squared: 0.8206, Adjusted R-squared: 0.8146
## F-statistic: 137.2 on 1 and 30 DF, p-value: 1.02e-12
wpc = Intercept + coef x pyth
# Create the linear regression model
pyth_lm <- lm(wpc ~ pyth, data = NHL22)
# Create a scatter plot with regression line using ggplot2
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc)) +
geom_point(size=3) +
geom_smooth(method = "lm", se = FALSE, color = "blue", size= 1.2) +
labs(x = "Pythagorean Expectation", y = "Winning Percentage") +
theme_minimal()
# Print the plot
print(regression_plot)### comparing teams
# Create a scatter plot with regression line for each team
regression_plot_teams <- ggplot(NHL22, aes(x = pyth, y = wpc, label = team)) +
geom_point(size=3) +
geom_smooth(method = "lm", se = FALSE, color = "lightblue", size=1.2) +
geom_text_repel(
nudge_x = 0.02, nudge_y = 0.0005,
segment.size = 0.2, segment.color = "grey50"
) +
labs(x = "Pythagorean Expectation", y = "Winning Percentage") +
theme_minimal()
# Print the plot
print(regression_plot_teams)## What we need in the data frame:
# identity of team
# points scored
# poitns scored against them
# aggregate of thos values across the entire season
pe_nhl<- shot_data %>%
#filter(season == 2022) %>%
clean_names() %>%
select(home_team_code, away_team_code, home_team_goals, away_team_goals, game_id)
## variable where 1 --> if home team won the game and 0 --> if away team didnt win the game
pe_nhl_score<- pe_nhl %>%
group_by(game_id) %>%
mutate(hwin = case_when (sum(home_team_goals)> sum(away_team_goals) ~ 1,
sum(away_team_goals)>sum(home_team_goals) ~ 0)) %>%
mutate(awin = case_when (sum(home_team_goals)< sum(away_team_goals) ~ 1,
sum(away_team_goals)<sum(home_team_goals) ~ 0)) %>%
mutate(count =1)
### aggregation
nhl_home<- pe_nhl_score %>% ### record of the team as a home team
group_by(home_team_code) %>%
summarise(hwin = sum(hwin),
home_team_goals = sum(home_team_goals),
away_team_goals = sum(away_team_goals),
count = sum(count)) %>%
rename(team = home_team_code, away_team_goals_h = away_team_goals, home_team_goals_h = home_team_goals, Gh = count)
nhl_away<- pe_nhl_score %>% ### record of the team as an away team
group_by(away_team_code) %>%
summarise(awin = sum(awin),
home_team_goals = sum(home_team_goals),
away_team_goals = sum(away_team_goals),
count = sum(count)) %>%
rename(team = away_team_code, away_team_goals_a = away_team_goals, home_team_goals_a = home_team_goals, Ga = count)
NHL22 <- merge(nhl_home, nhl_away, by = "team")
# Now we create the total wins, games, played, runs scored and run conceded by summing the totals as home team and away team
NHL22$W <- NHL22$hwin + NHL22$awin
NHL22$G <- NHL22$Gh + NHL22$Ga
NHL22$R <- NHL22$home_team_goals_h + NHL22$away_team_goals_a
NHL22$RA <- NHL22$away_team_goals_h + NHL22$home_team_goals_a
#NHL22
#define win percentage and the Pythagorean Expectation
NHL22$wpc <- NHL22$W / NHL22$G
NHL22$pyth <- NHL22$R^2 / (NHL22$R^2 + NHL22$RA^2)
#NHL22
## regression
pyth_lm <- lm(wpc ~ pyth, data = NHL22)
summary(pyth_lm)##
## Call:
## lm(formula = wpc ~ pyth, data = NHL22)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22462 -0.06325 -0.01492 0.07677 0.22397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05434 0.06144 -0.884 0.385
## pyth 1.14875 0.11283 10.181 1.46e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09762 on 26 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.7995, Adjusted R-squared: 0.7918
## F-statistic: 103.7 on 1 and 26 DF, p-value: 1.457e-10
wpc = Intercept + coef x pyth
# Create the linear regression model
pyth_lm <- lm(wpc ~ pyth, data = NHL22)
# Create a scatter plot with regression line using ggplot2
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "navyblue") +
labs(x = "Pyth", y = "Winning Percentage", title = "Regression: wpc ~ pyth") +
theme_minimal()
# Print the plot
print(regression_plot)### comparing teams
# Create a scatter plot with regression line for each team
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc, label = team)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "skyblue") +
geom_text_repel(
nudge_x = 0.02, nudge_y = 0.005,
segment.size = 0.2, segment.color = "grey50"
) +
labs(x = "Pyth", y = "Winning Percentage", title = "Regression: wpc ~ pyth") +
theme_minimal()
# Print the plot
print(regression_plot)One of the main reasons that people are interested in sports analytics is that they want to predict the outcome of events that have not yet occurred. Thus we want to go beyond “explanation” - finding the model that best fits the data (i.e. history) and to use our model to forecast the outcome of games in the future. Pythagorean Expectation can be thought of as a forecast. At any point in the season, it can be calculated based on the games already played. Using it as a forecast would amount to saying that from that point onward the win percentage of the team would equal the Pythagorean Expectation to date. In this notebook, we’re going to see if it is a good forecasting model in the context of the NHL data we examined earlier.
# Create a subset of NHL22 data frame containing selected columns
nhl_home_c <- pe_nhl_score[c('home_team_code', 'home_team_goals', 'away_team_goals', 'count', 'game_id')]
# Add a new column 'home' with value 1
nhl_home_c$home <- 1
# Rename columns
colnames(nhl_home_c) <- c('team', 'R', 'RA', 'count', 'game_id')
# Create a subset of NHL22 data frame containing selected columns
nhl_away_c <- pe_nhl_score[c('away_team_code', 'away_team_goals', 'home_team_goals', 'count', 'game_id')]
# Add a new column 'home' with value 1
nhl_away_c$home <- 0
# Rename columns
colnames(nhl_away_c) <- c('team', 'R', 'RA', 'count', 'game_id')
NHL22_c <- rbind(nhl_home_c, nhl_away_c) %>%
select(-'...6')
### define a win
NHL22_c$win <- ifelse(NHL22_c$R > NHL22_c$RA, 1, 0)
### split data
# Generate a random vector of indices to split the data frame
split_indices <- sample(nrow(NHL22_c), nrow(NHL22_c) / 2)
# Create the first half
Half1 <- NHL22_c[split_indices, ]
describe(Half1)## vars n mean sd median trimmed mad min max range
## team* 1 96143 25.32 11.30 34 26.84 1.48 1 37 36
## R 2 96143 1.29 1.35 1 1.10 1.48 0 9 9
## RA 3 96143 1.30 1.35 1 1.10 1.48 0 9 9
## count 4 96143 1.00 0.00 1 1.00 0.00 1 1 0
## game_id 5 96143 21173.71 2309.03 20624 20632.93 489.26 20003 30416 10413
## win 6 96143 0.32 0.47 0 0.28 0.00 0 1 1
## skew kurtosis se
## team* -0.82 -0.90 0.04
## R 1.13 1.10 0.00
## RA 1.13 1.11 0.00
## count NaN NaN 0.00
## game_id 3.57 11.11 7.45
## win 0.77 -1.41 0.00
# Create the second half
Half2 <- NHL22_c[-split_indices, ]
describe(Half2)## vars n mean sd median trimmed mad min max range
## team* 1 96143 25.38 11.25 34 26.88 0.00 1 37 36
## R 2 96143 1.30 1.36 1 1.11 1.48 0 9 9
## RA 3 96143 1.30 1.36 1 1.10 1.48 0 8 8
## count 4 96143 1.00 0.00 1 1.00 0.00 1 1 0
## game_id 5 96143 21195.87 2352.22 20623 20632.92 492.22 20003 30416 10413
## win 6 96143 0.32 0.47 0 0.28 0.00 0 1 1
## skew kurtosis se
## team* -0.82 -0.90 0.04
## R 1.12 1.05 0.00
## RA 1.12 1.05 0.00
## count NaN NaN 0.00
## game_id 3.49 10.51 7.59
## win 0.76 -1.42 0.00
# Group by 'team' and summarize data for the first half of the season
Half1perf <- Half1 %>%
group_by(team) %>%
summarize(count1 = sum(count),
win1 = sum(win),
R1 = sum(R),
RA1 = sum(RA)) %>%
ungroup() %>%
rename(count = count1, win = win1, R = R1, RA = RA1)
# Calculate win percentage and Pythagorean Expectation for the first half of the season
Half1perf <- Half1perf %>%
mutate(wpc1 = win / count,
pyth1 = R^2 / (R^2 + RA^2))
# Group by 'team' and summarize data for the second half of the season
Half2perf <- Half2 %>%
group_by(team) %>%
summarize(count2 = sum(count),
win2 = sum(win),
R2 = sum(R),
RA2 = sum(RA)) %>%
ungroup() %>%
rename(count = count2, win = win2, R = R2, RA = RA2)
# Calculate win percentage and Pythagorean Expectation for the second half of the season
Half2perf <- Half2perf %>%
mutate(wpc2 = win / count,
pyth2 = R^2 / (R^2 + RA^2))
# Merge the two data frames on the 'team' column
Half2predictor <- merge(Half1perf, Half2perf, by = "team")
# Create a scatter plot
scatter_plot <- ggplot(data = Half2predictor, aes(x = pyth1, y = wpc2)) +
geom_point(color = "#1f77b4", size = 4, alpha = 0.7) +
theme_minimal() +
labs(x = "Pythagorean Expectation (First Half)",
y = "Win Percentage (Second Half)",
title = "Pythagorean Expectation vs. Win Percentage") +
theme(plot.title = element_text(hjust = 0.5))
# Display the scatter plot
print(scatter_plot)## create a scatter plot comparing win percentage from the first half of the season against win percentage in the second half of the season
# Create a scatter plot
scatter_plot_1 <- ggplot(data = Half2predictor, aes(x = wpc1, y = wpc2)) +
geom_point(color = "#1f77b4", size = 4, alpha = 0.7) +
theme_minimal() +
labs(x = "Win% (Fall-Winter)",
y = "Win% (Winter-Spring)") +
theme(plot.title = element_text(hjust = 0.5))
# Display the scatter plot
print(scatter_plot_1)# The two plots look similar
# We can be more precise still if we compare the correlation coefficients. The first row of the table shows the
# correlation of win percentage in second half of the season against itself, win percentage in the first half of the season,
# Pythagorean Expectation in the first half of the season, and Pythagorean Expectation in the second half of the season.
# Our focus is on comparing the second and third columns.
# Extract the relevant columns
keyvars <- Half2predictor[c('team', 'wpc2', 'wpc1', 'pyth1', 'pyth2')]
# Calculate the correlation coefficients
correlation_matrix <- cor(keyvars[, c('wpc2', 'wpc1', 'pyth1', 'pyth2')])
# Print the correlation matrix
print(correlation_matrix)## wpc2 wpc1 pyth1 pyth2
## wpc2 1.0000000 0.9841039 0.9440588 0.9553679
## wpc1 0.9841039 1.0000000 0.9617343 0.9536036
## pyth1 0.9440588 0.9617343 1.0000000 0.9840436
## pyth2 0.9553679 0.9536036 0.9840436 1.0000000
# We can also sort the variables to show for each club how close the relationships are between the first and second half
# of the season
# Sort the data frame by 'wpc2' column in descending order
keyvars <- keyvars %>%
arrange(desc(wpc2))
# Print the sorted keyvars data frame
print(keyvars)## team wpc2 wpc1 pyth1 pyth2
## 1 TBL 0.78453039 0.68987342 0.89948082 0.9111842
## 2 NJD 0.59375000 0.49382716 0.70758123 0.7958848
## 3 MTL 0.51994498 0.50939306 0.70672138 0.7243709
## 4 ANA 0.44373576 0.45089481 0.68504348 0.6610888
## 5 WPG 0.44237175 0.43618421 0.66018500 0.6741355
## 6 T.B 0.44212219 0.44687500 0.67413341 0.6586442
## 7 VGK 0.43502304 0.43320236 0.62486265 0.6360283
## 8 N.J 0.39542484 0.38485317 0.60975610 0.6217893
## 9 S.J 0.38305785 0.39464068 0.60887172 0.5991963
## 10 WSH 0.37995595 0.38478747 0.56840402 0.5603843
## 11 FLA 0.37473684 0.38461538 0.56728221 0.5755606
## 12 MIN 0.36526946 0.37341475 0.55361232 0.5400345
## 13 TOR 0.36274510 0.35695187 0.53134566 0.5379332
## 14 BOS 0.35785953 0.35108153 0.58081994 0.5651861
## 15 PIT 0.35321637 0.30568182 0.41717047 0.4646130
## 16 DET 0.35214286 0.34634852 0.52634478 0.5499324
## 17 NYI 0.34926052 0.32051282 0.55127746 0.5668245
## 18 NYR 0.34521158 0.35258724 0.56086308 0.5999856
## 19 STL 0.34046997 0.35248447 0.60829783 0.5845219
## 20 BUF 0.33259177 0.31990521 0.49776287 0.4945870
## 21 CAR 0.32300358 0.33660131 0.50195847 0.5143636
## 22 CGY 0.31914251 0.31180470 0.46311493 0.4674723
## 23 EDM 0.31624214 0.32108573 0.50209533 0.4948948
## 24 SEA 0.31404959 0.32515337 0.48903086 0.5053638
## 25 VAN 0.30899612 0.30835417 0.48343139 0.4876765
## 26 LAK 0.29473684 0.27797834 0.41442904 0.5094331
## 27 DAL 0.28496197 0.27881885 0.41339329 0.4269574
## 28 CBJ 0.28453453 0.28623188 0.41983193 0.3973601
## 29 L.A 0.27549272 0.28523490 0.47524680 0.4785066
## 30 COL 0.27542579 0.27883137 0.44631103 0.4240710
## 31 CHI 0.26918018 0.25725594 0.45455043 0.4639067
## 32 NSH 0.26371414 0.24175824 0.42861598 0.4357576
## 33 PHI 0.25421822 0.28555556 0.36021601 0.3339415
## 34 ARI 0.24360830 0.24548736 0.34915556 0.3396058
## 35 OTT 0.20276163 0.20083391 0.27548070 0.2859064
## 36 SJS 0.01960784 0.01831502 0.24253348 0.2431256
## 37 ATL 0.00000000 0.00000000 0.08431689 0.1111699
keyvars_ext<-keyvars %>%
mutate(difference = (wpc2-wpc1)*100)We can see from the correlation matrix that win percentage for the second half of games is correlated with win percentage in the first half of games - the correlation coefficient is +0.983. That is a heavy correlation and suggests that the first half of games is highly predictive of the second half of games. Pythagorean Expectation is slightly less effective at forecasting - the correlation coefficient is lower, at +0.960. Though, the difference between predictor variables is not large.
When we sort the teams from highest to lowest send half of season win percentage, we find a mixed picture. Some clubs perform with less than one percentage point difference in each half, e.g. The ATL (Atlanta Thrashers), ARI (Arizona Coyotes), MIN (Minnesota Wild), while others differed up to 6 percentage points, e.g. BUF (Buffalo Sabres), and VGK (Vegas Golden Knights).
correlation_matrix <- cor(keyvars[, c('wpc2', 'wpc1', 'pyth1', 'pyth2')])
library(reshape2)
correlation_long <- melt(correlation_matrix)
correlation_long## Var1 Var2 value
## 1 wpc2 wpc2 1.0000000
## 2 wpc1 wpc2 0.9841039
## 3 pyth1 wpc2 0.9440588
## 4 pyth2 wpc2 0.9553679
## 5 wpc2 wpc1 0.9841039
## 6 wpc1 wpc1 1.0000000
## 7 pyth1 wpc1 0.9617343
## 8 pyth2 wpc1 0.9536036
## 9 wpc2 pyth1 0.9440588
## 10 wpc1 pyth1 0.9617343
## 11 pyth1 pyth1 1.0000000
## 12 pyth2 pyth1 0.9840436
## 13 wpc2 pyth2 0.9553679
## 14 wpc1 pyth2 0.9536036
## 15 pyth1 pyth2 0.9840436
## 16 pyth2 pyth2 1.0000000
heatmap_plot <- ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "#699FA1", high = "#C3E5E7") +
theme_minimal() +
labs()
# Print the heatmap plot
print(heatmap_plot)# Arrange the plots using cowplot
combined_plot <- plot_grid(
plot_grid(heatmap_plot, scatter_plot_1, ncol = 2, labels = c("A", "B")),
regression_plot_teams,
ncol = 1, nrow = 2,
rel_heights = c(1, 2),
labels = c("", "C")
)
# Display the combined plot
print(combined_plot)