A Pythagorean expectation is a statistic used to measure how many wins a team should expect, based on how many points they score and how many they allow.

In this project, we delve into the world of sports analytics by exploring the Pythagorean Expectation as a predictive tool for NHL team performance. Focusing on the 2022-2023 NHL season, we embark on a comprehensive analysis of the Pythagorean Expectation across all teams, and expand our investigation to encompass historical data from previous seasons. The ultimate goal is to assess the efficacy of this model by comparing its predictive capacity between the first and second halves of the ongoing season.

Overview:

Pythagorean Expectation: Derived from the Pythagorean theorem, the Pythagorean Expectation is a formula used in sports analytics to predict a team’s win-loss record based on their points scored and allowed. It is a valuable tool for evaluating the efficiency of a team’s performance relative to their actual win-loss record.

Data Collection and Analysis: We gather detailed statistics for all NHL teams during the 2022-2023 season, including goals scored, goals allowed, and actual win-loss records. By applying the Pythagorean Expectation formula, we calculate the expected win-loss records for each team and compare them with their actual records. This provides insight into over-performing and under-performing teams in terms of wins and losses.

Historical Context: To understand the consistency of the Pythagorean Expectation across seasons, we extend our analysis to previous NHL seasons. By aggregating data from multiple seasons, we aim to identify trends, outliers, and whether the Pythagorean Expectation remains a relevant predictive tool over time.

Seasonal Dynamics: One key aspect of this project involves assessing whether the Pythagorean Expectation’s predictive capacity changes between the first and second halves of the ongoing NHL season. By dividing the season into two parts, we can analyze whether teams’ performance becomes more aligned with their expected records as the season progresses.

### Data

shots_2007<- read_csv(here("MP_data", "CAN_shots_2007.csv"))
shots_2008<- read_csv(here("MP_data", "CAN_shots_2008.csv"))
shots_2009<- read_csv(here("MP_data", "CAN_shots_2009.csv"))
shots_2010<- read_csv(here("MP_data", "CAN_shots_2010.csv"))
shots_2011<- read_csv(here("MP_data", "CAN_shots_2011.csv"))
shots_2012<- read_csv(here("MP_data", "CAN_shots_2012.csv"))
shots_2013<- read_csv(here("MP_data", "CAN_shots_2013.csv"))
shots_2014<- read_csv(here("MP_data", "CAN_shots_2014.csv"))
shots_2015<- read_csv(here("MP_data", "CAN_shots_2015.csv"))
shots_2016<- read_csv(here("MP_data", "CAN_shots_2016.csv"))
shots_2017<- read_csv(here("MP_data", "CAN_shots_2017.csv"))
shots_2018<- read_csv(here("MP_data", "CAN_shots_2018.csv"))
shots_2019<- read_csv(here("MP_data", "CAN_shots_2019.csv"))
shots_2020<- read_csv(here("MP_data", "CAN_shots_2020.csv"))[, colnames(shots_2019)]
shots_2021<- read_csv(here("MP_data", "CAN_shots_2021.csv"))
shots_2022<- read_csv(here("MP_data", "CAN_shots_2022.csv"))

shot_data<- rbind(shots_2009, shots_2010, shots_2011, shots_2012, shots_2013, shots_2014, shots_2015, shots_2016, shots_2017, shots_2018, shots_2019, shots_2020, shots_2021, shots_2022)

Pythagorean Expectation for the 2022 season

## What we need in the data frame: 

# identity of team 
# points scored
# poitns scored against them
# aggregate of thos values across the entire season 

pe_nhl<- shot_data %>% 
  filter(season == 2022) %>% 
  clean_names() %>% 
  select(home_team_code, away_team_code, home_team_goals, away_team_goals, game_id)

## variable where 1 --> if home team won the game and 0 --> if away team didnt win the game 

pe_nhl_score<- pe_nhl %>% 
  group_by(game_id) %>% 
  mutate(hwin = case_when (sum(home_team_goals)> sum(away_team_goals) ~ 1,
                           sum(away_team_goals)>sum(home_team_goals)  ~ 0)) %>% 
  mutate(awin = case_when (sum(home_team_goals)< sum(away_team_goals) ~ 1,
                           sum(away_team_goals)<sum(home_team_goals)  ~ 0)) %>% 
  mutate(count =1)


### aggregation 

nhl_home<- pe_nhl_score %>% ### record of the team as a home team
group_by(home_team_code) %>% 
  summarise(hwin = sum(hwin),
            home_team_goals = sum(home_team_goals),
            away_team_goals = sum(away_team_goals),
            count = sum(count)) %>%
  rename(team = home_team_code, away_team_goals_h = away_team_goals, home_team_goals_h = home_team_goals, Gh = count)

nhl_away<- pe_nhl_score %>% ### record of the team as an away team 
  group_by(away_team_code) %>% 
  summarise(awin = sum(awin),
            home_team_goals = sum(home_team_goals),
            away_team_goals = sum(away_team_goals),
            count = sum(count)) %>%
  rename(team = away_team_code, away_team_goals_a = away_team_goals, home_team_goals_a = home_team_goals, Ga = count)


NHL22 <- merge(nhl_home, nhl_away, by = "team")


# Now we create the total wins, games, played, runs scored and run conceded by summing the totals as home team and away team


NHL22$W <- NHL22$hwin + NHL22$awin
NHL22$G <- NHL22$Gh + NHL22$Ga
NHL22$R <- NHL22$home_team_goals_h + NHL22$away_team_goals_a
NHL22$RA <- NHL22$away_team_goals_h + NHL22$home_team_goals_a
#NHL22

#define win percentage and the Pythagorean Expectation

NHL22$wpc <- NHL22$W / NHL22$G
NHL22$pyth <- NHL22$R^2 / (NHL22$R^2 + NHL22$RA^2)
#NHL22


## regression

pyth_lm <- lm(wpc ~ pyth, data = NHL22)
summary(pyth_lm)
## 
## Call:
## lm(formula = wpc ~ pyth, data = NHL22)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20828 -0.11902 -0.02924  0.11852  0.33060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.11477    0.06938  -1.654    0.109    
## pyth         1.32684    0.11328  11.713 1.02e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 30 degrees of freedom
## Multiple R-squared:  0.8206, Adjusted R-squared:  0.8146 
## F-statistic: 137.2 on 1 and 30 DF,  p-value: 1.02e-12

wpc = Intercept + coef x pyth

  • For every one unit increase in pyth, the value of wpc goes up by 1.33
  • Pythagorean Expectation can account for 81.5% of the variation in win percentage
# Create the linear regression model
pyth_lm <- lm(wpc ~ pyth, data = NHL22)

# Create a scatter plot with regression line using ggplot2
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc)) +
  geom_point(size=3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", size= 1.2) +
  labs(x = "Pythagorean Expectation", y = "Winning Percentage") +
  theme_minimal()

# Print the plot
print(regression_plot)

### comparing teams 

# Create a scatter plot with regression line for each team
regression_plot_teams <- ggplot(NHL22, aes(x = pyth, y = wpc, label = team)) +
  geom_point(size=3) +
  geom_smooth(method = "lm", se = FALSE, color = "lightblue", size=1.2) +
  geom_text_repel(
    nudge_x = 0.02, nudge_y = 0.0005,
    segment.size = 0.2, segment.color = "grey50"
  ) +
  labs(x = "Pythagorean Expectation", y = "Winning Percentage") +
  theme_minimal()

# Print the plot
print(regression_plot_teams)

PE for all seasons

## What we need in the data frame: 

# identity of team 
# points scored
# poitns scored against them
# aggregate of thos values across the entire season 

pe_nhl<- shot_data %>% 
  #filter(season == 2022) %>% 
  clean_names() %>% 
  select(home_team_code, away_team_code, home_team_goals, away_team_goals, game_id)

## variable where 1 --> if home team won the game and 0 --> if away team didnt win the game 

pe_nhl_score<- pe_nhl %>% 
  group_by(game_id) %>% 
  mutate(hwin = case_when (sum(home_team_goals)> sum(away_team_goals) ~ 1,
                           sum(away_team_goals)>sum(home_team_goals)  ~ 0)) %>% 
  mutate(awin = case_when (sum(home_team_goals)< sum(away_team_goals) ~ 1,
                           sum(away_team_goals)<sum(home_team_goals)  ~ 0)) %>% 
  mutate(count =1)


### aggregation 

nhl_home<- pe_nhl_score %>% ### record of the team as a home team
group_by(home_team_code) %>% 
  summarise(hwin = sum(hwin),
            home_team_goals = sum(home_team_goals),
            away_team_goals = sum(away_team_goals),
            count = sum(count)) %>%
  rename(team = home_team_code, away_team_goals_h = away_team_goals, home_team_goals_h = home_team_goals, Gh = count)

nhl_away<- pe_nhl_score %>% ### record of the team as an away team 
  group_by(away_team_code) %>% 
  summarise(awin = sum(awin),
            home_team_goals = sum(home_team_goals),
            away_team_goals = sum(away_team_goals),
            count = sum(count)) %>%
  rename(team = away_team_code, away_team_goals_a = away_team_goals, home_team_goals_a = home_team_goals, Ga = count)


NHL22 <- merge(nhl_home, nhl_away, by = "team")


# Now we create the total wins, games, played, runs scored and run conceded by summing the totals as home team and away team


NHL22$W <- NHL22$hwin + NHL22$awin
NHL22$G <- NHL22$Gh + NHL22$Ga
NHL22$R <- NHL22$home_team_goals_h + NHL22$away_team_goals_a
NHL22$RA <- NHL22$away_team_goals_h + NHL22$home_team_goals_a
#NHL22

#define win percentage and the Pythagorean Expectation

NHL22$wpc <- NHL22$W / NHL22$G
NHL22$pyth <- NHL22$R^2 / (NHL22$R^2 + NHL22$RA^2)
#NHL22


## regression

pyth_lm <- lm(wpc ~ pyth, data = NHL22)
summary(pyth_lm)
## 
## Call:
## lm(formula = wpc ~ pyth, data = NHL22)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22462 -0.06325 -0.01492  0.07677  0.22397 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.05434    0.06144  -0.884    0.385    
## pyth         1.14875    0.11283  10.181 1.46e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09762 on 26 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.7995, Adjusted R-squared:  0.7918 
## F-statistic: 103.7 on 1 and 26 DF,  p-value: 1.457e-10

wpc = Intercept + coef x pyth

  • For every one unit increase in pyth, the value of wpc goes up by 1.15
  • Pythagorean Expectation can account for 80.0% of the variation in win percentage
# Create the linear regression model
pyth_lm <- lm(wpc ~ pyth, data = NHL22)

# Create a scatter plot with regression line using ggplot2
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "navyblue") +
  labs(x = "Pyth", y = "Winning Percentage", title = "Regression: wpc ~ pyth") +
  theme_minimal()

# Print the plot
print(regression_plot)

### comparing teams 

# Create a scatter plot with regression line for each team
regression_plot <- ggplot(NHL22, aes(x = pyth, y = wpc, label = team)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "skyblue") +
  geom_text_repel(
    nudge_x = 0.02, nudge_y = 0.005,
    segment.size = 0.2, segment.color = "grey50"
  ) +
  labs(x = "Pyth", y = "Winning Percentage", title = "Regression: wpc ~ pyth") +
  theme_minimal()

# Print the plot
print(regression_plot)

Forecasting

One of the main reasons that people are interested in sports analytics is that they want to predict the outcome of events that have not yet occurred. Thus we want to go beyond “explanation” - finding the model that best fits the data (i.e. history) and to use our model to forecast the outcome of games in the future. Pythagorean Expectation can be thought of as a forecast. At any point in the season, it can be calculated based on the games already played. Using it as a forecast would amount to saying that from that point onward the win percentage of the team would equal the Pythagorean Expectation to date. In this notebook, we’re going to see if it is a good forecasting model in the context of the NHL data we examined earlier.

# Create a subset of NHL22 data frame containing selected columns
nhl_home_c <- pe_nhl_score[c('home_team_code', 'home_team_goals', 'away_team_goals', 'count', 'game_id')]

# Add a new column 'home' with value 1
nhl_home_c$home <- 1
# Rename columns
colnames(nhl_home_c) <- c('team', 'R', 'RA', 'count', 'game_id')


# Create a subset of NHL22 data frame containing selected columns
nhl_away_c <- pe_nhl_score[c('away_team_code', 'away_team_goals', 'home_team_goals', 'count', 'game_id')]

# Add a new column 'home' with value 1
nhl_away_c$home <- 0
# Rename columns
colnames(nhl_away_c) <- c('team', 'R', 'RA', 'count', 'game_id')


NHL22_c <- rbind(nhl_home_c, nhl_away_c) %>% 
  select(-'...6')
  
### define a win

NHL22_c$win <- ifelse(NHL22_c$R > NHL22_c$RA, 1, 0)


### split data 

# Generate a random vector of indices to split the data frame
split_indices <- sample(nrow(NHL22_c), nrow(NHL22_c) / 2)

# Create the first half
Half1 <- NHL22_c[split_indices, ]
describe(Half1)
##         vars     n     mean      sd median  trimmed    mad   min   max range
## team*      1 96143    25.32   11.30     34    26.84   1.48     1    37    36
## R          2 96143     1.29    1.35      1     1.10   1.48     0     9     9
## RA         3 96143     1.30    1.35      1     1.10   1.48     0     9     9
## count      4 96143     1.00    0.00      1     1.00   0.00     1     1     0
## game_id    5 96143 21173.71 2309.03  20624 20632.93 489.26 20003 30416 10413
## win        6 96143     0.32    0.47      0     0.28   0.00     0     1     1
##          skew kurtosis   se
## team*   -0.82    -0.90 0.04
## R        1.13     1.10 0.00
## RA       1.13     1.11 0.00
## count     NaN      NaN 0.00
## game_id  3.57    11.11 7.45
## win      0.77    -1.41 0.00
# Create the second half
Half2 <- NHL22_c[-split_indices, ]
describe(Half2)
##         vars     n     mean      sd median  trimmed    mad   min   max range
## team*      1 96143    25.38   11.25     34    26.88   0.00     1    37    36
## R          2 96143     1.30    1.36      1     1.11   1.48     0     9     9
## RA         3 96143     1.30    1.36      1     1.10   1.48     0     8     8
## count      4 96143     1.00    0.00      1     1.00   0.00     1     1     0
## game_id    5 96143 21195.87 2352.22  20623 20632.92 492.22 20003 30416 10413
## win        6 96143     0.32    0.47      0     0.28   0.00     0     1     1
##          skew kurtosis   se
## team*   -0.82    -0.90 0.04
## R        1.12     1.05 0.00
## RA       1.12     1.05 0.00
## count     NaN      NaN 0.00
## game_id  3.49    10.51 7.59
## win      0.76    -1.42 0.00
# Group by 'team' and summarize data for the first half of the season
Half1perf <- Half1 %>%
  group_by(team) %>%
  summarize(count1 = sum(count),
            win1 = sum(win),
            R1 = sum(R),
            RA1 = sum(RA)) %>%
  ungroup() %>%
  rename(count = count1, win = win1, R = R1, RA = RA1)

# Calculate win percentage and Pythagorean Expectation for the first half of the season
Half1perf <- Half1perf %>%
  mutate(wpc1 = win / count,
         pyth1 = R^2 / (R^2 + RA^2))

# Group by 'team' and summarize data for the second half of the season
Half2perf <- Half2 %>%
  group_by(team) %>%
  summarize(count2 = sum(count),
            win2 = sum(win),
            R2 = sum(R),
            RA2 = sum(RA)) %>%
  ungroup() %>%
  rename(count = count2, win = win2, R = R2, RA = RA2)

# Calculate win percentage and Pythagorean Expectation for the second half of the season
Half2perf <- Half2perf %>%
  mutate(wpc2 = win / count,
         pyth2 = R^2 / (R^2 + RA^2))


# Merge the two data frames on the 'team' column
Half2predictor <- merge(Half1perf, Half2perf, by = "team")

# Create a scatter plot
scatter_plot <- ggplot(data = Half2predictor, aes(x = pyth1, y = wpc2)) +
  geom_point(color = "#1f77b4", size = 4, alpha = 0.7) +
  theme_minimal() +
  labs(x = "Pythagorean Expectation (First Half)",
       y = "Win Percentage (Second Half)",
       title = "Pythagorean Expectation vs. Win Percentage") +
  theme(plot.title = element_text(hjust = 0.5))


# Display the scatter plot
print(scatter_plot)

## create a scatter plot comparing win percentage from the first half of the season against win percentage in the second half of the season

# Create a scatter plot
scatter_plot_1 <- ggplot(data = Half2predictor, aes(x = wpc1, y = wpc2)) +
  geom_point(color = "#1f77b4", size = 4, alpha = 0.7) +
  theme_minimal() +
  labs(x = "Win% (Fall-Winter)",
       y = "Win% (Winter-Spring)") +
  theme(plot.title = element_text(hjust = 0.5))

# Display the scatter plot
print(scatter_plot_1)

# The two plots look similar
# We can be more precise still if we compare the correlation coefficients. The first row of the table shows the 
# correlation of win percentage in second half of the season against itself, win percentage in the first half of the season,
# Pythagorean Expectation in the first half of the season, and Pythagorean Expectation in the second half of the season.
# Our focus is on comparing the second and third columns.

# Extract the relevant columns
keyvars <- Half2predictor[c('team', 'wpc2', 'wpc1', 'pyth1', 'pyth2')]

# Calculate the correlation coefficients
correlation_matrix <- cor(keyvars[, c('wpc2', 'wpc1', 'pyth1', 'pyth2')])

# Print the correlation matrix
print(correlation_matrix)
##            wpc2      wpc1     pyth1     pyth2
## wpc2  1.0000000 0.9841039 0.9440588 0.9553679
## wpc1  0.9841039 1.0000000 0.9617343 0.9536036
## pyth1 0.9440588 0.9617343 1.0000000 0.9840436
## pyth2 0.9553679 0.9536036 0.9840436 1.0000000
# We can also sort the variables to show for each club how close the relationships are between the first and second half
# of the season

# Sort the data frame by 'wpc2' column in descending order
keyvars <- keyvars %>%
  arrange(desc(wpc2))

# Print the sorted keyvars data frame
print(keyvars)
##    team       wpc2       wpc1      pyth1     pyth2
## 1   TBL 0.78453039 0.68987342 0.89948082 0.9111842
## 2   NJD 0.59375000 0.49382716 0.70758123 0.7958848
## 3   MTL 0.51994498 0.50939306 0.70672138 0.7243709
## 4   ANA 0.44373576 0.45089481 0.68504348 0.6610888
## 5   WPG 0.44237175 0.43618421 0.66018500 0.6741355
## 6   T.B 0.44212219 0.44687500 0.67413341 0.6586442
## 7   VGK 0.43502304 0.43320236 0.62486265 0.6360283
## 8   N.J 0.39542484 0.38485317 0.60975610 0.6217893
## 9   S.J 0.38305785 0.39464068 0.60887172 0.5991963
## 10  WSH 0.37995595 0.38478747 0.56840402 0.5603843
## 11  FLA 0.37473684 0.38461538 0.56728221 0.5755606
## 12  MIN 0.36526946 0.37341475 0.55361232 0.5400345
## 13  TOR 0.36274510 0.35695187 0.53134566 0.5379332
## 14  BOS 0.35785953 0.35108153 0.58081994 0.5651861
## 15  PIT 0.35321637 0.30568182 0.41717047 0.4646130
## 16  DET 0.35214286 0.34634852 0.52634478 0.5499324
## 17  NYI 0.34926052 0.32051282 0.55127746 0.5668245
## 18  NYR 0.34521158 0.35258724 0.56086308 0.5999856
## 19  STL 0.34046997 0.35248447 0.60829783 0.5845219
## 20  BUF 0.33259177 0.31990521 0.49776287 0.4945870
## 21  CAR 0.32300358 0.33660131 0.50195847 0.5143636
## 22  CGY 0.31914251 0.31180470 0.46311493 0.4674723
## 23  EDM 0.31624214 0.32108573 0.50209533 0.4948948
## 24  SEA 0.31404959 0.32515337 0.48903086 0.5053638
## 25  VAN 0.30899612 0.30835417 0.48343139 0.4876765
## 26  LAK 0.29473684 0.27797834 0.41442904 0.5094331
## 27  DAL 0.28496197 0.27881885 0.41339329 0.4269574
## 28  CBJ 0.28453453 0.28623188 0.41983193 0.3973601
## 29  L.A 0.27549272 0.28523490 0.47524680 0.4785066
## 30  COL 0.27542579 0.27883137 0.44631103 0.4240710
## 31  CHI 0.26918018 0.25725594 0.45455043 0.4639067
## 32  NSH 0.26371414 0.24175824 0.42861598 0.4357576
## 33  PHI 0.25421822 0.28555556 0.36021601 0.3339415
## 34  ARI 0.24360830 0.24548736 0.34915556 0.3396058
## 35  OTT 0.20276163 0.20083391 0.27548070 0.2859064
## 36  SJS 0.01960784 0.01831502 0.24253348 0.2431256
## 37  ATL 0.00000000 0.00000000 0.08431689 0.1111699
keyvars_ext<-keyvars %>% 
  mutate(difference = (wpc2-wpc1)*100)

Conclusion

We can see from the correlation matrix that win percentage for the second half of games is correlated with win percentage in the first half of games - the correlation coefficient is +0.983. That is a heavy correlation and suggests that the first half of games is highly predictive of the second half of games. Pythagorean Expectation is slightly less effective at forecasting - the correlation coefficient is lower, at +0.960. Though, the difference between predictor variables is not large.

When we sort the teams from highest to lowest send half of season win percentage, we find a mixed picture. Some clubs perform with less than one percentage point difference in each half, e.g. The ATL (Atlanta Thrashers), ARI (Arizona Coyotes), MIN (Minnesota Wild), while others differed up to 6 percentage points, e.g. BUF (Buffalo Sabres), and VGK (Vegas Golden Knights).

correlation_matrix <- cor(keyvars[, c('wpc2', 'wpc1', 'pyth1', 'pyth2')])

library(reshape2)

correlation_long <- melt(correlation_matrix)

correlation_long
##     Var1  Var2     value
## 1   wpc2  wpc2 1.0000000
## 2   wpc1  wpc2 0.9841039
## 3  pyth1  wpc2 0.9440588
## 4  pyth2  wpc2 0.9553679
## 5   wpc2  wpc1 0.9841039
## 6   wpc1  wpc1 1.0000000
## 7  pyth1  wpc1 0.9617343
## 8  pyth2  wpc1 0.9536036
## 9   wpc2 pyth1 0.9440588
## 10  wpc1 pyth1 0.9617343
## 11 pyth1 pyth1 1.0000000
## 12 pyth2 pyth1 0.9840436
## 13  wpc2 pyth2 0.9553679
## 14  wpc1 pyth2 0.9536036
## 15 pyth1 pyth2 0.9840436
## 16 pyth2 pyth2 1.0000000
heatmap_plot <- ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "#699FA1", high = "#C3E5E7") +
  theme_minimal() +
  labs()

# Print the heatmap plot
print(heatmap_plot)

# Arrange the plots using cowplot
combined_plot <- plot_grid(
  plot_grid(heatmap_plot, scatter_plot_1, ncol = 2, labels = c("A", "B")),
  regression_plot_teams,
  ncol = 1, nrow = 2,
  rel_heights = c(1, 2),
  labels = c("", "C")
)

# Display the combined plot
print(combined_plot)