library(tidyverse)
library(stringdist)
library(ineq)
library(DT)
library(broom)
library(jtools)

theme_update(plot.title = element_text(hjust = 0.5)) 

Overview

The objective of this project is to take a look at how professional sports teams distribute their budgets among players, and to see if there’s a significant relationship between the distribution of wages and the team’s success. Salary should be a reasonable proxy for skill, since the market value of a player is largely (but not entirely) determined by their ability. One could argue that soccer is more of a team sport and basketball is more individual. It would then follow that soccer teams benefit more from having a distributed investment in their players, whereas individual standouts might be weighted more heavily in basketball.

I scraped data from hoopshype.com, fbref.com and basketball-reference.com.These datasets contain information about player salaries, team salary budgets, and team season statistics (wins, losses, points etc). Basketball data is limited to the NBA, whereas soccer data is pulled from the Premier League, La Liga, Bundesliga, and Major League Soccer. NBA data goes back to 1990, whereas soccer data is more recent. Data on the European leagues goes back to 2013, and MLS data goes back to 2007.

First Look

Team Salary Budgets

All salary numbers have been converted to USD. The average basketball team’s annual budget is much higher than the average soccer team’s, regardless of league. Budgets have generally been trending upwards in both sports, though interestingly La Liga and Serie A have seen a decline in recent years.

my_color_palette <- c("#3498db", "#e74c3c", "#2ecc71", "#f39c12", "#9b59b6")

basketball_teams %>%
  group_by(year_2) %>%
  summarize(average_salary_budget = mean(salary, na.rm = TRUE)) %>%
  ggplot(aes(year_2, average_salary_budget)) +
  geom_line(size = .7, color =  "#3498db") +
  geom_point(shape = 17, color = "#3498db") +
  geom_text(aes(family = "serif", label = ifelse(year_2 == max(year_2) | year_2 == min(year_2), scales::comma(average_salary_budget), "")),
            hjust = .5, vjust = -1.5, size = 3) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 170000000)) +
  scale_x_continuous(breaks = seq(min(basketball_teams$year_2), max(basketball_teams$year_2), by = 2)) +
  labs(x = "Year", y = "Average Salary Budget", title = "NBA Average Team Salary Budget by Year") +
  theme(text = element_text(family = "serif"))

soccer_teams %>%
  mutate(league = factor(league, levels = c('premier_league', 'la_liga', 'serie_a', 'bundesliga', 'mls'))) %>%
  group_by(year_2, league) %>%
  summarize(average_salary_budget = mean(annual_wages, na.rm = TRUE)) %>%
  ggplot(aes(year_2, average_salary_budget, color = league)) +
  geom_line() +
  geom_point(shape = 17) +
  scale_y_continuous(labels = scales::comma) +
  scale_color_manual(values = my_color_palette,
                     labels = c("Premier League", "La Liga", "Serie A", "Bundesliga", "MLS")) +
  labs(x = "Year", y = "Average Salary Budget", title = "Soccer Club Salary Budgets by Year", color = "League") +
  theme(text = element_text(family = "serif"))

Total Salary Distributions

While the average NBA team budget is much higher than the average soccer team’s, the overall distribution of budgets among soccer teams is heavily skewed right. The 2019-2020 Real Madrid budget was over 356 million, dwarfing the NBA’s highest budgets which fall below 200 million.

basketball_teams %>%
  filter(year_2 >= 2015) %>%
  ggplot(aes(salary)) +
  geom_histogram(bins = 40, color = "white", fill = "#3498db", alpha = 0.7) +
  scale_x_continuous(labels = scales::comma, limits = c(50000000, 200000000)) +
  labs(x = "Team Salary", y = "Count", title = "Distribution of NBA Team Salaries Since 2015") +
  theme_minimal() +
  theme(
    text = element_text(family = "serif"),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.text = element_text(size = 10),
    plot.background = element_blank()
  )

soccer_teams %>%
  filter(year_2 >= 2015) %>%
  ggplot(aes(annual_wages)) +
  geom_histogram(bins = 60, color = "white", fill = "#3498db", alpha = 0.7) +
  scale_x_continuous(labels = scales::comma, limits = c(0, 400000000)) +
  labs(x = "Team Salary", y = "Count", title = "Distribution of Soccer Team Salaries Since 2015") +
  theme_minimal() +
  theme(
    text = element_text(family = "serif"),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.text = element_text(size = 10),
    plot.background = element_blank()
  )

Gini Coefficients and Pay Inequality

The Gini Coefficient is an economic variable that measures the income or wealth distribution of a population. The coefficient ranges from 0 to 1. A value of 1 represents perfect inequality, where a single individual controls 100% of the wealth, and a value of 0 represents a perfectly even distribution of wealth.

Since professional teams keep many players on their payroll that may never see playing time, I decided to only calculate Gini coefficients for the top earners of each team. For soccer teams, I chose to only count the 16 highest earners, and the top 10 highest earners for basketball.

The mean Gini Coefficient among basketball teams is 0.334, and the mean among soccer teams is 0.281. Interestingly, MLS teams distribute their wages much more unevenly compared to their European counterparts. The team with the highest Gini Coefficient is the 2008-2009 LA Galaxy at 0.78, where David Beckham had a salary of 5.5 million, but the team’s entire budget was little more than 8 million. The mean Gini Coefficient for European clubs is 0.225, and for the MLS it is 0.413.

Distributions of Gini Coefficients

basketball_teams %>%
  ggplot(aes(gini)) + 
  geom_histogram(bins = 50, color = "white", fill = "#3498db", alpha = 0.7) + 
  geom_vline(xintercept = mean(basketball_teams$gini), color = "#e74c3c", linetype = "dashed", size = 1) +  
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  labs(
    x = "Gini Coefficient",
    y = "Count",
    title = "Distribution of NBA Team Gini Coefficients",
    subtitle = paste("Mean: ", round(mean(basketball_teams$gini), 3),
                      ", SD: ", scales::number(sd(basketball_teams$gini), accuracy = 0.01)),
  )  +
  theme_minimal() +
  theme(
    text = element_text(family = "serif"),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.text = element_text(size = 10),
    plot.background = element_blank()
  )

european_teams <-filter(soccer_teams, league != "mls")

european_teams %>%
  ggplot(aes(gini)) + 
  geom_histogram(bins = 40, color = "white", fill = "#3498db", alpha = 0.7) + 
  geom_vline(xintercept = mean(european_teams$gini, na.rm = TRUE), color = "#e74c3c", linetype = "dashed", size = 1) +  
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  labs(
    x = "Gini Coefficient",
    y = "Count",
    title = "Distribution of European Soccer Team Gini Coefficients",
    subtitle = paste("Mean: ", round(mean(european_teams$gini, na.rm = TRUE), 3),
                      ", SD: ", scales::number(sd(european_teams$gini, na.rm = TRUE), accuracy = 0.01)),
  )  +
  theme_minimal() +
  theme(
    text = element_text(family = "serif"),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.text = element_text(size = 10),
    plot.background = element_blank()
  )

mls_teams <- soccer_teams %>% filter(league == "mls")

mls_teams %>%
  ggplot(aes(gini)) + 
  geom_histogram(bins = 40, color = "white", fill = "#3498db", alpha = 0.7) + 
  geom_vline(xintercept = mean(mls_teams$gini, na.rm = TRUE), color = "#e74c3c", linetype = "dashed", size = 1) +  
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  labs(
    x = "Gini Coefficient",
    y = "Count",
    title = "Distribution of MLS Team Gini Coefficients",
    subtitle = paste("Mean: ", round(mean(mls_teams$gini, na.rm = TRUE), 3),
                      ", SD: ", scales::number(sd(mls_teams$gini, na.rm = TRUE), accuracy = 0.01)),
  )  +
  theme_minimal() +
  theme(
    text = element_text(family = "serif"),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.text = element_text(size = 10),
    plot.background = element_blank()
  )

Inequality Within the MLS

The difference in salary distribution within the MLS compared to other soccer teams is striking. Pay inequality within the MLS has been written about before. The primary culprit behind this inequity is the “designated player rule,” also known as the “Beckham Rule,” which allows teams to sign up to three players outside of their salary cap in order to compete with international markets. Lionel Messi is a recent example of this, earning about half of Inter Miami’s entire 2023 budget. While this rule allows teams to sign international talent, the inequity drives dissatisfaction among local talent and disincentivizes young American athletes from pursuing the MLS as a career, or soccer entirely.

Player Salaries

Below are three Pareto Charts of the MLS, European soccer teams, and the NBA. In each, I bucketed player salaries into 100 even groups in order of salary from 2020 onward. The height of each column corresponds to the average salary within that bucket. The red line corresponds to the cumulative percentage of total wages that the bucket (and previous buckets) is responsible for. There are two main points to look out for.

  1. The skewness of the data. Each graph is skewed to the right, as expected, but the steepness corresponds to the difference between top earners and the rest.

  2. The concentration of earnings. The faster the cumulative percentage line approaches 100, the higher the concentration of earnings is in the top earners of the group.

mls_buckets <- player_wages_soccer %>%
  filter(year_2 >= 2020, league == "mls") %>%
  arrange(desc(annual_wages)) %>% 
  mutate(bucket = ntile(annual_wages, 100)) %>% 
  group_by(bucket) %>%
  summarize(n = n(), 
            bucket_wages = sum(annual_wages),
            bucket_avg = mean(annual_wages)) %>%
  arrange(desc(bucket)) %>%
  mutate(pct_total = bucket_wages / sum(bucket_wages) * 100,
         cumulative_pct = cumsum(pct_total))

mls_buckets %>%
  arrange(bucket) %>%
  ggplot(aes(x = bucket)) +
  geom_col(aes(y = bucket_avg), fill = "#3498db", color = "white") +
  scale_x_reverse() +
  geom_line(aes(y = cumulative_pct * max(mls_buckets$bucket_avg) / 100), color = "red") +
  scale_y_continuous(sec.axis = sec_axis(~ . * 100 / max(mls_buckets$bucket_avg), name = "Cumulative Percentage"), labels = scales::comma) +
  labs(x = "Bucket", y = "Bucket Average", title = "MLS Pareto Chart") +
  theme(text = element_text(family = "serif"))

eu_buckets <- player_wages_soccer %>%
  filter(year_2 >= 2020, league != "mls") %>%
  arrange(desc(annual_wages)) %>% 
  mutate(bucket = ntile(annual_wages, 100)) %>% 
  group_by(bucket) %>%
  summarize(n = n(), 
            bucket_wages = sum(annual_wages),
            bucket_avg = mean(annual_wages)) %>%
  arrange(desc(bucket)) %>%
  mutate(pct_total = bucket_wages / sum(bucket_wages) * 100,
         cumulative_pct = cumsum(pct_total))

eu_buckets %>%
  arrange(bucket) %>%
  ggplot(aes(x = bucket)) +
  geom_col(aes(y = bucket_avg), fill = "#3498db", color = "white") +
  scale_x_reverse() +
  geom_line(aes(y = cumulative_pct * max(eu_buckets$bucket_avg) / 100), color = "red") +
  scale_y_continuous(
    sec.axis = sec_axis(~ . * 100 / max(eu_buckets$bucket_avg), name = "Cumulative Percentage"),
    labels = scales::comma,
    breaks = seq(0, max(eu_buckets$bucket_avg), by = 5000000) 
  ) +
  labs(x = "Bucket", y = "Bucket Average", title = "European Clubs Pareto Chart") +
  theme(text = element_text(family = "serif"))

nba_buckets <- basketball_players %>% 
  ungroup() %>%
  filter(year_2 >= 2020) %>%
  arrange(desc(salary)) %>% 
  mutate(bucket = ntile(salary, 100)) %>% 
  group_by(bucket) %>%
  summarize(n = n(), 
            bucket_wages = sum(salary),
            bucket_avg = mean(salary)) %>%
  arrange(desc(bucket)) %>%
  mutate(pct_total = bucket_wages / sum(bucket_wages) * 100,
         cumulative_pct = cumsum(pct_total))

nba_buckets %>%
  arrange(bucket) %>%
  ggplot(aes(x = bucket)) +
  geom_col(aes(y = bucket_avg), fill = "#3498db", color = "white") +
  scale_x_reverse() +
  geom_line(aes(y = cumulative_pct * max(nba_buckets$bucket_avg) / 100), color = "red") +
  scale_y_continuous(sec.axis = sec_axis(~ . * 100 / max(nba_buckets$bucket_avg), name = "Cumulative Percentage"), labels = scales::comma) +
  labs(x = "Bucket", y = "Bucket Average", title = "NBA Pareto Chart") +
  theme(text = element_text(family = "serif"))

There are some interesting observations from these graphs. First, comparing the MLS to European clubs. The MLS appears to be significantly more skewed, with a much steeper drop off from top earners to the rest of the league. At the same time, the concentration of earnings - the cumulative percentage line - is almost the same as that of the Europeans. This is fairly consistent with what I understand of the “designated player rule.” There are a few players that far out-earn their peers, but since they are few in number they don’t dramatically affect the overall concentration of earnings.

When comparing the NBA graph to those of the soccer teams, two things seem clear. The skew appears to be less dramatic than either the MLS or European clubs, and there’s a higher concentration of earnings among the top players. The top cohort of players is larger, so they command a higher percentage of the total earnings of the league.

Gini Coefficients Over Time

Below are graphs of Gini Coefficients over time. I would have been interested to see MLS data prior to 2008, since that was the year the “designated player rule” was instated. Sadly the dataset doesn’t go back that far. With some exceptions, Gini Coefficients have stayed relatively steady over time.

soccer_teams %>%
  mutate(league = factor(league, levels = c("mls", "la_liga", "bundesliga", "serie_a", "premier_league"))) %>%
  group_by(year_2, league) %>%
  summarize(median_gini = median(gini, na.rm = TRUE)) %>%
  ggplot(aes(year_2, median_gini, color = league)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(limits = c(.1, .5)) +
  labs(x = "Year", y = "Median Gini Coefficient", title = "Soccer Team Median Gini Coefficients Over Time", color = "League") +
  scale_color_manual(values = my_color_palette, labels = c("MLS", "La Liga", "Bundesliga", "Serie A", "Premier League")) +
  theme(text = element_text(family = "serif"))

basketball_teams %>%
  group_by(year_2) %>%
  summarize(median_gini = median(gini, na.rm = TRUE)) %>%
  ggplot(aes(year_2, median_gini)) +
  geom_line(color = "#3498db") +
  geom_point(color = "#3498db") +
  scale_y_continuous(limits = c(.2, .5)) +
  labs(x = "Year", y = "Median Gini Coefficient", title = "NBA Median Gini Coefficients Over Time") +
  theme(text = element_text(family = "serif"))

Impacts on Team Results

Relative Wages

With a higher budget, a team can acquire more talent, and a more talented team tends to win more games. There have been many examples of well-funded and talented teams that have underperformed, but that relationship is generally true. I’ll start by plotting team budgets against success metrics for basketball and soccer. For the NBA I’m using win percentages, and for soccer I’m using points. Since draws are possible and common in soccer, using win percentages wouldn’t be sufficient. I also calculated relative wages for each team using the following equations, since it doesn’t make much sense to compare raw budgets against each other over time.

  • \(\text{Soccer Data: } \text{Relative Wage} = \frac{\text{Team Wage}}{\text{Average Team Wage in Year y and League l}}\)
  • \(\text{NBA Data: } \text{Relative Wage} = \frac{\text{Team Wage}}{\text{Average Team Wage in Year y}}\)
avg_wages <- soccer_teams %>%
  group_by(year_2, league) %>%
  summarize(avg_wages = mean(annual_wages, na.rm = TRUE))

soccer_teams <- soccer_teams %>% left_join(avg_wages, by = c("year_2", "league"))
soccer_teams <- soccer_teams %>% mutate(relative_wage = annual_wages / avg_wages)

basketball_avg_salary <- basketball_teams %>%
  group_by(year_2) %>%
  summarize(avg_wages = mean(salary, na.rm = TRUE))
basketball_teams <- basketball_teams %>%
  left_join(basketball_avg_salary, by = "year_2") %>%
  mutate(relative_wage = salary / avg_wages)

ggplot(soccer_teams, aes(x = relative_wage, y = Pts)) +
  geom_point(color = "#3498db", alpha = 0.7) + 
  geom_smooth(color = "#e74c3c") + 
  labs(
    title = "Relative Wage and Points in Professional Soccer",
    x = "Relative Wage",
    y = "Points"
  ) +
  theme(text = element_text(family = "serif")
  )

ggplot(basketball_teams, aes(x = relative_wage, y = win_pct)) +
  geom_point(color = "#3498db", alpha = 0.7) + 
  geom_smooth(color = "#e74c3c") + 
  labs(
    title = "Relative Wage and Win Percentage in the NBA",
    x = "Relative Wage",
    y = "Points"
  ) +
  theme(text = element_text(family = "serif")
  )

In both the NBA and professional soccer, relative wage is correlated with increased success. As the relative wage increases, the effect diminishes, which is also unsurprising. This suggests that there are diminishing returns to relative wage, which is intuitive. The relationship actually reverses in the NBA data. Perhaps teams with extremely high relative wages place more value in the amount of money that they’re spending rather than the efficiency with which they are spending it. Due to the sparsity of teams with very high relative wages I would not place too much weight on this observation. Lots of those teams are the New York Knicks, and no amount of money can fix that franchise.

Regressions

Now I’ll run simple linear regressions to see if Gini Coefficients have a significant impact on success, holding relative wage constant.

Soccer Regression

\[ Points = GiniCoefficient + RelativeWage \]

soccer_model <- lm(Pts ~ gini + relative_wage, data = soccer_teams)
summ(soccer_model)
Observations 1095
Dependent variable Pts
Type OLS linear regression
F(2,1092) 407.47
0.43
Adj. R² 0.43
Est. S.E. t val. p
(Intercept) 43.84 0.91 48.11 0.00
gini -27.46 2.85 -9.63 0.00
relative_wage 13.05 0.46 28.31 0.00
Standard errors: OLS

The results of this regression are very interesting. Both the Gini Coefficient and relative wage are statistically significant, and there’s a strong negative relationship between the Gini Coefficient and points. This suggests that wage inequality among soccer teams has a negative impact on success, when holding relative wage constant. A team is seemingly better off distributing its funds more equally than focusing on acquiring expensive individual talent that will eat a high proportion of their total budget.

Basketball Regression

\[ WinPercentage = GiniCoefficient + RelativeWage \]

basketball_model <- lm(win_pct ~  gini + relative_wage, data = basketball_teams)
summ(basketball_model)
Observations 920 (46 missing obs. deleted)
Dependent variable win_pct
Type OLS linear regression
F(2,917) 77.84
0.15
Adj. R² 0.14
Est. S.E. t val. p
(Intercept) 0.15 0.03 5.16 0.00
gini 0.27 0.05 5.22 0.00
relative_wage 0.26 0.03 10.37 0.00
Standard errors: OLS

The results of this regression are also interesting, for different reasons. The Gini Coefficient and relative wage variables are both statistically significant, but their impacts are very small. The adjusted R-squared value of the regression is 0.14, meaning that the explanatory variables account for a relatively small amount of the variability in win rates. So this model suggests that wage inequality and relative wages have a significant but ultimately small impact on success in the NBA. This is surprising to me, since I would expect relative wages to always have a very strong impact on success. One could interpret this weak relationship in different ways, barring some problem with the data. Perhaps the spending priorities of NBA teams are not always to acquire the best talent, but to sign big names that fill seats though they may not integrate well with the team. There have been examples of aging superstars being signed to large contracts, despite their decline in performance.

Conclusion

Earlier I suggested that soccer is more of a team sport than basketball. The findings in this report support that narrative. Soccer teams tend to distribute their funds much more evenly among their players than basketball teams. The MLS is an exception, largely due to the “designated player rule.” The regression results are also consistent with this idea, finding a strong negative relationship between the Gini Coefficient of a team and their success in professional soccer, but only a mild positive relationship among NBA teams. Thanks for reading, I hope you found this interesting.