Introduction

The data set that I chose to analyze dealt with data from the highest mens soccer league in the United States. This league is called Major League Soccer(MLS). The MLS does not compare in quality or popularity to any of the major five European soccer leagues, but the league is on the rise regarding talent and popularity. The number of foreign players coming to America to play in this league is increasing, which is causing more attention and raising the level in the league.

Dataset

The purpose of this data set is to observe the salaries of the player in the MLS from 2021. The data in this data set consists of 587 observations that are described by just 6 variables. The variables describe the teams, the players on the team, each players position and 2 ways to describe each players salary. The 4 main variables that are used in the visualizations describe the club, the position of a player and the 2 different salaries. The 2 different descriptions of the salaries are the base salary, and the guaranteed salary.

Findings

Analyzing this data helped gain a better understanding of which players will have a larger salary than others, and which teams will spend the most money on the salaries of their players. It is clear that there are teams that pay their players more and specific players that are paid more based on what position that they play. What was found was teams in big cities tend to spend more the total salary, and positions like midfielders and forwards are paid more than other the other positions.

import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
import warnings
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
warnings.filterwarnings=("ignore")
path = "U:/"
df = pd.read_csv(path + 'mls_salaries.csv')
df.to_csv(path + 'mls_salaries2.csv', index = False)
df = pd.read_csv(path + 'mls_salaries2.csv')

Amount That Each Club Spends on Paying Their Players

The analysis of the salaries in the MLS starts with a visualization that describes the amount that each team is paying their players in 2021 by adding up each players guaranteed salary on each team. This is shown through a bar chart that also describes the mean of the data. This piece of data shows all 27 teams in the league and shows which ones are above and below the league wide average of the total amount that each team pays their players.

The teams at the top of this list are not much of a surprise. Miami, Toronto and Los Angeles are all big cities with large markets and they are the top 3 highest spenders. These big cities are able to attract better players that are inevitably going to be paid more. There is a very large difference between what the highest and lowest spending teams spent on salaries in 2021. The difference is only around $9 million, which may not be a lot for other American sports, but the highest spending team being Inter Miami spends more than double what the Vancouver Whitecaps spent, who are lowest spending team. The average amount that a team spends in just over $12 million with the majority of the teams below below that mark.


df['Last Name'] = df['Last Name'].fillna('Unknown')
df = df.dropna()
df['Base Salary'] = df['Base Salary'].str.replace('$',"")
df['Base Salary'] = df['Base Salary'].str.replace(',',"")  
df['Base Guaranteed Comp.'] = df['Base Guaranteed Comp.'].str.replace('$',"")
df['Base Guaranteed Comp.'] = df['Base Guaranteed Comp.'].str.replace(',',"") 
df['Base Salary'] = pd.to_numeric(df['Base Salary'])
df['Base Guaranteed Comp.'] = pd.to_numeric(df['Base Guaranteed Comp.'])
x = df.groupby(['Club']).agg({'Base Salary':['sum'],'Base Guaranteed Comp.':['sum']}).reset_index()
x.columns = ['Club', 'Salary', 'Guaranteed']
x = x.sort_values('Salary', ascending=False)
x.loc[x.index]
x = x.sort_values('Salary', ascending=False)
x.reset_index(inplace=True, drop=True)
def pick_colors_according_to_mean_salary(this_data):
    colors=[]
    avg= this_data.Guaranteed.mean()
    for each in this_data.Guaranteed:
        if each > avg*1.01:
            colors.append('lightcoral')
        elif each < avg*0.99:
            colors.append('green')
        else:
            colors.append('black')
    return colors
bottom3= 0
top3= 26
d3= x.loc[bottom3:top3]
d3= d3.sort_values('Guaranteed', ascending=True)
d3.reset_index(inplace=True, drop=True)
my_colors3 = pick_colors_according_to_mean_salary(d3)
Above =mpatches.Patch(color='lightcoral', label='Above Average')
At =mpatches.Patch(color='black', label='Within 1% of the Average')
Below =mpatches.Patch(color='green', label='Below Average')
fig = plt.figure(figsize= (20,12))
ax1 = fig.add_subplot(1,1,1)
ax1.barh(d3.Club, d3.Guaranteed, color= my_colors3)
for row_counter, value_at_row_counter in enumerate(d3.Guaranteed):
    if value_at_row_counter > d3.Guaranteed.mean()*1.01:
        color= 'lightcoral'
    elif value_at_row_counter < d3.Guaranteed.mean()*0.99:
        color= 'green'
    else: 
        color= 'black'
    ax1.text(value_at_row_counter+2, row_counter, 
             '{:,}'.format(value_at_row_counter),
             color= color, size=12, fontweight='bold',
             ha='left', va='center', backgroundcolor= 'white')
plt.xlim(0, d3.Guaranteed.max()*1.1)
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=12)
plt.axvline(d3.Guaranteed.mean(),color='black', linestyle='dashed')
ax1.text(d3.Guaranteed.mean()+1e5, 0, 'Mean = ' + str("{:,.0f}".format(d3.Guaranteed.mean())), rotation=0, fontsize=14, fontweight='bold')
ax1.xaxis.set_major_formatter(FuncFormatter( lambda x, pos:('$%1.1fM')%(x*1e-6)))
ax1.set_title('MLS Clubs Guarenteed Salaries', size=20)
ax1.set_xlabel('Guarenteed Base Salary in Millions (M)', fontsize=16)
ax1.set_ylabel('Clubs', fontsize=16)
plt.xticks()
plt.yticks(fontsize=14)
plt.show()

Top 10 Most Money a Club Compensates a Position

The next visualization is a dual axis bar chart that is used to describe the top 10 highest amount of money that a team allocates a part of their total salary to a position of players within their club. This piece of data looks at both the base and guaranteed salary of the position. It turns out that all 10 of these dual axis bars represents the salary for either a group of midfielders or forwards from one of the 10 teams that are listed. This shows that the clubs must value the forwards and midfielders the most over the other positions.

The majority fo the clubs on this list spend around 50% of the amount that they spend on their total salaries just on one position. Most of the clubs in this visualization are no surprise because 9 of the clubs represented in this chart spent more than the league average on their total salaries. The team that is a surprise is Real Salt Lake. Real Salt lake spends more than $6.7 million on just one position, while only spending a little over $10.5 million in total.


x1 = df.groupby(['Club', 'Playing Position']).agg({'Base Salary':['sum'],'Base Guaranteed Comp.':['sum']}).reset_index()

x1.columns = ['Club','Position', 'Salary', 'Guaranteed']
x1 = x1.sort_values('Salary', ascending=False)
x1.loc[x1.index]
x1.reset_index(inplace=True, drop=True)

def autolabel(these_bars, this_ax, place_of_decimal, symbol):
    for each_bar in these_bars:
        height= each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, 
                     height*1.005, 
                     symbol+format(height/1e6, place_of_decimal)+'M',
                     fontsize=15, 
                     color='black', 
                     ha='center', 
                     va='bottom')
                     
bottom1= 0
top1= 9
d1= x1.loc[bottom1:top1]

fig = plt.figure(figsize=(26,16))
ax1 = fig.add_subplot(1,1,1)
ax2 = ax1.twinx()
bar_width = 0.45

x1_pos = np.arange(10)
base_bars = ax1.bar(x1_pos-(0.5*bar_width), d1.Salary, bar_width, color='grey', edgecolor='black', label='Base salary')
guaranteed_bars = ax2.bar(x1_pos+(0.5*bar_width), d1.Guaranteed, bar_width, color='green', edgecolor='black', label='Guaranteed salary')

ax1.set_xlabel('Club Paying the Position', fontsize= 18)
ax1.set_ylabel('Base Salary in Millions (M)', fontsize= 20, labelpad= 20)
ax2.set_ylabel('Guaranteed Salary in Millions (M)', fontsize= 20, rotation= 270, labelpad= 20)
ax1.tick_params(axis='y', labelsize= 18)
ax2.tick_params(axis='y', labelsize= 18)

plt.title('Top 10 Most Money a Club Allocates \n to a Single Position', fontsize =24)
ax1.set_xticks(x1_pos)
ax1.set_xticklabels(d1.Club,  fontsize= 18)

base_color, base_label = ax1.get_legend_handles_labels()
guaranteed_color, guaranteed_label = ax2.get_legend_handles_labels()
legend= ax1.legend(base_color + guaranteed_color, base_label + guaranteed_label, loc='upper right', frameon=True,
                   ncol=1, shadow=True, borderpad= 1, fontsize=18)

ax1.set_ylim(0, d1.Salary.max()*1.26)
ax1.yaxis.set_major_formatter(FuncFormatter( lambda x, pos:('$%1.fM')%(x*1e-6)))
ax2.yaxis.set_major_formatter(FuncFormatter( lambda x, pos:('$%1.fM')%(x*1e-6)))

autolabel(base_bars, ax1, '.2f', '$')
autolabel(guaranteed_bars, ax2, '.2f', '$')

plt.show()

Number of Players in Each Position for Each Club

The next visualization observes the amount of players that each club has in each position. The highest number of players that a club has in a single position is Montreal who have 14 midfielders in their club. Midfielders and defenders are the two positions that clubs carry the most of. It makes sense that there are not many goalkeepers on each club since only 1 can play at time. While the count of forwards is in between the defenders and goalkeepers.

This chart that shows very simple information, but help give more of an understanding on the emphasis of how clubs value each position. Since the defenders count is obviously larger than the forwards, it would be expected that defenders would allocate for more of a teams salary over forwards. Since this is not the case, it is clear that clubs think investing salary money in forwards in more important than defenders.


df3 = pd.read_csv('mls_salaries.csv', usecols = ['Club', 'Playing Position'])

df3['Playing Position'] = df3['Playing Position'].str.replace('M-D','D-M')
df3['Playing Position'] = df3['Playing Position'].str.replace('M-F','F-M')

x4 = df3.groupby(['Club', 'Playing Position'])['Playing Position'].count().reset_index(name='count')

x4= x4.rename(columns={'Club': 'Club', 'Playing Position': 'Position',})

x5= x4.groupby(['Club', 'Position'])['count'].sum().reset_index()
x5= pd.DataFrame(x5)

x5['count_hundreds'] = round(x5['count']/.01,0)

x5 = x5.sort_values('Club', ascending=False)

plt.figure(figsize=(18,19))

plt.scatter(x5['Position'], x5['Club'], marker= '8', cmap='viridis',
           c=x5['count_hundreds'], s=x5['count_hundreds'], edgecolors='black')

plt.title('Count of Each Position From the Highest Spending Teams', fontsize=18)
plt.xlabel('Positions', fontsize=14)
plt.ylabel('Clubs', fontsize=14)

cbar = plt.colorbar()
cbar.set_label('Number of Players in Each Position', rotation= 270, fontsize= 14, color='black', labelpad=30)
               
plt.show()

Breakdown of the Positions in the Top 10 Guaranteed Salaries by Club

This next visualization is a stacked bar chart. This bar chart shows the top 10 teams in terms of aggregate total salary, and shows where they spend their money on each position. It is clear that midfielders and forwards are paid the most by all of the teams. Part of this may because they are known as the goalscorers. The goalscorers tend to get the most publicity, attracting the most hype which may cause them to have a higher salary. Also more foreign attackers are coming into the league, that are being bought for millions of dollars, meaning these clubs are investing in them more heavily than a domestic player.


df['Playing Position'] = df['Playing Position'].str.replace('M-D','D-M')
df['Playing Position'] = df['Playing Position'].str.replace('M-F','F-M')

top_df = df[ df.Club.isin(d3.Club[1:11]) ]

stacked_df1 = top_df.groupby(['Club', 'Playing Position'])['Base Guaranteed Comp.'].sum().reset_index(name= 'Guaranteed Salary')

stacked_df1 = stacked_df1.pivot(index= 'Club', columns= 'Playing Position', values= 'Guaranteed Salary')
stacked_df1 = stacked_df1.reset_index(drop=False)
stacked_df1 = stacked_df1.fillna(0)

fig=plt.figure(figsize= (18, 10));
ax=fig.add_subplot(1, 1, 1);

ax.bar(stacked_df1.Club, stacked_df1.D, color='red');
ax.bar(stacked_df1.Club, stacked_df1['D-M'], bottom=stacked_df1.D, color='blue');
ax.bar(stacked_df1.Club, stacked_df1.F, bottom=stacked_df1.D+stacked_df1['D-M'], color='green');
ax.bar(stacked_df1.Club, stacked_df1['F-M'], bottom=stacked_df1.D+stacked_df1['D-M']+stacked_df1.F, color='black');
ax.bar(stacked_df1.Club, stacked_df1['GK'], bottom=stacked_df1.D+stacked_df1['D-M']+stacked_df1.F + stacked_df1['F-M'], color='purple');
ax.bar(stacked_df1.Club, stacked_df1['M'], bottom=stacked_df1.D+stacked_df1['D-M']+stacked_df1.F + stacked_df1['F-M'] + stacked_df1['GK'], color='brown');


ax.set_ylabel('Total Guaranteed Salary in Millions (M)', fontsize=18);
plt.title('Total Guaranteed Salary by Position\n Stacked Bar Plot', fontsize=18)
ax.yaxis.set_major_formatter(FuncFormatter( lambda x, pos:('$%1.fM')%(x*1e-6)))
#ax.xticks(fontsize=10, rotation=45)
plt.yticks(fontsize=14)
plt.xticks(rotation= 45, ha='right') 
ax.set_xlabel('Clubs', fontsize=18)

plt.show()

Pie Chart

The final visualization is a pie chart that shows the complete breakdown of the percent of salary that each position receives league wide. Midfielders receive 36.1% of the $323.12 million of salary throughout the league. While players that are considered defenders and midfielders receive the lowest percent of the the total salary at only 2.6%. This chart really shows how much forwards are valued. The forward position takes up 29% of the total salary while having a much lower count of players than both the midfielders and defenders. This chart shows that forwards are paid the most on average, with this chart making it clear with there percentage only being 7% less than midfielders, while having a much smaller count, as seen in the third visualization.


pie_df = df[ df.Club.isin(d3.Club[1:27]) ].reset_index(drop=True)
pie_df = pie_df.groupby(['Playing Position'])['Base Guaranteed Comp.'].sum().reset_index(name= 'Guaranteed Salary')
pie_df = pie_df.sort_values(by='Guaranteed Salary', ascending=False).reset_index(drop=True)

all_guaranteed = pie_df['Guaranteed Salary'].sum()

colormap = plt.get_cmap("tab20c");

mycolors = colormap([1, 5, 9, 13, 17, 19]);

fig = plt.figure(figsize=(7,7));

ax = fig.add_subplot(1,1,1);

ax.pie(pie_df['Guaranteed Salary'], labels=pie_df['Playing Position'], autopct='%1.1f%%',

       shadow=False, startangle=100,

       colors=mycolors, textprops={'color':"black"},

       pctdistance=0.85, labeldistance=1.1);

hole = plt.Circle((0,0), 0.5, fc ='white');

fig1= plt.gcf();

fig1.gca().add_artist(hole);


ax.yaxis.set_visible(False);

plt.title('Amount of Salary Allocated to each Position', fontsize=18);

ax.axis('equal');


ax.text(0,0, 'Guaranteed Salary\n' + '$' + str(round(all_guaranteed/1e+6,2))+'M', size = 14, ha='center', va='center');


plt.show()

Conclusion

After analyzing this data a lot was learned about how teams pay their players in the MLS. Forwards are paid the most, with midfielders are paid the second most, and then defenders and goalkeepers follow them. Clubs that are in large markets also tend to spend more on players. This makes sense as the league is expanding to more than just a domestic league. Meaning foreign players with impressive playing careers come over to the United States to finish there career, and they tend to pick a large city. This is why forwards and midfielders from clubs in cities like Los Angeles, Toronto and New York are paid the most in the league.